BACKGROUND AND OBJECTIVES: NICUs vary in the quality of care delivered to very low birth weight (VLBW) infants. NICU performance on 1 measure of quality only modestly predicts performance on others. Composite measurement of quality of care delivery may provide a more comprehensive assessment of quality. The objective of our study was to develop a robust composite indicator of quality of NICU care provided to VLBW infants that accurately discriminates performance among NICUs.
METHODS: We developed a composite indicator, Baby-MONITOR, based on 9 measures of quality chosen by a panel of experts. Measures were standardized, equally weighted, and averaged. We used the California Perinatal Quality Care Collaborative database to perform across-sectional analysis of care given to VLBW infants between 2004 and 2010. Performance on the Baby-MONITOR is not an absolute marker of quality but indicates overall performance relative to that of the other NICUs. We used sensitivity analyses to assess the robustness of the composite indicator, by varying assumptions and methods.
RESULTS: Our sample included 9023 VLBW infants in 22 California regional NICUs. We found significant variations within and between NICUs on measured components of the Baby-MONITOR. Risk-adjusted composite scores discriminated performance among this sample of NICUs. Sensitivity analysis that included different approaches to normalization, weighting, and aggregation of individual measures showed the Baby-MONITOR to be robust (r = 0.89–0.99).
CONCLUSIONS: The Baby-MONITOR may be a useful tool to comprehensively assess the quality of care delivered by NICUs.
- CPQCC —
- California Perinatal Quality Care Collaborative
- VLBW —
- very low birth weight
What’s Known on This Subject:
The traditional process-focused approach to quality improvement has not remedied NICUs’ inconsistency in quality of care delivery across clinically important measures. Global measurement of quality may induce broad, systems-based improvement, but must be formally studied.
What This Study Adds:
We present a systematically developed and robust composite indicator, the Baby-MONITOR, to assess the quality of care delivered to very low birth weight infants in the NICU setting.
Neonatal intensive care is a complex and multidimensional activity, which the measurement of its quality should reflect. There is value in summarizing performance by combining the information from multiple measures, as such a summary can convey quality from many different perspectives.1 The Institute of Medicine noted that composite measures can enhance measurement to extend beyond tracking performance on individual measures, and can provide a potentially deeper view of the reliability of the care system.2 A multidimensional measure can convey quality from many different perspectives and may provide new insights into effective improvement strategies.
The National Quality Forum defines composite measures as “a combination of two or more individual measures into a single measure that results in a single score.”1 They are created by compiling individual measures into a single indicator, on the basis of an underlying model of the multidimensional concept that is being measured.3 Their primary appeal is the ability to simplify and summarize otherwise complex issues, and to provide global insights and trends about quality of care.
On the other hand, composite indicators may be susceptible to unsound conceptual or statistical approaches, and they may be less transparent than individual measures of quality. Therefore, the construction of composites requires that explicit and transparent methods are used to ensure conceptual and statistical soundness,4 so that they do not (1) fall short of their aim to improve quality of care, (2) fail to elicit buy-in from providers, (3) misclassify providers as outliers, or (4) encourage overly simplistic conclusions.
In this article, we describe the construction of a composite indicator of quality of care delivered to very low birth weight (VLBW; <1500 g) infants, building on previous work. We have coined the term Baby-MONITOR (Measure Of Neonatal InTensive care Outcomes Research) for the instrument,14 which we present as a prototype for the next generation of quality assessment. Our primary objective was to test whether the Baby-MONITOR would discriminate global NICU performance on quality of care delivery. Our secondary objective was to test the robustness of the Baby-MONITOR.
We followed a systematic and explicit approach based on recommendations by the European Commission Joint Research Center and the Organization for Economic Cooperation and Development, thought leaders in this area.3,4,15 Preliminary steps in the development of the Baby-MONITOR included development of a theoretical framework4,15; expert-informed selection of its measure components14,16; initial data analysis (1) to investigate the completeness of the data, (2) to develop and test adequate measure definitions and restrictions, and (3) to minimize systematic selection and transfer biases; and construction of risk adjustment models.17
With these building blocks in place, we standardized and risk-adjusted outcomes, weighted the individual components, and aggregated measures to form a composite indicator. After defining a base-case composite, we evaluated its sensitivity to the underlying assumptions and explored the effects of alternative computational approaches.
Patient data for this analysis were obtained from the California Perinatal Quality Care Collaborative (CPQCC). Local NICU personnel are trained to abstract data. Annual training sessions help to promote accuracy and uniformity in data abstraction. Each record has range and logic checks both at the time of data collection and data closeout, with auditing of records with excessive missing data. A detailed description of the patient-selection criteria has been published elsewhere.14 In brief, the goal was to create a sample of VLBW infants that would represent the “common” preterm infant. For this study, 9023 unique VLBW infants cared for at 22 California regional NICUs between 2004 and 2010 met the inclusion criteria. Of these centers, 15 are designated as level 4 (access to pediatric surgical subspecialists) and the remainder as level 3.18 We used multiyear analyses because of the small number of VLBW infants cared for in some institutions.
To ensure that patient outcomes reflected the quality of care of the NICU under observation, we excluded infants who died before 12 hours of life, those transferred in after 3 days of age, those transferred out for reasons other than convalescent and chronic care, and those who had severe congenital anomalies. Finally, to avoid systematic bias based on decisions to withhold resuscitation at the threshold of viability, we restricted the analysis to infants born after 24 completed weeks of gestation.
Quality-of-care measures: Measures were selected by an expert panel via a formal modified Delphi process,4 and subsequently affirmed by a sample of practicing neonatologists.16 Measure definitions were derived from standard CPQCC/Vermont Oxford Network (VON) algorithms. The measures were expressed as binary variables at the patient level and as proportions at the unit level. They included: (1) any antenatal steroid administration; (2) moderate hypothermia (<36°C) on admission; (3) non–surgically-induced pneumothorax; (4) health care-associated bacterial or fungal infection; (5) chronic lung disease (oxygen requirement at 36 weeks’ gestational age); (6) timely eye exam (retinopathy of prematurity screening at the age recommended by the American Academy of Pediatrics); (7) discharge on any human breast milk; (8) mortality during the birth hospitalization, and (9) growth velocity (less or more than the median of 12.4 g/kg/day). Growth velocity was determined according to a logarithmic function described by Patel.19 We aligned variables so that a higher value represented a better outcome. Other restrictions with regard to transfers and hospital of birth are described elsewhere.14
We applied CPQCC standard operational definitions for all independent variables, including gender, weight for gestational age below the 10th percentile, birth outside a regional center, and Cesarean birth. Gestational age at birth was categorized into 25 weeks to 27 weeks, 6 days; 28 weeks to 29 weeks, 6 days; and 30 weeks or more gestation groups, based on similar patient numbers among groups. Apgar score was categorized as 3 or below, between 4 and 6, and above 6.
Because some NICUs will have higher morbidity and mortality rates simply because they care for sicker infants, we developed risk-adjustment models20–22 for all measures, except the eye examination (which is a process that should be performed on all infants independent of illness severity at birth; Section 1 in the Supplemental Web Appendix gives additional details on the risk-adjustment model). Variables included in these models include a combination of prenatal care, gestational age at birth, small for gestational age status, multiple birth, cesarean delivery, inborn or outborn, and 5-minute Apgar score.17
We used the Draper-Gittoes21 method of risk adjustment, which has long been used successfully in the UK higher education system. With this method, a standardized z score is constructed that is suitable for combining via unweighted or weighted averaging. These z scores should be approximately normally distributed with mean 0 and SD 1. Additional details are available in the Supplemental Web Appendix, Section 2.
We adopted an equal weighting scheme for the base case. In sensitivity analyses, we explored a variety of weighting schemes based on expert opinion. Our panel of experts14 was asked to distribute 100 imaginary dollars across the measures, according to the relative contribution of each measure to overall NICU quality (see Table 1). Mean and median weights derived from this exercise were applied in sensitivity analyses.
Aggregation and Discrimination
Measures were aggregated by averaging the 9 z scores for each NICU. The 95% confidence intervals were computed via bootstrapping23 (a simulation in which each NICU’s patients are re-sampled with replacement 1000 times) and are plotted in Fig 1; failure of 2 such intervals to overlap corresponds to a highly statistically significant difference. This criterion was used to discriminate between NICUs. In addition, we assigned star ratings to groups of NICUs depending on whether their entire confidence interval fell below 0 (3 stars), overlapped 0 (4 stars), or above 0 (5 stars).
We investigated the effects of our methodological choices on the composite score by varying measure weights and by alternative methods of measure aggregation. In the base case all measures were equally weighted and averaged; this is an easily understood format. However, different approaches are possible, and if they result in substantially different performance assessments, might be favored on theoretical grounds.
We tested several alternatives to weighting, including using the mean and median weights derived by our group of experts (median weights are less prone to outlier opinions). In addition, rather than adding the z scores, we assigned a rank to each NICU and then added the ranks across different measures. This method may provide more separation between NICUs, although one must bear in mind that a 1-place difference in ranks could reflect large or small difference in z scores.
We also explored a different aggregation method by multiplying rather than adding the z scores. Geometric (multiplicative) aggregation is theoretically appealing. Whereas linear addition of measures allows NICUs to fully trade off low performance in 1 measure with high performance in another, multiplicative aggregation allows only for partial compensation. Consider a scenario including 2 NICUs and 2 quality measures. Suppose NICU A achieves a score of 1 on 1 measure and a score of 9 on the other, whereas NICU B achieves a score of 5 on both. Under additive aggregation both NICUs perform equally (9 + 1 = 5 + 5). However, the extreme performance of NICU A results in a much lower rating under multiplicative aggregation (9 < 25). Multiplicative aggregation is thus intriguing for settings where policymakers aim to promote broad standards of care.
We assessed robustness (our use of the term robustness in this article is synonymous with the term stability) of NICU performance under these different scenarios with the base case using both Pearson and Spearman rank correlation coefficients. Correlation coefficients >0.7 imply strong correlation.24 See Supplemental Web Appendix, Section 3, for additional details.
We evaluated stability over time of the base case as follows: the measures were generated separately using data from 2004 to 2007 and 2008 to 2010, and the results were compared. Parry24 showed that mortality alone was not a good indicator of quality in that NICUs tended to bounce between top and bottom performance. Because we do expect some drift in quality of care over time, our analysis aimed to find moderate correlations of performance between the 2 time periods.
In 2004 to 2007, 22 NICUs, and in 2008 to 2010, 21 NICUs, met the criteria for inclusion, so the analyses were done on these units. First, we conducted Pearson and Spearman correlation analyses across the 2 time periods. Second, we applied the nonparametric Wilcoxon signed-rank test to examine the extent, if any, of the temporal differences. We also looked at the number of NICUs that changed quartiles between the 2 time periods. Finally, we computed the κ statistic for the quartiles.
Human Subjects Compliance
This study was approved by the CPQCC and the Baylor College of Medicine Internal Review Board.
Table 2 shows the population and NICU characteristics for the combined sample, as well as for the 2 study periods 2004 to 2007 and 2008 to 2010. Of note is the improvement in absolute and risk-adjusted components of the Baby-MONITOR between the time periods in all measures, except for the rate of occurrence of pneumothoraces (which roughly held constant).
Performance on Individual Measures of Quality
Table 3 shows the standardized z scores for the clinical measures, with units ordered with regard to ascending composite score. The variation in performance within and between these regional NICUs is notable (see Supplemental Web Appendix, Section 4): it would be difficult to draw inferences on overall performance based on any individual measure of quality.
Base Case Baby-MONITOR
The base-case composite indicator was derived by averaging the z scores from each NICU; measures were assigned equal weights. Performance on the Baby-MONITOR is not an absolute marker of quality but indicates overall performance relative to that of the other NICUs (Fig 1). NICUs can evaluate their absolute performance by investigating the composite’s individual components.
Several observations can be made regarding California regional NICUs:
Considerable variation, evinced by non-overlapping confidence intervals, exists.
The composite scores for NICU V are significantly better than for NICUs A to Q.
The scores for NICUs A to G are below 0 (lower than expected), include 0 for NICUs H to N, and are above 0 (better than expected) for NICUs O to V.
A classification system was derived based on item 3 above, in which NICUs A to G are assigned 3 stars, NICUs H to N are assigned 4 stars, and NICUs O to V are assigned 5 stars. Special recognition was awarded to NICU V as the top performer, with its composite score exceeding the upper limit of the next best NICU’s 95% confidence interval.
Figure 2 presents composite scores for the 22 NICUs based on 5 methods of weighting and aggregation across outcomes, including the base case (equal weights), mean and median expert weights, the base case using ranks, and the geometric mean (Supplemental Web Appendix Table B provides additional detail). Results are presented as ranks and show that the base case exhibits a high degree of stability. The Pearson and Spearman correlations between the base case and the other 4 approaches in Supplemental Table B varied from 0.89 to 0.99.
However, the figure also juxtaposes interesting characteristics of multiplicative aggregation in the geometric composite vis-à-vis additive aggregation in the base case. Note that NICU K, ranked 12th in the base case, was second lowest using the geometric approach. Table 3 shows that this NICU had the lowest score in the “no hypothermia” measure and the highest in the “timely eye exam” measure; because the geometric composite rewards stable performance across all measures, NICU K is unable to compensate for low performance on 1 measure with good results on others.
Robustness of the Baby-MONITOR Over Time
The Pearson correlation between base-case composites derived from 2004 to 2007 and 2008 to 2010 data was 0.67; the Spearman (rank) correlation was 0.74. The P value for the Wilcoxon signed-rank test was .68, providing support for the null hypothesis that there was no difference between the 2 time periods. Of the 21 NICUs, 3 changed quartiles in the positive direction and 3 in the negative direction. The κ statistic was 0.43, indicating moderate agreement.25
We present the first iteration of the Baby-MONITOR, a composite indicator of neonatal intensive care quality provided to VLBW infants. The development of the Baby-MONITOR followed a formal, stepwise, and explicit process that has been peer reviewed and is widely applied in health and non-health settings.3,26–28
In previous work, we developed a theoretical framework for the Baby-MONITOR,4,15 selected measures of quality using rigorous methods,14 validated the selection,16 conducted initial data analyses, and developed risk models for individual measures.17,29 In this study, we aggregated the measures, assessed the composite’s ability to discriminate among NICUs, and conducted extensive sensitivity analyses with regard to weighting and aggregation. We found the Baby-MONITOR to be a robust measure and able to discriminate overall quality of care delivery.
Composite measurement of quality is becoming more prominent in health care and is being used to support consumer choices. Already, composites such as the US News and World Report hospital rankings exert great influence on hospital strategic planning and marketing. Whether they accurately discriminate higher from lower performing hospitals is less evident. In fact, several studies have highlighted the variable nature of performance ratings according to different methods.30,31 Such divergence can be addressed only by adopting explicit and transparent standards for indicator development.3,32
In this study, we demonstrated high correlation with alternative methods of composite construction. We therefore retained the base case with equal weighting and additive aggregation. Clinicians may think that equal weighting poorly reflects quality priorities in the NICU, as few would consider mortality on par with hypothermia on admission. However, an equal weighting scheme is supported by the literature, which shows that unit weights would have to differ dramatically to substantially affect performance assessments.33 Figure 2 confirms this literature and supports our decision for equal weighting, as it demonstrates little effect of various weighting schemes on NICU performance.
With regard to aggregation, we decided against grouping measures into sub-dimensions despite our previous research showing that the 9 measures assessed only 4 latent factors.17 Grouping would add to the composite’s complexity and require decisions about weighting within and between sub-indicators. If replicated in larger datasets we may revise the Baby-MONITOR, but for this initial iteration, we selected a simpler format.
We were intrigued by the results generated by the geometric composite. In the base case, the additive computation allows a high score in 1 measure to cancel a low score in another; this compensation does not occur in the geometric composite. In our sample, NICU K would have been classified as a 3-star rather than a 4-star NICU under multiplicative aggregation, revealing its quality deficits in several domains. Arguably, from the standpoint of policymakers, achievement of a certain performance benchmark across all aspects of care, as promoted by the multiplicative composite, may be more desirable. A multiplicative approach to measure aggregation may also be better aligned with the premise of composite measurement with regard to its role in incentivising systems-based multidimensional improvement. However, before a recommendation to use multiplicative rather than additive aggregation can be made, these findings require affirmation in a larger, more diverse sample of NICUs, as well as validation against other indicators.
Users of the Baby-MONITOR must respect its limitations and recognize that this initial iteration is merely a first step toward comprehensively measuring NICU quality of care delivery. Interpretation of composite ratings must take into account that absolute differences may be small or not statistically significant. This is a particular concern for rank-based performance assessments that include only point estimates of ranks. We therefore chose to present the Baby-MONITOR results as a caterpillar plot, and for further user-friendliness included a star rating based on normative criteria (ie, overall performance falling below, meeting, or exceeding expectations), even though simple star ratings may overstate quality differences and must be interpreted with caution. We suggest that the composite be used to generate system-based improvement efforts that “lift the boat” on multiple measures simultaneously. Such efforts should accompany, not replace, traditional quality improvement efforts.
One important concern for the Baby-MONITOR is validity, that is, does it measure overall NICU quality? Support for its validity has several sources. Content validity (the measure represents all facets of the underlying construct) was conferred by a panel of independent experts in the original measure selection process14 and strengthened by a sample of recognized clinicians.16 Construct validity (the composite actually measures what it is supposed to measure) is supported by formally including each measure’s reliability, validity, importance, scientific soundness, and usability in the selection process. In addition, each component of the Baby-MONITOR achieves statistical separation of NICUs and so does the composite. Nevertheless, given the absence of a gold standard comparison, additional research is needed to further solidify construct validity, including comparison with other measures of quality. We are currently investigating convergent validity of the Baby-MONITOR and NICU safety culture. In addition, future research will need to address whether performance on the Baby-MONITOR correlates with long-term infant outcomes (predictive validity).
For reasons of sample size, we combined 4 years of data to generate the initial estimates for the Baby-MONITOR. For a composite to meet the needs of clinicians at the frontline, additional research will need to focus on generating real-time estimates of the composite results based on moving averages or Bayesian updating methods.
Finally, although our current analysis is restricted to regional, or mostly level 4, NICUs, recent improvements in data collection, with linkage of outcomes across hospitals and post-discharge outcomes, will allow us to generalize assessment beyond the regional NICUs to the larger universe of lower level NICUs in California and beyond.
We present the first iteration of the Baby-MONITOR and display information regarding its ability to discriminate quality of care and robustness to different assumptions in its construction and over time.
We thank Aloka Patel and Rush University Medical Center for granting Dr Profit a nonexclusive license to use Rush’s exponential infant growth model for noncommercial research purposes.
- Accepted March 24, 2014.
- Address correspondence to Jochen Profit, MD, MPH, Perinatal Epidemiology and Health Outcomes Research Unit, Division of Neonatology, Department of Pediatrics, Stanford University School of Medicine, MSOB Room x115, 1165 Welch Rd, Stanford, CA 94305. E-mail:
Dr Profit acquired funding for this study, conceptualized and designed the study, selected data for inclusion in analyses, analyzed the data, assisted with interpretation of the results, and drafted the initial manuscript; Dr Kowalkowski analyzed the data, assisted with interpretation of the results, and revised the manuscript; Dr Zupancic conceptualized and designed the study, selected data for inclusion in analyses, assisted with interpretation of the results, and revised the manuscript; Dr Pietz helped acquire funding for the study, conceptualized the study, analyzed the data, assisted with interpretation of the results, and revised the manuscript; Dr Richardson analyzed the data, assisted with interpretation of the results, and revised the manuscript; Dr Draper assisted with designing the analysis and interpretation of the results and revised the manuscript; Dr Hysong conceptualized and helped design the study, assisted with interpretation of the results, and revised the manuscript; Dr Thomas helped acquire funding for the study, conceptualized and helped design the study, assisted with interpretation of the results, and revised the manuscript; Dr Petersen helped acquire funding for the study, conceptualized and helped design the study, assisted with interpretation of the results, and revised the manuscript; Dr Gould helped acquire funding for the study, conceptualized and helped design the study, selected data for inclusion in analyses, assisted with interpretation of the results, and revised the manuscript; and all authors approved the final manuscript as submitted.
Dr Profit and Dr Pietz had full access to all of the data in the study and take responsibility for the integrity of the data and the accuracy of the data analysis.
At the time of the research, Dr Profit was on faculty at Baylor College of Medicine, Texas Children’s Hospital, Department of Pediatrics, Section of Neonatology. He held a secondary appointment in the Department of Medicine, Section of Health Services Research and conducted his research at the VA Health Services Research and Development Center of Excellence where he collaborated with Dr Kowalkowski.
FINANCIAL DISCLOSURE: Drs Profit, Zupancic, and Gould have served in consultant roles to the Vermont Oxford Network NICQ 7 and 8 Quality Improvement Collaboratives. The other authors have indicated they have no financial relationships relevant to this article to disclose.
FUNDING: Dr Profit’s contribution is supported in part by the Eunice Kennedy Shriver National Institute of Child Health and Human Development grant 1 K23 HD056298 (Principal Investigator: Dr Profit). Dr Petersen was a recipient of the American Heart Association Established Investigator Award (0540043N) at the time this work was conducted. Drs Petersen and Hysong also receive support from a Veterans Administration Center Grant (VA HSR&D CoE HFP90-20). Dr Hysong’s contribution is supported in part by the Department of Veterans Affairs Health Services Research and Development Program (CD2-07-0181). Dr Thomas’s effort was supported in part by grants from the Eunice Kennedy Shriver National Institute of Child Health and Human Development grant 1 K24 HD053771-01 (Principal Investigator: Dr Thomas). Funded by the National Institutes of Health (NIH).
POTENTIAL CONFLICT OF INTEREST: Dr Gould is the Principal Investigator for the California Perinatal Quality Care Collaborative. Dr Profit is a researcher at the California Perinatal Quality Care Collaborative. The other authors have indicated they have no potential conflicts of interest to disclose.
- ↵NQF. Composite Measure Evaluation Framework and National Voluntary Consensus Standards for Mortality and Safety–Composite Measures: A Consensus Report. National Quality Forum. Washington, DC; 2009
- Institute of Medicine.
- Nardo M,
- Saisana M,
- Saltelli A,
- Tarantolo S,
- Hoffman A,
- Giovanini E
- Medicare. Hospital Compare. 2013 [cited March 31, 2013]. Available at: www.medicare.gov/hospitalcompare/
- Jha AK,
- Orav EJ,
- Li Z,
- Epstein AM
- Profit J,
- Zupancic JA,
- Gould JB,
- Petersen LA
- Stark AR,
- American Academy of Pediatrics Committee on Fetus and Newborn
- Patel AL,
- Engstrom JL,
- Meier PP,
- Kimura RE
- Efron BT,
- An RJ
- Aberson CL
- ↵Kelley ET, Hurst J. Health Care Quality Indicator Project—conceptual framework. 2006; 23. Available at: www.oecd.org/dataoecd/1/36/36262363.pdf
- Mattke S,
- Epstein AM,
- Leatherman S
- ↵Profit J, Gould JB, Draper D, Zupancic JA, Kowalkowski MA, Woodard L, et al. Variations in definitions of mortality have little influence on neonatal intensive care unit performance ratings. J Pediatr. 2013;162:50–55.e2
- Williams SC,
- Koss RG,
- Morton DJ,
- Loeb JM
- Shahian DM,
- Normand SL
- Peterson ED,
- Delong ER,
- Masoudi FA,
- et al.,
- ACCF/AHA Task Force on Performance Measures
- Copyright © 2014 by the American Academy of Pediatrics