Abstract
Background. Risk-adjusted severity of illness is frequently used in clinical research and quality assessments. Although there are multiple methods designed for neonates, they have been infrequently compared and some have not been assessed in large samples of very low birth weight (VLBW; <1500 g) infants.
Objectives. To test and compare published neonatal mortality prediction models, including Clinical Risk Index for Babies (CRIB), Score for Neonatal Acute Physiology (SNAP), SNAP-Perinatal Extension (SNAP-PE), Neonatal Therapeutic Interventions Scoring System, the National Institute of Child Health and Human Development (NICHD) network model, and other individual admission factors such as birth weight, low Apgar score (<7 at 5 minutes), and small for gestational age status in a cohort of VLBW infants from the Washington, DC area.
Methods. Data were collected on 476 VLBW infants admitted to 8 neonatal intensive care units between October 1994 and February 1997. The calibration (closeness of total observed deaths to the predicted total) of models with published coefficients (SNAP-PE, CRIB, and NICHD) was assessed using the standardized mortality ratio. Discrimination was quantified as the area under the curve (AUC) for the receiver operating characteristic curves. Calibrated models were derived for the current database using logistic regression techniques. Goodness-of-fit of predicted to observed probabilities of death was assessed with the Hosmer-Lemeshow goodness-of-fit test.
Results. The calibration of published algorithms applied to our data was poor. The standardized mortality ratios for the NICHD, CRIB, and SNAP-PE models were .65, .56, and .82, respectively. Discrimination of all the models was excellent (range: .863–.930). Surprisingly, birth weight performed much better than in previous analyses, with an AUC of .869. The best models using both 12- and 24-hour postadmission data, significantly outperformed the best model based on birth data only but were not significantly different from each other. The variables in the best model were birth weight, birth weight squared, low 5-minute Apgar score, and SNAP (AUC = .930).
Conclusion. Published models for severity of illness overpredicted hospital mortality in this set of VLBW infants, indicating a need for frequent recalibration. Discrimination for these severity of illness scores remains excellent. Birth variables should be reevaluated as a method to control for severity of illness in predicting mortality.
- severity of illness index
- neonatal mortality
- risk adjustment
- infant
- very low birth weight
- intensive care
- neonatal
Severity of illness assessment has been important for a wide range of pediatric, neonatal, and adult uses including quality assessments, controlling for severity of illness in clinical studies, and studies of resource utilization and management.1–3National programs such as the ORYX initiative by the Joint Commission on Accreditation of Hospitals and Related Organizations are premised on the condition of controlling evaluations for severity of illness.4 Although severity of illness is a familiar medical concept, it is sometimes difficult to assess. In the context of intensive care, a rational and objective way to define and quantify severity of illness is through the development of probabilistic models predicting mortality risk.5 Such predictive models have been developed for all age groups.6–11
Severity measurements in neonatal intensive care have traditionally used birth weight and Apgar scores, but the relationship between mortality and these parameters has been insufficiently precise to use for quality assessment. Further, the rapid evolution of neonatal care has made the relationships between mortality and these variables unstable. In 1993, the Score for Neonatal Acute Physiology (SNAP),9 the SNAP-Perinatal Extension (SNAP-PE),10 and the Clinical Risk Index for Babies (CRIB)11 scores were proposed for use in assessing severity with sufficient precision to allow an expansion of the applications to include quality assessment. SNAP uses the worst recorded values of more than 2 dozen routinely measured physiologic variables during the first 24 hours of stay; SNAP-PE supplements the SNAP with additional scoring for birth weight, small for gestational age (SGA) status, and low Apgar score (<7 at 5 minutes). CRIB uses information on base excess and oxygen requirements during the first 12 hours of life, as well as birth weight, gestational age, and congenital malformations. Although these models have been shown to perform better than birth weight alone in predicting hospital mortality, each suffers from certain limitations. SNAP and SNAP-PE were developed with a relatively small representation of very low birth weight (VLBW; <1500 g) infants, the group in which most deaths occur and most medical advancements have been made. CRIB was developed before widespread use of surfactant and is highly dependent on respiratory status. The Neonatal Therapeutic Intervention Scoring System (NTISS) assesses severity based on the intensity of therapeutic intervention during the first 24 hours of stay, using information on 62 specific therapeutic interventions. Finally, a model dependent on variables collected before admission to the neonatal intensive care unit (NICU) has been proposed by the National Institute of Child Health and Human Development (NICHD) neonatal research network.8 Although this NICHD model, based on birth weight, SGA status, gender, race (black vs other), and 1-minute Apgar ≤3, has the advantage of better separating therapy from severity of illness measurement, the published performance has been substantially worse than the CRIB or SNAP models.
We evaluated the relative ability of severity measures to discriminate between hospital deaths and survivors in a large group of VLBW infants admitted to 8 NICUs in Washington, DC. Because most neonatal mortality occurs in this group, assessment of quality using measures such as standardized mortality ratios (SMRs) must depend on reliable predictors. The measures, examined singly or in combination using models developed from our data, included birth weight, low Apgar score (<7 at 5 minutes), the NICHD neonatal network's preadmission score, CRIB, SNAP, SNAP-PE, and NTISS. Because separating severity assessment from therapy as much as possible is beneficial in quality assessment models, we also tested the performance of the SNAP and SNAP-PE scores derived from the first 12 hours of care. These analyses address the tradeoffs in discriminatory power with respect to issues of complexity and duration of scoring. We also compared our mortality experience to that predicted by published logistic regression equations for the NICHD neonatal network's preadmission model, SNAP-PE, and CRIB to test whether they are sufficiently calibrated in VLBW infants.
METHODS
In 1992, NICHD, in conjunction with the National Institutes of Health, Office of Research on Minority Health and the National Institute for Nursing Research, sponsored the cooperative agreement: “The National Institutes of Health DC Initiative to Reduce Infant Mortality in Minority Populations in the District of Columbia.” The purpose of the 5-year initiative was to develop coordinated projects designed to better understand the reasons for the high rate of infant mortality in the District of Columbia and to design and evaluate intervention projects aimed at reducing the number of infants in the District of Columbia at increased risk of dying in their first year of life. As part of this initiative, 9 of the 10 NICUs in the District of Columbia caring for low birth weight infants agreed to study the various aspects of neonatal intensive care. Eight of the NICUs provided acute care, and the ninth provided chronic care only. All protocols were approved by the institutional review boards of each site, as well as by the National Institutes of Health.
Eligibility criteria for inclusion in the study included the following: 1) live birth, with birth weight from 500 to 1499 g; 2) birth to a resident of the District of Columbia; 3) care in a participating NICU or intermediate/step-down unit; and 4) date of birth and date of first admission to a participating NICU during the enrollment period, October 1, 1994 to February 19, 1997. These criteria implicitly excluded delivery room deaths. Exclusion criteria included: 1) neonates not expected to survive attributable to extreme prematurity (only comfort care given) or inevitably lethal malformations; 2) neonates admitted after discharge to home; 3) infants who died before 2 hours of age; and 4) infants transferred from a nonparticipating hospital after 24 hours of age.
Data were collected prospectively from November 13, 1995 through April 30, 1997, and retrospectively from October 1, 1994 through November 12, 1995. All prospective data were abstracted from the clinical records using a process designated for each site to insure complete data collection. Retrospective data collection used the existing medical records. Clinical data relevant to this report included maternal information (age, socioeconomic information, prenatal care, antenatal steroids, and multiple infants per gestation), infant data (gender, race, gestational age, and congenital malformations), delivery room status (birth weight, SGA status, Apgar scores, and mode of delivery), transportation status, and physiologic data required for the SNAP and CRIB (mean blood pressure, heart rate, respiratory rate and temperature, blood gas data, blood chemistries and blood counts, urine output, presence of seizures, apnea, stool guaiac, base excess, and minimum and maximum fraction of inspired oxygen [Fio2] values). In addition, therapeutic and monitoring data for the NTISS were collected for the first 24 hours.12 Physiologic and NTISS data were collected separately for the first and second 12 hours after admission. Physiologic measurements in the 2 hours before death were excluded from data collection for the SNAP score. Transfers between participating NICUs were tracked to provide a complete record of the birth hospitalization until the occurrence of either in-hospital death or discharge to home, non-NICU chronic care, or a nonparticipating hospital. If infants reached boarder status (ie, retained in the hospital for nonmedical reasons), they were considered discharged. Both daily and outcome data collection for the entire project were more extensive.
Before data collection, each site was visited and general and site-specific data collection procedures were developed, including a detailed, site-specific operations manual. In addition to the project's principal investigator, project coordinator, field manager, and data collectors, each site had a staff nurse assigned as a facilitator to help with data collection and a staff physician to serve as a site liaison to the principal investigator. Data collectors were registered nurses who were oriented in depth to the data-recording methods at each site. Each research nurse covered multiple sites to ensure consistency in data collection during periods of absence of the other data collectors. Data were collected directly onto computers, which applied specific immediate error checks and transmitted via modem to the data management center each evening. The completeness of the sample was assessed using the NICU and delivery room logs.
Excluded from this analysis were infants who were transferred to or among participating hospitals during the first 24 hours of life, because the protocol's data collection forms precluded assigning components of the severity of illness scores in less than 12- or 24-hour blocks. This effectively excluded outborn infants and reduced the number of acute care NICUs where infants originated to 7. The characteristics of these 7 NICUs are shown in Table 1. Excluded from the analysis of hospital mortality were infants who were missing any of the severity of illness measures. Infants who were transferred to a nonparticipating hospital were assumed to have survived the birth hospitalization.
NICUs (n = 7)
The different methods proposed for calculating neonatal severity of illness, including the NICHD network score based on preadmission variables,8 the SNAP score in the form of the SNAP-PE (best-fit model),10 and the CRIB score13 have published formulas that predict mortality. To assess how well each of these formulas actually predicted total mortality (ie, calibration) in our sample, we computed the ratio of the observed to the predicted total number of hospital deaths and, using a z test,14,,15examined whether this standardized mortality ratio (SMR) was significantly different from 1. It is also important to assess the ability of the individual severity of illness scores to distinguish between neonates who are likely to die in the hospital from those likely to survive (ie, discrimination). To do this, one can calculate for each score the true-positive rate (of the total number of hospital deaths, the proportion identified to be at high risk) and the false-positive rate (of the total number of survivors, the proportion identified to be at high risk). The magnitude of these quantities depends on a threshold value for the scores that define high risk. A less stringent threshold value would increase sensitivity but decrease specificity. This is illustrated by receiver operating characteristic (ROC) curves, which plot the true-positive rate against the false-positive rate for different threshold values. Points on the diagonal line from (0,0) to (1,1) indicate an equal chance of being labeled high risk for both deaths and survivors. The higher the ROC curve is from this chance line, the better the discrimination. The area under the ROC curve (AUC) can be viewed as a quantitative measure of this discrimination, which ranges from .5 for a chance discrimination to 1.0 for perfect discrimination. The AUC can also be interpreted as the probability that a randomly selected hospital death has a higher severity score than a randomly selected hospital survivor. We assessed the discriminatory ability of individual scores by calculating their AUCs, a measurement that is model-free and depends only on the ability of the scores to rank infants in increasing likelihood of hospital death. We also calibrated each score to our data by fitting a logistic regression with the score as the single predictor variable. The estimation of the best intercept and slope in the logistic regression ensures that the predicted total number of deaths in the sample is the same as the observed total.
However, the regression model may still systematically overpredict for some illness severities and underpredict for others. We applied the Hosmer-Lemeshow goodness-of-fit test (PROC LOGISTIC, SAS Institute, Cary, NC)16,,17 to determine whether there was systematic underestimation or overestimation of mortality depending on the severity of illness. The Hosmer-Lemeshow test does this by comparing the observed number of deaths with the predicted number of deaths within each of 10 approximately equally sized subgroups of infants, as ordered by the predicted probabilities of death. Ruttimann18 provides a useful review of the concepts of logistic regression, calibration, discrimination, ROC analysis and AUCs, and predictive accuracy/goodness-of-fit.
To explore the performance of combinations of severity measures, we assessed their discrimination (AUC) and predictive accuracy (Hosmer-Lemeshow goodness-of-fit test) using logistic regression models. Because the sample size was insufficient to validate the various models on an independent sample, the analysis focused on predictors and their combinations that could be used to control for severity of illness in this sample. A model for each predictor (birth weight, NICHD preadmission score, CRIB, SNAP, SNAP-PE, and NTISS) was fit using an intercept and a linear term; nonlinearity was tested through the addition of a quadratic term. Significant (P < .05) quadratic terms were retained unless they became nonsignificant in subsequent modeling. Terms for birth weight, low Apgar score (<7 at 5 minutes, after the formulation in the SNAP-PE), and SGA were added to each model unless their information was already contained in the score (ie, NICHD preadmission score, CRIB, and SNAP-PE). These additional terms were then removed as indicated by backward elimination using P < .05 to remain in the model. We compared the AUCs from the different models by using a test that makes few model assumptions and is valid for correlated AUCs.19
Descriptive statistics are reported as means ± standard deviation, medians, and proportions. All P values for single parameters are 2-sided.
RESULTS
The cohort consisted of 552 infants. Exclusions for inevitably lethal malformations (n = 0), infants given comfort care only (n = 10), deaths before 2 hours of life (n = 2), transfers from nonparticipating sites after 24 hours of age (n = 10), and missing charts (n = 8; no deaths) left a total study population of 522 infants. In addition, 46 infants who were transferred into or among the participating hospitals in the first 24 hours of life were excluded from the data analysis, leaving 476 infants born to 450 women. Of these, 250 infants were in the prospective cohort and 226 in the retrospective cohort. All infants were admitted to the NICU within 2.4 hours of birth (mean: .37 ± .30 hours). Complete follow-up data were available for 465 infants; 6 infants discharged to nonparticipating hospitals at ages ranging from 6 to 88 days of life were assumed to have survived, as were 5 infants with ages ranging from 85 to 159 days who were still in the hospital when the study ended. Eight infants (1 death) were excluded from the mortality analysis because missing values for 1 or more Apgar scores precluded the calculation of the NICHD score or the SNAP-PE. For the construction of the CRIB score, minimum and maximum appropriate Fio2 values were unavailable for 52 infants because none of the blood gases had a partial pressure of arterial oxygen in the appropriate range (50–80 mm Hg).
One of the investigators (M.M.P.) assigned point scores for minimum and maximum appropriate Fio2 based on clinical assessment of the pattern of partial pressure of arterial oxygen and Fio2 values for available blood gas measurements. Alternative analyses (data not shown) that excluded infants with missing CRIB components gave qualitatively similar results to those reported below. Race was not recorded for 7 infants; given the racial composition of our population, we computed the NICHD score for these infants as if they were black.
The mothers were mostly black (90.3%) and single (78.9%); 13.4% were <20 years old, and 19.5% had received no prenatal care. Antenatal steroids were received by 71.2% of mothers. Infant characteristics are shown in Table 2. The infants had a mean birth weight of 1048 ± 285 g with a mean gestational age of 28.6 ± 2.8 weeks, and 20.4% were categorized as SGA. The distributions of weight and gestational age are also shown in Table 2. Almost half of the neonates were delivered via caesarian section. Low Apgar scores were common, with 27.9% having Apgar scores ≤3 at 1 minute, and 24.8% with Apgar scores <7 at 5 minutes.
Infant Descriptive Characteristics (n = 476)
The severity of illness data are shown in Table 3. The mean 1-minute and 5-minute Apgar scores were 4.9 ± 2.3 and 7.3 ± 1.8, respectively. The mean CRIB score was 5.2 ± 5.1. The intensity of therapy and monitoring as measured by the NTISS was 19.5 ± 7.4. By definition, most of the components of SNAP and SNAP-PE scores for the first 12 hours cannot exceed their 24-hour counterparts, and this is reflected in the mean difference between the 12-hour and 24-hour scores (SNAP: 11.8 ± 7.6 vs 13.5 ± 8.3; SNAP-PE: 23.6 ± 19.5 vs 25.3 ± 20.2). The mean (24-hour) SNAP and SNAP-PE scores (13.5 and 25.3) were similar to previously reported mean scores (13.9 and 24.6) for infants weighing <1500 g.20
Severity of Illness Measures (n = 476)
The preestablished predictive equations significantly overpredicted hospital mortality in this sample (Table 4). There were 66 observed hospital deaths. In contrast, the preadmission model proposed by the NICHD network predicted 102.1 deaths (SMR = .65; 95% confidence interval [CI] = .50,.79; P < .0001), the CRIB model predicted 117.3 deaths (SMR = .56; 95% CI = .46,.67;P < .0001), and the SNAP-PE model predicted 80.2 deaths (SMR = .82; 95% CI = .68,.97; P = .017).
Performance of Published Neonatal Mortality Risk Predictors (n = 468; Observed Deaths = 66)
Selected results for discrimination and goodness of fit are shown inTable 5. Birth weight alone was a very good discriminator (AUC = .869) but the goodness-of-fit test was poor, with numbers of observed deaths deviating significantly from predicted probabilities (P = .0005). However, the addition of a quadratic birth weight term achieved an excellent fit at the cost of a slight decrease in discrimination. No other quadratic terms were retained, and SGA was not predictive in this sample. The term for Apgar score <7 at 5 minutes was significant in every model for which it was included. Adding this term improved the performance of the quadratic birth weight model to an AUC of .892, also with an excellent goodness-of-fit. This AUC for birth weight (its quadratic term and Apgar score <7 at 5 minutes) was equivalent to the performance of the CRIB score and was higher than the AUC for the preadmission score proposed by the NICHD network, even when enhanced by adding the 5-minute Apgar score term.
Discriminatory Ability and Goodness-of-Fit for Fitted Models (n = 468)
Figure 1 illustrates the fit of the linear (birth weight alone) and quadratic (birth weight and birth weight squared) models to the observed mortality, by deciles of birth weight. The model with birth weight alone overpredicted mortality for the second through fifth deciles (birth weights between 631 and 1070 g) and tended to underpredict for the remainder. The improvement in fit attributable to the inclusion of the birth weight-squared term is clear, especially for the lower birth weights where most of the deaths occur.
Observed and predicted mortality by birth weight decile groups.
Within the group of models employing 12-hour data, ∼.01 increments in the AUC were achieved between CRIB, CRIB enhanced with the low 5-minute Apgar term, the 12-hour SNAP-PE, and the best discriminating 12-hour model that included quadratic birth weight, low Apgar, and 12-hour SNAP (AUC = .920). All models displayed satisfactory goodness-of-fit, as did the models using 24-hour data. The AUC for 24-hour SNAP-PE (.916) was similar to that of the best 12-hour model, and the best 24-hour model (quadratic birth weight, low Apgar, and 24-hour SNAP) achieved an AUC of .930. Models containing NTISS performed slightly less well than did those with 24-hour SNAP.
There were no significant differences among AUCs within the groups (birth data, birth plus 12-hour data, and birth plus 24-hour data) shown in Table 5, although the difference in AUCs for CRIB versus the best discriminating 12-hour model approached significance atP = .056. Table 6displays P values for the comparison of AUCs among predefined severity of illness measures. The AUCs for the 12-hour and 24- hour SNAP-PE scores were significantly different from those for birth weight (P = .016 and .0050) and the NICHD score (P = .0043 and .0019), but were not significantly different from the AUC for CRIB (P = .30 and .088) or each other (P = .21). The AUC for CRIB did not significantly differ from that of birth weight (P = .29) or the NICHD score (P = .35).
P Values for Comparison of AUCs Among Severity of Illness Measures (n = 468)
Figure 2 shows the empirical ROC curves for birth weight alone and for the best discriminating birth, 12-hour, and 24-hour models. The curves are similar for the best 12- and 24-hour models, which use SNAP in addition to quadratic birth weight and Apgar score <7 at 5 minutes. Both dominate the curves for the models based on birth data only. Table 7 providesP values for the comparison of AUCs among these fitted models. The best discriminating 12- and 24-hour models had significantly better AUCs than the models using birth weight alone (P = .010 and .0056), or quadratic birth weight and Apgar score <7 at 5 minutes (P = .058 and .035), but did not differ significantly from each other (P = .20).
ROC curves for hospital mortality for birth weight and best discriminatory models using birth, 12-hour, and 24-hour data (n = 468).
P Values for Comparison of AUCs Among BW and Best Discriminatory Models Using Birth, 12-Hour, and 24-Hour Data (n = 468)
DISCUSSION
Neonatal severity of illness measures have several important uses. First, they are valuable in comparing outcomes across hospitals or NICUs. Accurate and reliable measures of severity of illness are required for unbiased and reliable comparisons such as those involving benchmarking or comparative quality of care studies. Second, they can serve to control for population differences when performing studies such as clinical trials, outcome evaluations, and evaluations of resource utilization. Severity measures must perform satisfactorily if they are to fulfill these roles. Interinstitutional comparisons and benchmarking are especially sensitive to the performance of the severity of illness measures because comparisons of the number of observed outcomes to the number of predicted outcomes are central to the interpretation of the institutional performance. Mortality is generally the outcome used to validate the severity of illness measure because mortality is clearly defined and occurs with sufficient frequency for use in prediction models.
Models developed to measure neonatal severity of illness have improved in calibration and discriminatory power. Yet questions remain concerning the applicability and validity of these methods. Are they applicable to all neonatal populations? For example, scores have been developed from specific geographical regions. The proportion of low birth weight infants and extremely low birth weight infants differed in each of the populations used to develop the scores. The rapid medical evolution in perinatal and neonatal care could have altered the relationship between mortality and the score's predictor variables. Measurement bias may be present because most scores require the collection of a broad spectrum of physiologic measurements that may not be routinely collected in each NICU at the same level of intensity. Lead-time bias issues may be present attributable to transport issues and different practice patterns of care given in the delivery room. Each of these issues has the potential to limit the utility of these instruments.21
We compared the calibration and discriminatory power of existing severity of illness scales with respect to hospital mortality in VLBW infants in a contemporary urban population. Our assessments were substantially more detailed than other assessments of their performance,22–24 providing comparisons of discrimination and goodness-of-fit for all well-established indices of neonatal severity of illness and important birth variables, as well as calibration of published prediction equations. Of note, all existing prediction models were poorly calibrated for this population. Each substantially overpredicted the number of deaths by at least 20%, with standardized mortality ratios ranging from .56 to .82. Some of the issues detailed above could account for this poor performance. Specifically, this study population received care in an environment of widespread surfactant (88%) and antenatal steroid (71%) use, both of which have significantly reduced neonatal mortality in VLBW infants.25,,26 Such differences may have challenged the calibration of previously developed illness severity models for hospital death. Additionally, the population was predominantly urban black (90.3%) infants, who have lower neonatal mortality rates at each individual low birth weight category.27 Despite this, the NICHD score, which includes race as a component, did not perform significantly better than other models. Its development among an inclusive population of VLBW newborns using preadmission factors may have contributed to its poor performance in this population, which was intentionally limited to NICU admissions, and excluded infants who were only given comfort care and early (<2 hour) deaths.
Surprisingly, relatively simple predictors discriminated extremely well in this population. The single variable of birth weight achieved very good discrimination (AUC = .869). Previous studies8,,10,13,22,24 have not found this excellent result, with AUCs for birth weight alone ranging from .72 to .83. Goodness-of-fit for our associated logistic model was poor (P = .0005). Similar lack of fit for birth weight alone has been found previously.8 The addition of a quadratic birth weight term achieved satisfactory goodness-of-fit (P = .89) with little loss in discrimination (AUC = .863). Previous research conducted in the context of all birth weights also indicated that the logistic relationship for birth weight is nonlinear.10 Including Apgar score <7 at 5 minutes as an additional predictor increased the AUC to .892, a value comparable to the CRIB score and better than the score developed by the NICHD network. Similar discriminatory power (AUC = .87) was previously observed for a birth data model incorporating birth weight, gestational age, and 5-minute Apgar score.24 However, none of the differences in AUCs among birth data models reached statistical significance.
Little discriminatory ability was lost when using the 12-hour version of SNAP-PE, compared with its 24-hour counterpart, and the difference was not statistically significant. Using different criteria, a qualitatively similar conclusion has recently been published.23 This small loss should be contrasted with the potential benefit of reducing the influence of therapy on the score. Among 12-hour predictors, the AUC for CRIB was somewhat lower than that of the 12-hour SNAP-PE but did not differ significantly. This loss should be contrasted with the advantage in simplicity of the CRIB. The discriminatory performance of CRIB, however, did not differ significantly from that of birth weight or the NICHD score. Confirmation of the intermediate performance of CRIB, as opposed to a true performance more similar to birth scores or to 12-hour SNAP-PE, would require a larger sample.
Similar patterns were seen for the best discriminating models: 12-hour and 24-hour best models (quadratic birth weight, low Apgar at 5 minutes, and relevant version of SNAP) had significantly higher AUCs than the best birth data model (SNAP removed) but did not differ significantly from each other. Because these models use coefficients derived from the current data to weight the contributions of individual terms, AUCs for these models may be somewhat overestimated compared with their performance in a validation sample.28Consequently, comparisons between these models and the performance of the predefined severity of illness measures may be overstated to some extent and should be interpreted with caution.
Severity of illness methods are now commonly used in intensive care for children and adults. Adjustments for severity of illness as well as other case-mix variables are required for comparisons of mortality rates and efficiency of care measures. Our results suggest that use of physiologic variables may not be necessary in adjusting for severity of illness in VLBW infants. Relatively simple and commonly available data including birth weight and Apgar scores should be reexamined for their utility in these endeavors.
ACKNOWLEDGMENTS
This work was supported in part by a grant from the National Institute of Child Health and Human Development and the National Institutes of Health, Office of Research on Minority Health.
The study could not have been completed without the collaboration of the District of Columbia Neonatal Network, whose members include: Murray M. Pollack, MD; Billie Short, MD; Kantilal M. Patel, PhD; Julie Ziegler, MA; Joyce Williams, RN; Doris Bartel, MSN, from Children's National Medical Center; Kenneth Harkavy, MD, from the Columbia Hospital for Women; Michal Young, MD, from DC General Hospital; Annette Heiser Ficker, MD, from Hospital for Sick Children; Fariborz Rahbar, MD, and Davene White, RN, NNP, from Howard University Hospital; Ayman A. E. El-Mohandes, MD, from George Washington University Medical Center; K. N. Siva Subramanian, MD, and Ramasubbareddy Dhanireddy, MD, from Georgetown University Medical Center; Maria Paz Ruiz, MD, from Providence Hospital; John P. Grausz, MD, from Washington Hospital Center; Nancy Taplin McCall, ScD, from Health Economics Research, Inc; and Vijaya Rao, PhD, and Matthew A. Koch, MD, PhD, from Research Triangle Institute.
We gratefully acknowledge the significant contributions of the following people: LaSonji Holman, Margaret Crosby, Gloria Seymour, Priscilla Johnson, Linda Font, Karol Duffy, and Kate Collins Wooddell for their work on abstracting the records; Sheilia O'Brien for field management; Patricia Higdon, Jean Gilroy, Eva M. Bell, and Linda L. Ivey as site coordinators; Pamela A. Angelus, Elizabeth Estrada Jarosz, Susan Novosel, Lisa Wright, Chita Taylor, Maria J. Floyd, Mary Ellen Lynch, Judith Stark, Jane Devine as nurse facilitators; Connie L. Hobbs, Donna Hewitt for data form design and manual of operations; Scott E. Schaefer and Margo F. Brinkley for data entry, data management, and reporting systems; Arthur Macaraeg, Mhairi MacDonald, Antoine Fomufod, and Susan McCabe, for the help with the design of the protocol.
This study was part of the National Institutes of Health DC Initiative to Reduce Infant Mortality in Minority Populations in the District of Columbia and was funded by The National Institutes of Health Office of Research on Minority Health and the National Institute of Child Health and Human Development. The following institutions were participants in the National Institutes of Health DC Initiative to Reduce Infant Mortality in Minority Populations in the District of Columbia: Children's National Medical Center: P. Scheidt and M. Pollack (principal investigators); DC Department of Public Health: B. Hatcher (principal investigator); DC General Hospital: L. Johnson (principal investigator); Georgetown University Medical Center: K. N. Sivasubramanian (principal investigator); Howard University: B. Wesley (principal investigator); University of the District of Columbia: V. Melnick (principal investigator); Research Triangle Institute: V. Rao (principal investigator); and National Institute of Child Health and Human Development: H. Berendes (program officer), A. Herman (scientific coordinator), and B. Wingrove (program coordinator).
Footnotes
- Received January 19, 1999.
- Accepted August 9, 1999.
Reprint requests to (M.M.P.) Department of Critical Care Medicine, Children's National Medical Center, 111 Michigan Ave NW, Washington, DC 20010. E-mail: mpollack{at}cnmc.org
- SNAP =
- Score for Neonatal Acute Physiology •
- SNAP-PE =
- SNAP Perinatal Extension •
- CRIB =
- Clinical Risk Index for Babies •
- SGA =
- small for gestational age •
- VLBW =
- very low birth weight •
- NTISS =
- Neonatal Therapeutic Intervention Scoring System •
- NICU =
- neonatal intensive care unit •
- NICHD =
- National Institute of Child Health and Human Development •
- Fio2 =
- fraction of inspired oxygen •
- SMR =
- standardized mortality ratio •
- ROC =
- receiver operating characteristic •
- AUC =
- area under the ROC curve •
- CI =
- confidence interval
REFERENCES
- Copyright © 2000 American Academy of Pediatrics