Prediction of Death for Extremely Low Birth Weight Neonates
Objective. To compare multiple logistic regression and neural network models in predicting death for extremely low birth weight neonates at 5 time points with cumulative data sets, as follows: scenario A, limited prenatal data; scenario B, scenario A plus additional prenatal data; scenario C, scenario B plus data from the first 5 minutes after birth; scenario D, scenario C plus data from the first 24 hours after birth; scenario E, scenario D plus data from the first 1 week after birth.
Methods. Data for all infants with birth weights of 401 to 1000 g who were born between January 1998 and April 2003 in 19 National Institute of Child Health and Human Development Neonatal Research Network centers were used (n = 8608). Twenty-eight variables were selected for analysis (3 for scenario A, 15 for scenario B, 20 for scenario C, 25 for scenario D, and 28 for scenario E) from those collected routinely. Data sets censored for prior death or missing data were created for each scenario and divided randomly into training (70%) and test (30%) data sets. Logistic regression and neural network models for predicting subsequent death were created with training data sets and evaluated with test data sets. The predictive abilities of the models were evaluated with the area under the curve of the receiver operating characteristic curves.
Results. The data sets for scenarios A, B, and C were similar, and prediction was best with scenario C (area under the curve: 0.85 for regression; 0.84 for neural networks), compared with scenarios A and B. The logistic regression and neural network models performed similarly well for scenarios A, B, D, and E, but the regression model was superior for scenario C.
Conclusions. Prediction of death is limited even with sophisticated statistical methods such as logistic regression and nonlinear modeling techniques such as neural networks. The difficulty of predicting death should be acknowledged in discussions with families and caregivers about decisions regarding initiation or continuation of care.
Extremely low birth weight (ELBW) infants continue to have a disproportionately high mortality rate, compared with larger, more mature infants, despite advances in perinatal and neonatal care.1–3 Decisions regarding initiation or continuation of support, as well as decisions regarding aggressiveness of management options, are difficult for many of these infants, and guidelines have been developed to assist with clinical management4 and counseling of families5 around the time of birth. These guidelines are dependent on the best estimate of gestational age. However, many prenatal and postnatal factors associated with outcomes (eg, multiple gestation, Apgar scores, birth weight, gender, and prenatal steroid use)3,6 modify the risk of death for individual neonates. The risk of death also changes with postnatal age, because ELBW neonates who survive beyond the first days of life have a higher likelihood of survival.7 Therefore, during consideration of management options and counseling at different times, such as just before birth, just after birth, and in the days after birth, it is necessary to take these additional factors and the preceding clinical course into account. Prediction of death is also useful for auditing or benchmarking, comparison of outcomes among NICUs, controlling for population differences during clinical trials, and evaluation of resource utilization.3,8,9
Clinical intuition, scoring systems such as the Score for Neonatal Acute Physiology (SNAP),10 regression analyses,3,11 and nonlinear statistical models such as neural networks11,12 have been evaluated previously for the prediction of death but have not been shown to have sufficiently high sensitivity and specificity for clinical purposes. Neural networks, more properly called “artificial neural networks,” are nonparametric, pattern-recognition techniques that can recognize complex nonlinear relationships or “hidden patterns” between independent and dependent variables, as well as possible interactions between independent variables.13,14 It is possible that, with a sufficiently large sample size and high-quality data, novel and clinically important models using either regression models or neural networks for the prediction of death among extremely premature neonates can be developed.
The aim of this study was to develop and to compare multiple logistic regression and neural network models for the prediction of ELBW death in multiple scenarios at different time points, using only prenatal data (with either a limited or expanded set of variables) or adding data available soon after birth, data available after completion of the first 24 hours of life, or data available at the end of the first 1 week of life. It was hypothesized that the best prediction models would be those using data from just after birth, because most of the deaths occur in the first days of life and are associated with variables known soon after birth. It was also hypothesized that nonparametric, pattern-recognition techniques such as neural networks would prove superior to standard logistic regression models.
Study Centers and Population
Data for all live-born infants with birth weights of 401 to 1000 g who were born between January 1, 1998, and April 9, 2003, and admitted to the 19 centers of the National Institute of Child Health and Human Development Neonatal Research Network were included in this study. Routinely, the data analyzed are collected systematically, stored in a database, and used for surveillance of the care and outcomes for high-risk infants in NICUs. The identity of the patients is kept highly confidential. The collection of data for the Neonatal Research Network had been approved by the institutional review boards of the participating institutions. All network centers are tertiary care centers.
Data Collection and Analysis
All statistical analyses were performed at the Research Triangle Institute (Research Triangle Park, NC). Thirty variables were selected from the database for analysis on the basis of the existing literature, which indicated that these variables were associated with death among premature infants (Table 1). All continuous (eg, birth weight) and logical (eg, gender) data variables were used unaltered, whereas ordinal data (eg, Apgar scores) were converted to categorical variables (eg, Apgar score at 5 minutes of >6: yes or no).
Five data sets were created to reflect 5 time points (scenarios), as follows: scenario A, limited prenatal data using only 3 variables; scenario B, scenario A plus additional prenatal data (to determine whether additional variables improved predictive ability); scenario C, scenario B plus data obtained 5 minutes after birth; scenario D, scenario C plus data obtained at 24 hours of life; scenario E, scenario D plus data obtained at 7 days of age. Scenario A had 3 variables, whereas scenario B had 15 variables (the 3 variables of scenario A plus 12 additional variables) (Table 1). Scenario C had 20 variables (the 15 of scenario B plus 5 additional variables), whereas scenario D had 25 (20 of scenario C plus 5 additional variables) and scenario E had 28 (25 of scenario D plus 3 additional variables) (Table 1). The data sets were censored for prior death and missing data, so that only infants who survived to 24 hours were included in scenario D and those who survived to 1 week were included in scenario E. The 5 data sets (1 for each scenario) were each divided into 10 pairs of training (70%) and test (30%) data sets, by assigning observations randomly to the training and test data sets (Table 2). Logistic regression and neural network models were created in S-PLUS (Insightful Corp, Seattle, WA) with training data sets, and mortality probabilities were calculated for test data. The neural network models for all 5 scenarios were back-propagation models with sigmoid transformation using 1 hidden layer with 6 nodes. This process of development and testing of the models was repeated with each of the 10 replicate data sets, and the results were averaged. The predictive abilities of the regression and neural network models were compared by using the area under the curve (AUC) of the receiver operating characteristic (ROC) curves, calculated with the method described by Hanley and McNeil.15 ROC curves plot sensitivity versus 1 − specificity; the more the AUC approaches 1, the greater is the predictive value. The matched-pair t test (SAS, Cary, NC) was used to compare the AUC for the logistic regression analysis with that for the neural network.
The total observations for each scenario ranged from 8608 for scenario A to 5973 for scenario E, because of censoring for prior death or missing data (Table 2). The median birth weight for the study population for scenario A was 735 g (mean: 735 g; SD: 158 g), the median gestation was 25 weeks (mean: 25.5 weeks; SD: 2.3 weeks), 43% of patients were non-Hispanic black (range: 5–84% by center), 50% of patients were male, and 82% of patients received mechanical ventilation (range: 72–96% by center). When the total study population was considered, 14.3% of patients had died by 24 hours, 22.4% by 7 days, and 35% by discharge. Although similar numbers of infants were analyzed for scenarios A, B, and C, death between birth and 24 hours and death between 24 hours and 7 days of age reduced the sample sizes significantly for scenarios D and E, respectively. The infants with missing data (mostly because of nonrecording of ≥1 variable) were comparable to those with recorded data.
To calibrate the models, AUCs and Hosmer-Lemeshow statistics were calculated for the training and test sets. We noticed little discrepancy between the training and test sets. For most models, the AUC was slightly higher for the training set. For scenarios A and D, however, the AUC was higher for the test set, although the values were within the confidence interval of the training set AUC (data not shown); this might be expected because of the large sample size, which makes the models quite robust. The Hosmer-Lemeshow statistic was good only for scenarios D (statistic = .12) and E (statistic = .2) and was poor for scenarios A, B, and C (statistic < .01), indicating significant differences between model-predicted and observed values.
The models for scenarios A, B, and C could be compared with each other because the data sets were similar, but a direct comparison of these models with scenarios D and E was not possible because the data sets were dissimilar. Model C had a larger AUC, compared with models A and B (Table 3). The multiple logistic regression model had a larger AUC than the neural network, indicating better predictive ability, for scenario C, but the models had similar AUC values for other scenarios (Table 3). Although the AUC was statistically greater in the regression model for scenario C, the magnitude of the difference (difference: 0.01) is unlikely to be clinically relevant. Larger magnitudes of differences in AUC values (difference: 0.07–0.09) between the regression and neural network models for scenarios D and E were not statistically significant because the variation was greater. Although neural networks produced better predictions and had excellent Hosmer-Lemeshow goodness-of-fit statistics with the training sets, they failed to produce better predictions with the test data.
The models were compared at 50% and 90% sensitivity; these levels of sensitivity were chosen arbitrarily so that the models could be compared when a higher specificity (lower sensitivity) and a higher sensitivity are required. At 50% sensitivity (infants predicted to die of those who died), the regression models for scenarios A and B had a specificity (infants predicted to survive of those who survived) of 93%, a positive predictive value (PPV) (infants predicted to die who actually died) of 80%, and a negative predictive value (NPV) (infants predicted to survive who survived) of 78%, whereas scenario C had a specificity of 95%, a PPV of 84%, and a NPV of 79%. The model for scenario D had 89% specificity, 58% PPV, and 85% NPV, whereas that for scenario E had 90% specificity, 49% PPV, and 90% NPV.
At a higher sensitivity, the specificity and PPV naturally declined. At 90% sensitivity, the regression models for scenarios A and B had a specificity of 35%, a PPV of 43%, and a NPV of 87%. At the same sensitivity, the model for scenario C had a specificity of 55%, a PPV of 51%, and a NPV of 93%, whereas that for scenario D had 49% specificity, 35% PPV, and 94% NPV and that for scenario E had 43% specificity, 25% PPV, and 96% NPV.
The regression equation coefficients and odds ratios showed that the contributions of different variables to the outcome varied with the scenario (Table 4). It can also be seen that some of the variables used were not associated significantly with the outcome in the models (Table 4). For example, in scenario C, for which the logistic regression model performed best, the variables associated with a significantly lower risk of death were use of prenatal steroids, black race, older gestational age, presence of pregnancy-induced hypertension, higher birth weight, and higher 5-minute Apgar score (≥3) and the variables associated with a higher risk of death were higher center mortality rate, presence of prepartum hemorrhage, multiple births, and male gender (Table 4). Center mortality rate was considered an aggregated measure, and a multilevel modeling approach was not implemented for the sake of simplicity and consistency. Exploratory analyses were also performed (data not shown) with varying numbers of hidden layers for neural networks and stepwise selection of variables for regression models, but the increase in complexity of the models did not improve performance, indicating that the models were quite robust.
The identification of ELBW infants at high risk of death is of increasing importance, particularly because many of these infants are at high risk for neurodevelopmental impairment.10 The current study demonstrates that the ability to predict death is significantly better (both statistically and of a clinically relevant magnitude) at 5 minutes of age, rather than at or before birth with only prenatal data. However, the ability to predict death does not improve with increasing age among infants who avoid early death, because early variables do not have lingering effects. Also, the contribution of the different variables to subsequent death varies with the time period. Prediction with multiple logistic regression proved comparable to that with neural networks for most of the time periods.
There are important strengths to this study. The data sets evaluated in this study included many thousands of ELBW infants, making this the largest of any such prediction study to date. Infants from multiple level III centers in the United States were evaluated during a recent period in which mortality rates did not change significantly, making the results comparable to current clinical practice. In addition, the statistical models were developed with one data set and tested with another data set, which ensured that the model was truly tested. Developing and testing a model with the same set may lead to excellent performance with a high AUC for an overtrained model that may not be able to predict outcomes in a different data set.
There are also some limitations to this study. Only variables that already existed in the database could be used for analysis. Other variables that may be associated with death (eg, chorioamnionitis, timing of prenatal steroid therapy, fetal biophysical profile, and resuscitation variables such as parental or physician wishes regarding resuscitation) could not be evaluated because they were not part of the data collected. It must also be noted that the models for the different scenarios used different data sets, because infants who died before the scenario could not be considered for the prediction of subsequent death. Scenario A was approximately comparable to scenarios B and C (which were identical), and these scenarios included almost all live-born infants. However, scenario D included only infants who had survived to 24 hours of age, and scenario E included only infants who had survived to 1 week of age; therefore, scenarios D and E must be considered in isolation and not in comparison with scenario A, B, or C. Another limitation is that, in addition to prior death, a few infants were excluded because of missing data, which resulted in smaller data sets (mostly for scenarios D and E) than were accounted for by earlier death alone. It is known that ELBW infants are at highest risk of death in the first 3 days.7 Therefore, the overall likelihood of death was higher in scenarios A, B, and C and decreased with scenario D and additionally with scenario E. Because the PPV of a test also depends on the prevalence of the outcome in the population (PPVs are low for rare outcomes and higher for common outcomes, with the same sensitivity), the models are less likely to be accurate in the later scenarios, because mortality rates are lower after the immediate postnatal period. The PPVs of these models therefore diminish over time, because mortality rates are lower among older infants. Other limitations of this study are that statistical methods such as regression analysis or neural networks are not easy to use in the clinical setting. It would be possible to optimize the regression models by evaluating nonlinear relationships and interactions and incorporating them into the logistic regression models, but this would increase model complexity and might decrease clinical utility. Neural networks especially are considered a “black box,” the inner working of which is difficult to determine.
A limited number of prenatal variables (gestational age, race, and prenatal steroid use) performed as well as a larger collection of prenatal variables, which indicates that a parsimonious model is often preferable to a large model, although some of the additional included variables (tocolysis, antibiotic use, and singleton birth) were associated significantly with a lower probability of death in this scenario. At 5 minutes of age, the addition of birth weight, gender, and 5-minute Apgar score improved the model. As expected, there was a proportional increase in survival rates with increasing Apgar scores. The odds ratio for gestational age was less significant in scenario C, compared with scenario B, because part of its contribution to outcome was taken over by the inclusion of birth weight (with which gestational age is correlated strongly). The odds ratio per 100-g increase in birth weight is ∼0.5 to 0.6, which is highly statistically significant and clinically relevant. It is possible that, if the birth weight and gender of the fetus were determined prenatally with good accuracy, those factors could also be used as predictors in the prenatal period and could be used in discussions of mortality risk with the parents before birth, when discussions about resuscitation are held. Some variables (such as tocolysis and prenatal antibiotic use) that were significant in scenario B were no longer significant after birth. For infants who survived to 24 hours of age, the effects of race were less significant, whereas the effect of prenatal steroid use was diminished by the seventh day. The maximal oxygen concentrations at 24 hours and at 7 days were strong predictors of death, probably because they were good indicators of the underlying severity of the respiratory illness. Clinicians do not normally use mathematical equations and ROC curves to predict outcomes for individual neonates, but knowledge of these predictors and how their contributions vary over time may assist clinical judgment and influence decision-making.
A comparison of the current study with prior similar studies3,6,16 reveals some important similarities and a few differences. Horbar et al16 developed a logistic regression model for the prediction of death among very low birth weight neonates. As in our study, admission factors such as lower birth weight, male gender, and nonblack race were associated with higher mortality rates. However, clinical practices (prenatal steroid use and surfactant use) have changed significantly in the decade since the publication of that report, with corresponding improvements in mortality and morbidity rates for larger preterm infants. Tyson et al3 evaluated risk factors known at birth among infants with birth weights of 501 to 800 g who were born between 1994 and 1995. Female infants, small-for-gestational age infants, and infants whose mothers received prenatal steroids demonstrated improvements in survival rates if mechanical ventilation was administered.3 However, race and multiple birth were not associated significantly with death in that study, perhaps because only infants with birth weights of 501 to 800 g were evaluated and the sample size was smaller.3 Shankaran et al6 investigated, with logistic regression analysis, the risk factors for early death (<12 hours of age) among 5986 ELBW infants born between 1993 and 1997. Similar to our study, factors associated with early death were absence of prenatal steroid use, absence of tocolytic treatment, male gender, lower gestational age and birth weight, nonblack race, and absence of hypertension/preeclampsia.6 A very wide range of differences in care, including the use of delivery room intubation, ventilatory support, intravenous fluid administration, antibiotic treatment, pressor support, and surfactant therapy, was noted between centers for the infants who died early.6 It is probable that a similar explanation regarding different care practices is responsible for the center mortality rate being a determinant of ELBW death in the current study.
Meadow et al10 showed that the predictive ability of serial SNAP scores and clinical intuition for neonatal death declines with time. Even with the advantage of a large data set and sophisticated statistical techniques, predictive ability declined with time in our study, possibly because of a decrease in the number of rational or logical predictors. Variables that are major risk factors for death (eg, birth weight, gestational age, gender, race, and Apgar scores) are mostly determinants of early death and can be identified or measured easily, whereas the risk factors for later death (eg, sepsis, necrotizing enterocolitis, and bronchopulmonary dysplasia) are not well defined or cannot be determined sufficiently in advance, leading to an attenuation in predictive ability for death. Meadow et al10 also demonstrated the importance of prediction; infants who were predicted to die but who actually survived were at high risk (82%) of neurodevelopmental impairment.10 It is likely that infants with a higher probability of death would also have a higher probability of morbidity in the current study, and this will be investigated when follow-up data are available for these infants. Pollack et al9 compared Clinical Risk Index for Babies, SNAP, SNAP-Perinatal Extension, and other models with prenatal, birth, and first 12- and 24-hour data. The discriminatory ability of these models was very good. However, the same data set was used for development and testing of the models and larger infants (up to 1500 g) were evaluated, both of which would increase the apparent predictive ability. Another issue to consider with prediction models is the “self-fulfilling prophecy,” ie, if an infant is considered at high risk of death, then there may be a bias against provision of aggressive resuscitative measures. It is difficult to determine in these studies the extent to which death was attributable directly to the magnitude of the variable (eg, birth weight of 500 g) or to clinicians’ perceptions of that variable as a predictor of death (eg, less aggressive resuscitation of infants who were ≤500 g at birth).
Regression analysis has limitations in some clinical situations, because the relationships between independent and dependent variables may be nonlinear. Neural networks are nonparametric, pattern-recognition techniques capable of identifying hidden patterns and interactions.13,14 Cross et al13 provided an introduction to neural networks for clinicians, and Tu14 reviewed the advantages and disadvantages of neural networks versus regression models for predicting medical outcomes. Neural networks have been found to be suitable and superior to logistic regression for the prediction of outcomes for critically ill adult patients.17 There have been 2 single-center studies comparing multiple regression models with neural networks for the prediction of death among premature neonates.11,12 In 1 of those studies,12 which used admission data for prediction of death among very low birth weight infants, the neural network performed significantly better than the logistic regression; in the other,11 which used data from admission and the first 6 hours of life for prediction of ELBW death, the performance was equivalent. It is difficult to compare those single-center studies with our multicenter study, because variations in clinical practices might induce greater variance in the relationship of a variable (dependent on those clinical practices) to death. For example, infants who might have been removed from support within hours after birth at one institution might be more often resuscitated aggressively at another institution, which might lead to postponement of death beyond the first day or even survival to discharge with impairment. Therefore, it may be necessary for each center to develop and to test its own model for the prediction of outcomes.
One major implication of this study is that it may be better to postpone decisions about initiation of support or withdrawal of care until 5 minutes after birth, rather than making decisions before birth with only prenatal data, because immediate postnatal variables such as Apgar scores (reflecting status at birth), birth weight, and gender add significantly (20% higher specificity and 8% higher PPV at 90% sensitivity) to the ability to predict death. The other major implication is that it is difficult to predict death (or survival) for individual neonates with certainty. Clinicians and parents need to be aware of inherent biases and uncertainty in trying to foretell the future, especially when such predictions are used for clinical decision-making. It has been demonstrated that obstetricians and pediatricians who underestimate the possibility of survival of a neonate are less likely to resuscitate the neonate or to use mechanical ventilation, inotropes, or other standard therapies.18 Despite these limitations in prognostication, these predictive models indicate the contribution of the known major risk factors to death and are useful for the generation of hypotheses that can be tested in controlled trials (eg, the benefits of tocolysis and maternal antibiotic therapy in preterm labor and the effects on long-term outcomes of care practices responsible for variations in center mortality rates).
Financial support was provided by National Institutes of Health grants U10 HD27851, U01 HD36790, U10 HD21364, U10 HD34216, U10 HD27871, M01 RR06022, U10 HD27856, M01 RR00750, U10 HD27853, M01 RR08084, U10 HD34167, M01 RR02635, M01 RR02172, M01 RR01032, U10 HD21373, U10 HD27904, U10 HD21397, U10 HD21415, U10 HD21385, U10 HD40689, U10 HD27880, M01 RR00070, U10 HD27881, U10 HD 40461, and M01 RR00997.
The participating National Institute of Child Health and Human Development Neonatal Research Network Centers, principal investigators, and research coordinators are indicated in the Appendix.
- Accepted February 28, 2005.
- Address correspondence to Namasivayam Ambalavanan, MD, Department of Pediatrics, 525 New Hillman Building, 619 South 20th St, University of Alabama at Birmingham, Birmingham, AL 35249. E-mail:
No conflict of interest declared.
- ↵Lemons JA, Bauer CR, Oh W, et al. Very low birth weight outcomes of the National Institute of Child Health and Human Development Neonatal Research Network, January 1995 through December 1996. Pediatrics.2001;107(1) . Available at: www.pediatrics.org/cgi/content/full/107/1/e1
- Victorian Infant Collaborative Study Group. Improved outcome into the 1990s for infants weighing 500–999 g at birth. Arch Dis Child Fetal Neonatal Ed.1997;77 :F91– F94
- ↵American Academy of Pediatrics, Committee on Fetus and Newborn. Perinatal care at the threshold of viability. Pediatrics.2002;110 :1024– 1027
- ↵Meadow W, Reimshisel T, Lantos J. Birth weight-specific mortality for extremely low birth weight infants vanishes by four days of life: epidemiology and ethics in the neonatal intensive care unit. Pediatrics.1996;97 :636– 643
- ↵Pollack MM, Koch MA, Bartel DA, et al. A comparison of neonatal mortality risk prediction models in very low birth weight infants. Pediatrics.2000;105 :1051– 1057
- ↵Meadow W, Frain L, Ren Y, Lee G, Soneji S, Lantos J. Serial assessment of mortality in the neonatal intensive care unit by algorithm and intuition: certainty, uncertainty, and informed consent. Pediatrics.2002;109 :878– 886
- ↵Zernikow B, Holtmannspoetter K, Michel E, et al. Artificial neural network for risk assessment in preterm neonates. Arch Dis Child Fetal Neonatal Ed.1998;79 :F129– F134
- ↵Morse SB, Haywood JL, Goldenberg RL, Bronstein J, Nelson KG, Carlo WA. Estimation of neonatal outcome and perinatal therapy use. Pediatrics.2000;105 :1046– 1050
- Copyright © 2005 by the American Academy of Pediatrics