Revalidation of the Score for Neonatal Acute Physiology in the Vermont Oxford Network
OBJECTIVES. Our specific objectives were (1) to document the performance of the revised Score for Neonatal Acute Physiology and the revised Score for Neonatal Acute Physiology Perinatal Extension in predicting death in the Vermont Oxford Network, compared with published normative values; (2) to determine whether this performance could be improved through recalibration of the weights for individual score items; (3) to determine the impact of including congenital anomalies in the predictive model; and (4) to compare performance against that of the Vermont Oxford Network risk adjustment, separately and in combination.
METHODS. Fifty-eight Vermont Oxford Network centers collected data prospectively for the revised Score for Neonatal Acute Physiology in the first 12 hours after admission of infants in 2002.
RESULTS. Data were collected for 10469 infants, and analyses were undertaken for 9897 who met inclusion criteria. The median revised Score for Neonatal Acute Physiology was 5, and the mean birth weight was 1951 g. Recalibration of the revised Score for Neonatal Acute Physiology and revised Score for Neonatal Acute Physiology Perinatal Extension resulted in minimal changes in their discriminatory abilities. The Vermont Oxford Network risk adjustment performed similarly, compared with the revised Score for Neonatal Acute Physiology Perinatal Extension.
CONCLUSIONS. Current score performance was similar to that observed previously, which suggests that the revised Score for Neonatal Acute Physiology and revised Score for Neonatal Acute Physiology Perinatal Extension have not decalibrated over the 7 years since the first cohort was assembled, despite advances in neonatal care during that period. Addition of congenital anomalies to the revised Score for Neonatal Acute Physiology Perinatal Extension improved discrimination significantly, particularly for infants with birth weights of >1500 g. The Vermont Oxford Network risk adjustment performed similarly, compared with the revised Score for Neonatal Acute Physiology Perinatal Extension.
Large-scale quality-improvement efforts in neonatology typically involve comparison of practices and outcomes between NICUs. For such comparisons to be valid, investigators and clinicians must account for differences in patient populations with respect to factors such as birth weight and severity of illness. The revised Score for Neonatal Acute Physiology (SNAP-II) quantifies illness severity, whereas the SNAP-II Perinatal Extension (SNAPPE-II) also incorporates information on birth weight, small-for-gestational age status, and Apgar score.1 SNAP-II is designed for measurement of physiologic severity of illness, whereas SNAPPE-II is more appropriate for risk adjustment, because it takes into account the independent effects of nonphysiologic baseline characteristics (Table 1). The scores predicted death accurately in a large cohort of 14610 infants in the original validation.1 Since then, SNAP-II and SNAPPE-II have facilitated comparison of practices and outcomes in 2 very large neonatal networks, namely, the Canadian Neonatal Network and the Kaiser Permanente network of NICUs in California.2–5
To remain relevant and useful, such illness severity scores must fulfill 2 criteria. First, they must maintain previously demonstrated performance. Deterioration in predictive performance may occur if incremental improvements in care change the relationship between illness severity and mortality rates. For example, a new treatment for hypotension might decrease the probability of death for any given level of blood pressure and thus eliminate or reduce the contribution of the lowest mean blood pressure criterion. Second, the scores must perform similarly in other settings. Even if the relationship between illness severity and mortality rates remains constant, there may be idiosyncrasies in how physiologic measurements are made in a particular setting, or such measurements may be less reliable outside the research environment. Although the original Score for Neonatal Acute Physiology (SNAP) has been tested in multiple small data sets, the more-parsimonious SNAP-II and SNAPPE-II have not been revalidated.
Revalidation has assumed more importance now that large organizations are launching aggressive benchmarking and quality-improvement efforts. We investigated the current performance and feasibility of SNAP-II and SNAPPE-II in one such enterprise, the Vermont Oxford Network (VON). Our specific objectives were as follows: (1) to document the performance of SNAP-II and SNAPPE-II against published normative values; (2) to determine whether this performance could be improved through recalibration of the weights for individual score items; (3) to determine the impact of a new item, namely, congenital anomalies; (4) to compare performance against that of VON risk adjustment (VON-RA), separately and in combination; and (5) to examine the relative performance of the scores in higher and lower birth weight cohorts.
The VON is a voluntary collaborative group of health professionals committed to improving the effectiveness and efficiency of medical care for newborn infants and their families, through a coordinated program of research, education, and quality-improvement projects.6 The network maintains a clinical database of information about patients at >500 participating NICUs. The SNAP-II project involved a self-selected group of 58 units; the centers and site investigators are listed in “Acknowledgments.” The project was approved by institutional research review boards at the University of Vermont and Beth Israel Deaconess Medical Center and by boards at the participating hospitals.
Study Patients and Data Collection
Data collection began between January and March 2002 and ended for all centers in December 2002. All patients who were entered into the VON clinical database at participating centers after each center’s declared start date were registered for the SNAP-II study. In some centers, these included only infants weighing <1500 g at birth; in other (“expanded”) centers, all birth weights were enrolled.
We collected data for calculation of SNAP-II and SNAPPE-II over the first 12 hours after NICU admission. For infants who died or were transferred before 12 hours, we calculated scores up to that point but analyzed the results separately. Physiologic data included the following: lowest mean blood pressure, lowest core body temperature, lowest serum pH of a capillary or arterial blood gas sample, presence of multiple-seizure activity (suspected by 2 clinicians or a neurologist or confirmed with electroencephalography), total urine output, arterial blood gas results with the lowest Pao2, highest mean airway pressure, and highest inspired oxygen concentration.
VON has traditionally adjusted for baseline characteristics by adding demographic and clinical variables to the regression equations used to analyze outcomes in the clinical data set. These VON variables include gestational age, gestational age squared, multiple gestation, outborn status, Apgar score, gender, cesarean section delivery, and presence of a congenital anomaly (Table 1). We used a multilevel modification of a previously published VON definition for congenital anomalies.7 This earlier definition relied on empirical observations of outcomes of infants with anomalies reported to the VON database, by using a predefined list of conditions. The mortality rate for infants with major birth defects was 58%, whereas the mortality rate for infants without major birth defects was 13% (P < .001). The more-recent definition used in the current study stratifies risk according to 5 categories (no defect, moderately severe defect, severe defect, very severe defect, or most severe defect). Although the mortality rates have not been published for this expanded definition, it has been shown to be more discriminatory than the earlier definition.
Our data sets for analysis of SNAP and VON differed slightly from each other because SNAP requires a period of time for accumulation of physiologic data, whereas VON-RA does not. The VON data set consisted of all infants with complete clinical data registered in participating centers during the study period. The SNAP data set excluded infants who died before NICU admission, were missing critical data necessary for calculation of SNAP-II, or were moribund at admission (that is, receiving only comfort care, without intubation, mechanical ventilation, pressor treatment, or cardiac compressions). To provide optimal estimates of test characteristics, we report the performance of VON-RA with the VON sample and the performance of SNAP-II with the SNAP sample. For direct comparisons of the 2 scores, we used the smaller SNAP sample.
We report analyses for all birth weights and stratified analyses for infants ≤1500 g and >1500 g separately. Because some centers submit only data on ≤1500-g infants to VON, the full data set has an overweighting of smaller infants. To avoid bias toward very low birth weight infants, we restricted the nonstratified analyses to the expanded centers that enrolled infants of all birth weights.
Researchers at each site entered data into an Internet-based data entry program, which performed initial data validity checks. We then merged SNAP-II data with the VON clinical data set by using the unique VON identifier and performed additional validity and completeness checks.
As in previous validations, we elected to test score performance in predicting death, which we defined as any death occurring before discharge home. We chose death rather than other outcomes such as bronchopulmonary dysplasia, retinopathy of prematurity, or length of stay because it is measured reliably and applies to infants of all gestational ages.
We tested calibration, or the extent to which a model or score predicts death, by applying the Hosmer-Lemeshow goodness-of-fit test.8 The Hosmer-Lemeshow test compares observed and predicted death over successive intervals of risk. A preferred diagnostic test has a nonsignificant result in this analysis. We quantified discrimination, or the extent to which the model or score distinguishes normal from abnormal, by using the area under the receiver operating characteristic (ROC) curve (AUC).9 The ROC curve plots test sensitivity against the false-positive rate.
In the initial “revalidation” analysis, we calculated SNAP-II and SNAPPE-II as the sum of the previously reported weights for the various levels of derangement for each of the 6 components.1 We determined discrimination and calibration and compared results with those in the original publication. We then performed a “recalibration,” by performing logistic regression analysis of death and the covariates from the SNAP-II and SNAPPE-II, rather than the numerical scores themselves. This approach allowed the β coefficient weights for each item in the score, such as lowest mean blood pressure or pH, to change. Test characteristics might be expected to differ from the initial analysis if the relationship of death to score items had changed. We compared performance of several alternative methods of risk adjustment, including (1) SNAP-II, (2) SNAPPE-II, (3) SNAPPE-II plus congenital anomalies, (4) VON-RA, and (5) a combination of VON-RA with SNAP-II.
We compared discrimination of competing tests through nonparametric Mann-Whitney comparison of the respective AUC values.10 We performed population comparisons by using Student’s t test and χ2 tests as appropriate. All analyses were completed using SAS 9 (SAS Institute, Cary, NC) and Stata 7 (College Station, TX) software.
Of the original 62 centers participating, 58 continued to project completion. Median enrollment was 105 infants per center, with a maximum of 1011 infants and a minimum of 6 infants. As shown in Fig 1, after exclusion of infants with missing clinical data, the VON sample included 10439 infants. After exclusion of infants with missing SNAP data and those who were moribund at admission or died before NICU admission, the final sample for SNAP analysis included 9897 infants.
Table 2 compares infants in study centers with those in nonparticipating centers. Infant characteristics fell within a typical range for such a mixed group of NICUs. The characteristics of the centers themselves were not statistically distinct. Mean annual deliveries were 3571 deliveries in participating centers and 3112 deliveries in nonparticipating centers (P = .116). Mean NICU admissions were 614 admissions in participating centers and 556 admissions in nonparticipating centers (P = .381). For both participating and nonparticipating centers, 19% were centers that had restrictions on ventilation and performed only minor surgery, whereas 59% were centers that had no restrictions on assisted ventilation and performed major surgery not requiring cardiopulmonary bypass (P = .998). The corresponding obstetrics services were type 3 (providing services for all serious illnesses and abnormalities, supervised by a full-time maternal-fetal specialist) in 76% of participating centers and 77% of nonparticipating centers (P = .79). Expanded data on infants with birth weights of >1500 g were collected by 31% of participating centers and 11% of nonparticipating centers.
Details of deaths in the VON data set and the SNAP subset are shown in Table 3. The mortality rate for the VON data set was slightly higher because the data set included delivery room deaths, whereas SNAP data excluded such deaths.
Figure 2 shows that there was significant interinstitution variability in severity of illness, as measured with SNAP-II. Across all patients, the mean SNAP-II was 11.7 (SD: 14.2), and the median was 5. The median center, according to score, had a mean SNAP-II of 15, with a minimum of 6 and a maximum of 33.
Revalidation of Score Performance
Results of Hosmer-Lemeshow and ROC analyses for SNAP-II are shown in Table 4. In each birth weight stratum, there was a modest increase in discriminatory power, as shown by the AUC, between SNAP-II and SNAPPE-II (P < .001 for all weight categories). Discrimination was best for the smaller infants and the cohort of all birth weights and was moderate for larger infants. Goodness of fit was adequate for all strata. Comparison of SNAPPE-II performance with the originally published results showed negligible differences for the all-birth weight group and ≤1500-g group and a modest decrease in discrimination for the >1500-g group.
The recalibrated SNAP-II and SNAPPE-II results, in which the β coefficients for each item were allowed to vary in logistic regression analyses, are reported in Table 4. As shown, the recalibrated scores performed very similarly to the scores with the original weights. This implies that the modest differences between the current results and the original publication are most likely attributable to the population, rather than to decalibration of the underlying score.
Performance of VON-RA
Test characteristics of VON-RA with the data set that included delivery room deaths are given in Table 5. VON-RA demonstrated excellent discrimination across all birth weight strata. Goodness of fit was adequate across all strata in this data set.
Comparisons of scores in the SNAP data set, which excluded delivery room deaths, are shown in Table 6 and Fig 3. The addition of congenital anomalies to SNAPPE-II resulted in a substantial improvement in discrimination, compared with SNAPPE-II without congenital anomalies, particularly in the all-birth weight and >1500-g strata (AUC comparison: P < .001). Discrimination of SNAPPE-II with congenital anomalies was statistically superior to but of similar clinical significance, compared with that of VON-RA (AUC comparison: P = .003 for all birth weights, P = .007 for ≤1500 g, and P = .065 for >1500 g). The addition of SNAP-II to VON-RA did improve performance slightly, especially in the >1500-g stratum (AUC comparison: P < .001 for all birth weights, P < .001 for ≤1500 g, and P = .002 for >1500 g). This combined score showed discrimination similar to that of SNAPPE-II when congenital anomalies were included in the latter (P = .058 for all birth weights, P = .001 for ≤1500 g, and P = .253 for >1500 g strata). It should be noted that, although its average discrimination was excellent, there was a pattern of a slightly lower observed/expected mortality ratio at the extremes of risk for SNAPPE-II plus congenital anomalies, as there was for VON-RA in the ≤1500-g cohort (data not shown).
Although SNAP-II and SNAPPE-II were applied recently to a small population of very low birth weight infants,11 our study represents the first revalidation of the scores in a novel large cohort of all birth weights. SNAPPE-II performed similarly, compared with the original report, which provides evidence for both its longevity and its generalizability to other populations. We speculate that the well-maintained performance is attributable to the fairly consistent patterns of neonatal intensive care since the scores were derived and to the large unselected cohorts in the original publication. The results indicate that the scores are likely to provide acceptable risk adjustment without frequent recalibration, provided that neonatal practices change only incrementally.
Both SNAP-II and SNAPPE-II rely mainly on physiologic measurements for risk adjustment, although SNAPPE-II does take baseline characteristics into consideration. In contrast, VON-RA, like the Clinical Risk Index for Babies, consists almost exclusively of covariates that do not measure physiologic illness severity directly. Our comparisons of VON-RA and SNAPPE-II yielded several interesting results. First, as with earlier comparisons between SNAP and Clinical Risk Index for Babies, the performances of the 2 classes of scores in predicting death were similar. Although there are theoretical reasons for assuming that physiologic risk adjustment might be superior,12 especially given the discretionary nature of certain risk-adjustment factors (such as cesarean section), these did not translate into better test characteristics in the VON setting. Given the complementary nature of the 2 types of scores, we hypothesized that a combination of SNAP-II and VON-RA would improve risk adjustment. The addition of SNAP-II did indeed improve prediction significantly in statistical terms, although the clinical significance of an increase in AUC from 0.93 to 0.95, as shown for the all-birth weight category, is questionable.
The results across birth weight categories are particularly interesting. Before the study, we hypothesized that illness severity scoring would have its greatest contribution in VON among the larger infants, because increments of birth weight or gestation are likely to be less linked to death closer to term. We observed the opposite, however; VON-RA performed best in the >1500-g cohort, whereas performances were similar for the 2 scores in the <1500-g cohort.
Part of the performance difference is related to the inclusion of congenital anomalies in VON-RA. Indeed, when congenital anomalies were combined with SNAPPE-II, performance was superior to that of VON-RA. Richardson et al1 deferred inclusion of congenital anomalies in SNAPPE-II because of concerns that the definition of this variable was inconsistent. The VON definition of congenital anomalies is based on a definition that is standardized and well validated. Although the congenital anomaly factor certainly seemed to affect performance in the current study, it must be noted that it is an empirically derived ranking of risks. Therefore, the congenital anomaly factor may be more dependent on practice styles and may decalibrate more quickly as technology changes, compared with other score components.
Although the VON-RA approach yielded results similar to those of SNAPPE-II, illness severity scoring might still provide benefits not seen with other approaches. As shown in Fig 2, quantification of illness severity provides a concrete description of a NICU population. Unlike de novo regression methods of risk adjustment, SNAPPE-II now is well standardized and has been shown to be generalizable. In contrast to “a priori” items such as outborn status and cesarean section rate, the score items are not discretionary. Moreover, because it measures a dynamic property rather than immutable or historical characteristics, physiologic scoring is responsive to changes in management. With the advent of computerized data collection, this property might allow daily or serial risk adjustment. Similarly, such scores may be used as outcome variables for obstetric or earlier neonatal care.
These advantages of numerical severity-of-illness scores over the logistic regression approach of VON-RA must be balanced against the disadvantage of the necessity of excluding delivery room deaths. For risk adjustment of elements of postadmission neonatal care, such as the incidence of bronchopulmonary dysplasia, such an exclusion may not be significant. However, VON provides confidential, comparative, performance data for use in quality improvement, and the full range of neonatal deaths must be considered. The optimal strategy is likely to be a combination of the 2 scores, with delivery room VON-RA for hospital-level comparisons, followed by physiologic severity-of-illness scoring for the subset of infants admitted to the NICU. In this context, SNAP would serve as an outcome measure for the quality of labor and delivery, transport, or NICU stabilization. Similar combinations might be used to guide certain clinical decisions, such as those regarding transport.13 It should be emphasized, however, that the broad confidence intervals for individual patients make such scoring systems inappropriate for guiding life-support decisions at the patient level.
The differences between the 2 scores are perhaps not as significant as the fact that they both performed very well. Payers and third-party health management organizations are beginning to use proprietary approaches to risk management to guide management of clinicians’ resource utilization. In the future, such uses might extend to the pay-for-performance arena. In contrast to both SNAP-II and VON-RA, these approaches may offer inadequate or misleading risk adjustment. We encourage organizations that are informing quality-of-care comparisons or payment decisions with neonatal risk adjustment to use only approaches that have been subjected to large-scale, evidence-based validation.
We dedicate this article to Doug Richardson, our teacher and friend.
The steering committee members included Esmond Arrindell (Baptist Memorial Hospital for Women, Memphis, TN); David Corcoran (Rotunda Hospital, Dublin, Ireland); Douglas Dransfield (Barbara Bush Children’s Hospital, Portland, ME); Keith Gallaher (Cape Fear Valley Medical Center, Fayetteville, NC); Jeffrey Gerdes (Pennsylvania Hospital, Philadelphia, PA); Roger Hinson (Woman’s Hospital, Baton Rouge, LA); David Hoffman (Reading Hospital and Medical Center, Reading, PA); Patrick Lewallen (Emanuel Children’s Hospital, Portland, OR); Allen Merritt (St Charles Medical Center, Bend, OR); and Jeanne Webb (Miller Children’s Hospital, Long Beach, CA). Participating sites and site investigators included the following: Al Corniche Hospital (Vijay Baichoo, Gregory Samson); Albany Medical Center (Su Boynton, Pauline Graziano, Joaquim Pinheiro); Albert Einstein Medical Center (Agnes Salvador, David Schutzman); Aultman Hospital (Brenda Douglass, Kim Reese); Baptist Memorial Hospital for Women (Esmond Arrindell, Dianna Garner); Barbara Bush Children’s Hospital at Maine Medical Center (Douglas Dransfield, Dan Sobel); Baylor Healthcare System (Pam McKinley, Jonathan Whitfield); Beth Israel Deaconess Medical Center (John Zupancic); Cape Fear Valley Medical Center (Keith J. Gallaher, Anne Sheaves); Childrens Hospital Los Angeles, Center for Newborn and Infant Critical Care (Cyndi Atkinson, Philippe S. Friedlich); Children’s Hospital Medical Center Akron (Tina Bair, Judy Ohlinger); Children’s Hospital of Illinois at OSF at St Francis Medical Center (Howard S. Cohen, Constance McConnell); Children’s Hospital of Philadelphia Neonatology at Chester County Hospital (Michael Friedman, Lloyd Tinianow); Christiana Care Health Services (Kathy Leef, David Paul); Columbia University Medical Center (Jack Lorenz, Kiyoko Ohira-Kist); Columbus Children’s Hospital (Patty Lore, Rick McClead); Crozer-Chester Medical Center (Cynthia Dembofsky, Sonia Hulman); DeVos Children’s Hospital/Spectrum Health (Ed Beaumont, Dinah Sutton); Doctor’s Hospital West; Evanston Hospital (William MacKendrick, Sue Wolf); Fitzgerald Mercy Medical Center (David Shutzman); Grant Medical Center (Craig Anderson, Nancy Wagner); Hennepin County Medical Center (Raul F. Cifuentes, MaryAnn Tyler); Henry Ford Hospital (Savitri Kumar, Bonnie Malmberg); Hospital for Children and Adolescents (Sture Andersson, Marita Suni); Hospital of University of Pennsylvania (Judy Burke, Jeffrey Merrill); Howard County General Hospital (Bharti Razdan, Misrak Tadesse); Inova Fairfax Hospital for Children (Robin Baker, Rebecca Beck); Janet Weis Children’s Hospital at Geisinger Medical Center (Lauren Johnson-Robbins); Mercy Hospital and Medical Center (Jagjit Teji, Rohitkumar Vasa); Miller Children’s Hospital (Arthur Strauss, Jeanne Webb); Naval Medical Center-San Diego (Douglas Carbine); New Hanover Regional Medical Center (Robert McArtor, Jane Ranney); NICU Ospedale S. Anna (Daniele Merazzi); Norton Suburban Hospital; Parkview Memorial Hospital (Ihor Bilyk); Pennsylvania Hospital (Soraya Abbasi, Jeffrey Gerdes); Riverside Methodist Hospital (Rick McClead); Rockford Memorial Hospital (Wendy Boehm, Patricia Ittmann); Rotunda Hospital (David Corcoran); Sioux Valley Children’s Hospital; Sisters of Charity (Anthony Barone, Anantham Harin); Sparrow Hospital (Carolyn Herrington, Padmani Karna); St Agnes Hospital (Barbara Long, Arturo Santos); St Charles Medical Center (Maryanne Merritt, T. Allen Merritt); St John Hospital and Medical Center (Maria Duenas); St John’s Hospital (Brenda Bigley, Dennis Crouse); St Joseph’s Regional Medical Center (Jeffrey Garland, Susan Kannenberg); Sunnybrook and Women’s College Health Sciences Centre (Michael Dunn, Allyson Nichols); Brooklyn Hospital Center (Patrick LeBlanc, Meena LaCorte); Reading Hospital and Medical Center (Gerald D. Brown, David J. Hoffman); University of Massachusetts Memorial Health Care (Francis J. Bednarek, Mary L. Naples); University Kebangsaan Malaysia (Nem-Yun Boo, Ismail Juriza); University of Michigan-Holden NICU (Al Cain, Ronald Dechert); University of Puerto Rico Hospital NICU; Wesley Medical Center (Barry Bloom, Paula Delmore); Woman’s Hospital (Roger M. Hinson); Wyckoff Heights Medical Center.
We acknowledge gratefully the guidance of the VON SNAP Pilot Project Steering Committee and the efforts of the site investigators to ensure the accuracy and completeness of data.
- Accepted July 26, 2006.
- Address correspondence to John A. F. Zupancic, MD, ScD, Department of Neonatology, Beth Israel Deaconess Medical Center, 330 Brookline Ave, Rose Building, Room 318, Boston, MA 02215. E-mail:
The authors have indicated they have no financial relationships relevant to this article to disclose.
- ↵Escobar GJ, Greene JD, Hulac P, et al. Rehospitalisation after birth hospitalisation: patterns among infants of all gestations. Arch Dis Child.2005;90 :125– 131
- ↵Horbar JD. The Vermont Oxford Network: evidence-based quality improvement for neonatology. Pediatrics.1999;103(suppl E) :350– 359
- ↵Hosmer D, Lemeshow S. Applied Logistic Regression. New York, NY: John Wiley; 1989
- ↵Gagliardi L, Cavazza A, Brunelli A, et al. Assessing mortality risk in very low birthweight infants: a comparison of CRIB, CRIB-II, and SNAPPE-II. Arch Dis Child Fetal Neonatal Ed.2004;89 :F419—F422
- ↵Richardson D, Tarnow-Mordi WO, Lee SK. Risk adjustment for quality improvement. Pediatrics.1999;103(suppl E) :255– 265
- Copyright © 2007 by the American Academy of Pediatrics