ARTICLE |

a Department of Neonatology, Beth Israel Deaconess Medical Center, Boston, Massachusetts
b Division of Newborn Medicine, Harvard Medical School, Boston, Massachusetts
c Vermont Oxford Network, Burlington, Vermont
d Department of Pediatrics, University of Vermont College of Medicine, Burlington, Vermont
e Integrated Centre for Care Advancement through Research Edmonton (iCare), University of Alberta, Edmonton, Alberta, Canada
f Perinatal Research Unit, Kaiser Permanente Medical Care Program Division of Research, Oakland, California
| ABSTRACT |
|---|
|
|
|---|
METHODS. Fifty-eight Vermont Oxford Network centers collected data prospectively for the revised Score for Neonatal Acute Physiology in the first 12 hours after admission of infants in 2002.
RESULTS. Data were collected for 10469 infants, and analyses were undertaken for 9897 who met inclusion criteria. The median revised Score for Neonatal Acute Physiology was 5, and the mean birth weight was 1951 g. Recalibration of the revised Score for Neonatal Acute Physiology and revised Score for Neonatal Acute Physiology Perinatal Extension resulted in minimal changes in their discriminatory abilities. The Vermont Oxford Network risk adjustment performed similarly, compared with the revised Score for Neonatal Acute Physiology Perinatal Extension.
CONCLUSIONS. Current score performance was similar to that observed previously, which suggests that the revised Score for Neonatal Acute Physiology and revised Score for Neonatal Acute Physiology Perinatal Extension have not decalibrated over the 7 years since the first cohort was assembled, despite advances in neonatal care during that period. Addition of congenital anomalies to the revised Score for Neonatal Acute Physiology Perinatal Extension improved discrimination significantly, particularly for infants with birth weights of >1500 g. The Vermont Oxford Network risk adjustment performed similarly, compared with the revised Score for Neonatal Acute Physiology Perinatal Extension.
Key Words: infant newborn predictive value of tests illness severity
Abbreviations: AUCarea under receiver operating characteristic curve ROCreceiver operating characteristic SNAPScore for Neonatal Acute Physiology SNAP-IIrevised Score for Neonatal Acute Physiology SNAPPE-IIrevised Score for Neonatal Acute Physiology Perinatal Extension VONVermont Oxford Network VON-RAVermont Oxford Network risk adjustment
Large-scale quality-improvement efforts in neonatology typically involve comparison of practices and outcomes between NICUs. For such comparisons to be valid, investigators and clinicians must account for differences in patient populations with respect to factors such as birth weight and severity of illness. The revised Score for Neonatal Acute Physiology (SNAP-II) quantifies illness severity, whereas the SNAP-II Perinatal Extension (SNAPPE-II) also incorporates information on birth weight, small-for-gestational age status, and Apgar score.1 SNAP-II is designed for measurement of physiologic severity of illness, whereas SNAPPE-II is more appropriate for risk adjustment, because it takes into account the independent effects of nonphysiologic baseline characteristics (Table 1). The scores predicted death accurately in a large cohort of 14610 infants in the original validation.1 Since then, SNAP-II and SNAPPE-II have facilitated comparison of practices and outcomes in 2 very large neonatal networks, namely, the Canadian Neonatal Network and the Kaiser Permanente network of NICUs in California.25
|
Revalidation has assumed more importance now that large organizations are launching aggressive benchmarking and quality-improvement efforts. We investigated the current performance and feasibility of SNAP-II and SNAPPE-II in one such enterprise, the Vermont Oxford Network (VON). Our specific objectives were as follows: (1) to document the performance of SNAP-II and SNAPPE-II against published normative values; (2) to determine whether this performance could be improved through recalibration of the weights for individual score items; (3) to determine the impact of a new item, namely, congenital anomalies; (4) to compare performance against that of VON risk adjustment (VON-RA), separately and in combination; and (5) to examine the relative performance of the scores in higher and lower birth weight cohorts.
| METHODS |
|---|
|
|
|---|
Study Patients and Data Collection
Data collection began between January and March 2002 and ended for all centers in December 2002. All patients who were entered into the VON clinical database at participating centers after each centers declared start date were registered for the SNAP-II study. In some centers, these included only infants weighing <1500 g at birth; in other ("expanded") centers, all birth weights were enrolled.
We collected data for calculation of SNAP-II and SNAPPE-II over the first 12 hours after NICU admission. For infants who died or were transferred before 12 hours, we calculated scores up to that point but analyzed the results separately. Physiologic data included the following: lowest mean blood pressure, lowest core body temperature, lowest serum pH of a capillary or arterial blood gas sample, presence of multiple-seizure activity (suspected by 2 clinicians or a neurologist or confirmed with electroencephalography), total urine output, arterial blood gas results with the lowest PaO2, highest mean airway pressure, and highest inspired oxygen concentration.
VON has traditionally adjusted for baseline characteristics by adding demographic and clinical variables to the regression equations used to analyze outcomes in the clinical data set. These VON variables include gestational age, gestational age squared, multiple gestation, outborn status, Apgar score, gender, cesarean section delivery, and presence of a congenital anomaly (Table 1). We used a multilevel modification of a previously published VON definition for congenital anomalies.7 This earlier definition relied on empirical observations of outcomes of infants with anomalies reported to the VON database, by using a predefined list of conditions. The mortality rate for infants with major birth defects was 58%, whereas the mortality rate for infants without major birth defects was 13% (P < .001). The more-recent definition used in the current study stratifies risk according to 5 categories (no defect, moderately severe defect, severe defect, very severe defect, or most severe defect). Although the mortality rates have not been published for this expanded definition, it has been shown to be more discriminatory than the earlier definition.
Our data sets for analysis of SNAP and VON differed slightly from each other because SNAP requires a period of time for accumulation of physiologic data, whereas VON-RA does not. The VON data set consisted of all infants with complete clinical data registered in participating centers during the study period. The SNAP data set excluded infants who died before NICU admission, were missing critical data necessary for calculation of SNAP-II, or were moribund at admission (that is, receiving only comfort care, without intubation, mechanical ventilation, pressor treatment, or cardiac compressions). To provide optimal estimates of test characteristics, we report the performance of VON-RA with the VON sample and the performance of SNAP-II with the SNAP sample. For direct comparisons of the 2 scores, we used the smaller SNAP sample.
We report analyses for all birth weights and stratified analyses for infants
1500 g and >1500 g separately. Because some centers submit only data on
1500-g infants to VON, the full data set has an overweighting of smaller infants. To avoid bias toward very low birth weight infants, we restricted the nonstratified analyses to the expanded centers that enrolled infants of all birth weights.
Researchers at each site entered data into an Internet-based data entry program, which performed initial data validity checks. We then merged SNAP-II data with the VON clinical data set by using the unique VON identifier and performed additional validity and completeness checks.
Analyses
As in previous validations, we elected to test score performance in predicting death, which we defined as any death occurring before discharge home. We chose death rather than other outcomes such as bronchopulmonary dysplasia, retinopathy of prematurity, or length of stay because it is measured reliably and applies to infants of all gestational ages.
We tested calibration, or the extent to which a model or score predicts death, by applying the Hosmer-Lemeshow goodness-of-fit test.8 The Hosmer-Lemeshow test compares observed and predicted death over successive intervals of risk. A preferred diagnostic test has a nonsignificant result in this analysis. We quantified discrimination, or the extent to which the model or score distinguishes normal from abnormal, by using the area under the receiver operating characteristic (ROC) curve (AUC).9 The ROC curve plots test sensitivity against the false-positive rate.
In the initial "revalidation" analysis, we calculated SNAP-II and SNAPPE-II as the sum of the previously reported weights for the various levels of derangement for each of the 6 components.1 We determined discrimination and calibration and compared results with those in the original publication. We then performed a "recalibration," by performing logistic regression analysis of death and the covariates from the SNAP-II and SNAPPE-II, rather than the numerical scores themselves. This approach allowed the ß coefficient weights for each item in the score, such as lowest mean blood pressure or pH, to change. Test characteristics might be expected to differ from the initial analysis if the relationship of death to score items had changed. We compared performance of several alternative methods of risk adjustment, including (1) SNAP-II, (2) SNAPPE-II, (3) SNAPPE-II plus congenital anomalies, (4) VON-RA, and (5) a combination of VON-RA with SNAP-II.
We compared discrimination of competing tests through nonparametric Mann-Whitney comparison of the respective AUC values.10 We performed population comparisons by using Students t test and
2 tests as appropriate. All analyses were completed using SAS 9 (SAS Institute, Cary, NC) and Stata 7 (College Station, TX) software.
| RESULTS |
|---|
|
|
|---|
|
|
|
|
1500-g group and a modest decrease in discrimination for the >1500-g group.
|
Performance of VON-RA
Test characteristics of VON-RA with the data set that included delivery room deaths are given in Table 5. VON-RA demonstrated excellent discrimination across all birth weight strata. Goodness of fit was adequate across all strata in this data set.
|
1500 g, and P = .065 for >1500 g). The addition of SNAP-II to VON-RA did improve performance slightly, especially in the >1500-g stratum (AUC comparison: P < .001 for all birth weights, P < .001 for
1500 g, and P = .002 for >1500 g). This combined score showed discrimination similar to that of SNAPPE-II when congenital anomalies were included in the latter (P = .058 for all birth weights, P = .001 for
1500 g, and P = .253 for >1500 g strata). It should be noted that, although its average discrimination was excellent, there was a pattern of a slightly lower observed/expected mortality ratio at the extremes of risk for SNAPPE-II plus congenital anomalies, as there was for VON-RA in the
1500-g cohort (data not shown).
|
|
| DISCUSSION |
|---|
|
|
|---|
Both SNAP-II and SNAPPE-II rely mainly on physiologic measurements for risk adjustment, although SNAPPE-II does take baseline characteristics into consideration. In contrast, VON-RA, like the Clinical Risk Index for Babies, consists almost exclusively of covariates that do not measure physiologic illness severity directly. Our comparisons of VON-RA and SNAPPE-II yielded several interesting results. First, as with earlier comparisons between SNAP and Clinical Risk Index for Babies, the performances of the 2 classes of scores in predicting death were similar. Although there are theoretical reasons for assuming that physiologic risk adjustment might be superior,12 especially given the discretionary nature of certain risk-adjustment factors (such as cesarean section), these did not translate into better test characteristics in the VON setting. Given the complementary nature of the 2 types of scores, we hypothesized that a combination of SNAP-II and VON-RA would improve risk adjustment. The addition of SNAP-II did indeed improve prediction significantly in statistical terms, although the clinical significance of an increase in AUC from 0.93 to 0.95, as shown for the all-birth weight category, is questionable.
The results across birth weight categories are particularly interesting. Before the study, we hypothesized that illness severity scoring would have its greatest contribution in VON among the larger infants, because increments of birth weight or gestation are likely to be less linked to death closer to term. We observed the opposite, however; VON-RA performed best in the >1500-g cohort, whereas performances were similar for the 2 scores in the <1500-g cohort.
Part of the performance difference is related to the inclusion of congenital anomalies in VON-RA. Indeed, when congenital anomalies were combined with SNAPPE-II, performance was superior to that of VON-RA. Richardson et al1 deferred inclusion of congenital anomalies in SNAPPE-II because of concerns that the definition of this variable was inconsistent. The VON definition of congenital anomalies is based on a definition that is standardized and well validated. Although the congenital anomaly factor certainly seemed to affect performance in the current study, it must be noted that it is an empirically derived ranking of risks. Therefore, the congenital anomaly factor may be more dependent on practice styles and may decalibrate more quickly as technology changes, compared with other score components.
Although the VON-RA approach yielded results similar to those of SNAPPE-II, illness severity scoring might still provide benefits not seen with other approaches. As shown in Fig 2, quantification of illness severity provides a concrete description of a NICU population. Unlike de novo regression methods of risk adjustment, SNAPPE-II now is well standardized and has been shown to be generalizable. In contrast to "a priori" items such as outborn status and cesarean section rate, the score items are not discretionary. Moreover, because it measures a dynamic property rather than immutable or historical characteristics, physiologic scoring is responsive to changes in management. With the advent of computerized data collection, this property might allow daily or serial risk adjustment. Similarly, such scores may be used as outcome variables for obstetric or earlier neonatal care.
These advantages of numerical severity-of-illness scores over the logistic regression approach of VON-RA must be balanced against the disadvantage of the necessity of excluding delivery room deaths. For risk adjustment of elements of postadmission neonatal care, such as the incidence of bronchopulmonary dysplasia, such an exclusion may not be significant. However, VON provides confidential, comparative, performance data for use in quality improvement, and the full range of neonatal deaths must be considered. The optimal strategy is likely to be a combination of the 2 scores, with delivery room VON-RA for hospital-level comparisons, followed by physiologic severity-of-illness scoring for the subset of infants admitted to the NICU. In this context, SNAP would serve as an outcome measure for the quality of labor and delivery, transport, or NICU stabilization. Similar combinations might be used to guide certain clinical decisions, such as those regarding transport.13 It should be emphasized, however, that the broad confidence intervals for individual patients make such scoring systems inappropriate for guiding life-support decisions at the patient level.
The differences between the 2 scores are perhaps not as significant as the fact that they both performed very well. Payers and third-party health management organizations are beginning to use proprietary approaches to risk management to guide management of clinicians resource utilization. In the future, such uses might extend to the pay-for-performance arena. In contrast to both SNAP-II and VON-RA, these approaches may offer inadequate or misleading risk adjustment. We encourage organizations that are informing quality-of-care comparisons or payment decisions with neonatal risk adjustment to use only approaches that have been subjected to large-scale, evidence-based validation.
| ACKNOWLEDGMENTS |
|---|
The steering committee members included Esmond Arrindell (Baptist Memorial Hospital for Women, Memphis, TN); David Corcoran (Rotunda Hospital, Dublin, Ireland); Douglas Dransfield (Barbara Bush Childrens Hospital, Portland, ME); Keith Gallaher (Cape Fear Valley Medical Center, Fayetteville, NC); Jeffrey Gerdes (Pennsylvania Hospital, Philadelphia, PA); Roger Hinson (Womans Hospital, Baton Rouge, LA); David Hoffman (Reading Hospital and Medical Center, Reading, PA); Patrick Lewallen (Emanuel Childrens Hospital, Portland, OR); Allen Merritt (St Charles Medical Center, Bend, OR); and Jeanne Webb (Miller Childrens Hospital, Long Beach, CA). Participating sites and site investigators included the following: Al Corniche Hospital (Vijay Baichoo, Gregory Samson); Albany Medical Center (Su Boynton, Pauline Graziano, Joaquim Pinheiro); Albert Einstein Medical Center (Agnes Salvador, David Schutzman); Aultman Hospital (Brenda Douglass, Kim Reese); Baptist Memorial Hospital for Women (Esmond Arrindell, Dianna Garner); Barbara Bush Childrens Hospital at Maine Medical Center (Douglas Dransfield, Dan Sobel); Baylor Healthcare System (Pam McKinley, Jonathan Whitfield); Beth Israel Deaconess Medical Center (John Zupancic); Cape Fear Valley Medical Center (Keith J. Gallaher, Anne Sheaves); Childrens Hospital Los Angeles, Center for Newborn and Infant Critical Care (Cyndi Atkinson, Philippe S. Friedlich); Childrens Hospital Medical Center Akron (Tina Bair, Judy Ohlinger); Childrens Hospital of Illinois at OSF at St Francis Medical Center (Howard S. Cohen, Constance McConnell); Childrens Hospital of Philadelphia Neonatology at Chester County Hospital (Michael Friedman, Lloyd Tinianow); Christiana Care Health Services (Kathy Leef, David Paul); Columbia University Medical Center (Jack Lorenz, Kiyoko Ohira-Kist); Columbus Childrens Hospital (Patty Lore, Rick McClead); Crozer-Chester Medical Center (Cynthia Dembofsky, Sonia Hulman); DeVos Childrens Hospital/Spectrum Health (Ed Beaumont, Dinah Sutton); Doctors Hospital West; Evanston Hospital (William MacKendrick, Sue Wolf); Fitzgerald Mercy Medical Center (David Shutzman); Grant Medical Center (Craig Anderson, Nancy Wagner); Hennepin County Medical Center (Raul F. Cifuentes, MaryAnn Tyler); Henry Ford Hospital (Savitri Kumar, Bonnie Malmberg); Hospital for Children and Adolescents (Sture Andersson, Marita Suni); Hospital of University of Pennsylvania (Judy Burke, Jeffrey Merrill); Howard County General Hospital (Bharti Razdan, Misrak Tadesse); Inova Fairfax Hospital for Children (Robin Baker, Rebecca Beck); Janet Weis Childrens Hospital at Geisinger Medical Center (Lauren Johnson-Robbins); Mercy Hospital and Medical Center (Jagjit Teji, Rohitkumar Vasa); Miller Childrens Hospital (Arthur Strauss, Jeanne Webb); Naval Medical Center-San Diego (Douglas Carbine); New Hanover Regional Medical Center (Robert McArtor, Jane Ranney); NICU Ospedale S. Anna (Daniele Merazzi); Norton Suburban Hospital; Parkview Memorial Hospital (Ihor Bilyk); Pennsylvania Hospital (Soraya Abbasi, Jeffrey Gerdes); Riverside Methodist Hospital (Rick McClead); Rockford Memorial Hospital (Wendy Boehm, Patricia Ittmann); Rotunda Hospital (David Corcoran); Sioux Valley Childrens Hospital; Sisters of Charity (Anthony Barone, Anantham Harin); Sparrow Hospital (Carolyn Herrington, Padmani Karna); St Agnes Hospital (Barbara Long, Arturo Santos); St Charles Medical Center (Maryanne Merritt, T. Allen Merritt); St John Hospital and Medical Center (Maria Duenas); St Johns Hospital (Brenda Bigley, Dennis Crouse); St Josephs Regional Medical Center (Jeffrey Garland, Susan Kannenberg); Sunnybrook and Womens College Health Sciences Centre (Michael Dunn, Allyson Nichols); Brooklyn Hospital Center (Patrick LeBlanc, Meena LaCorte); Reading Hospital and Medical Center (Gerald D. Brown, David J. Hoffman); University of Massachusetts Memorial Health Care (Francis J. Bednarek, Mary L. Naples); University Kebangsaan Malaysia (Nem-Yun Boo, Ismail Juriza); University of Michigan-Holden NICU (Al Cain, Ronald Dechert); University of Puerto Rico Hospital NICU; Wesley Medical Center (Barry Bloom, Paula Delmore); Womans Hospital (Roger M. Hinson); Wyckoff Heights Medical Center.
We acknowledge gratefully the guidance of the VON SNAP Pilot Project Steering Committee and the efforts of the site investigators to ensure the accuracy and completeness of data.
| FOOTNOTES |
|---|
Address correspondence to John A. F. Zupancic, MD, ScD, Department of Neonatology, Beth Israel Deaconess Medical Center, 330 Brookline Ave, Rose Building, Room 318, Boston, MA 02215. E-mail: jzupanci{at}bidmc.harvard.edu
The authors have indicated they have no financial relationships relevant to this article to disclose.
| REFERENCES |
|---|
|
|
|---|
34 weeks gestation.
J Pediatr. 2004;145
:754
760[CrossRef][ISI][Medline]
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||