In the July 2002 issue of Pediatrics, Engle et al1 conclude that in a population of infants at risk for hyperbilirubinemia who are referred for retesting, transcutaneous bilirubinometry has limited usefulness. This study and many like it are quite necessary, but in this case, the result was very predictable provided one accepts the following:
No test is perfect.
Patients with signs (eg, jaundice) are more likely to be referred for testing.
Patients with signs (eg, jaundice) are more likely to have high transcutaneous bilirubin values than those without signs.
Engle’s population consisted of infants in whom “the primary caregiver determined that clinically apparent jaundice necessitated retesting.” In contrast, Bhutani et al2 (who reported on the same device) selected a population who, for the most part, had testing done at the time of routine metabolic screening or essentially were all “well” newborns.2 Presumably, some of these infants newborns will later become jaundiced and some will not. Thus, the spectrum of patients tested in each study was different.
The specificity of a test answers the question, “If a patient does not have disease, how likely is he to have a negative test?” It is measured as the true-negative rate or as (true-negatives)/(true-negatives + false-positives). In Engle’s study, because infants were referred because of jaundice, we would expect relatively more false-positives (jaundiced but not hyperbilirubinemic) and fewer true-negatives (not jaundiced, not hyperbilirubinemic) in this population than in the Bhutani study; that is, a larger denominator and a smaller numerator. We would predict specificity to drop. It did. Similarly, sensitivity is recorded as (true-positives)/(true-positives + false-negatives). As the proportion of jaundiced versus nonjaundiced patients tested is increased, the number of false-negative patients referred should be a bit lower, and sensitivity should improve.
This phenomenon whereby the discriminative power of a diagnostic test is influenced by referral patterns is well described.3–5 In general, when populations are selected for the referral on the basis of signs or symptoms, such selection usually raises sensitivity, reduces specificity, and decreases the likelihood ratio for both negative and positive results.6 In other words, the diagnostic accuracy of tests is often “used up” with referral of patients from primary to more secondary/tertiary settings.7 This is not always true but will usually be so for diagnostic tests when patients are referred on the basis of signs or symptoms. When evaluating diagnostic research for use in clinical practice, it is important for the practitioner to ask the question: “Was this diagnostic test evaluated in an appropriate spectrum of patients?”
Rubaltelli et al8 evaluated the same device in yet another study. An adequate description of the spectrum of patients was not provided (“… newborn infants who underwent tests … as part of their normal care in 6 different hospitals …”), so the effect of spectrum bias cannot be well estimated.
Both Engle and Rubaltelli used receiver operator characteristic curves (ROC) to describe results, but each in a different way. ROC curves plot sensitivity versus specificity and can be a nice way to visualize the global accuracy of a test. Using routine laboratory-measured serum bilirubin as the “gold standard” and cutpoint values (outcomes of interest) of >10 mg/dL and >15 mg/dL, Engle et al used the curves to downplay the predictive accuracy of the transcutaneous bilirubin test. Thus, while Engle finds the BiliCheck (Respironics, Marietta, GA) to have the global capability to distinguish populations of hyperbilirubinemic from nonhyperbilirubinemic infants the test lacks the ability to sufficiently estimate the probability of hyperbilirubinemia in individuals. Rubaltelli used the ROC curves to compare 2 diagnostic tests (standard laboratory measured serum bilirubin and transcutaneous bilirubin) with the gold standard of serum bilirubin as measured by high- performance liquid chromatography. Rubalatelli et al concludes (appropriately) that both standard lab measures of bilirubin and the BiliCheck are prone to similar error; or, you can “take your pick.” These investigators also point out that the true outcome of interest is neither “blood bilirubin” nor “ skin bilirubin” but “brain bilirubin.” Whether skin/tissue bilirubin or blood bilirubin is a better predictor of brain bilirubin is certainly a question worth answering.
Decision analysis tells us that to make a clinical decision one needs 2 basic pieces of information. The first is a probability estimate of outcomes, and the second is a utility or value assigned to outcomes. That is, knowing not only “What is the likelihood that serum bilirubin is >10 mg at 36 hours?” but also, “What is the value of knowing that the bilirubin is >10 mg at 36 hours?” As it stands now, the poor clinician is told by some that transcutaneous bilirubin values cannot be of worth because they may not accurately predict serum bilirubin values greater than “X” (your important level here), which in turn may or may not accurately identify the infant whose bilirubin value is likely to surpass “Y” (your important action threshold value here), which may or may not accurately predict the “gold standard” HPLC value “Z,” which may or may not accurately predict the brain bilirubin content, and so on. Studies such as those described here help to clear up this confusion by refining estimates of probabilities of outcomes, but they provide no help in assigning utility to the outcomes. Thus, the situation is even “fuzzier” than it appears. Astute clinicians know this, factor in their own values, and hence the remarkable variation in their decision-making behaviors.
In the case of transcutaneous bilirubinometry, what can one say? At present the BiliCheck seems to have useful diagnostic accuracy when used as a mass-screening device (a la Bhutani et al) in the normal newborn nursery. It can help establish a risk estimate and answer the question, “Should I worry about this infant?” If this is the way one practices then this might be the right job for the tool. The findings of Engle et al suggest that in the infants referred for additional testing, the device cannot answer the question “How high is the serum bilirubin?” with an acceptable degree of diagnostic accuracy. This is the wrong job for the tool.
- Received April 29, 2002.
- Accepted April 29, 2002.
- Reprint requests to (R.E.S.) F5790 Mott Hospital, 1500 E Medical Center Dr, University of Michigan Health Systems, Ann Arbor, MI 48109-0254. E-mail:
- ↵Engle WD, Jackson GL, Sendelbach D, Manning D, Frawley WH. Assessment of a transcutaneous device in the evaluation of neonatal hyperbilirubinemia in a primarily Hispanic population. Pediatrics.2002;110 :61– 67
- ↵Bhutani VK, Gourley GR, Adler S, Kreamer B, Dalin C, Johnson LH. Noninvasive measurement of total serum bilirubin in a multiracial predischarge newborn population to assess the risk of severe hyperbilirubinemia. Pediatrics.2000;106(2) Available at: www.pediatrics.org/cgi/content/full/106/2/e17
- ↵Ransohoff DF, Feinstein AR. Problems of spectrum and bias in evaluating the efficacy of diagnostic tests. N Engl J Med.1978;29917 :926– 930
- Reid MC, Lachs MS, Feinstein AR. Use of methodological standards in diagnostic test research. Getting better but still not good. JAMA.1995;2748 :645– 651
- ↵Irwig L, Bossuyt P, Glasziou P, Gatsonis C, Lijmer J. Designing studies to ensure that estimates of test accuracy are transferable. BMJ.2002;324 :669– 671
- ↵Knottnerus JA, Leffers P. The influence of referral patterns on the characteristics of diagnostic tests. J Clin Epidemiol.1992;4510 :1143– 1154
- ↵Sackett DL, Haynes RB. The architecture of diagnostic research. BMJ.2002;234 :539– 541
- ↵Rubaltelli FF, Gourley GR, Loskamp N, et al. Transcutaneous bilirubin measurement: a multicenter evaluation of a new device. Pediatrics.2001;1076 :1264– 1271
- Copyright © 2002 by the American Academy of Pediatrics