|
|
eLetters is an online forum for ongoing
peer review. To submit an eLetter please go to the article you wish
to respond to and click on the link that reads
"eLetters: Submit a Response." Submission of
eLetters are open to all health care professionals
and experts in related fields.
eLetters to:
-
- ELECTRONIC ARTICLE:
Michael J. Vincer, Heather Cake, Michael Graven, Linda Dodds, Shelly McHugh, and Theresa Fraboni
- A Population-Based Study to Determine the Performance of the Cognitive Adaptive Test/Clinical Linguistic and Auditory Milestone Scale to Predict the Mental Developmental Index at 18 Months on the Bayley Scales of Infant Development-II in Very Preterm Infants
Pediatrics 2005; 116: e864-e867
[Abstract]
[Full text]
[PDF]
|
|
eLetters published:
-
Assessing bias in validity studies of developmental screening tests
- Bonnie W, Camp
(3 August 2006)
|
Assessing bias in validity studies of developmental screening tests |
3 August 2006 |
|
|
Bonnie W, Camp, Professor Emeritus University of Colorado School of Medicine
Send letter to journal:
Re: Assessing bias in validity studies of developmental screening tests
campbw{at}msn.com Bonnie W, Camp
|
I would like to commend Dr. Vincer and colleagues and the editors of Pediatrics for a model report of a developmental screening test validity study1. This completeness is very helpful for the reader who wishes to evaluate a validity study for potential bias and offers a rare opportunity to apply Standards for Accuracy in Reporting Diagnostic Tests (STARD)2 guidelines to a published study.
The STARD guidelines were designed to help researchers and editors know what information needs to be included in reports of validity studies so that readers can assess utility of the test and identify sources of bias that threaten the reliability of the conclusions and generalizability of the results. Five major sources of bias that seem particularly applicable to evaluation of developmental screening test accuracy are the following: uncertainty in estimates of sensitivity and specificity, verification bias, inappropriate test bias, bias in conduct of the study and, spectrum bias.
Uncertainty in estimates of prevalence, sensitivity and specificity is primarily a function of sample size. Altman3 recommends a minimum sample size of 200 for setting reference standards and this should apply equally well to validity studies of screening tests. Estimating sensitivity and specificity from unacceptably small samples is especially egregious in the developmental/behavioral screening literature. This is particularly problematic because sensitivity is based on the total number of children who are positive on the reference test. In some reports this number has been as low as 4 which is even too small to calculate the 95% confidence interval by conventional methods.
Verification bias can result when, by design or accident, a non-random sample receives the reference test or gold standard. A survey of diagnostic tests reported in the pediatric literature over a 3 year period found that 36% of the studies were subject to verification bias4. Bias from this source is usually examined by reviewing participant flow through the study.
Bias from use of inappropriate tests may occur because of problems with either the screening test or the reference test and are especially important in developmental screening where secular changes in the age of achieving milestones (the “Flynn” effect) affect results obtained with tests that have out-dated norms.
There are many ways conduct of the study can lead to bias. Two that are often overlooked are review bias resulting from lack of blinding or incorporation of information from one test to another and “construct irrelevant”5 bias resulting from factors such as fatigue, stranger anxiety, or previous experience with either test.
Spectrum bias results when one or more characteristics of the sample affect estimates of accuracy. The most common form occurs when there is a “limited challenge” resulting from failure to include a broad range of disease severity. Spectrum bias is also present when there are comorbid conditions that increase the number of false negative and false positives2 and when results are a result of factors that are specific to the setting in which the study is done. Failure to recognize the “setting specific issues” that lead to spectrum bias has been cited as a major weakness of many new diagnostic tests.6
Vincer et al. reported a validity study of the Cognitive Adaptive Test/Clinical Linguistic Auditory Milestone Scale (C/C) developmental screening test in identifying children who obtained a Mental Development Index (MDI) of <70 on the Bayley Scales of Infant Development II (BSID-II) at 18 months. Their sample of 147 children with complete data represented 85% of the surviving infants born at <31 weeks gestation in Nova Scotia and Prince Edward Island during the time period selected. Even though the total number tested was slightly lower than optimal, the number of reference test positives was large enough for the 95% confidence interval for sensitivity to be less than ±.13. Their accounting for children lost to follow-up was satisfactory enough to rule out verification bias. They diminished the inappropriate test bias by using an up-to-date reference test, discussing and adjusting for the out of date norms on the screening test, and giving a convincing and informative discussion of the rationale for choice of cutting points. Incorporation bias was present in that a number of items are on both the screening test and the reference test. Some test review bias was also present because physicians administering the C./C in 14% (20) of the cases knew results of the BSID II before administering the screening test. The authors noted that these latter two did not change the results though when present, they have been noted to inflate sensitivity. Furthermore, the estimates of utility went beyond reports of sensitivity, specificity, ROC analysis and computation of likelihood ratios to assess evidence of test utility by computing posterior probabilities and relating them to expected prevalence (as opposed to the study prevalence) based on previous observations in the geographical area covered.
With the exception of spectrum bias, the study appears to have successfully addressed most of the sources of bias that render validity studies of screening tests unusable. Some type of spectrum bias is present because Vincer et al. restricted their sample only to children born at <31 weeks gestation. If this were the extent of spectrum bias, however, one would expect results of the Vincer et al. study to be comparable to those of other recent studies of neonatal follow-up using the same screening test and reference test with comparable cutoff scores. Even with this restriction, however, there are still large differences in estimates of sensitivity. Some of the other studies cannot be evaluated because of incompleteness in reports. Others also differ significantly in prevalence and other important aspects.
One study, that of Macias et al.7, is similar to Vincer et al. in prevalence, viz. 17% observed in Vincer et al. and 14% in Macias et al., but yields a much lower estimate of sensitivity, viz. 36%compared to 88% in the Vincer et al. study. The Macias et al. study has a small sample, but the difference in sensitivity is so great that the 95% confidence intervals for sensitivity in these two studies do not even overlap (Vincer et al., .75-1.01, Macias et al. .06-64). Finding a difference only in sensitivity is a little unusual and suggests that some factor or factors are either (1) inflating sensitivity in the Vincer et al. study or (2) predisposing children in the Macias et al. study to pass the screening test without affecting the reference test. A closer comparision of these two studies might help to clarify how such a discrepancy might happen and affect the meaningfulness of the study results.
The Macias et al. study was performed in a neonatal follow-up clinic serving both low birth weight infants and infants who had had other neonatal problems. Enough details are provided to indicate that verification bias was minimal, and adjustment was made for out-of-date norms on the C /C, though the choice of cut-off points in the two studies was slightly different. In the Vincer et al. study the cut-off on the C/C was slightly higher (DQ<83) than in the Macias et al. study (DQ<80) and the rationale for selection was different. For this difference to account for the difference in sensitivity, however, most of the children with MDI<70 in the Macias study would have had to cluster in the range of DQ80-83, which seems unlikely, though ultimately observable.
Differences in conduct of the study could also have produced different results. Incorporation bias is present as in the Vincer et al. study but test review bias is avoided, i.e., all psychologists administering the reference test were blind to results of the screening test in the Macias et al. study. Some type of construct irrelevant variance5 may have also been present in the Macias at al. study where children were given three tests in one visit with the BSID II always last. Mounting fatigue, uncooperativeness and inattention can result in poor performance on the last test given. But this would lead to increased prevalence and not just decreased sensitivity. Uncertainty in the estimates resulting from a small sample size is clearly a problem in the Macias et al. study; but, as noted previously, the differences between the two studies remain after taking this into account.
This leaves some aspect of spectrum bias, unrelated to the neonatal characteristics of the samples as the more serious source of discrepancy between the two studies. The Vincer at al. sample was 97% white, 32% lower socioeconomic status (LSES), with 85% of the mothers having education of 12+years and an average age of 29 years. All were born at <31 weeks gestation and were tested at 18 months of age. Macias et al. had 62% African-American, 65% lower socioeconomic status and ~50% with maternal education of 12+. Maternal age was not reported. They included non-premature children with perinatal insults as well as prematures and reported an age range of 6-24 months with half of the children below 12 months of age.
Thus the two study populations differed both in age range at follow-up and socio-demographic characteristics. Two possible explanations will be considered for why these differences resulted only in more “false negatives” in the Macias et al. study thereby affecting only sensitivity.
One possible explanation is that the meaning of ratio DQ’s (used by the C/C) varies with age which has a much wider range (6-24 months) in the Macias et al. study. Goddard invented the ratio IQ, (Mental Age/Chronological Age)x100, when he introduced the Binet test to the USA in 1905. It was used uncritically in the early IQ tests until evidence of wide variation in the standard deviation of ratios IQ’s at different ages mounted and it became evident that a ratio DQ did not mean the same thing at different ages. Although most well-designed psychological tests have replaced ratio IQ’s with standard scores ((Score-Mean)/SD)x100, this problem has been largely overlooked in the developmental screening literature. Vincer et al illustrate this problem to some extent with their ROC analyses at ages below 18 months where the best cut-off on the C/C varied from 109 at 4 months to 98 at 8 months and 81 at 12 months. Using the same cut-off score at all ages in a study collapsing across ages 6-24 months, as in the Macias et al. study, might have resulted in more false negatives at some but not all ages.
Another possible explanation is that one or more cultural or genetic factors (“comorbid conditions”) predisposing to acceleration in the age of developing some skills are present in the Macias et al study population and these skills affect the screening test more than the reference test. Accelerated development of selected skills in infancy have been reported in a variety of cultural and ethnic groups. For example, Frankenburg et al.8 found that under 20 months of age, Anglo children from unskilled worker families were significantly more likely to pass 39 of the items on the DDST than Anglo children in the cross-sectional comparison group. Variations in age of walking have also been reported across different countries and racial groups9-11.
But even these factors would not differentially affect only the screening test unless the skills in question were emphasized more on the screening test than the reference test. This is, of course, possible as, despite significant overlap in items between the C/C and the BSID II, The BSID-II is not only longer and more thorough, it also contains some assessment, at younger ages, of the more powerful, yet more difficult to assess, infant predictors of later development such as attention and novelty preference.
Pewsner et al.12 , citing a review of 27 studies of the “Ottawa ankle rules” (used in deciding on the need for an X-ray), note that in many , but not all,settings, “this decision can indeed, exclude fractures and reduce the number of unnecessary radiographs.” However, they also stress the fact that results may not always be transferable to other populations and settings and recommend that information required to judge transferability and applicabitilty should be provided in reports of validity studies.
The present study presents convincing evidence that the C/C can be a “useful decision aid” for identifying 12 and 18-month-old Nova Scotia prematures who should be referred for assessment of developmental delay. Perhaps generalization to an equally homogeneous white population in the United States and elsewhere would yield similar results. Until the specific features of the population that contribute to this result are better understood, however, further generalization is risky. Issues regarding choice of cut-off points and use of ratio DQ’s across several ages are technical and can be easily addressed. The question of how cultural/genetic diversity may lead to a mismatch between screening test and reference test is thornier. We cannot begin to understand the problem of spectrum bias, however, until all reports of validity studies are as complete as Vincer et al.
Reference List
1. Vincer MJ, Cake H, Graven M, et al. A Population-Based Study to Determine the Performance of the Cognitive Adaptive Test/Clinical Linguistic and Auditory Milestone Scale to Predict the Mental Developmental Index at 18 Months on the Bayley Scales of Infant Development-II in Very Preterm Infants. Pediatrics 2005;116:e864-e867.
2. Bossuyt PM, Reitsma JB, Bruns DE, et al. The STARD statement for reporting studies of diagnostic accuracy: explanation and elaboration. Annals of Internal Medicine 2003;138:1-12.
3. Altman DG. Some common problems in medical research. In D. G. Altman, Practical statistics for medical research. New York, NY: Chapman and Hall; 1991:396-438.
4. Bates A.S., Margolis PA, Evans A.T. Verification bias in pediatric studies evaluating diagnostic tests. J Pediatr 1993;122:585-590.
5. Messick S. Validity of psychological assessment: Validation of inferences from persons' responses and performances as scientific inquiry into score meaning. Am Psychol 1995;50:741-749.
6. Delaney B, Wilson S, Fitzmaurice D, et al. Near-patient tests in primary care: setting the standards for evaluation. Journal of Health Services & Research Policy 2000;5:37-41.
7. Macias MM, Saylor CF, Greer MK, et al. Infant screening: the usefulness of the Bayley Infant Neurodevelopmental Screener and the Clinical Adaptive Test/Clinical Linguistic Auditory Milestone Scale. Journal of Developmental & Behavioral Pediatrics 1998;19:155-161.
8. Frankenburg WK, Dick NP, Carland J. Development of preschool-aged children of different social and ethnic groups: implications for developmental screening. Journal of Pediatrics 1975;87:125-132.
9. Dennis W, Dennis MG. The effect of cradling practices upon the onset of walking in Hopi children. J. Genetic Psychol, 1991:563-572.
10. Grantham-McGregor SM, Back EH. Gross motor development in Jamaican infants. Developmental Medicine & Child Neurology 1971;13:79-87.
11. Hindley CB. Growing up in five countries: a comparison of data on weaning, elimination training, age of walking and IQ in relation to social class from European longitudinal studies. Developmental Medicine & Child Neurology 1968;10:715-724.
12. Pewsner D, Battaglia M, Minder C, et al. Ruling a diagnosis in or out with "SpPIn" and "SnNOut": a note of caution. [Review] [26 refs]. BMJ 2004;329:209-213.
Conflict of Interest:
None declared |
| |
|