eLetters is an online forum for ongoing peer review. To submit an eLetter please go to the article you wish to respond to and click on the link that reads "eLetters: Submit a Response." Submission of eLetters are open to all health care professionals and experts in related fields.

eLetters to:

ARTICLES:
Kevin Marks, Hollie Hix-Small, Kathy Clark, and Judy Newman
Lowering Developmental Screening Thresholds and Raising Quality Improvement for Preterm Children
Pediatrics 2009; 123: 1516-1523 [Abstract] [Full text] [PDF]
*eLetters: Submit a response to this article

eLetters published:

[Read eLetters] Interpreting measures of agreement
Bonnie W. Camp, James R. Murphy   (11 September 2009)

Interpreting measures of agreement 11 September 2009
  Top
Bonnie W. Camp,
Professor emeritus
Iniversity of Colorado School of Medicine,
James R. Murphy

Send letter to journal:
Re: Interpreting measures of agreement

campbw{at}msn.com Bonnie W. Camp, et al.

Interpreting measures of agreement

This paper reports a secondary analyses of a very important data set describing the outcome at age three to five years of 1393 term and 63 preterm children referred for evaluation of developmental delay based on screening by the clinician impression (PDI) and the age-appropriate ASQ at 12 and 24 months of age. The authors report that the agreement between PDI and ASQ for terms (82.4%) was significantly higher than for pre-terms (66.4%) and later note that "52.9% of referred preterm cases would not have promptly occurred without the ASQ." The authors interprete these differences as evidence that clinicians are less aware of delay in preterms than in terms.

This is an important finding if substantiated. However, this type of interpretation requires information usually presented in 2x2 tables to rule out plausible, rival interpretations. No relevant 2x2 tables are presented in the paper and cannot be calculated. Presentation of results is confusing, important information is sometimes missing or hard to find, and numbers sometimes differ between presentation in the narrative, the table and the figure.

Without information for the 2x2 tables, there is no way for the reader to determine whether the observed difference truly reflects a difference resulting from the pediatricians failure to recognize cases, or whether the difference is an artifact resulting from the high false positive rate with the ASQ, or more likely, an artifact of the difference in the chance of disagreement between the term and preterm. Methodological discussions of measuring agreement for categorical data(1;2) note that analysis of raw agreement is inappropriate and whatever measure is used should be accompanied by presentation of the 2X2 tables of data. The most common method recommended for this analysis is Cohen’s kappa, which corrects for chance. The chance of agreement is different for terms and preterms where 23% (14/65) of preterms were verified as EI eligible for service or monitoring vs. 12% (159/1363) for terms. (These numbers increase to 32% and 14%, respectively when adjusted for verification bias.) A significant difference between the kappas would support the authors’ interpretation. However, one would still need to look at the 2x2 tables because kappa can be misleading when the distribution is skewed as seems likely with 86-88% of the terms likely to be in one category. Meade et al(3) describe the problem as follows: “…when the proportion of positive ratings is extreme, the possible agreement above chance agreement is small and it is difficult to achieve even a moderate value of kappa.” Meade et al.(3) and McGinn et al.(2) also describe the use of ö which is a chance-independent measure of agreement based on the odds ratio . This measure is not influenced by extremes in the distribution of positive and negative results and has several mathematical advantages over kappa. Regardless of the method used, the 2x2 tables need to be presented to support the interpretation.

One should also consider how the difference in verification rate for the PDI and the ASQ contributed to differences in the agreement rate between term and preterm. In the original publication(4), referrals were verified 96% of the time for the PDI and 71% of the time for the ASQ. This information is not explicit in the present article but raises the possibility that disagreement between the two measures reflects the greater inaccuracy in the ASQ.

Another problem occurs with the apparent disparagement of the clnicians’ decisions. Anecdotal data in this article elaborate on the 4 or 5 (specific number is not clear) of the preterms whose “disorder” was “missed by the board certified pediatrician at 12 or 24 months” but identified by the ASQ so that referral “more promptly occurred,” as judged by status assessed from 1 to 4 years later. It is difficult to assess this kind of statement when data are presented with confusing (Table1) or unclear (Figure 1) labels; however, it appears, at least in Figure 1, that referrals made by the PDI alone are reflected in the numbers for referrals made when the ASQ was not returned. When the ASQ was returned, 40% (17/43) of preterms were referred and 82% (14/17, adjusted) were verified; when the ASQ was not returned, 32% (7/22) of preterms were referred (presumably by the PDI alone) and 86% (6/7, adjusted) were verified. Among terms, 22% (161/733) were referred when the ASQ was returned and the verification rate was 66% (106/161, adjusted) whereas 19% (122/630) were referred by the PDI alone with a verification rate of 70% (86/122, adjusted). Viewed from this perspective, the results are much more balanced than one would expect from the narrative. What useful purpose to science is served by such one-sided reporting? Surely, the most important point is to find the best, most feasible, combination of procedures to identify children early. One of the most important findings reported in this article is the information that even “low-risk” preterms are recognized to have a significantly higher rate of qualifying for developmental services than term infants as early as 12 and 24 months. What is not clear is how the other information can be interpreted to guide the most appropriate use of resources available. This is impossible without clearer and more complete presentation of the data such as appropriate 2x2 tables and rates of verification for PDI alone, ASQ alone and the combination of ASQ and PDI. Bonnie W. Camp, MD, PhD Professor Emeritus of Pediatrics and Psychiatry University of Colorado School of Medicine

James R. Murphy, PhD Professor of Biostatistics Head, Division of Biostatistics and Bioinformatics National Jewish Health and Adjunct Professor Colorado School of Public Health

Reference List

(1) Altman DG. Some common problems in medical research. Practical statistics for medical research.New York, NY: Chapman and Hall; 1991. p. 396-438.

(2) McGinn T, Guyatt G, Cook R, Korenstein D, Meade M. Measuring agreement beyond chance. In: Guyatt G, Rennie D, Meade MO, Cook DJ, editors. Users' Guides to the Medical Literature. 2 ed. Chicago, IL: McGraw-Hill Professional; 2008. p. 481-9.

(3) Meade MO, Cook RJ, Guyatt G, Groll R, Kachura JR, Bedard M, et al. Interobserver variation in interpreting chest radiographs for the diagnosis of acute respiratory distress syndrome. Am J Resp Crit Care Med 2000;161:85-90.

(4) Hix-Small H, Marks K, Squires J, Nickel R. Impact of implementing developmental screening at 12 and 24 months in a pediatric practice. Pediatrics 2007 Aug;120(2):381-9.

Conflict of Interest:

None declared