Interpreting measures of agreement
This paper reports a secondary analyses of a very important data set
describing the outcome at age three to five years of 1393 term and 63
preterm children referred for evaluation of developmental delay based on
screening by the clinician impression (PDI) and the age-appropriate ASQ
at 12 and 24 months of age. The authors report that the agreement between
PDI and ASQ for terms (82.4%) was significantly higher than for pre-terms
(66.4%) and later note that "52.9% of referred preterm cases would not
have promptly occurred without the ASQ." The authors interprete these
differences as evidence that clinicians are less aware of delay in
preterms than in terms.
This is an important finding if substantiated. However, this type of
interpretation requires information usually presented in 2x2 tables to
rule out plausible, rival interpretations. No relevant 2x2 tables are
presented in the paper and cannot be calculated. Presentation of results
is confusing, important information is sometimes missing or hard to find,
and numbers sometimes differ between presentation in the narrative, the
table and the figure.
Without information for the 2x2 tables, there is no way for the
reader to determine whether the observed difference truly reflects a
difference resulting from the pediatricians failure to recognize cases, or
whether the difference is an artifact resulting from the high false
positive rate with the ASQ, or more likely, an artifact of the difference
in the chance of disagreement between the term and preterm.
Methodological discussions of measuring agreement for categorical
data(1;2) note that analysis of raw agreement is inappropriate and
whatever measure is used should be accompanied by presentation of the 2X2
tables of data.
The most common method recommended for this analysis is Cohen’s kappa,
which corrects for chance. The chance of agreement is different for terms
and preterms where 23% (14/65) of preterms were verified as EI eligible
for service or monitoring vs. 12% (159/1363) for terms. (These numbers
increase to 32% and 14%, respectively when adjusted for verification
bias.) A significant difference between the kappas would support the
authors’ interpretation. However, one would still need to look at the 2x2
tables because kappa can be misleading when the distribution is skewed as
seems likely with 86-88% of the terms likely to be in one category. Meade
et al(3) describe the problem as follows: “…when the proportion of
positive ratings is extreme, the possible agreement above chance agreement
is small and it is difficult to achieve even a moderate value of kappa.”
Meade et al.(3) and McGinn et al.(2) also describe the use of ö which is a
chance-independent measure of agreement based on the odds ratio . This
measure is not influenced by extremes in the distribution of positive and
negative results and has several mathematical advantages over kappa.
Regardless of the method used, the 2x2 tables need to be presented to
support the interpretation.
One should also consider how the difference in verification rate for
the PDI and the ASQ contributed to differences in the agreement rate
between term and preterm. In the original publication(4), referrals were
verified 96% of the time for the PDI and 71% of the time for the ASQ. This
information is not explicit in the present article but raises the
possibility that disagreement between the two measures reflects the
greater inaccuracy in the ASQ.
Another problem occurs with the apparent disparagement of the
clnicians’ decisions. Anecdotal data in this article elaborate on the 4
or 5 (specific number is not clear) of the preterms whose “disorder” was
“missed by the board certified pediatrician at 12 or 24 months” but
identified by the ASQ so that referral “more promptly occurred,” as judged
by status assessed from 1 to 4 years later. It is difficult to assess
this kind of statement when data are presented with confusing (Table1) or
unclear (Figure 1) labels; however, it appears, at least in Figure 1, that
referrals made by the PDI alone are reflected in the numbers for referrals
made when the ASQ was not returned. When the ASQ was returned, 40%
(17/43) of preterms were referred and 82% (14/17, adjusted) were verified;
when the ASQ was not returned, 32% (7/22) of preterms were referred
(presumably by the PDI alone) and 86% (6/7, adjusted) were verified.
Among terms, 22% (161/733) were referred when the ASQ was returned and the
verification rate was 66% (106/161, adjusted) whereas 19% (122/630) were
referred by the PDI alone with a verification rate of 70% (86/122,
adjusted). Viewed from this perspective, the results are much more
balanced than one would expect from the narrative.
What useful purpose to science is served by such one-sided reporting?
Surely, the most important point is to find the best, most feasible,
combination of procedures to identify children early. One of the most
important findings reported in this article is the information that even
“low-risk” preterms are recognized to have a significantly higher rate of
qualifying for developmental services than term infants as early as 12 and
24 months. What is not clear is how the other information can be
interpreted to guide the most appropriate use of resources available.
This is impossible without clearer and more complete presentation of the
data such as appropriate 2x2 tables and rates of verification for PDI
alone, ASQ alone and the combination of ASQ and PDI.
Bonnie W. Camp, MD, PhD
Professor Emeritus of Pediatrics and Psychiatry
University of Colorado School of Medicine
James R. Murphy, PhD
Professor of Biostatistics
Head, Division of Biostatistics and Bioinformatics
National Jewish Health and
Adjunct Professor Colorado School of Public Health
Reference List
(1) Altman DG. Some common problems in medical research. Practical
statistics for medical research.New York, NY: Chapman and Hall; 1991. p.
396-438.
(2) McGinn T, Guyatt G, Cook R, Korenstein D, Meade M. Measuring
agreement beyond chance. In: Guyatt G, Rennie D, Meade MO, Cook DJ,
editors. Users' Guides to the Medical Literature. 2 ed. Chicago, IL:
McGraw-Hill Professional; 2008. p. 481-9.
(3) Meade MO, Cook RJ, Guyatt G, Groll R, Kachura JR, Bedard M, et
al. Interobserver variation in interpreting chest radiographs for the
diagnosis of acute respiratory distress syndrome. Am J Resp Crit Care Med
2000;161:85-90.
(4) Hix-Small H, Marks K, Squires J, Nickel R. Impact of
implementing developmental screening at 12 and 24 months in a pediatric
practice. Pediatrics 2007 Aug;120(2):381-9.
Conflict of Interest:
None declared