OBJECTIVES: The primary objectives were to evaluate the quality of development and reporting of American Academy of Pediatrics (AAP) guidelines and to determine the level of evidence underlying the recommendations.
METHODS: Two reviewers scored each guideline by using the Appraisal of Guidelines for Research and Evaluation II (AGREE-II) instrument and determined the level of evidence for each recommendation in each guideline. Subgroup analyses compared AAP guidelines published before and after key changes in AAP guideline development policy and compared internal with endorsed guidelines.
RESULTS: For the 28 current guidelines, the highest average scores on AGREE-II were in scope and purpose (75%) and clarity of presentation (73%). The lowest average scores were in editorial independence (17%) and applicability (30%). The only domain that improved after AAP policy updates was editorial independence (P = .01). Of the 190 treatment recommendations, 43% were based on experimental studies, 30% on observational studies, and 27% on expert opinion or no reference. Compared with early guidelines, late guidelines included a higher proportion of treatment recommendations based on experimental studies (P = .05).
CONCLUSIONS: There was no clear improvement in the quality of development and reporting of AAP clinical practice guidelines over time. Routine application of AGREE-II to guideline development could enhance guideline quality. The proportion of guideline recommendations based on experimental evidence has increased slightly over time. Pediatric research agendas should be matched to vital gaps in the evidence underlying pediatric guidelines.
- AAP —
- American Academy of Pediatrics
- AGREE-II —
- Appraisal of Guidelines for Research and Evaluation II
- CCT —
- controlled clinical trial
- RCT —
- randomized controlled trial
What’s Known on This Subject:
In the only previous cross-sectional study, the quality of pediatric guidelines was rated low on the AGREE-II (Appraisal of Guidelines for Research and Evaluation II) scale. The levels of evidence used in pediatric clinical practice guidelines have never been described.
What This Study Adds:
American Academy of Pediatrics guidelines score low on the AGREE-II scale. Approximately one-quarter of recommendations are based on expert opinion or no reference. These findings support the adoption of standards for guideline development and research targeted toward unsupported recommendations.
The implementation of recommendations from high-quality guidelines should lead to improvements in patient care.1–3 Strikingly, “evidence-based” recommendations in important domains of adult medicine (oncology, cardiology, infectious disease, and screening) are based on low levels of evidence.4–6 The method of development and reporting of guidelines also affects quality. Guidelines differ markedly in the process that led to the formulation of the guideline, the depth of description of this process, and the scales used to present the levels of evidence: for example, a 2001 publication described 121 different systems to grade evidence.7
There is limited formal study of the current status of pediatric guideline development. The only previous study that used an objective scoring system to evaluate the quality of formulation of pediatric guidelines included those published up to 2004 and revealed many to be of low quality.8 There is no published study on the level of evidence used to derive the recommendations in pediatric guidelines.
It is important to examine pediatric guidelines directly because their evidence base differs from that of adult guidelines. There is reluctance to test and use pharmacologic interventions in growing, asymptomatic children.9 Other barriers to pediatric research include the complexity of surrogate consent,10 clinical heterogeneity, the limited ability of children to communicate symptoms, and the latency of adverse outcomes.11
The goals of this study were to assess the overall quality of development and reporting of American Academy of Pediatrics (AAP) guidelines, to determine the level of underlying evidence, and to determine if there has been any improvement over time that corresponds to changes in AAP guideline development policy. An additional goal was to compare AAP internal with endorsed guidelines. Our hypothesis was that both guideline quality and the level of supporting evidence would have improved over time.
Selection of Guidelines
We performed a cross-sectional review of the clinical practice guidelines on the AAP Web site on October 1, 2011.12 The review included all clinical practice guidelines developed by the AAP as well as those produced by other organizations and endorsed by the AAP. Clinical reports were excluded because they contain few definite recommendations. AAP technical reports were considered part of clinical practice guidelines only if the guideline explicitly referenced them.
Key AAP policy statements related to guideline development were gathered, and guidelines were divided into those published before key policy statements (pre–policy guidelines) and those published afterward (post–policy guidelines).
The 2 reviewers (A.I. and M.S.) piloted the data extraction form on 1 guideline. They then independently reviewed each guideline.
Quality of Development and Reporting
The Appraisal of Guidelines for Research and Evaluation II (AGREE-II) instrument was used to assess the guidelines13 (Table 1). Scores (from 1 to 7) on each of the 23 AGREE-II items and on overall quality were assigned independently by the 2 reviewers. All scoring discrepancies were then discussed and scores changed accordingly, but no attempt was made to reach consensus. All item scores in a particular domain were then averaged for each guideline, and a percentage score was derived for each domain according to the method specified by the AGREE-II instrument.
The reviewers recorded the scale used by the guideline developers to rate the level of supporting evidence (eg, Grading of Recommendations Assessment, Development and Evaluation [GRADE], US Preventative Task Force system, previously unpublished scale).
Level of Evidence Behind the Recommendations
For each recommendation in each guideline, the reviewers extracted information about the study designs on which the recommendation was based by using information from the guideline itself and from the abstracts of the original referenced articles. The highest level of evidence on which the recommendation was based was recorded. A recommendation was deemed to have been based on a particular study if the referenced study’s population, intervention, and outcome were applicable to the recommendation. Consensus was reached for each recommendation by discussion, with discrepancies settled by discussion with a third author (J.L.R.). The level-of-evidence categories used, from highest to lowest quality, were as follows: randomized controlled trial (RCT), controlled clinical trial without randomization (CCT), observational cohort study, case-control study, pre-post study, cross-sectional study, time series, case series, expert opinion, and no reference.14 Recommendations were classified as diagnostic or therapeutic and subgrouped into those published before and after the median year of publication.
The 2 reviewers’ interrater reliability in assigning AGREE-II scores using the scores recorded before discussion was quantified by using an intraclass correlation coefficient.
Although the AGREE-II manual emphasizes domain scores over total scores, total AGREE-II scores were used as a practical means to compare guidelines. Although AGREE-II does not provide interpretation of specific scores, a score <50% was considered to indicate a deficiency. Spearman’s rank order correlation was used to correlate AGREE-II scores with years of publication. The Wilcoxon rank-sum test was used to compare guideline subgroups in terms of AGREE-II scores and recommendation subgroups in terms of supporting levels of evidence. In addition, a series of pairwise z tests were used to compare subgroups of guidelines in terms of the number of recommendations on the basis of supporting evidence, as follows: (1) experimental evidence (RCT or CCT), (2) observational evidence (cohort, case control, pre-post, cross-sectional, time series, case series), or (3) expert opinion or no reference. The same analysis was performed after separating recommendations into either diagnosis-based or treatment-based recommendations.
Analyses were performed by using MS Excel (Microsoft Corporation, Redmond, WA) and SPSS Statistics 19 (IBM SPSS Statistics, IBM Corporation, Armonk, NY).
There were 28 clinical practice guidelines that met the inclusion criteria published from 1997 through 2011 (14 internal and 14 endorsed guidelines). Two key AAP policy statements relating to clinical practice guideline development were identified. One such statement, entitled “Classifying Recommendations for Clinical Practice Guidelines” (AAP Policy Statement 1), was published in 2004,15 and another entitled “Toward Transparent Clinical Policies” (AAP Policy Statement 2) was published in 2008.16
There was strong overall agreement between the 2 raters on AGREE-II scores before discussion, with an intraclass correlation coefficient of 0.77 (95% confidence interval: 0.73, 0.80).
Quality of Development and Reporting
The highest overall scores were recorded in scope and purpose (75%) and clarity of presentation (73%) (Table 2). The lowest scores were recorded consistently in applicability (30%) and editorial independence (17%). Notably, 89% and 86% of guidelines scored <50% on applicability and editorial independence, respectively; and 46% scored <50% on stakeholder involvement, rigor of development, and total score. Overall quality was <50% in 29% of guidelines, whereas only 1 guideline scored <50% on scope and purpose, and all were >50% for clarity of presentation.
There were no statistically significant differences between pre– and post–policy guidelines on any of the AGREE-II domain scores except for improvement of editorial independence after each of the policy statements: AAP Policy Statements 1 (P = .002) and 2 (P = .01). There was also no significant difference in overall quality or in total AGREE-II score between guidelines published before and after each AAP policy statement. Figure 1 plots AAP guideline total scores by the year of publication, which showed no statistically significant trend in quality over time (rs = 0.15, P = .46).
However, improvements were found in 2 of the 5 individual AGREE-II items addressed by each AAP policy statement (AGREE-II 9, 10, and 11 for AAP Policy Statement 1 and AGREE-II 11 and 12 for AAP Policy Statement 2). After AAP Policy Statement 1, the mean score on AGREE-II item 10 (“the methods for formulating the recommendations are clearly described”) increased from 2.5 to 4.5 (P = .009). After AAP Policy Statement 2, the mean score on AGREE-II item 12 (“there is an explicit link between the recommendations and the supporting evidence”) increased from 5.2 to 6.5 (P < .001).
Internal guidelines had significantly higher total scores than did endorsed guidelines (P = .01; Table 2); however, the only AGREE-II domain that differed significantly between the groups was scope and purpose, with internal guidelines again achieving higher scores (P = .01). A standard scale (versus a previously unpublished scale) was used to grade evidence by 10 of 15 late guidelines but only 3 of 13 early guidelines. The standard scales (n = number of guidelines using each scale) were from the Canadian Preventive Services (n = 1),17 Centers for Disease Control Community Preventive Services (n = 1),18 GRADE (n = 2)19 Jadad (n = 2),20 Oxford Centre for Evidence Based Medicine (n = 3),14 Sackett (n = 3),21 and the US Preventive Services Task Force (n = 1).22
Level of Evidence
There were 394 recommendations in the 28 guidelines: 201 diagnostic recommendations and 193 treatment recommendations. The total number of recommendations from early guidelines (published before 2004; n = 196; mean of 15 per guideline) was similar to those from late guidelines (published from 2004 onward; n = 198; mean of 13 per guideline). There were 139 internal guideline recommendations and 255 endorsed guideline recommendations.
The levels of evidence for the 394 recommendations were, by recommendation, 23% experimental studies, 46% observational studies, and 31% expert opinion or no reference. Of the 201 diagnostic recommendations, 5% were based on experimental studies, 62% on observational studies, and 33% on expert opinion or no reference. Of the 190 treatment recommendations, 43% were based on experimental studies, 30% on observational studies, and 27% expert opinion or no reference (Table 3).
There was no significant difference between early and late recommendations in terms of their overall levels of evidence (P = .26) nor between internal and endorsed guidelines (P = .97). However, a larger proportion of recommendations were based on RCTs (P = .05 for all recommendations and .05 for treatment recommendations only) and on RCTs or CCTs combined (P = .04 for all recommendations and .05 for treatment recommendations only) in later guidelines. A smaller proportion of recommendations were based on expert opinion or no reference in later guidelines (P = .01 for all recommendations and .03 for treatment recommendations only). There was no statistically significant change over time in the levels of evidence for diagnostic recommendations.
Clinical practice guidelines aim to represent the most current thinking on the basis of the highest level of evidence available. Because evidence-based medicine has flourished and methods of guideline development and evidence review processes have become more sophisticated, we expected that the overall quality of reporting and evidence base of clinical practice guidelines would have improved. Instead, our results revealed an extremely wide range in AAP guideline quality scores with no statistically significant trend toward higher scores over time. The pattern of high and low domain scores was the same as in an older study,8 despite an AAP policy statement published in 2004 that outlined a standardized method for developing and classifying guideline recommendations15 and another statement published in 2008 that focused on increasing transparency in guideline development.16 Although these policy statements resulted in an increase in certain AGREE-II item scores, the overall quality of the guidelines remained unchanged.
Within AGREE-II, the highest scores were consistently recorded in scope and purpose and clarity of presentation. The only domain that clearly improved after publication of AAP Policy Statements 1 and 2 was editorial independence, which is likely due to increased awareness of the need to disclose competing interests.23,24 However, the improved score was still low in this category. AAP guidelines at best mentioned efforts to disclose competing interests but did not delineate the nature of the conflicts or how they were dealt with. Editorial independence will remain a challenge, because AAP guidelines are written by volunteers who are among a small number of recognized experts; many receive some industry funding for research or serve on advisory boards for pharmaceutical companies. Therefore, AAP guidelines should openly disclose “the types of competing interests considered, the methods by which potential competing interests were sought, a description of competing interests, and a description of how the competing interests influenced the guideline process and development of recommendations.”13 The Institute of Medicine recommendations on editorial independence and disclosure of conflicts of interest for guideline authors can serve as a guide.25
Stressing “barriers and facilitators to implementation, strategies to improve uptake, and resource implications of applying the guideline”13 would improve the guidelines’ low domain scores on applicability. Improved efforts to address the practical issues of knowledge translation for primary care guidelines have been shown to lead to improved patient outcomes.26,27 To improve in the domains of applicability and stakeholder involvement, the AAP should consider soliciting comments on a late draft of guidelines from a wide variety of users. In addition to improving guideline quality, this could improve guideline dissemination by promoting a broader awareness of the guidelines. Also in the realm of stakeholder involvement, the traditional physician-only guideline panels miss the potential valuable input of allied health professionals, patients, and their parents, especially on the practicality of recommendations.
AAP guidelines would benefit from comprehensive and continually updated standards for development. AGREE-II is a possible standard. Recent AAP policy guideline statements (1 and 2) do not address most items on the 23-item AGREE-II instrument. In particular, they do not adequately address knowledge translation, a previously recognized weakness of pediatric guidelines.8 The 2004 endorsement of reporting guidelines for RCTs by medical journals improved the quality of RCT reporting over a short period of time.28,29 We found evidence for improvements in the areas targeted by AAP statements on guideline development. Explicit AAP endorsement of a more comprehensive standard for guidelines could likewise potentially succeed in improving the quality of guidelines across several more domains.
Level of Evidence
There was a statistically significant trend toward more experimental evidence in later guidelines. Even in later guidelines, however, more than one-quarter of therapeutic recommendations were based on expert opinion alone, which may be due in part to the previously mentioned barriers to clinical research in pediatric populations. Also, an RCT is not always feasible to answer diagnostic questions or for interventions favored by long-term clinical experience and worldwide consensus.20 Nonetheless, as suggested by the limited correlation between topics chosen for pediatric systematic reviews and the conditions commonly seen in primary care,30 there is a need for greater efforts to match the research agenda to clinical need.
In comparison with adult guideline recommendations, pediatric recommendations are based on slightly higher levels of evidence (23% based on experimental evidence in the current study versus 12% in adult cardiology,4 6% in adult oncology,5 and 14% in adult infectious disease guidelines6). However, there were almost 7 times as many adult cardiology recommendations4 as there were AAP guideline recommendations in all domains of pediatrics. As guidelines cover more scenarios, there may be no option but to resort to lower levels of evidence, which could ultimately erode clinician confidence in guidelines. As the AAP has suggested (“Toward Transparent Clinical Policies,” AAP Policy Statement 2), acknowledging the limits of the evidence base is key in maintaining organizational credibility when sharing expertise in the face of uncertainty.
Sample size and study design were limitations of this study. Although the sample size of guidelines was similar to that of previously published studies,4–6 the small number of included guidelines (n = 28) limited the power to detect differences between guidelines. The level of evidence behind each recommendation was not always transparent in the guidelines, and full reports on the primary studies were not obtained. Conclusions about trends in guideline quality were limited by the cross-sectional design. AAP guidelines automatically expire after 5 years unless reaffirmed, revised, or retired; older guidelines that remain active and so were included in this study may be of higher quality than those that were revised or retired. Matching current guidelines to future guideline updates (in a cohort study) would allow for a better assessment of temporal change in guideline quality than did our cross-sectional assessment.
AGREE-II is widely used31 and has shown high interrater reliability and construct validity: it seems that guidelines with higher scores will be more widely applied.32 However, AGREE-II focuses on methods of guideline development and the transparency of reporting as opposed to the strength of evidence behind the recommendations. Also, AGREE-II score interpretation is left entirely to the user.
Another limitation is that it was beyond the scope of the study to fully assess the quality of the studies that led to the evidence; a high-quality observational study may yield more useful information than a low-quality RCT.
There remains room for improvement in the development and reporting of AAP guidelines. Editorial independence and applicability demand special attention in this regard. Just as the endorsement of RCT-reporting guidelines has improved RCT reporting,28,29 AAP endorsement of established, broad standards might lead to higher-quality pediatric guidelines to additionally improve pediatric care. Because guidelines frequently relied on expert opinion, primary research targeted toward important evidence gaps will be vital in the long term.33
- Accepted December 4, 2012.
- Address correspondence to Joan L. Robinson, MD, 3-588D Edmonton Clinic Health Academy (ECHA), 11405-87 Ave, Edmonton, AB, Canada T6R 1C9. E-mail:
Mr Isaac contributed to writing the protocol and data collection and analysis and wrote the first draft of the manuscript; Dr Saginur conceived the study, contributed to writing the protocol and data collection and analysis, and revised the manuscript; Dr Hartling contributed to revising the protocol and reviewed the final manuscript; and Dr Robinson contributed to revising the protocol, supervised all aspects of the project, and revised the manuscript.
FINANCIAL DISCLOSURE: The authors have indicated they have no financial relationships relevant to this article to disclose.
FUNDING: No external funding.
COMPANION PAPER: A companion to this article can be found on page 794, and online at www.pediatrics.org/cgi/doi/10.1542/peds.2013-0125.
- Richardson WC,
- Berwick DM,
- Bisgard JC,
- Bristow LR
- Poonacha TK,
- Go RS
- ↵West S, King V, Carey TS, et al. Systems to rate the strength of scientific evidence. Evidence Report/Technology Assessment No. 47 (prepared by the Research Triangle Institute–University of North Carolina Evidence-based Practice Center under Contract No. 290-97-0011). 2002:64–88. Agency for Healthcare Research and Quality Publication No. 02-E016
- Boluyt N,
- Lincke CR,
- Offringa M
- Simpson LA,
- Peterson L,
- Lannon CM,
- et al
- ↵American Academy of Pediatrics. AAP Policy. Available at: http://pediatrics.aappublications.org/site/aappolicy/index.xhtml. Accessed June 1, 2011
- ↵AGREE Collaboration. Appraisal of Guidelines for Research and Evaluation II. Available at: www.agreetrust.org/?o=1397. Accessed June 1, 2011
- ↵Oxford Centre for Evidence-Based Medicine Levels of Evidence Working Group. The Oxford 2011 levels of evidence. Available at: www.cebm.net/index.aspx?o=5653. Accessed June 1, 2011
- American Academy of Pediatrics Steering Committee on Quality Improvement and Management
- Shiffman RN,
- Marcuse EK,
- Moyer VA,
- et al.,
- American Academy of Pediatrics Steering Committee on Quality Improvement and Management
- ↵Centers for Disease Control and Prevention. The community guide: guide to community preventive services. 1996. Available at: www.thecommunityguide.org/PDFS/EC ONEVAL_version 3.pdf. Accessed May 1, 2011
- US Preventive Services Task Force
- Institute of Medicine
- Licskai C,
- Sands T,
- Ong M,
- Paolatto L,
- Nicoletti I
- Hopewell S,
- Dutton S,
- Yu LM,
- Chan AW,
- Altman DG
- Tobe SW,
- Stone JA,
- Brouwers M,
- et al
- Brouwers M,
- Kho ME,
- Browman GP,
- et al
- Copyright © 2013 by the American Academy of Pediatrics