ARTICLE |
Divisions of a Emergency Medicine
e Critical Care Medicine
c Clinical Research Program, Children's Hospital Boston, Boston, Massachusetts
b Division of Pediatric Emergency Medicine, Boston Medical Center, Boston, Massachusetts
d Department of Health Research and Policy, Stanford University, and VA Palo Alto Cooperative Studies Programs Coordinating Center, Palo Alto, California
| ABSTRACT |
|---|
|
|
|---|
OBJECTIVE. The purpose of this work was to develop a simulation-based tool for the assessment of pediatric residents' resuscitation competency and to evaluate the tool's reliability and preliminarily its validity in a pilot study.
METHODS. We developed a 72-question yes-or-no questionnaire, the Tool for Resuscitation Assessment Using Computerized Simulation, representing 4 domains of resuscitation competency: basic resuscitation, airway support, circulation and arrhythmia management, and leadership behavior. We enrolled 25 subjects at each of 5 different training levels who all participated in 3 standardized code scenarios using the Laerdal SimMan universal patient simulator. Performances were videotaped and then reviewed by 2 independent expert raters.
RESULTS. The final version of the tool is presented. The intraclass correlation coefficient between the 2 raters ranged from 0.70 to 0.76 for the 4 domain scores and was 0.80 for the overall summary score. Between the 2 raters, the mean percent exact agreement across items in each domain ranged from 81.0% to 85.1% and averaged 82.1% across all of the items in the tool. Across subject groups, there was a trend toward increasing scores with increased training, which was statistically significant for the airway and summary scores.
CONCLUSIONS. In this pilot study, the Tool for Resuscitation Assessment Using Computerized Simulation demonstrated good interrater reliability within each domain and for summary scores. Performance analysis shows trends toward improvement with increasing years of training, providing preliminary construct validity.
Key Words: assessment competency medical education pediatric resuscitation simulation
Abbreviations: ACGME—Accreditation Council For Graduate Medical Education TRACS—Tool for Resuscitation Assessment Using Computerized Simulation ICC—intraclass correlation coefficient
Competency in pediatric resuscitation is an essential objective of pediatric residency training.1 It is mandated not only by residency review committee regulations2 but by the exigencies of patient care, given low survival rates from pediatric cardiopulmonary arrest.1,3,4 In September 1999, the Accreditation Council for Graduate Medical Education (ACGME) formally identified and endorsed 6 general competencies for medical education, among them competencies in direct patient care, medical knowledge, and professionalism; residency review committees mandated the implementation and assessment of these competencies as training program requirements effective July 2002. The current evolution of physician assessment is toward performance evaluation, symbolized by the pyramid described by Miller,5 emphasizing a progression from "knows" to "knows how," "shows how," and "does." This highest level corresponds with competency in the performance of a skill, beyond and distinct from the knowledge of how to do so.
Because resuscitation is a rare event in pediatrics, residency curricula must rely on simulated experiences rather than actual patient care to teach this topic and determine competency. High-fidelity medical simulation offers a realistic training environment that poses no threat to patient safety. Long recognized and used in aviation training, simulation was initially adopted for medical training and crisis management in the field of anesthesia6,7 but has since been used in an increasingly diverse group of medical specialties, including emergency medicine, surgery, and obstetrics-gynecology.8–12 The use of simulation has been extended beyond training to assessment,13 and the ACGME assessment toolbox recognizes simulations and models as the most desirable assessment methods for medical procedures in patient care.14
Objective assessment requires reliable and valid tools. Reliability refers to the reproducible assessment of the same performance under different circumstances. Interrater reliability in particular refers to consistency in ratings between different examiners. Interrater reliability is an indication of the generalizability of the tool, that is, whether the tool would have the same properties when applied by multiple users. Elements of validity include the selection of appropriate subject matter (content validity), appropriate variation in concert with expectations (construct validity), and accurate comparison with the established gold standard for the behavior in question (criterion validity). Simulator-based rating systems have been used in anesthesia with demonstration of good reliability and strong construct validity.15–18
Although there are standard courses in pediatric resuscitation, no published, validated assessment tool exists for the performance of pediatric resuscitation; a neonatal resuscitation specific-tool has recently been developed.19 The need for such a tool is illustrated by deficiencies in the resuscitation knowledge and skills of pediatric house officers.13,20–24 We sought to develop a reliable and valid tool for the assessment of pediatric residents' resuscitation competency using simulation. We present our novel tool, the Tool for Resuscitation Assessment Using Computerized Simulation (TRACS), that we have developed and report the results of a pilot study to assess its reliability and validity. Our study's primary goal was to conduct an initial assessment of the TRACS' interrater reliability. Our secondary objective was to preliminarily evaluate the TRACS' construct validity by comparing performance at different training levels.
| METHODS |
|---|
|
|
|---|
Tool Development
Using the American Heart Association Pediatric Advanced Life Support curriculum,25 combined with standard resuscitation management, we delineated 4 domains of resuscitation performance (basic resuscitation, airway support, circulation and arrhythmia management, and leadership behavior) and identified specific objective elements for each domain. We elected to use a checklist rather than global rating style for this tool in an effort to formally evaluate specific item-level competency and avoid any masking of deficits that could potentially occur with a global rating score. To provide content validity, these behaviors were reviewed by a panel of 13 experts (including Drs Vinci and Weiner) in the fields of pediatric emergency medicine, pediatric critical care, and medical education using a modified Delphi process.26 The panel was asked to confirm that the selected items should be expected of pediatric residents, to eliminate inappropriate items, and to add any missing items. Modifications were made to these selected items after trial runs in the simulator suite to these selected items to clarify them and facilitate scoring. The final version of the TRACS used 72 yes-or-no (performed or not performed) items: basics (6 items), airway (32 items), circulation (25 items), and leadership (9 items). The number of items in each domain is not meant to precisely reflect its relative importance but rather to gauge the complexity of its mastery, hence the increased number of items in the more procedural airway and circulation domains. Although leadership is perhaps an infinitely complex task, only a basic competency in leadership is being assessed here. The TRACS is shown in Fig 1.
|
Subject Enrollment
We recruited a convenience sample from a pool of all of the available members of 5 training groups: intern, junior resident, senior resident, recent graduate (pediatric residency graduates in their first year postresidency), and senior fellow (second- and third-year fellows in pediatric critical care or emergency medicine). We enrolled 25 subjects, 5 from each of the 5 training levels. All of the subjects in the first 4 groups were current residents or graduates of the Boston Combined Residency in Pediatrics based at Children's Hospital Boston and Boston Medical Center. The recent graduate pool included chief residents, first-year fellows, and other pediatricians still working clinically at these hospitals.
Sessions
We conducted sessions in the Children's Hospital Boston simulator suite between July and October 2003. In each session, subjects provided written informed consent and completed a training information questionnaire. After simulator suite orientation, each subject participated in 3 standardized code scenarios written by the authors (Drs Brett-Fleegler and Kleinman) using the Laerdal SimMan universal patient simulator (Laerdal Corporation, Stockholm, Sweden). At the beginning of each scenario, a brief scripted vignette with pertinent clinical information was provided (see Fig 2). At the time of this study, only an adult mannequin was available for use, so all of the scenarios involved patients age 12 years and older. Subjects functioned in 3 different code team roles, 1 for each scenario. In the first scenario, subjects served as the code team leader for a 14-year-old male near-drowning victim. Next, they managed the airway of a 12-year-old girl in status asthmaticus. Finally, they provided circulation and arrhythmia management for a 15-year-old girl with a tricyclic antidepressant overdose. Subjects performed the resuscitation along with a code team consisting of 3 to 5 additional health care providers (physicians, nurses, respiratory therapists, and/or pharmacists). These additional personnel were familiar with the clinical scenario and its expected management; all were trained through practice sessions to provide assistance to the subject when requested to do so but were explicitly instructed not to direct care in order to isolate the performance of the subject. All of the sessions were videotaped for evaluation. No code cards were allowed, because our intent was to assess competency independent of such aids. Afterward, subjects received feedback from faculty and answered a second questionnaire soliciting their feedback.
|
50% of subjects by both raters. The score in each domain was determined by the percentage of answered items scored as "yes" if
75% of items were answered in that domain. If <75% of items were answered, the domain score was set to missing. Because the number of items in each domain is meant to be reflective of the complexity of tasks involved, we chose to calculate an overall summary score that was a weighted average of the 4 domain scores, in which each domain was given a weight proportional to the number of items scored in that domain. Final subject scores were the average of the 2 raters' scores.
Statistical Analyses
To evaluate interrater reliability, intraclass correlation coefficients (ICCs) were calculated for domain and summary scores. These have been used frequently to assess interrater reliability when the variables being compared have ordinal or continuous score distributions.27 For each individual, dichotomous item within the domains, we calculated the most commonly used measure of agreement for nominal or categorical variables, the Cohen's
.28 However, Cohen's
value has a number of weaknesses that limit its usefulness in certain circumstances. Specifically, the
value is highly affected by skewed distributions, which cause it to have extremely low values, and it is unable to be calculated at all when there are rows or columns missing from a table.29–31 Several of the TRACS items had extremely skewed distributions (mostly positive ratings) or had 0 rows or columns because of 1 rater answering all "yes" or all "no"; in addition, some items had a 0 on the diagonal resulting in a negative
value. Therefore, we also chose to examine some raw agreement indices, including the overall percentage of agreement (proportion of all of the cases that received the same rating by both raters), the percentage of agreement for positive ratings, and the percentage of agreement on negative ratings. These latter 2 indices are analogous to sensitivity and specificity and are useful because they provide more specific information about the direction and type of rating inconsistencies. In addition, high agreement proportions for both positive and negative ratings would indicate that the observed level of agreement is higher than would occur by chance.29,30 To summarize item-level reliability across domains and for the overall tool, we then calculated the mean percentage of the exact agreement across items within each of the 4 domains and across all of the items in the tool.
To preliminarily evaluate construct validity, we assessed whether the TRACS could differentiate between trainees at different levels of training, although the study was not powered for this comparison. The domain and overall summary scores were compared among the 5 training levels using the nonparametric Kruskal-Wallis test. Because the scores are expected to increase as the level of training increases, a nonparametric trend test, the Jonckheere-Terpstra test, which tests for monotone increasing trends, was also conducted.32 All of the tests were 2 sided.
| RESULTS |
|---|
|
|
|---|
|
1 rater and were, therefore, eliminated from current analyses. Four items, 2 each from the airway and circulation domains, were infrequently scored, because the objective was performed by a team member instead of the subject. Two additional items in the airway domain referred to the administration of premedications for endotracheal intubation, which were infrequently given. Because this was deemed clinically acceptable by the raters, they independently elected not to score those items rather than rating them as incorrect. Another airway domain item (ensures suction available) was infrequently scored because of a typographical error on the scoring sheet. These 7 items were eliminated from the TRACS, leaving a total of 65 items for analysis: 6 in basic resuscitation, 27 in airway, 23 in circulation, and 9 in leadership behavior.
Scoring
The response rate for a given item is the percentage of subjects scored. After elimination of items as above, response rates for the 65 items by rater 1 ranged from 76% to 100%, with 55 of 65 items scored for
95% of subjects; for rater 2, response rates ranged from 68% to 100%, with 55 of 65 items scored for
95% of subjects. Reasons for inability to score a subject included difficulty visualizing relevant activity on the videotape and performance of the item by another team member. All of the domains were scored for 24 of the 25 subjects by both raters, given that
75% of items were answered in each domain; basic resuscitation was not scored for 1 subject by either rater because of a videotaping error.
In evaluating domain and summary scores, the ICC between the 2 raters ranged from 0.70 to 0.76 for the 4 domain scores and was 0.80 for the overall summary score (see Table 2). In comparing individual item scoring between raters,
values were 0 or negative for 19 of the 65 items. The mean of the remaining
was 0.51. In evaluating all of the items (including those with incalculable
values), the mean percentage of exact agreement across items in each domain was 85.5% (range: 77.3%–95.7%) for basics, 81.2% (range: 52.0%–100.0%) for airway, 81.0% (range: 25.0%–100.0%) for circulation, and 85.1% (range: 52.0%–100.0%) for leadership and averaged 82.1% across all of the items in the tool. The mean percentage of positive agreement overall was 86.5%. and the mean percentage of negative agreement overall was 48.2%.
|
.05) but not for the other domains or the overall summary score. With regard to trends, increasing scores with increased training were seen in all of the domains except for basics and were statistically significant using the Jonckheere-Terpstra trend test for the airway and summary score trends (P = .01).
|
| DISCUSSION |
|---|
|
|
|---|
0.70 for all of the domain and summary scores. An average percentage of exact agreement of
81.0% across all of the items also supports good interrater reliability. The percentage of positive agreement of 86.5% supports good sensitivity of the tool. In considering those items with very low
values, the majority have low agreement in negative ratings. These items will be revised in future iterations of the tool, and guidelines for reviewers will be clarified to improve agreement. Consideration will be given to eliminating the leadership behavior section of the tool, because it is the least objective element of the tool, and although critically important, perhaps it is better suited to another assessment modality. We, thus, anticipate even better reliability in future iterations of this tool. Reviewing videotaped performances to establish interrater reliability has ample precedent.18,33–35 The use of 2 reviewers provides appropriate opportunity to compare interrater reliability across a range of performances and is supported in the literature, as long as rater training is provided and objectives are well defined.33,36,37
The validity of an assessment tool depends on content, construct, and criterion validity. Content validity of the TRACS is based on its roots in the widely accepted Pediatric Advanced Life Support curriculum and is supported by expert panel review. Construct validity for this tool is supported by the performance analysis that shows trends toward improved performance with increasing years of training. More vigorous differences based on training were likely not seen because of the small sample size used in this pilot study and the marked heterogeneity of mock code experience within that sample. Other pertinent variations in training are almost certainly present and likely underlie this performance variation in subjects at the same level of training, thus emphasizing the importance of formal assessment. Among these may be the varied career interests of these subjects; should this be the case, formal assessment and training are all the more important to ensure that even those trainees not intending a career in critical care possess basic competency in resuscitation. No standard exists on which to base criterion validity. Recent research on previously validated oral board and objective structured clinical examination-based assessment tools used in emergency medicine has supported their validity in a simulator setting,38,39 but comparable tools do not exist that are specific for pediatric resuscitation.
In considering the limitations of this study, it should be noted that, whereas the use of 2 reviewers provides reasonable opportunity to assess interrater reliability, comparisons of additional raters would strengthen this assessment. The lack of blinding of our raters to the training status of the subjects is another potential limitation to our study but is expected to be the usual situation in routine use. In addition, the heterogeneity of scores at each level suggests that the raters were not unduly influenced by a priori assumptions about skill level. Including additional, blinded raters may be useful in evaluating the next version of this tool. We will also try to improve interrater agreement by targeting individual items with lower interrater agreement. The majority of these items were those involving assessment by the trainee, such as assessment of oxygenation status or blood pressure. These items are less observable and ask the rater to evaluate a thought process that may not be verbalized by the trainee. This highlights the challenges in assessing the performance of higher-order cognitive tasks. In our convenience sample, selection bias may have occurred for volunteers with either strong or weak resuscitation skills. In reviewing the intended careers of participants, using this as a proxy for resuscitation skills, our population seems diverse, because it includes both future critical care and noncritical care providers and, most importantly, reflects the target trainee population. With regard to our secondary objective of assessment of construct validity, the primary limitation of this study is its small sample size. The study was not powered to detect significant intergroup differences.
A reliable and valid assessment tool for pediatric resuscitation competency is important in addressing the mandate of the ACGME patient care competency. In graduate medical education, learning may be assessment driven,40 and poor clinical performers may overestimate their capabilities.41 Thus, assessment of pediatric resuscitation competency is essential to both facilitate learning and to safeguard patient care. This type of assessment tool will also be useful for both formative and summative purposes in physician training. Specific, formative feedback is highly valued by trainees; individual items from the TRACS can be used to direct constructive feedback regarding areas needing improvement. As a summative tool, the TRACS will enable the evaluation of residents over time and provide residency programs with objective measurements of trainee competency as they advance to positions of greater responsibility for patient care. Ultimately, trends seen in groups of residents may provide feedback at a programmatic level and guide curriculum modification.
Simulators offer a unique and valuable training and assessment environment. The subjects in our study found the simulator experience and feedback useful and rated these sessions as more helpful than mock resuscitations in a nonsimulator setting. This is consistent with previous experience in which simulators have been well received by trainees at many levels and in diverse settings.42–47 In many ways, simulation mirrors the original bedside teaching model while also incorporating the principles of problem-based learning.8,43 Simulators combine an interactive learning opportunity with a realistic setting and the chance to manage rare events; we feel this justifies the logistic and economic challenges of developing simulator capabilities, as is de facto supported by the rapid expansion of simulation centers48 and the explosive development of simulation as an educational environment.49,50
We intend to revise the TRACS on the basis of the strengths and limitations discovered in this pilot study. The next evaluation phase for the TRACS will involve further assessment of its interrater reliability and its construct validity via a study powered to evaluate intergroup differences. A reliable and valid assessment tool is crucial to the training and evaluation of pediatric residents in resuscitation.
| ACKNOWLEDGMENTS |
|---|
| FOOTNOTES |
|---|
Address correspondence to Marisa B. Brett-Fleegler, MD, Division of Emergency Medicine, Main South Basement, Room CB0120, Children's Hospital Boston, 300 Longwood Ave, Boston, MA 02115. E-mail: marisa.brett{at}childrens.harvard.edu
The authors have indicated they have no financial relationships relevant to this article to disclose.
Results from this study were presented at the annual meeting of the Pediatric Academic Societies; May 3, 2004; San Francisco, CA.
| REFERENCES |
|---|
|
|
|---|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||