February 2015, VOLUME135 /ISSUE 2

Identifying Autism in a Brief Observation

  1. Terisa P. Gabrielsen, PhD, NCSPa,
  2. Megan Farley, PhDa,
  3. Leslie Speer, PhD, NCSPa,
  4. Michele Villalobos, PhDa,
  5. Courtney N. Baker, PhDb, and
  6. Judith Miller, PhDa
  1. aDepartments of Psychiatry and Educational Psychology, University of Utah, Salt Lake City, Utah; and
  2. bDepartment of Child and Adolescent Psychiatry and Behavioral Sciences, Children’s Hospital of Philadelphia, Philadelphia, Pennsylvania


BACKGROUND: Pediatricians, neurologists, and geneticists are important sources for autism surveillance, screening, and referrals, but practical time constraints limit the clinical utility of behavioral observations. We analyzed behaviors under favorable conditions (ie, video of autism evaluations reviewed by experts) to determine what is optimally observable within 10-minute samples, asked for referral impressions, and compared these to formal screening and developmental testing results.

METHODS: Participants (n = 42, aged 15 to 33 months) were typically developing controls and children who screened positive during universal autism screening within a large community pediatric practice. Diagnostic evaluations were performed after screening to determine group status (autism, language delay, or typical). Licensed psychologists with toddler and autism expertise, unaware of diagnostic status, analyzed two 10-minute video samples of participants’ autism evaluations, measuring 5 behaviors: Responding, Initiating, Vocalizing, Play, and Response to Name. Raters were asked for autism referral impressions based solely on individual 10-minute observations.

RESULTS: Children who had autism showed more typical behavior (89% of the time) than atypical behavior (11%) overall. Expert raters missed 39% of cases in the autism group as needing autism referrals based on brief but highly focused observations. Significant differences in cognitive and adaptive development existed among groups, with receptive language skills differentiating the 3 groups.

CONCLUSIONS: Brief clinical observations may not provide enough information about atypical behaviors to reliably detect autism risk. High prevalence of typical behaviors in brief samples may distort clinical impressions of atypical behaviors. Formal screening tools and general developmental testing provide critical data for accurate referrals.

  • autism spectrum disorder
  • early identification
  • community pediatrics
  • screening-early childhood
  • referral

What’s Known on This Subject:

Behavioral observations influence a clinician’s decision to diagnose or refer, and may even override formal screening results. In the case of autism spectrum disorder, an expected rate of atypical behavior during the span of a medical visit is unknown.

What This Study Adds:

We are the first to quantify the high base rates of typical behavior in young children who have autism and language delay. When observation times are brief, the preponderance of typical behaviors may negatively impact referral decision accuracy.

Symptoms of autism spectrum disorder (ie, social/communication deficits and restricted interests/repetitive behaviors)1 become apparent over time, as gaps between typical and atypical development widen in childhood.27 Thus, both typical and atypical behaviors present simultaneously. Little is known empirically about ratios of typical to atypical behaviors in children who have typical or delayed development, leaving physicians to gather observational data during brief observations (10–20 minutes) with limited reference points. Surveillance and screening during primary care visits3 and clinical judgment of neurologists and geneticists (referrals for neurodevelopmental concerns) are key to early detection and referral. Although standardized parent report screening tools for autism are available,3 clinical impression is critical in decision-making, and often overrides information obtained from screening tools.8 Standardized observational screening measures are promising (ie, Systematic Observation of Red Flags,9 Screening Tool for Autism in Two-Year Olds10), but have not been adopted and may not be practical in primary care.

Little research exists to determine ratios of typical or atypical behaviors exhibited by children who have autism spectrum disorder during the time-span of an average medical visit.11 During brief observations, low-frequency atypical behaviors may not stand out among high frequencies of typical behavior. Thus, we aimed to determine the ratio of signal (atypical behaviors) to noise (typical behaviors) in behaviors of young children who have autism in brief observations, and relate this information to clinical judgment and standardized test data.



Institutional review boards of participating institutions approved all methods and parents gave written consent for screening, evaluation, and video recording. Children aged 15 to 33 months were recruited through a 3-tiered autism screening process in a large suburban pediatric practice.12 Screening had high participation rates (80% of families [n = 796] completed screening questionnaires, verified against clinic schedules). The sample was representative of many community clinics, comprising middle- to lower-socioeconomic status families from diverse racial and ethnic backgrounds, although African Americans were underrepresented. Participants were recruited, screened, and evaluated in English or Spanish.

Screening Process and Group Assignments

Participants were screened with the Modified Checklist for Autism in Toddlers (M-CHAT)13 and the Infant Toddler Checklist.14 Children who screened positive on at least 1 questionnaire or whose parents or providers were concerned despite negative screening (n =192) were contacted by phone for follow-up, then invited for in-person evaluation, if warranted, at no cost. In-person evaluations included the gold standard autism observational measure (Autism Diagnostic Observation Schedule [ADOS]15,16), a developmental measure (Mullen Early Learning Scales),17 and a measure of adaptive functioning (Vineland Adaptive Behavior Scales, Second Edition, Survey Interview).18 After evaluation, 14 children were identified with early signs of autism spectrum disorder and 16 with suspected language delays, but not autism (14 were selected based on age match to the Autism group). One child was identified as typically developing. Thirteen additional age-matched typically developing children were recruited from the same neighborhoods using the same screening instruments and test battery. Table 1 shows screening results and group assignment.


Screening Results of Children in Pediatric Primary Care12a

Table 2 shows characteristics of the sample. A χ2 analysis revealed no significant differences on demographic variables, but lower rates of subsidized insurance (proxy for socioeconomic status) in the Autism group are noted.


Demographic Characteristics

Cognitive testing results (Table 3) found the Typical group to have average abilities, Language group mean scores 1 SD below average, and Autism group mean scores 2 SDs below average. The mean Gross Motor score in the Autism group was 1 SD below average. Autism and Language groups showed lower abilities in Communication, Socialization, and Motor adaptive domains than the Typical group. Receptive Language on the Mullen Early Learning Scales was the only score among adaptive and cognitive measures with significant differences between each group, pairwise.


Cognitive and Adaptive Development

Study Procedures

Video Segments

Two samples (10-minute segments from clinical evaluations) were chosen for analysis to examine whether children behaved differently after becoming familiar with the examiner and room: (1) the first 10 minutes of an ADOS,15,16 and (2) 30 minutes after starting the ADOS. Each 10-minute video was divided into sixty 10-second clips, viewed consecutively with 4-second breaks for behavioral coding (5040 intervals across 42 children, 2 videos each).


Five behavioral categories were rated to reflect broad interactional behaviors that might be noted by providers familiar with autism, but not necessarily specialists. Behavioral categories were based on diagnostic criteria,19 ADOS scoring algorithms,15,16 and the Systematic Observation of Red Flags.9 These included social responding, vocalizations, play, social initiations, and a discrete behavior, response to name. Table 4 contains operational definitions used to determine whether behaviors were typical or atypical.


Ratios of Typical to Atypical Behaviors


Behaviors were rated for each category in each interval by using partial interval recording (a method of recording occurrence at any time during the interval).20,21 Each behavior category received only 1 rating per interval, even if multiple behaviors were observed. Atypical behaviors were prioritized to maximize detection, even if typical behaviors were present in the same segment. The order of priority for ratings was “atypical,” “typical,” “unclear,” and “no opportunity.” “Unclear” was the code for behaviors not visible (eg, off camera or back turned) so “unclear” and “no opportunity” codes were not of interest for this study. Raters could review segments without restriction. After completing each 10-minute video, raters indicated whether they would refer the child for an autism evaluation based solely on the observation.

Raters and Reliability

Two licensed psychologists, expert in early childhood development and autism spectrum disorder (and ADOS15 research reliable), rated behaviors. Both were unaware of study hypotheses and child-specific information other than age. Raters achieved initial reliability through practice videos. Inter-rater reliability was calculated by exact agreement (number of agreements/total observations) on 20% of study videos (5040 of 25 200 individual codes). Inter-rater reliability was 82% overall. Agreement was 84% on presence of behavior, 87% on absence of behavior. Agreement on typical behaviors was 97%, and on atypical behaviors it was 35%, which is discussed in more detail later. The κ between raters was 0.67.

Analytic Approach


One video of a child (age 33 months) in the Autism group was excluded from analyses other than Vocalizations because 46% of intervals were off camera. The child’s second video was included.


We analyzed rates of typical and atypical behaviors using non-parametric Kruskal-Wallis H and Mann-Whitney U to allow for significantly skewed distributions and occasional outliers. Demographic data and referral impressions were analyzed using χ2 tests. ANOVA was used to analyze normally distributed developmental testing standard scores and relationships between referral impressions and age. Correlations were calculated using Spearman’s ρ. In all analyses P values ≤ .01 were considered significant. Non-significant results are not reported in the text.


Behavioral Category Results

Differences in Opportunities

Kruskal-Wallis and Mann-Whitney tests determined that opportunities for social interactions were different among groups in Responding and Initiating, with fewer opportunities in the Autism group. Opportunities for Response to Name were greater in the Autism group (shown in Table 4). Because of these differences, coded behaviors were converted to percentage scores (eg, Atypical Percentage score = Atypical codes/[Atypical + Typical codes]) to standardize comparisons. No significant effects of time were found between samples at 10 minutes and 30 minutes, so the 2 observations were combined and analyzed on a per-child basis, limiting the effects of anomalous data in single samples.

Atypical behaviors occurred in 11% of intervals within the Autism group, compared with 2% of intervals in both Language and Typical groups. Group differences were significant for total number of atypical behaviors, χ2 (2) = 12.602; P = .002 (Autism > Language, P = .005; and Typical, P = .008). Because typical and atypical percentage scores are complementary, differences in total typical behaviors in all analyses reflect significance for Atypical behaviors, but in the opposite direction (eg, total typical behaviors were significant for group, but Autism < Language or Typical). Results by behavioral category are shown in Fig 1 and Table 4.


Differences in behavior occurrence between groups. Bars on Typical behavior columns illustrate the interquartile range, with dots indicating medians. In the case of an interquartile range where both 25th and 75th percentile scores were 100, only the dot is shown. Atypical score ranges are reflective of Typical score ranges and are not shown. Behaviors were rated as typical or atypical, then divided by the total behaviors (eg, typical/typical + atypical) to calculate percentage of behavior. Analysis comprised 3 groups, 2 time points (collapsed into 1 with no significant differences), and 12 dependent variables, including Total Atypical and Typical behaviors reported in the text. The majority of intervals included Play and/or Responding behaviors, whereas few intervals contained RTN behaviors. Significant differences shown for Atypical behaviors (shown) are reflected in Typical behaviors, but in the opposite direction.

Social Responding

Groups differed on typical and atypical responding overall, χ2 (2) = 9.899; P = .01, Autismatypical > Languageatypical; P = .01.


The quality and repetitive nature of sounds was significantly different among groups, χ2 (2) = 13.624; P = .001 (Autismatypical > Languageatypical; P = .01).


Although raw count ratios of play behaviors (Table 4) differ, comparison of percentages of typical to atypical play behaviors within each child were not significantly different. Mean ranks for atypical play were 27 in the Autism group, and 17.3 and 20.2 in the Language and Typical groups, respectively. Mean percentage scores of atypical play in all groups were <10%.

Social Initiation

Opportunities to initiate to the examiner (eg, to request desired toys or initiate another turn in a social game) were rated. The mean percentage scores of atypical initiating behavior were <10% and not significantly different between groups.

Response to Name

Differences between groups in the percentage of typical and atypical responses to name (RTN) were significant, χ2 (2) = 9.899; P = .007 (Autism [mean rank, 29.68] > Language [16.64]; P = .01). Opportunities for RTN occurred at a low base rate (6% of intervals). All children in the Autism group responded to their names at least once (typical). Many children in the Language (57%) and Typical (50%) groups failed to respond to their names at least once (atypical).

In 10-minute videos, average opportunities for RTN were 5 (SD, 3.6) in the Autism group, mean, 2.4 (SD, 2) for the Language group, and mean, 2.9 (SD, 2) for the Typical group. Typical responses to the first opportunity for RTN in the original codes were calculated as a more standardized measure of RTN. Typical overall RTN percentage scores in the Autism group (mean, 0.58; SD, 0.26) are similar to the overall percentage RTN on the first opportunity (0.56). In the Typical group, the percentage of typical responses (mean, 0.80; SD, 0.30) compares with overall response to the first opportunity (0.75). In the Language group, the percentage score for typical responses (mean, 0.86; SD, 0.15) compares to the percentage of responses to first RTN bids (0.64). Consistent with lower opportunities in the Language group, 6 out of 28 Language group videos contained no RTN opportunities, compared with 2 out of 28 in both the Autism and Typical groups. RTN opportunities on video did not necessarily include administration of the RTN item on the ADOS, but correlations of the Atypical Percentage scores with ADOS scores were moderate for RTN and algorithm scores as shown in Table 5.


Correlations Between Atypical Behavior Percentages and ADOS Algorithm Scores

Referral Decision

At the end of each coding session, raters were asked, “Based on this 10-minute observation alone, would you refer this child for autism evaluation?” “Yes” or “No” responses were converted to “Correct” or “Incorrect” according to diagnostic group. Figure 2 shows rater judgments by group. Two referral decisions were made per child (2 videos). Rater judgment was most inaccurate for the Autism group (11 out of 28 videos [39%] incorrect). In the Language group, 7 out of 28 videos were incorrect (25%), and in the Typical group, 3 out of 28 were incorrect (11%). Accuracy was not attributable to rater or time (0–10 minutes vs 30–40 minutes into the evaluation). Within this small sample, sensitivity of the referral impression (ASD or not ASD) was 0.61, specificity was 0.82, and positive and negative predictive values were 0.63 and 0.81, respectively. There was a significant interaction between age and time, (F [1,40] = 7.22; P = .01; Effect Size = 0.965). Incorrect decisions were more likely to be made for younger children (mean, 20.2 months; SD, 4) in the first 10 minutes, compared with correct decisions for later observations (at 30 minutes) or older children (mean, 24.7 months; SD, 5.1). All videos were included in analyses. For the video excluded from atypical/typical analysis, referral impression was Correct.


Accuracy of referral decisions. Children were observed twice (0 to 10 minutes; 30 to 40 minutes), each with a separate referral decision. Raters disagreed on autism referrals for all of the split decisions (1 view generated an autism referral, the other did not) except 1. In 1 Language case, the same rater gave different decisions for each view.

In 1 instance, 1 rater noted a basis for an autism referral decision. For 1 video of a child in the Typical group, the rater recommended autism referral based solely on the child’s atypical RTN in the 10-minute sample. The clinical impression of this child would otherwise not have indicated an autism referral.


Our results suggest that, during brief observations, typical behaviors in some children who have autism can exceed atypical behaviors in frequency, to such a degree that it was often difficult even for clinicians experienced in autism spectrum disorder assessment to correctly determine if enough atypical behavior existed to merit a referral for autism evaluation. With low agreement between raters on atypical behaviors (35%), it is possible that some atypical or typical behaviors were misidentified, but given the high rate of agreement on typical behaviors (97%), and low base rates of atypical behaviors (11% in the Autism group, 2% in non-Autism groups), it is unlikely any misidentified behaviors would overshadow the finding that typical behaviors were predominant, even in the Autism group. Rather, the low agreement on atypical behaviors highlights the difficulty of detecting atypical behaviors in brief observations, even for experts.

Although the Autism group demonstrated statistically higher rates of atypical behaviors and lower rates of typical behaviors compared with Language and Typical children overall, the ratio of behaviors in young children was such that typical behaviors in the Autism group still far exceeded atypical behaviors. Much attention has been drawn to atypical behaviors associated with autism.6,7,2231 However, children who have autism do not engage in unusual behavior exclusively, and at the individual level, even experienced clinicians did not always agree on atypicality of behavior. Likewise, typical development is characterized by occasional periods of repetitive play, rigidity, and failure to respond.32 Normative data regarding ratios of typical to atypical behaviors have been absent, leaving clinicians to rely on their own judgment about whether a child’s behavior is excessively atypical.

We found children in the Autism group responded to their name over half the time (58%). Although this was significantly less than the Language (86%), or the Typical group (80%), the response rate in the Autism group is consistent with research establishing RTN as a highly specific but insensitive predictor of an autism diagnosis.22 Clinically, a single example of typical or atypical behavior (eg, RTN or making eye contact) may exert undue influence in referral decisions. Our results suggest that a positive RTN during a brief observation would be a poor single method of ruling out autism risk.

Less overt behaviors (ie, lower rates of social initiations, lower responsiveness, and repetitive behaviors) may be more difficult to detect than failure to respond to one’s name. In these behaviors, ratios of atypical:typical in the Autism group ranged from 1:6 (Vocalization) to 1:18 (Initiating). Atypical behaviors occur so rarely they may easily escape clinical detection during a brief observation. When weighing evidence in a decision-making process, atypical behaviors noted may be overshadowed by the many typical behaviors also likely to be observed during the same period of time.

Expert clinicians, with advantages of focused and repeated observation conditions, identified only 61% of the Autism group observations as indicating need for autism referrals based on 10-minute behavioral samples. Identification of autism risk was more difficult with younger children during the first 10 minutes of the evaluation compared with older children or to the time period 30 to 40 minutes into the evaluation. Missed referral decisions may be related to inherent difficulties in characterizing atypical behaviors (agreement between expert raters was lowest for atypical behavior), low frequencies of atypical behavior relative to typical behavior, or the high degree of behavioral variability seen in the Autism group. If atypical behavior is difficult to pinpoint (even for experts), too infrequent to stand out in a brief visit, or occurring at different rates across children who have autism, there may not be enough reliable observational data available on a consistent basis for clinicians to develop their own threshold of concern. The fact that typical behaviors, in contrast, were so frequent in all children, and so salient (agreement on typical behavior was very high), may mean that typical behaviors create significant noise around which it is difficult to weigh the importance of the signal of infrequent atypical behavior.

Receptive language scores were the underlying characteristic differentiating the groups. In toddlers, receptive language assessment includes asking children to respond to familiar commands, give something, or point to something. These tasks have distinctly social components to them (ie, social reciprocity, sharing, and joint attention), and may provide opportunities to observe social communication before expressive language delays become evident. The finding that receptive language ability is an important distinguishing characteristic suggests the impact of receptive language deficits on interactions between examiner and child may have been more clinically meaningful than other atypical behaviors. This may also explain false-positive autism referrals in the Language group.

Study limitations include the absolute number of children observed and the fact that referral decisions were artificially based on observation alone, rather than observation in combination with developmental history, parent concerns, direct interaction, and formal screening results. Sensitivity and specificity may not be replicable, given the small sample and exposure to focused coding of behaviors before the referral decision. Raters were autism experts, with opportunity for focused, detailed, and repeated observations (300 discrete ratings per observation), a much more detailed experience than is available during busy medical visits. The final sample size was small, but derived from a community population, selected through universal screening. The sample is more likely to be representative of the range of clinical presentations facing general pediatricians and first-line specialty referrals (eg, neurology, Early Intervention, audiology) than other samples of referred populations with higher incidence rates.


Children who have autism display high rates of typical behavior alongside atypical behavior. Children who do not have autism show atypical behavior at times, albeit at a statistically smaller ratio than children who have autism. Even clinicians who have experience and expertise in autism may not detect differences in the atypical:typical behavior ratios in a 10- to 20-minute observation. Receptive language abilities are an important area of focus in early diagnosis of developmental delays.

These data suggest that decision-making processes for possible autism symptoms should include consideration of all available data. In addition to behavioral observation, autism screening tools, parent observations, developmental testing, and a detailed history should be considered when making referral decisions. The high rate of false-negative referral impressions in the study is consistent with findings indicating that in developmental disabilities generally, clinical judgment does not improve on screening test results.8 Even a small number of clear examples of unusual behaviors might be sufficient to prompt further questions, or to begin the process of gathering additional developmental testing data (evaluation). The presence of typical behavior, or absence of clearly atypical behavior, during a brief period of observation is not sufficient or reliable enough to override other data indicating concern for autism.


We acknowledge the valuable contributions of the families in the study, Granger Medical Pediatrics, Wee Care Pediatrics, Developmental Disabilities Inc, Paul Carbone, MD, and Lane Fischer, PhD, to data collection and the manuscript.


    • Accepted November 20, 2014.
  • Address correspondence to Terisa Gabrielsen, PhD, 340-A MCKB, Brigham Young University, Provo, UT 84602. E-mail: terisa_gabrielsen{at}
  • Drs Gabrielsen and Miller conceptualized and designed the study, recruited and evaluated participants, trained raters to reliability, analyzed the data, and drafted the original manuscript; Drs Farley and Speer generated data for the study as blinded raters and critically reviewed the manuscript; Dr Villalobos participated in design and implementation of data collection, evaluated participants, and critically reviewed the manuscript; Dr Baker designed the analysis and reviewed and revised the manuscript; and all authors approved the final manuscript as submitted.

  • Dr Gabrielsen’s current affiliation is Department of Counseling, Psychology and Special Education, Brigham Young University, Provo, UT. Dr Farley’s current affiliation is the Waisman Center, University of Wisconsin-Madison, Madison, WI. Dr Speer’s current affiliation is Center for Autism, Cleveland Clinic Children’s, Cleveland, OH. Dr Villalobos’s current affiliation is Department of Psychiatry, University of North Carolina, Chapel Hill, Asheville, NC. Dr Baker’s current affiliation is Department of Psychology, Tulane University, New Orleans, LA. Dr Miller’s current affiliation is Center for Autism Research, Children’s Hospital of Philadelphia and University of Pennsylvania, Philadelphia, PA.

  • FINANCIAL DISCLOSURE: The authors have indicated they have no financial relationships relevant to this article to disclose.

  • FUNDING: Portions of this study were funded by Centers for Disease Control and Prevention U01DD000068-01, The EACH CHILD Study, Judith Miller, Principal Investigator; University of Utah School of Medicine, Department of Pediatrics; University of Utah Graduate School, Steffensen Cannon Fellowship.

  • POTENTIAL CONFLICT OF INTEREST: The authors have indicated they have no potential conflicts of interest to disclose.