Comparison of Clinical, Maternal, and Self Pubertal Assessments: Implications for Health Studies
BACKGROUND: Most epidemiologic studies of puberty have only 1 source of pubertal development information (maternal, self or clinical). Interpretation of results across studies requires data on reliability and validity across sources.
METHODS: The LEGACY Girls Study, a 5-site prospective study of girls aged 6 to 13 years (n = 1040) collected information on breast and pubic hair development from mothers (for all daughters) and daughters (if ≥10 years) according to Tanner stage (T1–5) drawings. At 2 LEGACY sites, girls (n = 282) were also examined in the clinic by trained professionals. We assessed agreement (κ) and validity (sensitivity and specificity) with the clinical assessment (gold standard) for both the mothers’ and daughters’ assessment in the subcohort of 282. In the entire cohort, we examined the agreement between mothers and daughters.
RESULTS: Compared with clinical assessment, sensitivity of maternal assessment for breast development was 77.2 and specificity was 94.3. In girls aged ≥11 years, self-assessment had higher sensitivity and specificity than maternal report. Specificity for both mothers and self, but not sensitivity, was significantly lower for overweight girls. In the overall cohort, maternal and daughter agreement for breast development and pubic hair development (T2+ vs T1) were similar (0.66, [95% confidence interval 0.58–0.75] and 0.69 [95% confidence interval 0.61–0.77], respectively), but declined with age. Mothers were more likely to report a lower Tanner stage for both breast and pubic hair compared with self-assessments.
CONCLUSIONS: These differences in validity should be considered in studies measuring pubertal changes longitudinally when they do not have access to clinical assessments.
- CI —
- confidence interval
- OR —
- odds ratio
What’s Known on This Subject:
Mothers and girls underreport pubertal development relative to clinical measurements. Many epidemiologic studies base pubertal assessment on a single source (clinical, maternal, or self) and/or change sources over time as girls age and clinical and maternal assessments become more difficult.
What This Study Adds:
Maternal breast Tanner assessments were more valid than self-assessments compared with clinical only before age 11 years. Among girls ≥11 years, self-assessments had higher sensitivity and specificity. These differences in validity should be considered in studies measuring pubertal changes longitudinally.
Early age at menarche is associated with increased risk of breast cancer, ovarian cancer, Type 2 diabetes, and other health conditions.1–9 Earlier menarche has also been shown to be associated with higher rates of depression, anxiety, eating disorders, smoking, and substance abuse.10–15 Studies have also found that early menarche was associated with a 13% to 16% increased risk of all-cause mortality, even after adjusting for BMI.16,17
Age at menarche has been used as a proxy for onset of pubertal development. Markers of puberty, including breast and pubic hair development, often begin several years before first menses.18 Age at menarche started to decline in the early 1900s but has remained fairly constant at 12 to 13 years for the past 60 years.19 Over the past generation, there has been a dramatic decline in the age at onset of breast development.20,21 Because the age of onset of breast development is decreasing without corresponding decreases in age at first menses, menarche is an inaccurate indicator of pubertal onset.
Independent of age at menarche, earlier age at breast development has been associated with a 20% increased breast cancer risk in a prospective cohort of 104 931 women.22 Importantly, the study confirmed that the number of years between onset of breast development and menarche, referred to as tempo, may affect risk over and above the age at attainment of any single pubertal milestone.22 Thus, the window between onset of breast development and first menses has become wider in most populations worldwide,23,24 suggesting a possible future increase in breast cancer incidence.21,25
Pubertal onset, defined as the beginning of breast and/or pubic hair development, is often assessed by using Tanner staging,26 which is routinely used in clinical evaluations. Tanner stages range from T1 to T5, with T1 referring to prepubertal development and T5 indicating full development. T2 is the first appearance of either breast buds or pubic hair and is used to indicate the onset of puberty. Tanner stage is generally assessed by a clinician but can also be evaluated by self- or maternal report using drawings of Tanner stages with explanatory text.27
Most studies of pubertal development use only a single source of Tanner staging. For example, of the large epidemiologic studies, 4 use clinical staging by a trained professional,20,21,28,29 1 uses self-assessment,30 and 1 uses maternal staging, self-staging, or a combined measure.31 To interpret data across studies, it is important to determine whether pubertal development information differs by source, which can only be examined within the few cohorts that collect pubertal data from multiple sources.32–34 It is also important to determine whether factors such as age, family history of breast cancer, BMI, and race/ethnicity affect measurements to assess whether there could be any differential bias in pubertal assessment by source of Tanner staging (eg, clinical, maternal, or self). We report results from reliability and validity analyses comparing maternal and self-assessment to clinical staging in a large study of girls’ health and development.
LEGACY Girls Study
The LEGACY Girls Study is a 5-site prospective study of pubertal development in girls ages 6 to 13 years at recruitment, half of whom have a family history of breast cancer (for details, see John et al35). Classification of pubertal timing is based on the Growth and Development Questionnaire completed every 6 months by mothers/guardians for girls of all ages and by girls aged ≥10 years. It includes questions on age at menarche and breast and pubic hair development using line drawings that show 5 stages of development, Tanner T1 through T5, for breast and pubic hair.26 Because 97% of girls participated in LEGACY with their biological mother,35 we will refer to the guardian as the mother from here on. The girls’ self-assessment will be used for sensitivity analyses.
At 2 sites, we also collected clinical measures of breast development. Three clinical raters from New York and 1 from Utah were trained concurrently on the determination of Tanner breast stage using visual inspection along with palpation when necessary. Palpation was used in addition to visual assessment in a subset of girls, if they consented, to help the clinical raters distinguish between Tanner stages 1 and 2. Palpation was used in 32.2% of baseline clinical Tanner measures. The addition of palpation did not change the clinical Tanner rating in 92.1% of instances when palpation was used. The clinical raters did not evaluate pubic hair Tanner stage. Clinician interrater reliability for breast Tanner stage was almost perfect, with weighted κ scores ranging from 0.93 to 1.00 and κ for T2+ versus T1 ranging from 0.94 to 1.00 (based on 181 assessments with 2 clinical raters, see Supplemental Table 7).
We calculated measures of validity by treating the clinical assessment as the gold standard: sensitivity (percentage correctly identifying the onset of breast development, T2+) and specificity (percentage correctly identifying prepubertal stage, T1), separately for mothers and daughters. We calculated concordance (overall agreement), κ (T1 vs T2+) and weighted κ (for T1–T5) for the first visit with clinical staging available for New York and Utah girls. For both breast and pubic hair Tanner staging, we calculated κ between maternal and self-assessment for girls ages ≥10 years from all study sites. κ statistics were interpreted by strength of agreement as follows: <0.00, poor; 0.00–0.20, slight; 0.21–0.40, fair; 0.41–0.60, moderate; 0.61–0.80, substantial; 0.81–1.00, almost perfect.36
We examined differences in validity and agreement by age, breast cancer family history in first- and second-degree relatives, BMI, race/ethnicity, and study site. We calculated percentiles and z scores for each girl’s age based on age and gender using the Centers for Disease Control and Prevention SAS source code37 and compared girls with a BMI <85th percentile with those with a BMI ≥85th percentile.38 Race/ethnicity was mother-reported and categorized as non-Hispanic white, non-Hispanic black, Hispanic, Asian/Pacific Islander, or other for analyses using the full cohort. We combined girls identified as non-Hispanic black, Asian/Pacific Islander, or other into 1 group for analyses using clinical staging because of small numbers in each category. We formally tested differences in sensitivity and specificity of maternal and self-assessment by each characteristic using a 2-sample test of proportions.
We used polytomous logistic regression to examine factors (ie, girl’s age, family history status, BMI at visit [<85th percentile vs ≥85th percentile], race/ethnicity, and study site) associated with discordant clinical and maternal assessments of breast onset compared with the referent group of girls with concordant staging.
Clinical Versus Maternal Assessment of Breast Development
Girls with clinical assessments (n = 282) were slightly younger and smaller than girls in the overall cohort (n = 1040; Table 1). Of the clinical and maternal assessments, 73% were in agreement (Table 2). When there was disagreement, mothers were more likely to underestimate than overestimate their daughter’s breast stage. Unweighted κ for all 5 breast Tanner stages was 0.54 (95% confidence interval [CI] 0.47 to 0.62), and weighted κ was 0.72 (95% CI 0.67 to 0.78), indicating that discrepant assessments typically differed by only 1 stage. The κ for breast T2+ compared with T1 was 0.73 (95% CI 0.65 to 0.81), indicating substantial agreement between maternal and clinical assessments for the onset of breast development. Seventy-seven percent of mothers accurately identified when their daughters were T2+ (sensitivity), and 94.3% accurately identified when their daughters were T1 (specificity).
Validity of maternal report, when compared with the clinical assessment as the gold standard, differed significantly by age (sensitivity and specificity) and BMI (specificity; Table 3). Sensitivity of maternal report of T2+ was 56.0% for girls <10 years and 82.7% for those ≥10 years of age; specificity was 96.4% and 79.0%, respectively. Specificity was lower for mothers of overweight girls (≥ 85th percentile) (73.7% vs 97.0%). When we examined discordances between maternal and clinical reports using polytomous logistic regression models, only daughter’s BMI ≥85th percentile was associated with maternal overestimation of breast onset (odds ratio [OR] = 6.0, 95% CI 1.5 to 23.1).
Clinical Versus Maternal and Self-Assessment of Breast Development in Girls Aged ≥10 Years
Compared with the clinical assessment, agreement with self-assessment was lower than agreement with maternal assessment (Table 3). Sensitivities were slightly higher for self-assessment than maternal assessment, but specificities were much lower for self-assessment than maternal assessment, suggesting that girls ages ≥10 years are less accurate than their mothers at determining true negatives (no breast budding based on breast Tanner stage). Specificity in girls improved with age, and by age 11 years, girls had perfect specificity (Table 3) compared with the mothers’ specificity at age 11 years, which was lower at 75%.
Maternal Versus Self-Assessment of Breast Development in the Overall Cohort Ages ≥10 Years
Agreement between maternal and self-staging was moderate (weighted κ = 0.68, 95% CI 0.64 to 0.72; κ for T2+ = 0.66, 95% CI 0.58 to 0.75; Table 4). Girls were more likely to report a higher breast Tanner stage compared with their mother (Table 5, and also Supplemental Table 8 for details). Agreement on breast onset differed substantially by BMI (≥85th percentile, κ = 0.38, 95% CI 0.05 to 0.72; <85th percentile, κ = 0.68, 95% CI 0.59 to 0.77; differences were smaller for family history and race/ethnicity except for the Asian subgroup, where agreement was lower.
Maternal Versus Self-Assessment of Pubic Hair Development in the Overall Cohort Ages ≥10 Years
Maternal and self-staging for girls aged ≥10 years showed slightly higher agreement for pubic hair Tanner stage (weighted κ = 0.72, 95% CI 0.68 to 0.76) and pubic hair onset (κ = 0.69, 95% CI 0.61 to 0.77) than for breast Tanner stage (Table 6). Agreement in pubic hair assessment did not differ by BMI and was similar by family history, race/ethnicity, and study site (Table 6). Girls were more likely to report a higher pubic Tanner stage compared with their mother (Table 5, and also Supplemental Table 9 for details).
Our study demonstrates that the validity of assessments of pubertal development milestones differs by source of information. Compared with clinical reports, both mothers and daughters were more likely to underreport breast Tanner stage. Compared with the gold standard of clinical assessment, maternal assessment had a higher sensitivity and higher specificity in girls aged 10 years. At age ≥11 years, self-assessment had a higher sensitivity and specificity for breast Tanner staging compared with clinical report. Our results suggest that maternal assessment of breast onset before age 11 years is more accurate compared with self-assessment. For girls aged ≥11 years, self-assessment is more accurate. We did not have a clinical assessment for pubic hair development. Maternal and self-assessment had moderate agreement and had a similar range for both breast and pubic Tanner assessment. Girls were much more likely to report higher breast and pubic hair stage. Therefore, studies using only maternal assessment in older girls will result in a higher average age at these pubertal milestones.
Accuracy of Breast Development Versus Clinical Assessment as Gold Standard
Agreement between clinical raters was almost perfect in our study, with κ ranging from 0.85 to 1.00 for T1 to T5 and 0.94 to 1.00 for T2+ compared with T1. These κ values were slightly higher than those reported in other studies, where estimates have ranged from 0.67 to 0.90, indicating substantial agreement,20,21,32,33,39 with some exceptions.40 We found that almost three-quarters of mothers accurately assessed breast Tanner stage compared with the clinical assessment. Mothers are generally found to be more reliable reporters of breast development than daughters, compared with physician ratings.32,41
In contrast, the majority of the girls in our study did not correctly stage their own breast development, which is consistent with other small studies of girls with similar age ranges.40,42,43 In our study, girls tended to underreport their own breast development, perhaps as a result of embarrassment about bodily changes and breast development during puberty.44 The literature on bias in self-assessment has been inconsistent,32,33,43 but there is some evidence that suggests that age and stage of development influence the direction of bias, with younger, less developed girls more likely to overestimate breast development and older, more developed girls more likely to underestimate breast development.43,45,46 We observed that girls aged ≥11 years were more accurate compared with the clinical gold standard than their mothers.
Maternal Versus Self-Staging of Pubic Hair Development
We assessed agreement between maternal and self-assessment for pubic hair measurements but did not have clinical measurements for validity. Previous studies comparing clinical and self-assessment reported agreement ranging from 0.37 to 0.91.40,43,45–47 Previous studies have shown a wide range of accuracy for self-assessed pubic hair staging,40,45,47,48 although 2 studies that also examined mother report suggest that self-staging may be more reliable than maternal staging for pubic hair development.33,41 We found that girls were more likely to report a higher stage of pubic hair development than their mothers.
Age-Related Differences Between Maternal and Self-Assessments
Our study can help reconcile opposing conclusions between 2 recent reports on the reliability and validity of Tanner staging in contemporary cohorts. In a Danish study, the authors argued that although clinical measures are preferred, self-assessments could be used in large epidemiologic studies if the main purpose was to determine whether the onset of puberty occurred (breast Tanner 2+ vs T1).33 The Chilean study concluded that maternal reports could be used for cohorts without clinical measures and that these maternal measures did not differ by the daughters’ BMI.32 Our findings help explain the different conclusions from these studies because the Danish study was conducted in older girls (median age 10.9, range 6.2 to 14.7),33 compared with ours (median age 9.5, range 6.0 to 15.1). In our older girls, we also found that self-assessments are preferred for greater accuracy. We disagree with the conclusion of the Danish study33 that epidemiologic studies can use self-assessment for distinguishing between prepuberty (T1) and puberty (T2+). The higher sensitivity and specificity for pubertal onset in the Danish cohort, which concluded that self-assessment is accurate, is based on an older age distribution and a much smaller percentage of their cohort still in prepuberty. Comparing the 3 studies in terms of percentage of girls still in prepuberty (T1) determined by clinical assessment, the Chilean study had 83.9% of girls still in T1, compared with 56.4% in our study and only 19.8% in the Danish study. Thus, self-assessment may be useful for older girls in terms of the feasibility of data collection and more accurate than maternal assessment for girls aged ≥11 years based on our results, but it may be less useful for determining the onset of puberty, which, for many girls, takes place at younger ages.
Other Factors Affecting Accuracy and Agreement
After considering the age differences discussed earlier, only BMI was related to the discordance in our study between maternal and clinical assessments. A previous study reported poor reliability between clinical and self-staging in overweight girls,42 but others did not.32,33 In overweight girls with more fat tissue, it may be especially difficult to distinguish glandular breast tissue from fat tissue using visual assessment only.42 We overcame this limitation through our clinical ratings, which used visual assessment with palpation when necessary.49 However, our maternal and self-assessments differed from the clinical assessments, particularly in overweight girls. Thus, we disagree with the conclusion by the Chilean study32 that mothers can be used when clinical assessments are not available without adjusting the maternal assessments for the level of sensitivity and specificity. Maternal and self-staging of pubic hair development did not differ by BMI, likely a result of body size not influencing the appearance of pubic hair.
We also assessed whether accuracy differed by family history of breast cancer, given the higher breast cancer worry compared with those without a family history.50,51 The sensitivity of maternal assessment for breast Tanner stage was modestly higher in families with a breast cancer family history than in families without (80% vs 74%), but this difference was not statistically significant. Similar to an earlier study,52 we did not observe statistically significant differences in reliability and validity by race/ethnicity. Because most girls at the New York and Utah study sites were non-Hispanic white or Hispanic, we lacked sufficient statistical power to detect differences in maternal report of breast development for other racial/ethnic groups. Other studies in more diverse populations have found that black or Hispanic adolescents were less accurate in staging their breast and pubic hair development than were non-Hispanic white adolescents.40,48
Our results suggest that findings from studies that rely on maternal or self-staging of pubertal development may be biased. Validity studies such as ours can be used to adjust the estimates from epidemiologic studies because they can be used to determine the direction and the magnitude of the bias.53,54 We illustrate this by using the data reported from the Chilean study that stratified reliability measures by child’s BMI and observed a similar κ between maternal assessment and clinical assessment (by trained personnel) for overweight girls as for average weight girls (κ = 0.74 compared with κ = 0.71, respectively).32 Even though they reported similar reliability measures, the validity measures using the results from the trained personnel were different (sensitivity = 0.87 and 0.92 for average weight and overweight girls, specificity = 0.94 and 0.90, respectively). Thus, using maternal reports in this case would result in a higher estimate of the association between being overweight and breast onset (OR = 1.39, 95% CI 0.82 to 2.3) compared with the association using the results from the clinical assessment (OR = 1.18, 95% CI 0.66 to 2.12). Thus, validity studies conducted within a subcohort provide essential data to understand the impact of measurement error when clinical assessments are not available for the entire cohort. Our study did not have a clinical assessment for pubic hair development, and thus our validity findings were limited to breast only, whereas our reliability findings evaluated both breast and pubic hair development.
Our findings have implications for the interpretation of pubertal development data across pubertal cohorts because many collect information on pubertal development only from a single source20,21,28–30 and/or change sources over time.31 Specifically, our results support that for breast development, maternal report is more accurate for girls younger than 11 years and that self-assessment alone should not be used in epidemiologic studies of pubertal onset. For girls aged ≥11 years, self-assessment is more accurate for breast development. In studies lacking clinical breast Tanner for the whole cohort, sensitivity analyses adjusting for the validity of maternal and self-assessments should be used to understand the impact measurement error may have on the overall study conclusions.
The authors thank the LEGACY girls and their family members for their continuing contributions to the study and our colleagues at the participating family genetics and oncology clinics
- Accepted March 28, 2016.
- Address correspondence to Mary Beth Terry, PhD, Columbia University Mailman School of Public Health, Department of Epidemiology, 722 West 168th St, New York, NY 10032. E-mail:
FINANCIAL DISCLOSURE: The authors have indicated they have no financial relationships relevant to this article to disclose.
FUNDING: This work was supported by grants from the National Cancer Institute at the National Institutes of Health (R01 CA138638 to Dr John, R01 CA138819 to Dr Daly, R01 CA138822 to Dr Terry, and R01 CA138844 to Dr Andrulis) and the Canadian Breast Cancer Foundation (Dr Andrulis). Dr Andrulis holds the Anne and Max Tanenbaum Chair in Molecular Medicine at Mount Sinai Hospital and the University of Toronto. Funded by the National Institutes of Health (NIH).
POTENTIAL CONFLICT OF INTEREST: The authors have indicated they have no potential conflicts of interest to disclose.
- Garland M,
- Hunter DJ,
- Colditz GA, et al
- Collaborative Group on Hormonal Factors in Breast Cancer
- Jordan SJ,
- Webb PM,
- Green AC
- Currie C,
- Ahluwalia N,
- Godeau E,
- Nic Gabhainn S,
- Due P,
- Currie DB
- van Jaarsveld CH,
- Fidler JA,
- Simon AE,
- Wardle J
- Joinson C,
- Heron J,
- Lewis G,
- Croudace T,
- Araya R
- Jacobsen BK,
- Heuch I,
- Kvåle G
- Herman-Giddens ME,
- Slora EJ,
- Wasserman RC, et al
- de Muinich Keizer SM,
- Mul D
- Biro FM,
- Greenspan LC,
- Galvez MP, et al
- Marshall WA,
- Tanner JM
- Hui LL,
- Leung GM,
- Wong MY,
- Lam TH,
- Schooling CM
- Rasmussen AR,
- Wohlfahrt-Veje C,
- Tefre de Renzy-Martin K, et al
- Bandera EVWM,
- Marcella S,
- Donaldson A, et al
- ↵Division of Nutrition PA, and Obesity, National Center for Chronic Disease Prevention and Health Promotion. A SAS Program for the 2000 CDC Growth Charts (ages 0 to <20 y). Available at: www.cdc.gov/nccdphp/dnpao/growthcharts/resources/sas.htm. 2014. Accessed June 30, 2014
- Barlow SE; Expert Committee
- Bonat S,
- Pathomvanich A,
- Keil MF,
- Field AE,
- Yanovski JA
- Duke PM,
- Litt IF,
- Gross RT
- Gibbons A,
- Groarke A
- Neinstein LS
- Copyright © 2016 by the American Academy of Pediatrics