Published online March 1, 2007
PEDIATRICS
Vol. 119
No. 3
March 2007, pp.
608-610
(doi:10.1542/peds.2006-3030)
P Less Than .05: What Does It Really Mean?
Zeev N. Kain, MD, MBA and
Jill MacLaren, PhD
Center for the Advancement of Perioperative Health and Departments of Anesthesiology, Pediatrics, and Child Psychiatry, Yale University School of Medicine, New Haven, Connecticut
Abbreviations: CI, confidence interval
Although there is a growing body of literature criticizing the use of mere statistical significance as a measure of clinical impact, we submit that this concept has not been widely incorporated in the pediatric literature. This is especially problematic because an understanding of the limitations of using only statistical significance to evaluate treatments is necessary for readers of Pediatrics to draw accurate conclusions from data presented in this journal. Here we highlight some of the issues related to the complex problem of evaluating treatment effects and the importance of using clinical significance in addition to the traditional P value.
Currently, the magical boundary of P < .05 holds great importance in whether a manuscript is accepted for publication, a research application is funded, or a new drug is approved by the Food and Drug Administration. We submit that if a treatment is to be useful to our children, it is not enough for treatment effects to be statistically significant; they also need to be large enough to be clinically meaningful. Evaluating treatment outcomes on the basis of P value alone is problematic for several reasons. First, with a large sample, it is quite possible to have a statistically significant result between groups despite a minimal effect of treatment (ie, small effect size). Second, study outcomes with lower P values are typically misinterpreted by pediatricians as having stronger effects than those with higher P values. That is, most clinicians believe that a result with P = .002 has a much greater treatment effect than a result of P = .045. Although this is true if the sample size is the same in both studies, it is not true if the sample size is larger in the study with the smaller P value. This confusion becomes particularly concerning when one realizes that most pharmaceutically funded studies have very large sample sizes.
To combat overreliance on the P value, we recommend that pediatricians be interested in answering 3 basic questions when examining the report of a clinical trial:
- Could the findings of the clinical trial be solely a result of a chance occurrence? (ie, statistical significance)
- How large is the difference between the primary end points of the study groups? (ie, impact of treatment, effect size)
- Is the difference of primary end points between groups meaningful to a patient? (ie, clinical significance)
 |
UNDERSTANDING STATISTICAL SIGNIFICANCE
|
|---|
As is familiar to most readers of Pediatrics, the P value is the most commonly used method of evaluating the statistical significance of any finding. The origin of the P value lies in 1925 with Sir Ronald A. Fisher, who first suggested the use of a boundary between significance and nonsignificance that was based on probability.1–3 He arbitrarily set this boundary at P = .05, where "P" stands for the probability that a finding of interest was reached by chance.1,2 Although Fisher's emphasis on significance testing and the arbitrary boundary of P < .05 is familiar and widely used, it is important for pediatricians to recognize that this definition has been widely criticized over the past 80 years. Specifically, this approach is criticized because it does not take into account the size and clinical significance of the observed effect. That is, a small effect in a study with large sample size may have the same P value as a large effect in a study with a small sample size.
In an attempt to address some of the limitations of the P value, use of confidence intervals (CIs) has been advocated by some clinicians.3 It is important the readers realize, however, that these 2 definitions of statistical significance are essentially reciprocal.4 That is, a P value of <.05 is essentially the same as having a 95% CI that does not overlap 0. CIs do have some advantage, however, in that they can be used to estimate the size of difference between groups.5 Unfortunately, this approach is not widely used in the pediatric literature, and CIs are mostly used today as surrogates for the hypothesis test rather than considering the full range of likely effect size.
 |
BEYOND THE P VALUE: EFFECT SIZES
|
|---|
Providing more information than either P values or CIs, the group of statistics called "effect sizes" are measures of the magnitude of difference between groups, standardized by controlling for variation within groups. In other words, whereas a P value denotes whether the difference between 2 groups in a particular study is likely to occur solely by chance, the effect size quantifies the amount of difference between these 2 groups. Because effect size is based on standardized differences between groups and not sample size, they better evaluate the strength of the intervention. Of particular relevance to pediatricians is effect sizes of the d type, because these are primarily used to compare 2 treatment groups. d-type effect size is defined as the magnitude of difference between 2 means, divided by the SD [(mean of control group – mean of treatment group)/SD of the control group]. Thus, the d effect size depends on variation within the control group and the differences between the control and intervention groups. Conventionally, d-type effect sizes that are near .20 are interpreted as small, effect sizes near .50 are considered "medium," and effect sizes around .80 are considered "large."6 Effect sizes of another type, the risk potency type, include likelihood ratios such as odds ratio, risk ratio, risk difference, and relative risk reduction. Clinicians are probably more familiar with these less abstract statistics, and it may be helpful to realize that likelihood statistics are a type of effect size. There are a number of different types of effect sizes, but description of these various types and formulae is beyond the scope of this commentary; however, we refer the interested reader to a number of review articles that discuss these issues.7,8
 |
FURTHER STILL: CLINICAL SIGNIFICANCE
|
|---|
At this point, we feel that it is important to caution Pediatrics readers that magnitude of change (effect size) should not be interpreted as an indication of clinical significance. The clinical significance of a treatment should instead be based on external standards provided by patients and clinicians. That is, a small effect size may still be clinically significant and, likewise, a large effect size may not be clinically significant. Indeed, there is a growing recognition that traditional methods used, such as statistical significance tests and effect sizes, should be supplemented with methods for determining clinical significant changes.
Although there is little consensus about the criteria for these efficacy standards, the most common definitions of clinically significant change include: (1) treated patients make a statistically reliable improvement in the change scores; (2) treated patients are empirically indistinguishable from a normal population after treatment; or (3) there are changes of at least 1 SD. The most frequently used method for evaluating the reliability of change scores is the Jacobson-Truax method in combination with clinical cutoff points.9 Using this method, change is considered unlikely to be the product of measurement error if the reliable change index is >1.96. That is, when the score of a patient has a change score >1.96, one can reasonably assume that indeed the score has improved.
The validity of each of the above-described methods can be improved further by establishing their external validity (eg, patient perspective). For example, Flor et al10 conducted a large meta-analysis and evaluated the effectiveness of multidisciplinary treatment for chronic pain. The investigators found that pain among the patients who received the intervention was reduced by 25% with an effect size of .7. Although this finding seems promising statistically, the meaning of the results change in light of findings from Colvin et al, who reported that patients consider only a 50% pain improvement a "treatment success."11 Thus, in this example, a reduction of 25% in pain scores may be statistically but not clinically significant. Clearly, this is a developing area that warrants additional discussion.
 |
CONCLUSIONS
|
|---|
The issue of clinical significance is of utmost importance to both pediatric researchers and clinicians. On the research side, it is imperative that studies routinely evaluate both statistical and clinical significance to advance our understanding of treatment effects. As such, we encourage researchers to report effect sizes, at the very least, and incorporate external validations of clinical significance when possible. On the clinical side, pediatricians must understand the potential disconnect between statistical and clinical significance when making decisions about the adoption of new treatments. The interpretation of any research findings should occur in the context of the magnitude of change that occurred and the clinical significance of the findings.
 |
ACKNOWLEDGMENTS
|
|---|
This work was supported in part by the National Institutes of Health through National Institute of Child Health and Human Development grant R01HD37007-02.
 |
FOOTNOTES
|
|---|
Accepted Oct 20, 2006.
Address correspondence to Jill MacLaren, PhD, Department of Anesthesiology, Yale University School of Medicine, 333 Cedar St, New Haven, CT 06510. E-mail: zeev.kain{at}yale.edu
The authors have indicated they have no financial relationships relevant to this article to disclose.
Opinions expressed in these commentaries are those of the authors and not necessarily those of the American Academy of Pediatrics or its Committees.
 |
REFERENCES
|
|---|
- Fisher RA.
Statistical Methods for Research Workers. 1st ed. Edinburgh, Scotland: Oliver and Boyd; 1925
- Fisher RA.
Design of Experiments. 1st ed. Edinburgh, Scotland: Oliver and Boyd; 1935
- Simon R. Confidence intervals for reporting results of clinical trials.
Ann Intern Med.1986;105
:429
–435[ISI][Medline]
- Feinstein AR. P-values and confidence intervals: two sides of the same unsatisfactory coin.
J Clin Epidemiol. 1998;51
:355
–360[CrossRef][ISI][Medline]
- Gardner MG, Altman DG. Confidence intervals rather than P values: estimation rather than hypothesis testing.
BMJ. 1986;292
:746
–750[ISI][Medline]
- Cohen J.
Statistical Power Analysis for the Behavioral Sciences. 2nd ed. Mahwah, NJ: Lawrence Erlbaum; 1988
- Kirk R. Practical significance: a concept whose time has come.
Educ Psychol Meas.1996;56
:746
–759[Abstract]
- Snyder, P, Lawson S. Evaluating results using corrected and uncorrected effect size estimates.
J Exp Educ. 1993;61
:334
–349
- Jacobson NS, Truax P. Clinical significance: a statistical approach to defining meaningful change in psychotherapy research.
J Consult Clin Psychol. 1991;59
:12
–19[CrossRef][ISI][Medline]
- Flor H, Fydrich T, Turk DC. Efficacy of multidisciplinary pain treatment centers: a meta-analytic review.
Pain. 1992;49
:221
–230[CrossRef][ISI][Medline]
- Colvin DF, Bettinger R, Knapp R, Pawlicki R, Zimmerman J. Characteristics of patients with chronic pain.
South Med J. 1980;73
:1020
–1023[ISI][Medline]
PEDIATRICS (ISSN 1098-4275). ©2007 by the American Academy of Pediatrics