P, 95CI, and Power
By Kenneth L. Noller, MD
For the vast majority of clinicians (those with MPHs and a biostatistical background excepted) exposure to biostatistics is usually limited to a short, incomplete (often optional) course during the first or second year of medical school. Thus, it is not surprising that most of us, in a paper, tend to focus on the one or two statistical elements that has some meaning for us. During the past two decades, for most clinicians, the so-called "P" value has been the one statistic upon which they have focused. Unfortunately, many clinical journals likewise have (seemingly) often decided to publish a paper merely because a study found a "statistically significant" result.
What exactly is a "P" value? Without boring you with the math used in the calculations, a P value represents the Probability that an observed result is a chance or accidental conclusion rather than a true one. Thus, a P value of 0.05 suggests that the observed result would occur by chance only one time in 20. With a P value of 0.01, the result should be observed by chance only one time in 100. By convention, clinical studies usually use a P value of 0.05 as the level at which a study result should be accepted as "statistically" significant.
There are several problems with merely using one P value as a "cut-off" for determining whether a study should be read and believed, or rejected (and likely not published). Currently, epidemiologists are using the P value less and less in their studies and literature. Although there are many problems with using only one statistic for all studies, one of them is the fact that both small numbers of subjects and large numbers of subjects tend to result in P values that may be misleading. In many clinical studies, it is virtually impossible to obtain large sample sizes. There may only be (for example) 12 cases of a given disease in the last 10 years; or the FDA may only allow a drug to be tested on 20 subjects. While there are special statistical techniques to deal with small sample sizes, regardless of the technique used, if the investigator has only a few subjects it is difficult to demonstrate a "statistically significant" difference even if one almost certainly exists. Thus, it is wise for clinicians not only to look at P values but also to examine estimates of risk when they are given. For example, I was recently involved in a study in which the affected population had three times as many cases of a specific disease as the control population. I am convinced that the three-fold increase is true because of the rigorous study design, but statistical analysis failed to identify a P value of less than 0.05. In fact, the P value suggested that our observations would occur by chance only one time in 16. That, combined with our three-fold, elevated, risk assessment has convinced me that we have identified a problem.
Large sample sizes also affect the ability of the P value to provide meaningful information. When dealing with a large study group, almost any difference will be "statistically significant," but often not "clinically significant." For example, many anti-hypertensive drugs have been tested on large enough groups of individuals to show that there is a "statistically significant" decrease in blood pressure while taking the medication, but the reduction in blood pressure is so small that the drug has no clinical usefulness. Another example might be the several new drugs that are being marketed for the treatment of the symptoms of the "common cold." Those have been shown to reduce the duration of symptoms (by a day or two), but the clinical significance may be nil. By the time someone has symptoms of a cold bad enough to contact a physician, be seen, and have a prescription filled, most of the five- or six-day course of the illness will already have passed, and the clinical usefulness of an expensive drug is questionable.
The "95% Confidence Interval" (95CI) is another statistic that is commonly used in clinical papers, but is less well understood by many clinicians. When a study is performed, a "sample mean" is obtained. The 95CI is a statistic that is easily calculated from this mean. It is always presented as a range of values and is perhaps best explained by using an example. Assume that a study was performed on a new cholesterol lowering drug, that the drug was given to 100 individuals, and the mean (average) decrease in blood cholesterol was 20. The reduction of 20 represents only the reduction observed in the study participants, not what would happen if the drug were given to the entire population. However, the 95CI uses the fact that (in a properly designed study) the true population mean will lie within two standard deviations of the sample mean 95% of the time. Thus, in this example, the 95CI might be 20 ± 6. This suggests that if the drug were given to the entire population, the average reduction in cholesterol would be somewhere between 14 and 26 points.
There are a few other important aspects of the 95CI. First, just as for the P value, the sample size greatly affects the "width" of the confidence interval: large studies tend to have narrow confidence intervals whereas small studies have wide ones. Second, if a confidence crosses unity (that is, if the number 1.00 is within the confidence interval), the result might be a chance observation. However, as with P values, a confidence level that "just barely" crosses unity, combined with an elevated risk assessment, suggests that there might truly be a problem.
The final statistical test is that of "Power." Since I have recently commented on this subject in a journal review, let me only briefly present it here.
Power is important in negative studies. That is, if a paper is published suggesting that there is no association between one thing and another, virtually all of the top tier medical journals now require that the authors have performed a Power Analysis. The Power Analysis is a measure of the chance that the authors missed an association that actually was present. In some ways, it is the reverse of a P value and is commonly referred to as the chance of a beta error. Not surprisingly, studies with small sample sizes are much more likely to miss an association than studies with larger sample sizes. Most good investigators will now calculate a Power Analysis before beginning their study to determine the number of subjects they need in order to have a chance to detect a clinically significant difference. Unfortunately, many of the "second tier" and "throw-away" journals publish negative results without requiring a proper Power Analysis. Such articles can lead to a false assumption that a given procedure or treatment has no benefit when in fact it does.
Finally, it is important for clinicians to remember that no clinical study (indeed, no experiment of any type) can ever lead to a "cause and effect" association. Biostatistics are used to help us determine whether a procedure might cure a given condition, a drug might improve health, or a virus might be associated with a given illness. However, no statistic can tell us whether a given treatment or drug will help an individual patient. It is up to us to determine clinical significance.
1. O’Brien PC, Shampo, MA. Mayo Clin Proc 1981;56:274-276.
2. O’Brien PC, Shampo MA. Mayo Clin Proc 1981;56: 324-326.
3. Lilienfeld AM, Lilienfeld DE. Foundations of Epidemiology. 2nd ed. New York, NY: Oxford University Press; 1980.
4. Colton T. Statistics in Medicine. Boston, MA: Little Brown, and Company; 1974.
A researcher has performed a study on the effectiveness of a new drug that is used to prevent premature labor. The article reports that the study did not show a statistically significant benefit of the drug since the P value was 0.07. Which of the following statements regarding this study is most correct?
a. The new drug is not clinically useful for the prevention of premature labor.
b. If the investigators had used more study subjects, the results would have been significant.
c. The drug might be useful for preventing premature labor and should be studied further.
d. Only 7% of the study subjects showed a response to the drug.