From Data to Conclusions: How Research is Interpreted

*By Howell Sasser, PhD**, Scientific Review Coordinator, Manila Consulting Group; Adjunct Member of the Faculty, New York Medical College. Dr. Sasser reports no financial relationships relevant to this field of study.*

This is the final article in a three-part series about the design and conduct of clinical research. The first installment discussed how research begins with the formulation of research questions, and the second reviewed the strengths and limitations of some common study designs. This article will discuss a few key issues in the analysis and interpretation of research findings.

A basic assumption of the research process is that measuring a characteristic in many people will produce a more stable understanding of nature than will measuring that characteristic in only one person. This presents the problem of how best to report information collected from tens or hundreds or thousands of people. Reporting all individual values is impractical and uninformative, so summary measures are used.

The ability of a single value to stand in for many depends on how the characteristic it represents is conceptualized. Some things are less easily summarized than others. Opinions and thought processes (e.g. "How does someone choose a chiropractor?" "How do one's spiritual beliefs affect one's health practices?") can be difficult to boil down into one representative answer **** or even into a small number of answers. Research involving questions of this sort is commonly described as qualitative, and has its own set of analytical techniques which are beyond the scope of this article. More common in biomedicine are questions that produce counts (e.g. "How many nutritional supplements does the average person take each day?" "On a scale of one to ten, what are patients' pain levels before and after an acupuncture session?" "How many additional pain-free years can a recreational runner with osteoarthritis expect to gain by taking glucosamine?"). Research involving these questions is commonly described as quantitative, and uses numerical techniques and the "language" of statistical testing.

Within quantitative research, there are further distinctions among characteristics in the way they are conceived, analyzed, and reported. Some characteristics fall naturally into categories gender and race, for instance. Others can be grouped in a way that is analytically useful. For example, in their survey of the prevalence of CAM use and health status, Nguyen and colleagues collect and report participants' health status as Excellent, Very Good, Good, Fair, and Poor. They report age in ranges (18-29, 30-39, 40-49, 50-64, 65+) that are somewhat arbitrary, but which also have some analytical rationale.1 In other cases, these ranges may be determined by the data themselves. A common approach is to arrange the observed values from lowest to highest, and then divide them evenly into fourths (quartiles) or fifths (quintiles). This method may simplify the analytical process, but it can complicate the comparison of a study's results with other published findings. Whatever method is used to create them, these *categorical variables* are usually reported as counts and percents.

Other characteristics are not as easily grouped or are more appropriately analyzed without the imposition of (perhaps artificial) categories. In such cases, it is necessary to find a single value that is representative of the complete possible range. If the values in a population are normally distributed (that is, follow a bell-shaped curve), the mean or average will be the value that is both at the middle of the range and most commonly observed. The standard deviation is usually reported with the mean to give a sense of how closely grouped the other values are around the mean. Variables of this type are often referred to as being continuous. When the assumptions necessary for this approach are not met, categorization (as described above) may be used, or the median, the value at the middle of the range, may be reported instead. For example, in their report of a randomized trial of Echinacea for cold treatment, Barrett and colleagues found considerable skewness (uneven distribution) of cold severity ratings. To address this, they reported and tested median differences by treatment group as well as mean differences.^{2}

Continuous and categorical variables each have a number of associated statistical tests that are used in assessing the significance of differences in observed values by categories or levels of some other variable. It is important to note that many tests, especially those used with continuous variables, have underlying assumptions that must be satisfied for the results they produce to be interpreted correctly these are commonly called parametric tests. Often when one or more of these assumptions cannot be met, alternative tests are available these are commonly called non-parametric tests. It can be difficult (and at times impossible) for the reader of a journal article to know whether all necessary statistical steps have been done as they should be, so it is all the more important for researchers to have good statistical advice in the design and analysis of their work. It is also important to note that the "best" test for any analytical situation may not be the most powerful or sophisticated one, but rather is the one that fits the available data and the assumptions that can or cannot be made about them. (*See Table.*)

Statistical tests of comparisons in study data produce results that are not, by themselves, definitive. To provide the desired yes/no answer as to statistical significance, they must be judged according to a standard that is both uniform and decided upon in advance. Two common ways of doing this are with the use of p-values and confidence intervals.

The interpretation of statistical test results can be seen as the setting up of two opposing propositions:

- The test result we see is due to chance (the default option);
- The test result we see is not due to chance (i.e., the difference we observe in some characteristic across the levels of another characteristic really is as it appears).

Since either of these propositions may be true, our goal is not to show one or the other to be absolutely certain, but simply sufficiently more likely than the other to remove reasonable doubt. The definition of "sufficiently more likely" is straight-forward and mathematical for example, 1 chance in 1000, or 1 chance in 100, or 1 chance in 20. The last option, expressed as a decimal (0.05) is by far the most common standard of improbability. In much of published science, any test result with a probability of less than 0.05 (i.e., 1 in 20) is deemed to be sufficiently extreme to justify rejecting the default assumption of a chance event in favor of the alternate view that the difference in the associated characteristic is really as observed. This probability estimate is referred to as the *P* value.

A second method begins from the observation that most characteristics can take on a value from a finite number of possibilities. Once that range is defined, it becomes predictive. For example, if we measured the heights of 1000 men, and calculated the minimum and maximum values, we could be fairly certain that the height of any additional man would also fall within that range. For characteristics of interest in research, and also for the results of statistical tests, the range of values called the confidence interval can be calculated if we have some basis for estimating the likely variability in the thing being measured. The overlap of confidence intervals, and the presence within them of values defined as indicating "no difference," is interpreted in a way that is parallel to *P* values.

Discussion of statistical significance leads logically to the topic of study power. Since we usually cannot capture an entire population (i.e., every woman with breast cancer or every person who takes garlic supplements), we use a sample. The larger the sample, the greater the chance that we will measure accurately the characteristics of the population is represents. However, practical concerns compel us to keep study samples within limits that are manageable and affordable. Calculations for study power and sample size tell us how large a sample is needed to meet two needs: 1) to ensure that if our study hypothesis is in fact false, we will not mistakenly conclude that it is correct (also called avoidance of "Type I" error), and 2) to ensure that if our study hypothesis is in fact correct, we will show it to be so (also called avoidance of "Type II" error). It is common practice to choose a sample size that allows no more than a 5% chance of a Type I error (this is the origin of near-universal < 0.05 threshold for *P* values), and no more than a 20% chance of a Type II error. The methods and equations associated with estimating power and sample size are sophisticated enough to require expert advice.

Although these details may seem arcane, the underlying points that they support can and should be expressed simply. Statistics and statistical tests are based in and flow from the actual data collected in the study. Because they are interpreted in much the same way by everyone, they permit the comparison of findings across studies. Because they play a role in what data will be collected, how those data will be collected, and how the questions that drive a study are answered, analytical and statistical details are important elements of study design from the very beginning.

This discussion of quantitative methods is important for an understanding of the later phases of a research project, but it should not distract from the goal it is intended to serve the production of clinically and scientifically relevant information. Statistical significance should not be confused with clinical significance. It is possible to produce results that are statistically significant but practically irrelevant this sometimes happens in very large studies. It is equally possible to produce findings that are statistically non-significant but of great clinical importance this sometimes happens in very small studies. The object of the interpretation of study findings should always be to bring things full-circle, back to the questions that prompted the research, and then to serve as the basis for the next set of questions, as well as for the improvement of clinical care.

*References *

1. Nguyen LT, et al. Use of complementary and alternative medicine and self-rated health status: Results from a national survey. *J Gen Intern Med *2011;26:399-404.

2. Barrett B, et al. Echinacea for treating the common cold: A randomized trial. *Ann Intern Med *2010;153:769-777.