print Basic Statistics for Clinicians

1. HYPOTHESIS TESTING

The Null hypothesis

The Null Hypothesis: “The true difference in the effects of the experimental and control treatments on the outcome of interest is zero”

The result of a single experiment will almost always show a difference between experimental and control groups. Is the difference due to chance, or is it large enough to reject the null hypothesis and conclude there is a true difference in treatment effects?

The p value

Statistical tests yield a p value: the probability that the experiment would show a difference as great or greater than that observed if the null hypothesis were true. By statistical convention, the boundary or threshold that separates the plausible and the implausible is five times in 100 (p=0.05).
Statistical significance means that a result is “sufficiently unlikely to be due to chance that we are ready to reject the null hypothesis”.

However, the smaller the sample size, the greater the chance of erroneously concluding that the experimental treatment does not differ from the control - in statistical terms, the power of the test may be inadequate.

Type I Error

To conclude there is a difference in outcomes between treatment and control when no such difference exists. The probability of making such an error is designated alpha.

Type II Error

This type of error occurs when we erroneously fail to reject the null hypothesis, thus falsely concluding an effective treatment is useless. The larger the sample, the lower the risk of Type II error and the greater the power.

The Chi2 test

Application of a statistical test that compares proportions.

Student's t-test

A statistical test for continuous variables. P values for Student's t-test and others like it are obtained from standard tables.

Baseline differences

If factors that determine outcome are unequally distributed beaten groups despite random sampling, adjustments can be made (for several variables at once) to yield a p value that can be interpreted in the regular way.

Multiple tests

How likely is it that in six independent tests on two similar groups, at least one test would have crossed the 0.05 threshold by chance alone? The probability is calculated as follows: the probability that we would not cross the 0.05 threshold in testing a single hypothesis is 0.95; in testing two hypotheses the probability that neither one would cross the threshold is 0.95 multiplied by 0.95. Etc., so that six hypotheses yield 0.95 to the sixth power, which is 0.74. Therefore, when six independent hypotheses are tested the probability that at least one result is statistically significant is 0.265 or approx. 1 in 4, not 1 in 20. If we wish to maintain our overall boundary for statistical significance at 0.05, we must divide the threshold p value by six, so that each of the six tests uses a boundary value of p=0.008. That is, you would reject the null hypothesis that none of the characteristics differed significantly only if any one of the differences was significant at p<0.008.

There are several statistical strategies for dealing with multiple hypothesis testing of the same data. Apart from dividing the p value by the number of tests, we can specify, before the study is undertaken, a single primary outcome on which the main conclusions will hinge; a third approach is to derive a global test statistic that combines the multiple outcomes in a single measure.

Independent

Independent means that the result of a test of one hypothesis does not in any way depend on the results of the tests of any of the other hypotheses.

CONFIDENCE INTERVALS

Whereas, in hypothesis testing, study results lead the reader to reject or accept a null hypothesis, in estimation the reader can assess whether a result is strong or weak, definitive or not. A confidence interval, based on the observed result and the size of the sample, is calculated. It provides a range of probabilities within which the true probability would lie 95% or 90% of the time, depending on the precision desired.

It also provides a way of determining whether the sample is large enough to make the trial definitive. If the lower boundary of a confidence interval is above the threshold considered clinically significant, then the trial is positive and definitive; if the lower boundary is somewhat below the threshold, the trial is positive, but studies with larger samples are needed. Similarly, if the upper boundary of a confidence interval is below the threshold considered significant, the trial is negative and definitive. However, a negative result with a confidence interval that crosses the threshold means that trials with larger samples are needed to make a definitive determination of clinical importance.

In a positive trial - one that establishes that the effect of treatment is greater than zero - look at the lower boundary of the confidence interval to determine whether the size of the sample is adequate. The lower boundary represents the smallest plausible treatment effect compatible with the data. If it is greater than the smallest difference that is clinically important, the sample size is adequate and the trial definitive. However, if it is less than this smallest important difference, the trial is not definitive and further trials are required. In a negative trial - the results of which do not exclude the possibility that treatment has no effect - look at the upper boundary of the confidence interval to determine whether the size of the sample is adequate. If the upper boundary - the largest treatment effect compatible with the data - is less than the smallest difference that is clinically important, the size is adequate, and the trial definitively negative. If the upper boundary exceeds the smallest difference considered important, there may be an important positive treatment effect, the trial is not definitive, and further trials are required.

The point estimate of probability is the value we have obtained (as in a coin toss) - but what is the plausible range within which the true value may lie? Hence the confidence interval. The coin toss example illustrates how the confidence interval tells us whether the sample is large enough - 100 tosses will give a confidence interval within 10% of the point estimate, but 1000 are needed for a confidence interval within 3% of the point estimate. To obtain greater precision, you need more measurements - in clinical research, enrol more subjects, or increase the number of measurements in each enrolled subject.

MEASURES OF ASSOCIATION

Relative risk or risk ratio (RR)

The relative risk of the event after the experimental treatment, expressed as a percentage of the risk without such treatment. (Post-experiment risk divided by pre-experiment risk, expressed as a percentage).

Relative risk reduction (RRR)

The relative risk reduction is an estimate of the percentage of the baseline risk (the risk of an event in the control patients) removed as a result of therapy. The simplest way to derive it is to subtract the RR from 1.

Absolute risk reduction (ARR)

The absolute risk reduction (ARR) is the difference in the risk of an event between the groups. The difference in the risk of the outcome between patients who have undergone one therapy and those who have undergone another is called the absolute or attributable risk reduction or the risk difference. (It's a simple subtraction)

Odds ratio (OR)

The odds ratio, which is the measure of choice in case-control studies, gives the ratio of odds of an event in the experimental group to (divided by) those in the control group. The OR has certain optimal statistical properties that make it the fundamental measure of association in many types of studies. These statistical advantages may be particularly important when data from several studies are combined, as they are in a meta-analysis. Among such advantages, the comparison of risk represented by the OR does not depend on whether the investigator chose to determine the risk of an event occurring or not occurring. This is not true for RR. In some situations the OR and the RR will be close - e.g. in case-control studies of a rare disease.

Number needed to treat (NNT)

The number needed to treat tells the clinician how many patients need to be treated to prevent one event. One can arrive at this number by taking the reciprocal of the ARR (1/ARR). The NNT is directly related to the proportion of patients in the control group who suffer an adverse event. In general the NNT changes inversely in relation to the baseline risk - if the number of adverse events doubles, we need only treat half as many patients to prevent the same number of adverse events. In addition to calculating the NNT, one could also consider resources expended to prevent an event.

The OR and the RR provide limited information in reporting the results of prospective trials because they do not reflect changes in the baseline risk. The ARR and the NNT reflect both the baseline risk and the RRR. If the timing of events is important - to determine whether treatment extends for life, for example - survival curves are used to show when events occur over time.

Confidence intervals can be calculated for each of these measures of association.

Prospective randomized controlled trials

In such trials we start with an experiments group of patients who are subject to an intervention and a control group of patients who are not. The investigators follow the patients over time and record the incidence of events.

Prospective cohort studies

The process is similar, but patients are sampled according to whether they were or were not exposed to the treatment or risk factor, rather than being assigned to an intervention.

Case-control studies

Participants are sampled on the basis of whether they have experienced an event. Patients start the study with/without the event rather than with/without the exposure or intervention. Patients with the adverse outcome are compared with controls who have not suffered the outcome - a number chosen by the investigators Therefore the RR is not available because we do not know the population at risk.

The only measure of association that makes sense in a case-control study is the OR.

CORRELATION AND REGRESSION

Correlation and regression help us to understand the relation between variables and to predict patients' status in regard to a particular variable of interest. Correlation examines the strength of the relation between two variables or phenomena, neither of which is considered the variable one is trying to predict (the target variable). Regression analysis examines the ability of one or more factors, called independent variables, to predict a patient's status in regard to a target or dependent variable. (The statistical techniques for making a prediction or causal inference = regression). Independent and dependent variables may be continuous (taking a wide range of values) or binary (dichotomous, yielding yes-or-no results). Regression models can be used to construct clinical prediction rules that help to guide clinical decisions. In considering regression and correlation, clinicians should pay more attention to the magnitude of the correlation or the predictive power of the regression than to whether the relation is statistically significant.

Correlation

A relation is strong when patients who obtain high, intermediate or low scores on the first variable obtain respective scores on the second variable. The strength of the relation can be summarized in a single number, the relation coefficient (r). This can range from -1.0 (the strongest possible negative relation - the patient with the highest score on one test has the lowest on the other) to 1.0, the strongest possible relation. A correlation coefficient of 0 denotes no relation at all - patients with a high score on one test have the same range of scores on the second as those with a low score on the first test. The p value for r is determined from the null hypothesis that the true correlation between the two measures is 0. The smaller the p value the less likely that chance explains the relation between the two measures.

Regression

As clinicians we are interested in prediction: which patients will get a disease, and which will not. Regression analysis is useful in addressing these sorts of issues. When regression analysis assumes a straight line fit between the independent and dependent variable, and the dependent variable is continuous, we refer to the analysis as linear progression. When the dependent variable is dichotomous the term logistic regression may be used to refer to such models because they are based on logarithmic equations.

Acknowledgement

This article is based on the series by G. Guyatt, R. Jaeschke, N. Heddle, D. Cook, H. Shannon and S. Walter that was published in the Canadian Medical Association Journal 1995; 152: 27-32; 152: 169-173; 152: 351-357; 152: 497-504