Several weeks ago I wrote the first in a brief series of posts discussing the different types of evidence used in medicine. In that post I discussed the role of correlation in determining cause and effect.

In this post I will discuss the basic features of an experimental study, which can sere as a check-list in evaluating the quality of a clinical trial.

Medical studies can be divided into two main categories – pre-clinical or basic science studies, and clinical studies. Basic science studies involve looking at how parts of the biological system work and how they can be manipulated. They typically involve so-called in vitro studies (literally in glass) – using test tubes, petri dishes, genetic sequencers, etc. Or they can involve animal studies.

Clinical trials involve people. They are further divided into two main categories – observational studies and experimental studies. I will be discussing experimental studies in this post – studies in which an intervention is done to study subjects. Observational studies, on the other hand, look at what is happening or what has happened in the world, but does not involve any intervention.

Experimental Studies

The primary advantage of experimental studies is that they allow for the direct control of variables – in the hopes of isolating the variable of interest. Results are therefore capable of being highly reliable, although good clinical experiments are difficult to design and execute. When assessing a clinical trial here are the features to examine.

Prospective vs Retrospective

A prospective trial is one in which the treatment and the outcomes are determined prior to any intervention being done. Experimental trials are almost by definition prospective. A retrospective trial is one in which the data is gathered after the fact – taking patient records, for example, and looking at treatments and outcomes.

In a retrospective study you can try to account for variables, but you cannot control for them. It is therefore much more likely that there are confounding factors and the results are not as reliable.Also, retrospective studies can be biased by the way information is obtained – there can be a bias in the way patients are identified, for example.

Prospective trials are therefore considered superior to retrospective trials, which are at best preliminary in their conclusions.


Not all prospective trials are placebo-controlled, however. A non-controlled trial might identify potential subjects, give them all a treatment, and then see how they do. Such open-label single arm trials cannot control for placebo effects or experimenter biases, and again results should be considered preliminary.

Open or uncontrolled trials are not useless, however. The outcome of subjects in such trials can be compared to historical controls, and if a significant result is apparent (along with safety) can be used to justify a larger and more rigorous trial.

Controlled trials have one or more comparison groups in the trial itself – different groups of subjects receive different treatments or no treatment. All subjects can be followed in same manner. Control groups allow the experimenter to make sure that all the subjects have the same disease or symptoms, that they receive known treatments, and many variables (such as other treatments they may be receiving, severity at inclusion, age, sex, race, etc.) can be accounted for.

Controlling for variables

With controlled trials the experimenter can start to control for variables. If the question is – does treatment A improve outcome in disease X, a controlled prospective trial can attempt to isolate treatment A from other factors that may affect outcome.

One method for controlling variables is stratification – the study protocol can place subjects in different treatment groups so that the groups end up with the same proportion of different sexes, ages, races, and other known variables that may be pertinent. Stratification can control for known or obvious confounding factors.

But of course there can always be unknown confounding factors. The only way to deal with these is through randomization and large study size. If a large number of subjects are randomly assigned (once stratified for age, sex, etc.) into the different treatment groups, then any unknown variables should average out. Of course, this requires sufficient numbers – small studies are always suspect because the groups may be significantly different by chance alone.

Randomization is important because when patients select their own treatments this opens the door for selection bias. For example, sicker patients may opt for more aggressive therapy. They will do worse because they were sicker to begin with, making the more aggressive therapy look less effective.


A randomized prospective trial can control for many variables, but the only way to control for placebo effects and the bias of the experimenters is with blinding – meaning that participants don’t know who is getting the real treatment and who is getting a different treatment or a placebo.

A single-blind study is one in which subjects do not know which treatment they are getting. A double-blind study is one in which the experimenter does not know either – until the study is done and the “code is broken.”

When subjects are blinded, placebo effects should be the same. It is often difficult, however, to fully blind subjects. Medications may have obvious side effects, and subjects who experience the side effects know they are getting active medication.

Physical interventions, like acupuncture, surgery, massage, or physical therapy, are difficult to impossible to blind. A person knows if they have been massaged or not. For these studies creative blinding techniques may need to be used. Or, “sham” procedures can be used for placebos.

Studies may also assess how successful the blinding was – by asking subject if they think they received the placebo or the treatment.

Experimenters also need to be blinded to eliminate placebo and biasing effects. This is easy for drug trials, but may be impossible for physical intervention trials. However, a study can be partially double-blinded if there is a blinded evaluator – an experimenter whose only involvement with the study is to assess the subjects, while carefully avoiding any information that would clue them in as to which treatment arm each subject was in.

But the best studies are ones in which everyone involved is completely blinded until the results are completely in.

Outcome measures

Deciding how to determine if an intervention “works” is not always trivial. Outcome measure need to be a good and reliable marker of the disease or syndrome you are following. For example, in a diabetes study, do you follow HgA1C, random glucose checks, glucose tolerance tests, end-organ damage, need for medication, or some other biological marker?

In addition to being a good marker for what you are studying, the outcome should be meaningful. Do we care if a cholesterol lowering drug lowers total cholesterol, or if it prevents heart attacks and strokes? And if it prevents heart events, does it prolong survival (or just reduces angina without affecting survival)?

Outcomes also need to be free of confounding. For example, early stroke trials looked at stroke incidence, which may seem reasonable. However, if more subjects on a treatment died of heart attacks, they would not be around to have a stroke, so the treatment reduces stroke but only by allowing more heart attacks. So stroke-free survival is a better outcome to follow.

Outcome measures also vary on how objective or subjective they are. Just asking patients how they feel is not a very reliable outcome measure. You can pseudo-quantify this by asking them to put a number on their pain or other symptoms, but it is still a subjective reports. Measuring the volume of lesions in the brain, however, is an objective outcome measure, and is therefore more reliable.

Many studies will follow several outcomes – some subjective but important, and others objective and quantifiable if an indirect marker rather than a direct outcome we care about.

Statistical analysis

I won’t go into statistics in any detail, as that is a highly technical area and any reasonable treatment would be much longer than the rest of this post. Here even medical professionals rely upon statistical experts to make sure we get it right.

But it is good to understand the basics (as long as you don’t rely upon basic knowledge – then it is easy to be fooled by fancy statistical tricks).

The most basic concept of clinical trials is statistical significance – is there an effect or correlation that is probably greater than chance. Most studies rely upon the P-value, which is a measure of the chance the result occurring if the the null hypothesis (no effect) is correct. A P-value of 0.05 means (roughly) that 5% (or 1 in 20) probability that the outcome is due to chance alone, and not a real effect. P-value of 0.05 is commonly used as a cutoff of statistical significance, but it is important to realize with this cutoff 1 in 20 studies of worthless treatments will appear positive due to chance alone. Lower P-values, such as 0.01, are more significant.

But P-value isn’t everything.  A poorly designed study can result in an impressive P-value. Also, the size of the effect must be considered. You can have a low P-value for a tiny effect (if there are large numbers of subjects in the trial) – the effect may be clinically insignifcant, and small effects are more likely to be due to hidden biases or confounders.

Therefore, we generally are only impressed when a clinically large effect also has a low P-value.

In addition to P-value, the number of subjects in the trial is very important. Even though these are related, the larger the study the more impressive the results, as random fluctuations are less likely to play a role.

One common trick to look out for is multiple analysis.  A study may, for example, look at 10 variables (or one variable at 10 different points in time), and find statistical significance for one, and present that as a positive study. However, this is equivalent to taking 10 chances at that 1 in 20 chance of hitting significance. Proper statistical analysis will account for multiple comparisons.

Other factors to look out for

There are features that are important to consider is evaluating a clinical trial. What was the dropout rate? If half of the subjects dropped out, that unrandomizes or biases the groups, because drop outs are not random. For example, subjects that do not respond to treatment may drop out, leaving only those who do well.

Not all controls are equal as well. Sometime the control group is not an inactive placebo but standard care. What if the standard treatment is too effective, or what if it is not effective at all. You need to know what the study treatment is being compared to.


When a new clinical trial is being promoted in the news as evidence for or against a treatment – run down this list. Is it a randomized, controlled, double-blind trial, is the blinding adequate, are the outcome measures objective and relevant, is the effect size robust, how large is the study, what variables are actually being isolated, and what was the drop out rate?

And of course, no one study is ever the definitive last word on a clinical question. Each study must be put in the context of the full scientific literature, which means considering plausibility or prior probability. That is the essence of science-based medicine.


Posted by Steven Novella

Founder and currently Executive Editor of Science-Based Medicine Steven Novella, MD is an academic clinical neurologist at the Yale University School of Medicine. He is also the host and producer of the popular weekly science podcast, The Skeptics’ Guide to the Universe, and the author of the NeuroLogicaBlog, a daily blog that covers news and issues in neuroscience, but also general science, scientific skepticism, philosophy of science, critical thinking, and the intersection of science with the media and society. Dr. Novella also has produced two courses with The Great Courses, and published a book on critical thinking - also called The Skeptics Guide to the Universe.