P-Hacking and the Problem of Multiple Comparisons

The logic of the scientific method is straightforward…

1) Observe a phenomenon, generally some type of cause/effect relationship (e.g., X causes Y).

2) Generate a hypothesis describing the mechanism–the why–that connects X with Y.

3) Collect data on X and then on Y.

4) Analyze this data to either reject your hypothesis (X does not cause Y) or retain your hypothesis to investigate further.

As social scientists, we adhere–as best as possible–to this idealized notion of how scientific knowledge accumulates. Bounded by the double hermeneutic, however, studying the uniquely human condition presents challenges for how the ideal scientific process works in the real world. People, entrepreneurs, businesses, and society change constantly. Objective and incontrovertible truth then to a social scientist–or to the data analyst in an organization–is fundamentally unknowable. So, we have to do the best that we can. Unfortunately, the degrees of freedom available to social scientists, including management/organizational researchers, often exacerbates the inherent challenges to our research.

While it is easy to lie with statistics, it is even easier to lie without them. ~ Frederick Mosteller

There are countless judgement calls in applied statistics that reasonable people may disagree over, and that may fundamentally change the results of a study. That’s not what I’m going to talk about here though. What I want to address is a class of questionable research practices that, whether employed by omission or, even worse, by commission, fundamentally bias study results and conclusions.

I’m going to focus on three of them–multiple comparisons, p-hacking, and hypothesizing after the results are known (HARKing)–although there are several others. I’m also not going to address data fabrication or results manipulation. These activities do occur, but to me they have crossed the line into outright ethical violations that necessitate immediate censure by the community.

The three I’m talking about happen all the time, seem like they wouldn’t be that big of a deal, but in reality, result in Type I and Type II errors, lead to erroneous conclusions about effect sizes and courses of action, and contribute to the proliferation of scientific studies that are not replicable. This last issue is particularly bad. The only way to show that a finding is robust–to show that it wasn’t just an artifact of the lab or of a single dataset–is to reproduce the finding using the same method but on a new sample. While there is no clear way to be certain, a substantial number of published studies in the social sciences (psychology and behavioral economics for example), and even in the medical sciences, fail this fundamental premise.

As others have noted and I agree, the following three practices are highly likely to be the culprit for the proliferation of one-off findings.

Oh, in the interest of full disclosure, I’ve done all three of these in the past. Now though I know better 🙂

1) Multiple comparisons

This one seems innocuous, it’s easy to do, and it doesn’t seem like it would do any real damage. The topic is often addressed in basic statistics textbooks, but then is often quickly forgotten about as analysts go about their work.

In null hypothesis testing, we’re interested in determining the probability that the result we observed would have occurred by chance alone. Usually, we set a p < .05 standard, and we interpret this value as a 1 in 20 probability that the result we observed, or a larger result, would have occurred by chance alone. A p-value of .001 would be a 1 in 1,000 probability, a p-value of .5 would be a 1 in 2 probability, and so forth.

I’m a fan of null hypothesis testing, although it certainly has its limitations. The problem with the p < .05 standard though is that it is frighteningly easy to meet this threshold by chance…

Imagine if you are working with the data and you conduct, say, 20 different analyses on the same data. Applying the logic of our p < .05 standard, you will likely reject the null hypothesis at least once, purely by chance. If you run a hundred models–something software and a pot of coffee makes ridiculously simple–5 of your ‘statistically significant findings’ are likely to be spurious; they are simply artifacts of the data.

Now, this is actually a big topic with a lot of moving pieces, and we don’t have the space to do it justice. But the key takeaway is that the more models/analyses you run on the same data, and especially with the same basic collection of variables, the greater the odds that when you observe a p value less than .05 it is happening purely by chance.

I often though get questions about exploratory analysis, particularly when validating a measurement model, and the necessity of having to run dozens of similar models. Yes, you effectively have a multiple comparison problem. So, to me the answer is replicate yourself. You can report the initial study with the multiple comparisons but call it exploratory, and disclose what you did. Then, collect a new sample and test your results with a replication in the same paper. That shows a robustness of the findings and gives consumers of your analysis greater confidence that what you are reporting isn’t just a function of point and click statistics.

2) P-Hacking

Richard Bettis from the University of North Carolina best summed up p-hacking as “the hunt for asterisks.” It’s a problem that applies both to professional researchers as well as practitioners.

P-hacking is a close cousin of the multiple comparison problem, but here the motivation is a bit more sinister. With p-hacking, the analyst isn’t really looking to test an hypothesis, but is ‘letting the data speak’ by running a model and just looking for statistically significant relationships (e.g., p < .05). Usually though it’s not just one model but dozens and dozens, which we already learned was likely to result in a spurious finding.

The big problem with p-hacking is that we simply do not know if the strength of the relationship found is purely an artifact of the sample, the analytical method used, or legitimate judgment calls made by the researcher. We just don’t know. When combined with the multiple comparison problem and if statistical power is low, the probability becomes by high that the observed finding happened just by chance.

Just as with the multiple comparison problem, the best way to address p-hacking is to not do it, but in the case of an exploratory study, replicate yourself using a new sample and take care not to engage in exploratory analysis with the new sample. Notice a theme here? Because academic journals generally don’t favor replication studies–although this is [happily] changing–the answer to improve the confidence in our findings is to self-replicate and then to include that replication in the submission of the paper.

Just remember–fishing is fine for tuna, bad for data analysis.

3) Hypothesizing After Results Are Known (HARKing)

Of the three, this one is the worst. HARKing typically results from multiple comparisons and p-hacking, although technically it doesn’t have to (i.e., the analysts got lucky on the first shot and didn’t do anything else).

With HARKing, the analyst presents a hypothesis as if he/she set out to test that relationship right from the very beginning, but had already completed all of the analysis and knows how the data turns out. It is, effectively, the very opposite of the scientific method.

There are a number of discussions about HARKing and why it’s really, really bad, especially for academic research. I want to offer though another take on why it’s bad, particularly when it happens with studies of low statistical power.

We know that low power increases the likelihood of a Type II error–we just don’t have enough data to detect the effect we think is happening. Low power though also has a more pernicious impact though–if a Type I error occurs in a sample with low power, the observed effect size is highly likely to be substantially bigger than the true population value.

To put it another way, if I’ve p-hacked a small sample (let’s say, a sample of 200) with multiple comparisons and ended up finding a statistically significant result, the correlation (coefficient) I found is likely to be a lot bigger than its real value. I just made a problem substantially worse.

There is a good explanation of this effect here and here, but the basic idea is that in the presence of a Type I error, under-powered studies have a higher chance of showing a substantially inflated effect size.

So why does this matter?

Consider an applied case first. An analyst inside a business is playing around with data and finds that spending (per one hundred dollars) on employee morale programs increases employee satisfaction as measured on a scale from 1-5. After running dozens of regressions with various combinations of control variables, and so forth, the analyst finds a statistically significant coefficient of .25 (p = .035) with a sample of 40 respondents. The analyst tells his or her boss that for every one hundred dollars they spend on morale, employee satisfaction increases by .25, which seems like a big deal.

Now, a simple power calculation shows that this study is very under-powered (33% versus the accepted 80% standard) to detect that strong of an effect. But the bigger concern is that now the boss may make the decision to boost morale spending expecting a relatively big payoff that may likely not materialize at all. Is it possible that the analyst found the ‘true’ population value, sure, but the odds are against it.

On the academic side, inflated coefficient estimates from under-powered studies may find their way to meta-analyses that give a distorted view of the population effect size, which diminishes the value of the meta-analysis. It also though likely makes it difficult to report future studies with substantially larger samples but smaller effect sizes. Assuming consistency of the parameter estimates, the larger study’s reported effect size is far more likely to be closer to the population effect size (and likely smaller), but is not as ‘interesting’ or because early, under-powered studies reported an inflated effect.

Further, the HARKed paper often show effects–especially interaction effects–that seem counterintuitive and hence are more interesting but are presented as if the researcher had tested that proposition to begin with. What happens later is the file drawer problem when future scholars attempt to replicate the original paper and fail to do so, but do not publish the failed replication. In addition to faulty/biased/inflated effect sizes, HARKing wastes the time and resources of other researchers.

Even worse, the often smaller reported effect sizes in appropriately powered studies often leads to the erroneous–and silly–criticism that a sample is too large, or that, and yes, I actually had a comment like this by a reviewer, that with a large sample “effects are significant by default.”

For me, the answer to avoid the HARKing temptation is to not do it, and also not to do it if asked to by a reviewer/editor (this has happened to me several times, and often it means giving up the invitation to revise and resubmit). From a practical perspective though, assuming you have appropriate statistical power, if you do see an interesting or counterintuitive result in an initial round of analysis, as before, replicate yourself. Disclose that the initial study was exploratory and you identified an unexpected result, but then collect a new sample and see if the result holds (again avoiding the multiple comparisons/p-hacking problem).

Key Takeaway

Questionable research practices aren’t new, but the growing attention to their prevalence and the harm they do reinforces the obligation that we have as researchers to do the best science that we can do–whether as academics, or as analysts within organizations. Making corrections for multiple comparisons, avoiding p-hacking, and not HARKing ONLY makes the science we do better.

Musings

Dr. Brian Anderson's Blog

P-Hacking and the Problem of Multiple Comparisons