Why Do So Many Analyses Rely On a P-Value of 0.05?

In the world of statistics, logical explanations exist for nearly every formula or decision point made during an analysis. From describing the characteristics of your measures, to choosing your analytic methods, you draw on scientifically-based approaches developed by statisticians. You will encounter one exception to this “rule”, however, in all but the most basic of analyses. Your interpretation of a result as statistically significant or not depends on your choice of p-value (or probability value) threshold. Mast analysts’ standard choice of p-value threshold is 0.05…but why? In this article, I’ll explain why so many analyses rely on a p-value of 0.05 as that threshold.

What is A P-Value?

“Informally, a p-value is a probability under a specified statistical model that a statistical summary of the data (e.g., the sample mean difference between two compared groups) would be equal to or more extreme than it’s observed value.” – Wasserstein & Lazar (2016)[1]

In plain, non-technical, mathless language, the p-value tells you how incompatible your data are with the statistical model you are using. More specifically, when you test a statistical hypothesis, the resulting p-value tells you whether your data agree with that hypothesis.

In the simplest example, you might be performing a one-sample z-test. You use the one-sample z-test to test whether your sample mean is different from an estimate of a population mean. Like every statistical test, the one-sample z-test asks how compatible the data are with a difference of zero (0). The resulting p-value tells you the probability that your sample data are compatible with that idea.

If you find a high p-value, then your data are more compatible with the idea that there is no statistical difference. In contrast, if your p-value is very small, then your data are less compatible with the idea of no difference.

Now, at this point, you are probably asking where the 0.05 comes into play. You use the p-value threshold to decide the point where you call the p-value result incompatible with the hypothesis. And this is where the 0.05 comes in. Analysts have traditionally used 0.05 as the threshold.

Thus, if the resulting p-value from your test is 0.05 or less, your result is statistically significant. At this level, your data are generally considered incompatible with the hypothesis of no difference. In contrast, a p-value result higher than the 0.05 threshold indicates the data are compatible with the hypothesis.

Why Use a P-Value of 0.05 or Less?

If you’re reading this, you’re a curious kind of person, like me. And you are probably asking yourself, why is 0.05 the magic number? Well, it’s not a magic number. In fact, statisticians have had significant debate (no statistical pun intended) about its use over the past century. But analysts’ use of 0.05 as a p-value threshold is based on one part mathematical convenience, and possibly even one part academic feud.

Ronald A Fisher (1890 – 1962) is considered by many as the father of modern statistics (although there are many other important contributors).[2]^,[3] One of Fisher’s many contributions included his 1925 book Statistical Methods for Research Workers (SMRW). In this text, he sought to make statistical techniques more accessible to applied researchers across many disciplines.

A Mathematical Convenience

In SMRW, Fisher explains that a p-value of 0.05 is a practical threshold because it represents results that are roughly two (2) standard deviations away from no difference. With a p-value less than 0.05, you would expect a false result only once if you replicated the analysis 20 times. Fisher’s argument was a convenient mathematical justification for choosing a p-value of 0.05. His choice was not based on any mathematical proof that 0.05 was more important than other possible p-values.[4]

An Academic Feud

In addition to the convenience of 0.05, an academic feud may have helped promote Fisher’s choice of p-value.⁴ Karl Pearson was a contemporary of Fisher’s and the editor of journal Biometrika. The journal held the copyrights to some of the detailed statistical tables available at the time.

Unfortunately, Pearson and Fisher did not see eye-to-eye academically, and Biometrika refused Fisher copyright permission to replicate the tables. This forced Fisher to create his own tables, which he made easier to navigate by pre-selecting p-value thresholds to include. Among these pre-selected values, Fisher included 0.05 and 0.01, a practice that continues to this day.

It is important to note that even Fisher didn’t believe a p-value of 0.05 was the best choice in all cases. As one biostatistician aptly noted, “…if 0.05 worked in every setting, there would have been only one column in each table.”⁴Regardless, a convenient mathematical justification and academic feud conspired to make 0.05 a practical choice for researchers in many disciplines.

Conclusion

And there you have it. One mathematical convenience, and one academic feud. As a result, millions of statistics students today learn that the conventional threshold for determining statistical significance is p 0.05.

The next time you’re working on an analysis, however, consider the implications of the p 0.05 threshold. If p = 0.051, then the convention is to interpret the result as non-significant. In other words, under standard practice a p-value difference of 0.001 (or 0.1%) is enough to flip the interpretation of the result. That’s a pretty small margin for a probability that is a measure of compatibility, or uncertainty. And THIS, my friend, is why non-significance by itself is not a sufficient reason to conclude that no difference exists.

In my next post, we’ll explore this idea of uncertainty in statistical testing further. And if you are a business leader looking for a real data literacy assessment tool to use with your team, you’ll want to check out our free self-scoring assessment tool here.

[1] Wasserstein, Ronal L., and Nicole A. Lazar (2016). The ASA Statement on p-Values: Context, Process, and Purpose. The American Statistician 70(2): 129-133.

[2] Hald, Anders (1998). A History of Mathematical Statistics. New York: Wiley.

[3] Kennedy-Shaffer, Lee (2019). Before p < 0.05 to Beyond p < 0.05: Using History to Contextualize p-Values and Significance Testing. The American Statistician 2019 73(Supp. 1): 82-90.

[4] In fact, the size of your sample for an analysis is one important reason you might want to use a p-value threshold different from 0.05. With a very small sample, you might set your p-value threshold to a larger value like 0.10 because detecting a signal in small samples of data is more difficult. In contrast, with a very large sample the opposite it true and you might set a threshold lower than 0.05.