The End of “Statistical Significance”

I’ve linked to a number of articles about the misuse of p-values. Recently, I read a nice article in the October/November 2019 issue of MAA Focus summarizing a conversation between the Executive Directors of the Mathematical Association of America and the American Statistical Association about the ASA’s call to eliminate the use of p-values. Per copyright, I can’t copy the entire article here, but let me quote the lead paragraph:

In March 2016, the American Statistical Association took the extraordinary step of issuing a Statement on p-Values and Statistical Significance. This spring, the association went even further, publishing a massive special issue of its journal The American Statistician entitled Statistical Inference in the 21st Century: A World Beyond p<0.05. The lead editorial in that special issue called for the end of the use of the concept of statistical significance.

It’s going to be a while before entrenched statistics textbooks catch up with this new standard of professional practice.

Here’s an NPR article on the issue:

Other articles cited in the MAA Focus article:

Statistics for People in a Hurry

The following article was recommended to me by a former student: It’s synopsis is in the opening paragraph:

Ever wished someone would just tell you what the point of statistics is and what the jargon means in plain English? Let me try to grant that wish for you! I’ll zoom through all the biggest ideas in statistics in 8 minutes! Or just 1 minute, if you stick to the large font bits.

My Favorite One-Liners: Part 95

In this series, I’m compiling some of the quips and one-liners that I’ll use with my students to hopefully make my lessons more memorable for them.

Today’s quip is one that I’ll use in a statistics class when we find an extraordinarily small P-value. For example:

There is a social theory that states that people tend to postpone their deaths until after some meaningful event… birthdays, anniversaries, the World Series.

In 1978, social scientists investigated obituaries that appeared in a Salt Lake City newspaper. Among the 747 obituaries examined, 60 of the deaths occurred in the three-month period preceding their birth month. However, if the day of death is independent of birthday, we would expect that 25% of these deaths would occur in this three-month period.

Does this study provide statistically significant evidence to support this theory? Use \alpha=0.01.

It turns out, using a one-tailed hypothesis test for proportions, that the test statistics is z = -10.71 and the P-value is about 4.5 \times 10^{-27}. After the computations, I’ll then discuss what the numbers mean.

I’ll begin by asking, “Is the null hypothesis [that the proportion of deaths really is 25%] possible?” The correct answer is, “Yes, it’s possible.” Even extraordinarily small P-values do not prove that the null hypothesis is impossible. To emphasize the point, I’ll say:

After all, I found a woman who agreed to marry me. So extremely unlikely events are still possible.

Once the laughter dies down, I’ll ask the second question, “Is the null hypothesis plausible?” Of course, the answer is no, and so we reject the null hypothesis in favor of the alternative.


Statisticians Found One Thing They Can Agree On: It’s Time To Stop Misusing P-Values

From the excellent article

A common misconception among nonstatisticians is that p-values can tell you the probability that a result occurred by chance. This interpretation is dead wrong, but you see it again and again and again and again. The p-value only tells you something about the probability of seeing your results given a particular hypothetical explanation — it cannot tell you the probability that the results are true or whether they’re due to random chance…

Nor can a p-value tell you the size of an effect, the strength of the evidence or the importance of a result. Yet despite all these limitations, p-values are often used as a way to separate true findings from spurious ones, and that creates perverse incentives…

If there’s one takeaway from the ASA statement, it’s that p-values are not badges of truth and p < 0.05 is not a line that separates real results from false ones. They’re simply one piece of a puzzle that should be considered in the context of other evidence.

The article above links to the statement by the American Statistical Association as well as various commentaries by statisticians about the proper use of p-values.


Not Even Scientists Can Easily Explain P-Values published a very interesting feature: asking some leading scientists at a statistics conference to explain a P-value in simple, nontechnical terms. While they all knew the technical definition of a P-value, they were at a loss as to how to explain this technical notion to a nontechnical audience.

I plan on showing this article (and the embedded video) to my future statistics classes.

Not Even Scientists Can Easily Explain P-values


io9: “I Fooled Millions Into Thinking Chocolate Helps Weight Loss. Here’s How.”

Peer-reviewed publications is the best way that we’ve figured out for vetting scientific experiments and disseminating scientific knowledge. But that doesn’t mean that the system can’t be abused, either consciously or unconsciously.

The eye-opening article describes how the author published flimsy data that any discerning statistician should have seen through and even managed to get his “results” spread in the popular press. Some money quotes:

Here’s a dirty little science secret: If you measure a large number of things about a small number of people, you are almost guaranteed to get a “statistically significant” result. Our study included 18 different measurements—weight, cholesterol, sodium, blood protein levels, sleep quality, well-being, etc.—from 15 people. (One subject was dropped.) That study design is a recipe for false positives.

Think of the measurements as lottery tickets. Each one has a small chance of paying off in the form of a “significant” result that we can spin a story around and sell to the media. The more tickets you buy, the more likely you are to win. We didn’t know exactly what would pan out—the headline could have been that chocolate improves sleep or lowers blood pressure—but we knew our chances of getting at least one “statistically significant” result were pretty good.


With the paper out, it was time to make some noise. I called a friend of a friend who works in scientific PR. She walked me through some of the dirty tricks for grabbing headlines. It was eerie to hear the other side of something I experience every day.

The key is to exploit journalists’ incredible laziness. If you lay out the information just right, you can shape the story that emerges in the media almost like you were writing those stories yourself. In fact, that’s literally what you’re doing, since many reporters just copied and pasted our text.


The only problem with the diet science beat is that it’s science. You have to know how to read a scientific paper—and actually bother to do it. For far too long, the people who cover this beat have treated it like gossip, echoing whatever they find in press releases. Hopefully our little experiment will make reporters and readers alike more skeptical.

If a study doesn’t even list how many people took part in it, or makes a bold diet claim that’s “statistically significant” but doesn’t say how big the effect size is, you should wonder why. But for the most part, we don’t. Which is a pity, because journalists are becoming the de facto peer review system. And when we fail, the world is awash in junk science.

The Extent and Consequences of P-Hacking in Science

A couple months ago, the open-access journal PLOS Biology (which is a reputable open-access journal, unlike many others) published this very interesting article about the abuse of hypothesis testing in the scientific literature:

Here are some of my favorite quotes from near the end of the article:

The key to decreasing p-hacking is better education of researchers. Many practices that lead to p-hacking are still deemed acceptable. John et al. measured the prevalence of questionable research practices in psychology. They asked survey participants if they had ever engaged in a set of questionable research practices and, if so, whether they thought their actions were defensible on a scale of 0–2 (0 = no, 1 = possibly, 2 = yes). Over 50% of participants admitted to “failing to report all of a study’s dependent measures” and “deciding whether to collect more data after looking to see whether the results were significant,” and these practices received a mean defensibility rating greater than 1.5. This indicates that many researchers p-hack but do not appreciate the extent to which this is a form of scientific misconduct. Amazingly, some animal ethics boards even encourage or mandate the termination of research if a significant result is obtained during the study, which is a particularly egregious form of p-hacking (Anonymous reviewer, personal communication).

Eliminating p-hacking entirely is unlikely when career advancement is assessed by publication output, and publication decisions are affected by the p-value or other measures of statistical support for relationships. Even so, there are a number of steps that the research community and scientific publishers can take to decrease the occurrence of p-hacking.

Scientific research and false positives

I just read the following interesting article regarding the importance of replicating experiments to be sure that a result is scientifically valid: This strikes me as an engaging way to introduce the importance of P-values to a statistics class. Among the many salient quotes:

Psychologists are up in arms over, of all things, the editorial process that led to the recent publication of a special issue of the journal Social Psychology. This may seem like a classic case of ivory tower navel gazing, but its impact extends far beyond academia. The issue attempts to replicate 27 “important findings in social psychology.” Replication—repeating an experiment as closely as possible to see whether you get the same results—is a cornerstone of the scientific method. Replication of experiments is vital not only because it can detect the rare cases of outright fraud, but also because it guards against uncritical acceptance of findings that were actually inadvertent false positives, helps researchers refine experimental techniques, and affirms the existence of new facts that scientific theories must be able to explain…

Unfortunately, published replications have been distressingly rare in psychology. A 2012 survey of the top 100 psychology journals found that barely 1 percent of papers published since 1900 were purely attempts to reproduce previous findings…

Since journal publications are valuable academic currency, researchers—especially those early in their careers—have strong incentives to conduct original work rather than to replicate the findings of others. Replication efforts that do happen but fail to find the expected effect are usually filed away rather than published. That makes the scientific record look more robust and complete than it is—a phenomenon known as the “file drawer problem.”

The emphasis on positive findings may also partly explain the fact that when original studies are subjected to replication, so many turn out to be false positives. The near-universal preference for counterintuitive, positive findings gives researchers an incentive to manipulate their methods or poke around in their data until a positive finding crops up, a common practice known as “p-hacking” because it can result in p-values, or measures of statistical significance, that make the results look stronger, and therefore more believable, than they really are.

I encourage teachers of statistics to read the entire article.