My Favorite One-Liners: Part 95

In this series, I’m compiling some of the quips and one-liners that I’ll use with my students to hopefully make my lessons more memorable for them.

Today’s quip is one that I’ll use in a statistics class when we find an extraordinarily small P-value. For example:

There is a social theory that states that people tend to postpone their deaths until after some meaningful event… birthdays, anniversaries, the World Series.

In 1978, social scientists investigated obituaries that appeared in a Salt Lake City newspaper. Among the 747 obituaries examined, 60 of the deaths occurred in the three-month period preceding their birth month. However, if the day of death is independent of birthday, we would expect that 25% of these deaths would occur in this three-month period.

Does this study provide statistically significant evidence to support this theory? Use \alpha=0.01.

It turns out, using a one-tailed hypothesis test for proportions, that the test statistics is z = -10.71 and the P-value is about 4.5 \times 10^{-27}. After the computations, I’ll then discuss what the numbers mean.

I’ll begin by asking, “Is the null hypothesis [that the proportion of deaths really is 25%] possible?” The correct answer is, “Yes, it’s possible.” Even extraordinarily small P-values do not prove that the null hypothesis is impossible. To emphasize the point, I’ll say:

After all, I found a woman who agreed to marry me. So extremely unlikely events are still possible.

Once the laughter dies down, I’ll ask the second question, “Is the null hypothesis plausible?” Of course, the answer is no, and so we reject the null hypothesis in favor of the alternative.


My Favorite One-Liners: Part 36

In this series, I’m compiling some of the quips and one-liners that I’ll use with my students to hopefully make my lessons more memorable for them.

Not everything in mathematics works out the way we’d prefer it to. For example, in statistics, a Type I error, whose probability is denoted by \alpha, is rejecting the null hypothesis even though the null hypothesis is true. Conversely, a Type II error, whose probability is denoted by \beta, is retaining the null hypothesis even though the null hypothesis is false.

Ideally, we’d like \alpha = 0 and \beta = 0, so there’s no chance of making a mistake. I’ll tell my students:

There are actually two places in the country where this can happen. One’s in California, and the other is in Florida. And that place is called Fantasyland.

My Favorite One-Liners: Part 23

In this series, I’m compiling some of the quips and one-liners that I’ll use with my students to hopefully make my lessons more memorable for them.

Here are some sage words of wisdom that I give in my statistics class:

If the alternative hypothesis has the form p > p_0, then the rejection region lies to the right of p_0. On the other hand, if the alternative hypothesis has the form p < p_0, then the rejection region lies to the left of p_0.

On the other hand, if the alternative hypothesis has the form p \ne p_0, then the rejection region has two parts: one part to the left of p_0, and another part to the right. So it’s kind of like my single days. Back then, my rejection region had two parts: Friday night and Saturday night.

Statistics and percussion

I recently had a flash of insight when teaching statistics. I have completed my lectures of finding confidence intervals and conducting hypothesis testing for one-sample problems (both for averages and for proportions), and I was about to start my lectures on two-sample problems (liek the difference of two means or the difference of two proportions).

On the one hand, this section of the course is considerably more complicated because the formulas are considerably longer and hence harder to remember (and more conducive to careless mistakes when using a calculator). The formula for the standard error is longer, and (in the case of small samples) the Welch-Satterthwaite formula is especially cumbersome to use.

On the other hand, students who have mastered statistical techniques for one sample can easily extend this knowledge to the two-sample case. The test statistic (either z or t) can be found by using the formula (Observed – Expected)/(Standard Error), where the standard error formula has changed, and the critical values of the normal or t distribution is used as before.

I hadn’t prepared this ahead of time, but while I was lecturing to my students I remembered a story that I heard a music professor say about students learning how to play percussion instruments. As opposed to other musicians, the budding percussionist only has a few basic techniques to learn and master. The trick for the percussionist is not memorizing hundreds of different techniques but correctly applying a few techniques to dozens of different kinds of instruments (drums, xylophones, bells, cymbals, etc.)

It hit me that this was an apt analogy for the student of statistics. Once the techniques of the one-sample case are learned, these same techniques are applied, with slight modifications, to the two-sample case.

I’ve been using this analogy ever since, and it seems to resonate (pun intended) with my students as they learn and practice the avalanche of formulas for two-sample statistics problems.

Not Even Scientists Can Easily Explain P-Values published a very interesting feature: asking some leading scientists at a statistics conference to explain a P-value in simple, nontechnical terms. While they all knew the technical definition of a P-value, they were at a loss as to how to explain this technical notion to a nontechnical audience.

I plan on showing this article (and the embedded video) to my future statistics classes.


Interpreting statistical significance


io9: “I Fooled Millions Into Thinking Chocolate Helps Weight Loss. Here’s How.”

Peer-reviewed publications is the best way that we’ve figured out for vetting scientific experiments and disseminating scientific knowledge. But that doesn’t mean that the system can’t be abused, either consciously or unconsciously.

The eye-opening article describes how the author published flimsy data that any discerning statistician should have seen through and even managed to get his “results” spread in the popular press. Some money quotes:

Here’s a dirty little science secret: If you measure a large number of things about a small number of people, you are almost guaranteed to get a “statistically significant” result. Our study included 18 different measurements—weight, cholesterol, sodium, blood protein levels, sleep quality, well-being, etc.—from 15 people. (One subject was dropped.) That study design is a recipe for false positives.

Think of the measurements as lottery tickets. Each one has a small chance of paying off in the form of a “significant” result that we can spin a story around and sell to the media. The more tickets you buy, the more likely you are to win. We didn’t know exactly what would pan out—the headline could have been that chocolate improves sleep or lowers blood pressure—but we knew our chances of getting at least one “statistically significant” result were pretty good.


With the paper out, it was time to make some noise. I called a friend of a friend who works in scientific PR. She walked me through some of the dirty tricks for grabbing headlines. It was eerie to hear the other side of something I experience every day.

The key is to exploit journalists’ incredible laziness. If you lay out the information just right, you can shape the story that emerges in the media almost like you were writing those stories yourself. In fact, that’s literally what you’re doing, since many reporters just copied and pasted our text.


The only problem with the diet science beat is that it’s science. You have to know how to read a scientific paper—and actually bother to do it. For far too long, the people who cover this beat have treated it like gossip, echoing whatever they find in press releases. Hopefully our little experiment will make reporters and readers alike more skeptical.

If a study doesn’t even list how many people took part in it, or makes a bold diet claim that’s “statistically significant” but doesn’t say how big the effect size is, you should wonder why. But for the most part, we don’t. Which is a pity, because journalists are becoming the de facto peer review system. And when we fail, the world is awash in junk science.

The Extent and Consequences of P-Hacking in Science

A couple months ago, the open-access journal PLOS Biology (which is a reputable open-access journal, unlike many others) published this very interesting article about the abuse of hypothesis testing in the scientific literature:

Here are some of my favorite quotes from near the end of the article:

The key to decreasing p-hacking is better education of researchers. Many practices that lead to p-hacking are still deemed acceptable. John et al. measured the prevalence of questionable research practices in psychology. They asked survey participants if they had ever engaged in a set of questionable research practices and, if so, whether they thought their actions were defensible on a scale of 0–2 (0 = no, 1 = possibly, 2 = yes). Over 50% of participants admitted to “failing to report all of a study’s dependent measures” and “deciding whether to collect more data after looking to see whether the results were significant,” and these practices received a mean defensibility rating greater than 1.5. This indicates that many researchers p-hack but do not appreciate the extent to which this is a form of scientific misconduct. Amazingly, some animal ethics boards even encourage or mandate the termination of research if a significant result is obtained during the study, which is a particularly egregious form of p-hacking (Anonymous reviewer, personal communication).

Eliminating p-hacking entirely is unlikely when career advancement is assessed by publication output, and publication decisions are affected by the p-value or other measures of statistical support for relationships. Even so, there are a number of steps that the research community and scientific publishers can take to decrease the occurrence of p-hacking.

Student t distribution

One of my favorite anecdotes that I share with my statistics students is why the Student t distribution is called the t distribution and not the Gosset distribution.

From Wikipedia:

In the English-language literature it takes its name from William Sealy Gosset’s 1908 paper in Biometrika under the pseudonym “Student”. Gosset worked at the Guinness Brewery in Dublin, Ireland, and was interested in the problems of small samples, for example the chemical properties of barley where sample sizes might be as low as 3. One version of the origin of the pseudonym is that Gosset’s employer preferred staff to use pen names when publishing scientific papers instead of their real name, therefore he used the name “Student” to hide his identity. Another version is that Guinness did not want their competitors to know that they were using the t-test to test the quality of raw material.

Gosset’s paper refers to the distribution as the “frequency distribution of standard deviations of samples drawn from a normal population”. It became well-known through the work of Ronald A. Fisher, who called the distribution “Student’s distribution” and referred to the value as t.

From the 1963 book Experimentation and Measurement (see pages 68-69 of the PDF, which are marked as pages 69-70 on the original):

The mathematical solution to this problem was first discovered by an Irish chemist who wrote under the pen name of “Student.” Student worked for a company that was unwilling to reveal its connection with him lest its competitors discover that Student’s work would also be advantageous to them. It now seems extraordinary that the author of this classic paper on measurements was not known for more than twenty years. Eventually it was learned that his real name was William Sealy Gosset (1876-1937).

A T-shirt describing hypothesis testing