Student t distribution

One of my favorite anecdotes that I share with my statistics students is why the Student t distribution is called the t distribution and not the Gosset distribution.

From Wikipedia:

In the English-language literature it takes its name from William Sealy Gosset’s 1908 paper in Biometrika under the pseudonym “Student”. Gosset worked at the Guinness Brewery in Dublin, Ireland, and was interested in the problems of small samples, for example the chemical properties of barley where sample sizes might be as low as 3. One version of the origin of the pseudonym is that Gosset’s employer preferred staff to use pen names when publishing scientific papers instead of their real name, therefore he used the name “Student” to hide his identity. Another version is that Guinness did not want their competitors to know that they were using the t-test to test the quality of raw material.

Gosset’s paper refers to the distribution as the “frequency distribution of standard deviations of samples drawn from a normal population”. It became well-known through the work of Ronald A. Fisher, who called the distribution “Student’s distribution” and referred to the value as t.

From the 1963 book Experimentation and Measurement (see pages 68-69 of the PDF, which are marked as pages 69-70 on the original):

The mathematical solution to this problem was first discovered by an Irish chemist who wrote under the pen name of “Student.” Student worked for a company that was unwilling to reveal its connection with him lest its competitors discover that Student’s work would also be advantageous to them. It now seems extraordinary that the author of this classic paper on measurements was not known for more than twenty years. Eventually it was learned that his real name was William Sealy Gosset (1876-1937).

A nice news article on Bayesian statistics

The New York Times consistently provides the best coverage of mathematics and science by a traditional news outlet. Today, I’d like to feature their article The Odds, Updated Continually, which gives a nice synopsis of the growth of Bayesian statistics in recent years and how Bayesian statistics differs from the frequentist interpretation of statistics. For example:

Statistics may not sound like the most heroic of pursuits. But if not for statisticians, a Long Island fisherman might have died in the Atlantic Ocean after falling off his boat early one morning last summer.

The man owes his life to a once obscure field known as Bayesian statistics — a set of mathematical rules for using new data to continuously update beliefs or existing knowledge…

The essence of the frequentist technique is to apply probability to data. If you suspect your friend has a weighted coin, for example, and you observe that it came up heads nine times out of 10, a frequentist would calculate the probability of getting such a result with an unweighted coin. The answer (about 1 percent) is not a direct measure of the probability that the coin is weighted; it’s a measure of how improbable the nine-in-10 result is — a piece of information that can be useful in investigating your suspicion.

By contrast, Bayesian calculations go straight for the probability of the hypothesis, factoring in not just the data from the coin-toss experiment but any other relevant information — including whether you’ve previously seen your friend use a weighted coin.

Scientists who have learned Bayesian statistics often marvel that it propels them through a different kind of scientific reasoning than they’d experienced using classical methods.

“Statistics sounds like this dry, technical subject, but it draws on deep philosophical debates about the nature of reality,” said the Princeton University astrophysicist Edwin Turner, who has witnessed a widespread conversion to Bayesian thinking in his field over the last 15 years…

The Coast Guard has been using Bayesian analysis since the 1970s. The approach lends itself well to problems like searches, which involve a single incident and many different kinds of relevant data, said Lawrence Stone, a statistician for Metron, a scientific consulting firm in Reston, Va., that works with the Coast Guard.

At first, all the Coast Guard knew about the fisherman was that he fell off his boat sometime from 9 p.m. on July 24 to 6 the next morning. The sparse information went into a program called Sarops, for Search and Rescue Optimal Planning System. Over the next few hours, searchers added new information — on prevailing currents, places the search helicopters had already flown and some additional clues found by the boat’s captain.

The system couldn’t deduce exactly where Mr. Aldridge was drifting, but with more information, it continued to narrow down the most promising places to search.

Just before turning back to refuel, a searcher in a helicopter spotted a man clinging to two buoys he had tied together. He had been in the water for 12 hours; he was hypothermic and sunburned but alive.

Even in the jaded 21st century, it was considered something of a miracle.

Education is not Moneyball

I initially embraced value-added methods of teacher evaluation, figuring that they could revolutionize education in the same way that sabermetricians revolutionized professional baseball. Over time, however, I realized that this analogy was somewhat flawed. There are lots of ways to analyze data, and the owners of baseball teams have a real motivation — they want to win ball games and sell tickets — to use data appropriately to ensure their best chance of success. I’m not so sure that the “owners” of public education — the politicians and ultimately the voters — share this motivation.

An excellent editorial the contrasting use of statistics in baseball and in education appeared in Education Week: http://www.edweek.org/tm/articles/2014/08/27/fp_eger_valueadded.html?cmp=ENL-TU-NEWS1 I appreciate the tack that this editorial takes: the author is not philosophically opposed to sabermetric-like analysis of education but argues forcefully that, pragmatically, we’re not there yet.

Both the Gates Foundation and the Education Department have been advocates of using value-added models to gauge teacher performance, but my sense is that they are increasingly nervous about accuracy and fairness of the new methodology, especially as schools transition to the Common Core State Standards.

There are definitely grounds for apprehensiveness. Oddly enough, many of the reasons that the similarly structured WAR [Wins Above Replacement] works in baseball point to reasons why teachers should be skeptical of value-added models.

WAR works because baseball is standardized. All major league baseball players play on the same field, against the same competition with the same rules, and with a sizable sample (162 games). Meanwhile, public schools aren’t playing a codified game. They’re playing Calvinball—the only permanent rule seems to be that you can’t play it the same way twice. Within the same school some teachers have SmartBoards while others use blackboards; some have spacious classrooms, while others are in overcrowded closets; some buy their own supplies while others are given all they need. The differences across schools and districts are even larger.

The American Statistical Association released a brief report on value-added assessment that was devastating to its advocates. ASA set out some caveats on the usage on value-added measurement (VAM) which should give education reformers pause. Some quotes:

VAMs are complicated statistical models, and they require high levels of statistical expertise. Sound statistical practices need to be used when developing and interpreting them, especially when they are part of a high-stakes accountability system. These practices include evaluating model assumptions, checking how well the model fits
the data, investigating sensitivity of estimates to aspects of the model, reporting measures of estimated precision such as confidence intervals or standard errors, and assessing the usefulness of the models for answering the desired questions about teacher effectiveness and how to improve the educational system.

VAMs typically measure correlation, not causation: Effects – positive or negative – attributed to a teacher may actually be caused by other factors that are not captured in the model.

Under some conditions, VAM scores and rankings can change substantially when a different model or test is used, and a thorough analysis should be undertaken to evaluate the sensitivity of estimates to different models.

VAMs should be viewed within the context of quality improvement, which distinguishes aspects of quality that can be attributed to the system from those that can be attributed to individual teachers, teacher preparation programs, or schools. Most VAM studies find that teachers account for about 1% to 14% of the variability in test scores, and that the majority of opportunities for quality improvement are found in the system-level conditions. Ranking teachers by their VAM scores can have unintended consequences that reduce quality.

 

 

 

Correlation and Causation: Index

I’m using the Twelve Days of Christmas (and perhaps a few extra days besides) to do something that I should have done a long time ago: collect past series of posts into a single, easy-to-reference post. The following posts formed my series on data sets that (hopefully) persuade students that correlation is not the same as causation.

Part 1: Piracy and global warming. Also, usage of Internet Explorer and murder.

Part 2: An xkcd comic.

Part 3: STEM spending and suicide. Consumption of margarine and divorce. Consumption of mozzarella and earning a doctorate. Marriage rates and deaths by drowning.

 

 

 

 

Mathematical Christmas gifts

Now that Christmas is over, I can safely share the Christmas gifts that I gave to my family this year thanks to Nausicaa Distribution (https://www.etsy.com/shop/NausicaaDistribution):

Euler’s equation pencil pouch:

Box-and-whisker snowflakes to hang on our Christmas tree:

And, for me, a wonderfully and subtly punny “Confidence and Power” T-shirt.

Thanks to FiveThirtyEight (see http://fivethirtyeight.com/features/the-fivethirtyeight-2014-holiday-gift-guide/) for pointing me in this direction.

green lineFor the sake of completeness, here are the math-oriented gifts that I received for Christmas:

 

 

Scientific research and false positives

I just read the following interesting article regarding the importance of replicating experiments to be sure that a result is scientifically valid: http://www.slate.com/articles/health_and_science/science/2014/07/replication_controversy_in_psychology_bullying_file_drawer_effect_blog_posts.single.html. This strikes me as an engaging way to introduce the importance of P-values to a statistics class. Among the many salient quotes:

Psychologists are up in arms over, of all things, the editorial process that led to the recent publication of a special issue of the journal Social Psychology. This may seem like a classic case of ivory tower navel gazing, but its impact extends far beyond academia. The issue attempts to replicate 27 “important findings in social psychology.” Replication—repeating an experiment as closely as possible to see whether you get the same results—is a cornerstone of the scientific method. Replication of experiments is vital not only because it can detect the rare cases of outright fraud, but also because it guards against uncritical acceptance of findings that were actually inadvertent false positives, helps researchers refine experimental techniques, and affirms the existence of new facts that scientific theories must be able to explain…

Unfortunately, published replications have been distressingly rare in psychology. A 2012 survey of the top 100 psychology journals found that barely 1 percent of papers published since 1900 were purely attempts to reproduce previous findings…

Since journal publications are valuable academic currency, researchers—especially those early in their careers—have strong incentives to conduct original work rather than to replicate the findings of others. Replication efforts that do happen but fail to find the expected effect are usually filed away rather than published. That makes the scientific record look more robust and complete than it is—a phenomenon known as the “file drawer problem.”

The emphasis on positive findings may also partly explain the fact that when original studies are subjected to replication, so many turn out to be false positives. The near-universal preference for counterintuitive, positive findings gives researchers an incentive to manipulate their methods or poke around in their data until a positive finding crops up, a common practice known as “p-hacking” because it can result in p-values, or measures of statistical significance, that make the results look stronger, and therefore more believable, than they really are.

I encourage teachers of statistics to read the entire article.

Issues when conducting political polls

The classic application of confidence intervals is political polling: the science of sampling relatively few people to predict the opinions of a large population. However, in the 2010s, the art of political polling — constructing representative samples from a large population — has become more and more difficult.

FiveThirtyEight.com wrote a recent article, Is The Polling Industry in Statis or in Crisis?, about the nuts and bolts of conducting a survey that should provide valuable background information for anyone teaching a course in statistics. From the opening paragraphs:

There is no shortage of reasons to worry about the state of the polling industry. Response rates to political polls are dismal. Even polls that make every effort to contact a representative sample of voters now get no more than 10 percent to complete their surveys — down from about 35 percent in the 1990s.

And there are fewer high-quality polls than there used to be. The cost to commission one can run well into five figures, and it has increased as response rates have declined.1 Under budgetary pressure, many news organizations have understandably preferred to trim their polling budgets rather than lay off newsroom staff.

Cheaper polling alternatives exist, but they come with plenty of problems. “Robopolls,” which use automated scripts rather than live interviewers, often get response rates in the low to mid-single digits. Most are also prohibited by law from calling cell phones, which means huge numbers of people are excluded from their surveys.

How can a poll come close to the outcome when so few people respond to it?

Is The Polling Industry In Stasis Or In Crisis?

Nuts and Bolts of Political Polls

A standard topic in my statistics class is political polling, which is the canonical example of constructing a confidence interval with a relatively small sample to (hopefully) project the opinions of a large population. Of course, polling is only valid if the sample represents the population at large. This is a natural engagement activity in the fall semester preceding a presidential election.

A recent article on FiveThirtyEight.com, “Are Bad Pollsters Copying Good Pollsters,” does a nice job of explaining some of the nuts and bolts of political polling in an age when selected participants are increasingly unlikely to participate… and also raises the specter of how pollsters using nontraditional methods might consciously or subconconsciously cheating.  A sample (pun intended) from the article:

What’s a nontraditional poll? One that doesn’t abide by the industry’s best practices. So, a survey is nontraditional if it:

  • doesn’t follow probability sampling;
  • doesn’t use live interviewers;
  • is released by a campaign or campaign groups (because these only selectively release data);
  • doesn’t disclose (i.e. doesn’t release raw data to the Roper Archives, isn’t a member of the National Council on Public Polls, or hasn’t signed onto the American Association for Public Opinion Research transparency initiative).

Everything else is a gold-standard poll…

Princeton University graduate student Steven Rogers and Vanderbilt University professor of political science Joshua Clinton [studied] interactive voice response (IVR) surveys in the 2012 Republican presidential primary. (IVR pollsters are in our nontraditional group.) Rogers and Clinton found that IVR pollsters were about as accurate as live-interview pollsters in races where live-interview pollsters surveyed the electorate. IVR pollsters were considerably less accurate when no live-interview poll was conducted. This effect held true even when controlling for a slew of different variables. Rogers and Clinton suggested that the IVR pollsters were taking a “cue” from the live pollsters in order to appear more accurate.

My own analysis hints at the same possibility. The nontraditional pollsters did worse in races without a live pollster.