Statistical Inference for the General Education student

From the opening and closing paragraphs:

Many mathematics departments around the country offer an introductory statistics course for the general education student. Typically these students come to the mathematics classroom with minimal skills in arithmetic and algebra. In addition it is not unusual for these students to have very poor attitudes toward mathematics.

With this target population in mind one can design courses of study, called statistics, that will differ radically depending on what priorities are held. Many people choose to teach arithmetic through statistics and thereby build most of the course around descriptive statistics with some combinatorics. Others build most of the course around combinatorics and probabilities with some descriptive statistics. Few courses offered at this level spend much time or effort on statistical inference.

We believe that for the general education student the ideas of statistical inference and the resulting decision rules are of prime importance. This belief is based on the assumption that general education courses are included in the curriculum in order to help students to gain an understanding of their own essence, of their relationship to others, of the world around them, and of how man goes about knowing.

If you inspect most of the texts on the market today, you will find that they generally require that a student spend approximately a semester of study of descriptive statistics and probability theory before attempting statistical inference. This makes it very difficult to get to the general education portion of the subject in the time allotted most general education courses. If you agree with the analysis of the problem to this point the logical question is ‘Is there a way to teach statistical inference without the traditional work in descriptive statistics and probability?’. The remainder of this article describes an approach that allows one to answer this question with a yes…

It should be pointed out that there are some unusual difficulties in this approach to statistics [since] one trades traditional weakness in arithmetic and algebra for deficiencies in writing since the write-ups of the simulations demand clear and logical exposition on the part of the student. However, if you feel that the importance of ‘statistics for the general education student’ lies in the areas of inference and decision rules, then you should try this approach. You will like it.

This article won the 1978 George Polya award for expository excellence. Several techniques described this article probably would be modified with modern computer simulation today, but are still worthy of reading.

Click to access 00494925.di020678.02p03892.pdf

Welch’s formula

When conducting an hypothesis test or computing a confidence interval for the difference \overline{X}_1 - \overline{X}_2 of two means, where at least one mean does not arise from a small sample, the Student t distribution must be employed. In particular, the number of degrees of freedom for the Student t distribution must be computed. Many textbooks suggest using Welch’s formula:

df = \frac{\displaystyle (SE_1^2 + SE_2^2)^2}{\displaystyle \frac{SE_1^4}{n_1-1} + \frac{SE_2^4}{n_2-1}},

rounded down to the nearest integer. In this formula, SE_1 = \displaystyle \frac{\sigma_1}{\sqrt{n_1}} is the standard error associated with the first average \overline{X}_1, where \sigma_1 (if known) is the population standard deviation for X and n_1 is the number of samples that are averaged to find \overline{X}_1. In practice, \sigma_1 is not known, and so the bootstrap estimate \sigma_1 \approx s_1 is employed.

The terms SE_2 and n_2 are similarly defined for the average \overline{X}_2.

In Welch’s formula, the term SE_1^2 + SE_2^2 in the numerator is equal to \displaystyle \frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}. This is the square of the standard error SE_D associated with the difference \overline{X}_1 - \overline{X}_2, since

SE_D = \displaystyle \sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}.

This leads to the “Pythagorean” relationship

SE_1^2 + SE_2^2 = SE_D^2,

which (in my experience) is a reasonable aid to help students remember the formula for SE_D.

green line

Naturally, a big problem that students encounter when using Welch’s formula is that the formula is really, really complicated, and it’s easy to make a mistake when entering information into their calculators. (Indeed, it might be that the pre-programmed calculator function simply gives the wrong answer.) Also, since the formula is complicated, students don’t have a lot of psychological reassurance that, when they come out the other end, their answer is actually correct. So, when teaching this topic, I tell my students the following rule of thumb so that they can at least check if their final answer is plausible:

\min(n_1,n_2)-1 \le df \le n_1 + n_2 -2.

To my surprise, I have never seen this formula in a statistics textbook, even though it’s quite simple to state and not too difficult to prove using techniques from first-semester calculus.

Let’s rewrite Welch’s formula as

df = \left( \displaystyle \frac{1}{n_1-1} \left[ \frac{SE_1^2}{SE_1^2 + SE_2^2}\right]^2 + \frac{1}{n_2-1} \left[ \frac{SE_2^2}{SE_1^2 + SE_2^2} \right]^2 \right)^{-1}

For the sake of simplicity, let m_1 = n_1 - 1 and m_2 = n_2 -1, so that

df = \left( \displaystyle \frac{1}{m_1} \left[ \frac{SE_1^2}{SE_1^2 + SE_2^2}\right]^2 + \frac{1}{m_2} \left[ \frac{SE_2^2}{SE_1^2 + SE_2^2} \right]^2 \right)^{-1}

Now let x = \displaystyle \frac{SE_1^2}{SE_1^2 + SE_2^2}. All of these terms are nonnegative (and, in practice, they’re all positive), so that x \ge 0. Also, the numerator is no larger than the denominator, so that x \le 1. Finally, we notice that

1-x = 1 - \displaystyle \frac{SE_1^2}{SE_1^2 + SE_2^2} = \frac{SE_2^2}{SE_1^2 + SE_2^2}.

Using these observations, Welch’s formula reduces to the function

f(x) = \left( \displaystyle \frac{x^2}{m_1} + \frac{(1-x)^2}{m_2} \right)^{-1},

and the central problem is to find the maximum and minimum values of f(x) on the interval 0 \le x \le 1. Since f(x) is differentiable on [0,1], the absolute extrema can be found by checking the endpoints and the critical point(s).

First, the endpoints. If x=0, then f(0) = \left( \displaystyle \frac{1}{m_2} \right)^{-1} = m_2. On the other hand, if x=1, then f(1) = \left( \displaystyle \frac{1}{m_1} \right)^{-1} = m_1.

Next, the critical point(s). These are found by solving the equation f'(x) = 0:

f'(x) = -\left( \displaystyle \frac{x^2}{m_1} + \frac{(1-x)^2}{m_2} \right)^{-2} \left[ \displaystyle \frac{2x}{m_1} - \frac{2(1-x)}{m_2} \right] = 0

\displaystyle \frac{2x}{m_1} - \frac{2(1-x)}{m_2} = 0

\displaystyle \frac{2x}{m_1} = \frac{2(1-x)}{m_2}

xm_2= (1-x)m_1

xm_2 = m_1 - xm_1

x(m_1 + m_2) = m_1

x = \displaystyle \frac{m_1}{m_1 + m_2}

Plugging back into the original equation, we find the local extremum

f \left( \displaystyle \frac{m_1}{m_1+m_2} \right) = \left( \displaystyle \frac{1}{m_1} \frac{m_1^2}{(m_1+m_2)^2} + \frac{1}{m_2} \left[1-\frac{m_1}{m_1+m_2}\right]^2 \right)^{-1}

f \left( \displaystyle \frac{m_1}{m_1+m_2} \right) = \left( \displaystyle \frac{1}{m_1} \frac{m_1^2}{(m_1+m_2)^2} + \frac{1}{m_2} \left[\frac{m_2}{m_1+m_2}\right]^2 \right)^{-1}

f \left( \displaystyle \frac{m_1}{m_1+m_2} \right) = \left( \displaystyle \frac{m_1}{(m_1+m_2)^2} + \frac{m_2}{(m_1+m_2)^2} \right)^{-1}

f \left( \displaystyle \frac{m_1}{m_1+m_2} \right) = \left( \displaystyle \frac{m_1+m_2}{(m_1+m_2)^2} \right)^{-1}

f \left( \displaystyle \frac{m_1}{m_1+m_2} \right) = \left( \displaystyle \frac{1}{m_1+m_2} \right)^{-1}

f \left( \displaystyle \frac{m_1}{m_1+m_2} \right) = m_1+m_2

Based on the three local extrema that we’ve found, it’s clear that the absolute minimum of f(x) on [0,1] is the smaller of m_1 and m_2, while the absolute maximum is equal to m_1 + m_2.

\hbox{QED}

In conclusion, I suggest offering the following guidelines to students to encourage their intuition about the plausibility of their answers:

  • If SE_1 is much smaller than SE_2 (i.e., x \approx 0), then df will be close to m_2 = n_2 - 1.
  • If SE_1 is much larger than SE_2 (i.e., x \approx 1), then df will be close to m_1 = n_1 - 1.
  • Otherwise, df could be as large as m_1 + m_2 = n_1 + n_2 - 2, but no larger.