Finding the Regression Line without Calculus

Last month, my latest professional article, Deriving the Regression Line with Algebra, was published in the April 2017 issue of Mathematics Teacher (Vol. 110, Issue 8, pages 594-598). Although linear regression is commonly taught in high school algebra, the usual derivation of the regression line requires multidimensional calculus. Accordingly, algebra students are typically taught the keystrokes for finding the line of best fit on a graphing calculator with little conceptual understanding of how the line can be found.

In my article, I present an alternative way that talented Algebra II students (or, in principle, Algebra I students) can derive the line of best fit for themselves using only techniques that they already know (in particular, without calculus).

For copyright reasons, I’m not allowed to provide the full text of my article here, though subscribers to Mathematics Teacher should be able to read the article by clicking the above link. (I imagine that my article can also be obtained via inter-library loan from a local library.) That said, I am allowed to share a macro-enabled Microsoft Excel spreadsheet that I wrote that allows students to experimentally discover the line of best fit:

http://www.math.unt.edu/~johnq/ExploringTheLineofBestFit.xlsm

I created this spreadsheet so that students can explore (which is, after all, the first E of the 5-E model) the properties of the line of best fit. In this spreadsheet, students can enter a data set with up to 10 points and then experiment with different slopes and y-intercepts. As they experiment, the spreadsheet keeps track of the current sum of the squares of the residuals as well as the best guess attempted so far. After some experimentation, the spreadsheet can also provide the correct answer so that students can see how close they got to the right answer.

My Favorite One-Liners: Part 52

In this series, I’m compiling some of the quips and one-liners that I’ll use with my students to hopefully make my lessons more memorable for them. Today’s story is a continuation of yesterday’s post.

When I teach regression, I typically use this example to illustrate the regression effect:

Suppose that the heights of fathers and their adult sons both have mean 69 inches and standard deviation 3 inches. Suppose also that the correlation between the heights of the fathers and sons is 0.5. Predict the height of a son whose father is 63 inches tall. Repeat if the father is 78 inches tall.

Using the formula for the regression line

y = \overline{y} + r \displaystyle \frac{s_y}{s_x} (x - \overline{x}),

we obtain the equation

y = 69 + 0.5(x-69) = 0.5x + 34.5,

so that the predicted height of the son is 66 inches if the father is 63 inches tall. However, the prediction would be 73.5 inches if the father is 76 inches tall. As expected, tall fathers tend to have tall sons, and short fathers tend to have short sons. Then, I’ll tell my class:

However, to the psychological comfort of us short people, tall fathers tend to have sons who are not quite as tall, and short fathers tend to have sons who are not quite as short.

This was first observed by Francis Galton (see the Wikipedia article for more details), a particularly brilliant but aristocratic (read: snobbish) mathematician who had high hopes for breeding a race of super-tall people with the proper use of genetics, only to discover that the laws of statistics naturally prevented this from occurring. Defeated, he called this phenomenon “regression toward the mean,” and so we’re stuck with called fitting data to a straight line “regression” to this day.

My Favorite One-Liners: Part 51

In this series, I’m compiling some of the quips and one-liners that I’ll use with my students to hopefully make my lessons more memorable for them.

When I teach regression, I typically use this example to illustrate the regression effect:

Suppose that the heights of fathers and their adult sons both have mean 69 inches and standard deviation 3 inches. Suppose also that the correlation between the heights of the fathers and sons is 0.5. Predict the height of a son whose father is 63 inches tall. Repeat if the father is 78 inches tall.

Using the formula for the regression line

y = \overline{y} + r \displaystyle \frac{s_y}{s_x} (x - \overline{x}),

we obtain the equation

y = 69 + 0.5(x-69) = 0.5x + 34.5,

so that the predicted height of the son is 66 inches if the father is 63 inches tall. However, the prediction would be 73.5 inches if the father is 76 inches tall.

To make this more memorable for students, I’ll observe:

As expected, tall fathers tend to have tall sons, and short fathers tend to have short sons. For example, my uncle was 6’6″. His two sons, my cousins, were 6’4″ and 6’5″ and were high school basketball stars.

My father was 5’3″. I became a math nerd.

Deceiving with Statistics

I really enjoyed a recent Math With Bad Drawings post on how descriptive statistics can be used to deceive. For example:

See the rest of the post for similar picture for mean, median, mode, and variance (equivalent to standard deviation); I’ll be using these in my future statistics classes.

Regression

Source: http://www.xkcd.com/1725/

Engaging students: Fitting data to a quadratic function

In my capstone class for future secondary math teachers, I ask my students to come up with ideas for engaging their students with different topics in the secondary mathematics curriculum. In other words, the point of the assignment was not to devise a full-blown lesson plan on this topic. Instead, I asked my students to think about three different ways of getting their students interested in the topic in the first place.

I plan to share some of the best of these ideas on this blog (after asking my students’ permission, of course).

This student submission again comes from my former student Loc Nguyen. His topic, from Algebra: fitting data to a quadratic function.

green line

A1. What interesting (i.e., uncontrived) word problems using this topic can your students do now?

To engage students on this topic, I will provide them the word problems in the real life so they can see the usefulness of quadratic regression in predictive purposes. The question to the problem is about the estimated numbers of AIDS cases that can be diagnosed in 2006. The data only show from 1999 to 2003. This will be students’ job to figure out the prediction. I will provide the instructions for this task and I will also walk them through the process of finding the best curve that fit the given data. The best fit to the curve will give us the estimation. Here is how the instruction looks like:

quadraticdata

In the end, students will be able to acquire the parabola curve which fit the given data. By letting students work through the real life problems, they will be able to understand why mathematics is important and see how this concept is useful in their lives.

green line

B2. How does this topic extend what your students should have learned in previous courses?

Before getting into this topic, the students should have eventually been familiar with the word “quadratic” such as quadratic function, quadratic equation. Students should have been taught when the curve concaves up or down. In the previous course, students would be given the quadratic functions and they would be asked to find the maxima, minima, or intercepts. Or they would be asked to solve the quadratic equation and find the roots. The universal properties of quadratic function never change. When students encountered the concept of quadratic regression, they would not be so overwhelmed with the topic. There is no new rule or properties. The process is just backward. The Instead of having the given function, in this case, students will have to find the function based on the given data so that the curve would fit the data. Their prior knowledge is really essential for this topic, and this would help them to understand the concept of quadratic regression easier.

 

 

green line

C1. How has this topic appeared in pop culture (movies, TV, current music, video games, etc.)?

At the beginning of the class, I would like to show students the short video of football incident.

This incident was really interesting. The Titans punt went so high so that it hit the scoreboard in Cowboys stadium. Surprisingly, this was Cowboy’s new stadium. There were many questions about what was going on when the architecture built this stadium. It was supposed to be great. This incident revealed the errors in predicting the height of the scoreboard. The data they collected in past year may have been incorrect. I want to incorporate this incident into the concept of quadratic regression. I will pose several questions such as:

Was Titan football punter really that powerful? What was really wrong in this situation?

When the architectures built this stadium, did they ever think that the ball would reach the ceiling?

How come did the architectures fail to measure the height of the ceiling? Did they just assume the height of the stadium tall enough?

What was the path of the ball?

Students will eagerly respond to these questions, and I will slowly bring in the important of quadratic regression. I will then explain how quadratic regression helps us to predict the height based on collected data from past years.

 

References:

https://www.youtube.com/watch?v=V4N3LEi5a1Q

http://www.algebralab.org/Word/Word.aspx?file=Algebra_QuadraticRegression.xml

 

 

Engaging students: Approximating data by a straight line

In my capstone class for future secondary math teachers, I ask my students to come up with ideas for engaging their students with different topics in the secondary mathematics curriculum. In other words, the point of the assignment was not to devise a full-blown lesson plan on this topic. Instead, I asked my students to think about three different ways of getting their students interested in the topic in the first place.

I plan to share some of the best of these ideas on this blog (after asking my students’ permission, of course).

This student submission again comes from my former student Esmerelda Sheran. Her topic, from Algebra: approximating data by a straight line.

green line

A.2) How could you as a teacher create an activity or project that involves your topic?

 

If I created an activity for my class over approximating data by using a straight line I would make sure the type of data, they use is something that is relevant or interesting in the student’s lives. I would have the students work in pairs and choose the data they would work with out of three sets of data I have chosen. Examples of the choices of data would be the relationships between interceptions and wins for NFL teams, car accidents and age, or attendance and GPA (in college/universities). Using the data they chose the students would first take an educated guess of how the graph would look like, draw the scatter plot associated with the data, and compare their guess to the actual graph. At that point the students would try to identify the parent function (xb+c, mx+b, ab, ln(x) etc.) that the data is most similar to or if the data even has correlation. They would then draw what they believed the best fit line would look like on the scatterplot which they would compare to the linear regression once they calculated it on a graphing calculator. I would hope that this activity would be interesting due to the data being real and relatable as well as it being a way to connect parent functions and statistical data.

green line

D.1) What interesting things can you say about the people who contributed to the discovery and/or the development of this topic?

Two of the main collaborators of linear regression are Sir Francis Galton and Karl Pearson. Galton was the discoverer of the linear regression and Pearson further elaborated on Galton’s ideas. Linear regression actually came to be because of sweet peas, Galton was studying heredity in sweet peas and formulated linear regression to aid him in studying the relations he found in his studies. Galton was much more than a hereditist, he was a geologist, meteorologist, tropical explorer, founder of differential psychology, inventor of fingerprint identifications, and an author. A few more interesting things about Galton is that he was knighted, he was accused of promoting eugenics, he was British and he was a half cousin of Charles Darwin. If you were wondering what “eugenics” is, it is the idea of planned breeding of humans through selectively breeding and sterilization. Galton once said, “… I object to pretensions of natural equality.” Being that Galton studied heredity it is no wonder that he felt that some physical/mental/emotional attributes where superior and that humans would benefit from having the “best” genes. Unfortunately for Galton eugenics was frowned upon and he was attacked for promoting it. I think that students would find Galton extremely interesting because of his wide variety of interests.

Karl Pearson, although not as complex as Galton had a few attributes that I feel would interest students. Pearson did not have a childhood that would be considered normal in modern day. Pearson was homeschooled up until he turned nine, and then he went to London alone to study at the University of College School. After he received his degrees and studied physics, metaphysics and Darwinism, Pearson developed his own view in social Darwinism. The social beliefs, he developed led him to changing his name from Carl to Karl.

 

 

green line

E.1) How can technology be used to effectively engage students with this topic?

 

Technology in the classroom has and always will be an effective way to engage students if used correctly. To engage my students to learn how to approximated data with a straight line I would use excel, a smartboard, or the khan academy website. Excel is a useful piece of technology that is underappreciated by the average Joe. With a set of data you can record the relationships and then use the tools to create a scatterplot and then find the linear regression line on the graph.

Using a smartboard in the classroom is effective because it is new technology that is very special and kind of rare. Using smartboard to graph the points of data and then drawing an approximated regression line is highly kinesthetic and gives hands-on experiences instead of just typing in number and getting a calculated result that required almost no brain power. Kinesthetically moving their arms up, down, or side to side helps the students get a feel for the variation and relations between the data and drawing a best fit line themselves help the student understand the data on a different level. The Khan Academy website is a great resource for being introduced and even mastering the concept of linear regression because of the different activities available. For visual and auditory learners, there are a series of videos that explain approximating data by linear regression as well as how to be the most accurate when approximating. Similarly, there is an activity for kinesthetic learners in which they can move a line around to see which line seems most like the best fit line. It is beneficial from an instructor to use this website to help students of all learning types.

 

References

http://www.mirror.co.uk/news/uk-news/elderly-priest-found-dead-after-5099110

https://www.dartmouth.edu/~matc/math5.geometry/unit2/unit2.html

http://geomhistory.com/home.html

http://www.americanegypt.com/feature/cities/chichenitza/castillo_shadow.htm

https://explorable.com/greek-geometry

 

 

What Happens if the Explanatory and Response Variables Are Sorted Independently?

From the category “I Can’t Believe What I Just Read,” the following question was posed to a question-and-answer statistics board last month:

Suppose we have data set (X_i,Y_i) with n points. We want to perform a linear regression, but first we sort the X_i values and the Y_i values independently of each other, forming data set (X_i,Y_j). Is there any meaningful interpretation of the regression on the new data set? Does this have a name?

I imagine this is a silly question so I apologize, I’m not formally trained in stats. In my mind this completely destroys our data and the regression is meaningless. But my manager says he gets “better regressions most of the time” when he does this (here “better” means more predictive). I have a feeling he is deceiving himself.

The answers were priceless:

Your intuition is correct: the independently sorted data have no reliable meaning because the inputs and outputs are being randomly mapped to one another rather than what the observed relationship was.

There is a (good) chance that the regression on the sorted data will look nice, but it is meaningless in context.

And:

If you want to convince your boss, you can show what is happening with simulated, random, independent x,y data. With R:

And:

This technique is actually amazing. I’m finding all sorts of relationships that I never suspected. For instance, I would have not have suspected that the numbers that show up in Powerball lottery, which it is CLAIMED are random, actually are highly correlated with the opening price of Apple stock on the same day! Folks, I think we’re about to cash in big time. 🙂

The sad end of the story, from the original poster:

Thank you for all of your nice and patient examples. I showed him the examples by @RUser4512 and @gung and he remains staunch. He’s becoming irritated and I’m becoming exhausted. I feel crestfallen. I want my work to mean something. I will probably begin looking for other jobs soon.

Engaging students: Approximating data by a straight line

In my capstone class for future secondary math teachers, I ask my students to come up with ideas for engaging their students with different topics in the secondary mathematics curriculum. In other words, the point of the assignment was not to devise a full-blown lesson plan on this topic. Instead, I asked my students to think about three different ways of getting their students interested in the topic in the first place.

I plan to share some of the best of these ideas on this blog (after asking my students’ permission, of course).

This student submission again comes from my former student Delaina Bazaldua. Her topic, from Algebra: approximating data to a straight line.

green line

How has this topic appeared in pop culture (movies, TV, current music, video games, etc.)?

One of my favorite shows to watch is How I Met Your Mother. I specifically chose this topic for this class because of how it relates to an episode of the show. A piece of the episode that I’m referring to is shown in the YouTube video:

Barney, one of the main characters, describes the graph as the Crazy/Hot Scale. According to him, a girl cannot be crazier than hot which means she has to be above the diagonal straight line. This relates to the topic because one can approximate data by the straight line that Barney gives the viewer. I think the students will be able to relate to this and also find it humorous. Because this video has both of these characteristics, they will be able to remember the concept for upcoming homework and tests which is ultimately the most important part of math: understanding it and being able to recall it.

 

 

green line

How has this topic appeared in the news?

Most lines are drawn for the purpose of seeing if there is a relationship between the x and y axis and trying to figure out if you can approximate data from the straight line that is drawn. Graphs like this are found all over the news, and they often relate to natural disasters. For example, this linear regression, http://d32ogoqmya1dw8.cloudfront.net/images/quantskills/methods/quantlit/bestfit_line.v2.jpg, describes floods. In http://serc.carleton.edu/mathyouneed/graphing/bestfit.html, where the picture is found, describes more activities that can be used to create a linear regression which can be converted into a straight line. These examples of straight lines can be used to find more data that isn’t necessarily shown from the points that are plotted. The examples the website gave are: flood frequency curves, earthquake forecasting, meteorite impact prediction, earthquake frequency vs. magnitude, and climate change. This is beneficial for math because it allows students to realize that math isn’t abstract like it is often perceived to be, but rather, it is used for something very important and something that occurs several times a year such as natural disasters and weather.

 

 

green line

How can this topic be used in your students’ future courses in mathematics or science?

One of the purposes for teachers to teach is for students to learn what they should for the following year so they can be successful in the particular topic. When it comes to approximating data based on a straight line, the knowledge a student learns in algebra will carry them through statistics, physics, and other higher math and science classes. Linear regression is shown in statistics as one can see in this statistics website: http://onlinestatbook.com/2/regression/intro.html while physics is represented in the physics website: http://dev.physicslab.org/Document.aspx?doctype=3&filename=IntroductoryMathematics_DataAnalysisMethods.xml. A lot can be predicted from these straight lines which is why these graphs aren’t foreign to upper level math and science classes. As I stated before, a lot can be predicted from the graph where data points aren’t necessarily on the trend the data is setting which allows students to expect what would occur at a particular x or y value. A background in this area can help students through the rest of school and perhaps even the rest of their life in some cases.

 

References:

https://www.youtube.com/watch?v=uN_sSXKbzHk

http://serc.carleton.edu/mathyouneed/graphing/bestfit.html

http://onlinestatbook.com/2/regression/intro.html

http://dev.physicslab.org/Document.aspx?doctype=3&filename=IntroductoryMathematics_DataAnalysisMethods.xml

 

Growth Rate of Calculus Textbooks

GrowthRateOfCalculusTextbooks

Source: https://www.facebook.com/photo.php?fbid=589563657759289&set=a.250425975006394.53155.241224542593204&type=1&theater