(A Note for my SJC MAT 151 class – Summer 2018)

What we would have talked about on Friday, had we met for class, was Linear Regression.

Consider this Body Fat Data Set I found which compares patients’ body fat measurements to other measurements on their bodies.  The first sheet in the excel spreadsheet is simply all of the data I was given.  I was curious if any of these variables were  correlated with body fat in a significant way…

In the second sheet you will notice that I isolated Body Fat and Ankle (“size”?).  I’m not really sure what measurement they were using on the ankle and it really isn’t important for our purposes.  Anyway, if you check out a video like this one you should see that it is fairly simple to create the scatterplot with corresponding trendline and R^2 coefficient.

Let’s take a moment to understand what how the trendline is chosen.

Choosing the Line of Best Fit

Suppose you are given a collection of data points with two variables.  For our purposes, let’s consider the third sheet of the spreadsheet above where the x variable is the Body Fat and the y variable is the Age of the individual.  Imagine that we proposed two different candidates for the line of best fit:

RegressionREDBLUEline
Figure 1: Two proposed lines for the same 5 data points.

To be clear, the 5 data points that the red line is passing through are supposed to be the same 5 data points that the blue line is passing through, I just thought that separating the two pictures was clearer than drawing this:

RegressionREDBLUE2in1confusingline.jpeg
Figure 2: A presumably more confusing version of Figure 1?

At any rate, here is how we figure out which line, red or blue, is the better fit:

  1. For each of the five points in the data set, find the vertical distance between the data point and the red curve.
  2. Square each of these vertical distances.
  3. Add up these squares of vertical distances.  Label this total sum, SSR, for “sum of square on the red line”.
  4. Complete steps 1-3 for the blue line and label this total sum, SSB.
  5. Whichever line has a smaller sum of squares, SSR for the red line, or SSB for the blue line, is the better fit.

The beautiful thing is that mathematicians have figured out how to imagine all such possible lines, figure out their hypothetical SSX (i.e. sum of square for line X), and then find out which of these potential lines has the minimum value for SSX.  If you wanted to learn this procedure, you would have to know some Calculus or Linear Algebra.

Understanding the Correlations Coefficient

While calculating the sum of the squares of the vertical distances, SSX, for a proposed line of best fit, mathematicians are also able to calculate a very handy number: R^2 or the correlation coefficient.  Consider the Body Fat Data Set from above, where in the second sheet, I had Excel compute a trend line between Body Fat (x) and Ankle (y) measurement.  Notice that it states on the chart: R^2 = 0.071.  We will understand what this means for this example and then I will give you a general statement below.  Suppose I measured my ankle, and I found that it measured “25”.  Since the trend line has the equation

y = 0.0583 x + 22.

Then solving this equation when plugging in y = 25, I get x \approx 51.458.  The problem is, since R^2 = 0.071, this method of solving my Body Fat from my Ankle measurement will only work 7.1\% of the time!!

That’s the point: the correlation coefficient, R^2, tells us how often this whole process will be useful in linking the two variables using the trend line.  Since we can’t ever expect the points to all lie perfect in a line, we can’t ever really expect this model to ever work perfectly.  However, a higher R^2 value tells us that this linear model will be more useful.

Does anything predict Body Fat?

I spend a few minutes messing around with the data and didn’t find anything which correlated to body fat very well: Ankle measurement works 7.1% of the time, Age works 8.4% of the time, however, taking the difference of Abdomen and Hip seemed to predict Body Fat 56.56% of the time!  That’s still a miserable percentage; you might as well flip a coin (not that flipping a coin makes sense here but…) but at least it’s better.  Can you find any better percentages from comparing other variables I had not considered yet?  Or maybe we shouldn’t even focus on Body Fat!?  Why was I looking at that anyway?  All psychological analysis of my priorities aside, can you find any other two variables in that data set that have an R^2 of higher than 0.9?  Note: I didn’t check such a pair exists, this is an open ended question!

Why using the word “predict” is problematic

You should have heard this phrase before

Causation vs Causality

The point is that even if the correlation coefficient between Body Fat and let’s say, Eyebrow Size… even if the R^2 were 99%, who is to say that the body fat doesn’t cause a change in eyebrow size?  That is to to say, just because 99% of the time these two are related, does not mean that the mathematics reveals which is the “chicken” and which is the “egg” (which is a bad example because if saying which were the chicken and which were the egg were helpful then there probably wouldn’t be much controversy over which came first!).

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s