Building a linear model of data


Suppose that we have a set of data of 17 football players like the following:

    Table 1
Height (in.)Weight (lb.)
61140
64141
64144
66158
67156
67174
68160
68164
68170
69172
70170
71175
72170
72174
73176
74180
75192
Mean68.76165.65
Median68170

Going through this table of data as it stands, it's hard to 'tell a story' about the data.

So we do what statisticians like to do when they have a set of data -- we draw a graph to see if it helps see 'a story' in the data.

Here is a plot of the data. Each point in the plot represents one of the players from Table 1. So, for example, the point that is on the far left of the graph (and lowest on the plot) is the player with shortest height. From the table we see that his height is 61 inches and his weight is 140 lb. Make sure that you see that the point (61, 140), which represents this player, is in about the right place in the plot.

Such a plot of data is often called a 'scatter plot'. Do you agree that this is a good name for such a graph ?


With the help of the scatter plot for the data, we see a general 'story line'. That is: taller heights (moving from left to right on the graph) tend to have larger weights (moving from bottom to top).

Just to make sure that we are reading the graph correctly, notice the data point on the far right on the plot. This player's height is 75 inches and his weight is 192 lb.


We can now ask questions of these data:

Examples:

For a given height, what weight could I expect ?

What is the relationship between height and weight ?

As the height increases, what is the corresponding increase in weight?

In order to answers these kinds of questions, it would be very helpful to have a rule that relates the height to its expected weight.


We make some assumptions:

Assumption 1:

Assume that the relationship between the two quantities (or variables), height of the players (X) and weight (Y), can be described by a straight line.

Let's remember that the equation of a straight line is:

   Y = m X + c.

In terms of height and weight, we would have:

    Weight = m * Height + c

where:

m = the slope of the line

c = the Y intercept (where the line crosses the Y axis).

In words, the rule is something like this:

Take the player's height ( X ).

Multiply it by some number (m).

Add (or subtract) some other number (c).

This gives you the weight ( Y ), based on our set of data.

(Nice, right ?)

So what we are looking for is a relationship that says, for a given player's height, what weight can we expect ?

Assumption 2:

We do not expect that we will find a line that will exactly fit our data. The real world does not usually (if ever) conform to rules of mathematics!

Recognizing this fact, we will settle for a slightly different rule that relates the X variable and the Y variable. We use:

    Y' = m X + c

where Y' (call it 'Y prime') is the predicted (or estimated) value for Y.

For this new rule we say, 'for a player with a given height X, what is his predicted, or estimated weight Y' ?

In words, our rule is:

   Weight' (estimated miles per gallon) = m * height + c.

So now our strategy is to find a set of predicted values, Y' , and compare them with the actual values Y. The results of this comparison will help us decide whether our prediction rule is working well (that is, whether the rule fits the data).

Assumption 3:

We assume that whatever the rule is, its graph will pass either through the mean (average) values for the X variable and for the Y variable OR the median values for the X and Y variables.

It will depend on the type of analysis that we use to find the rule whether we use the median or mean values.

We'll call the mean (average) for the X values, Xbar and the average for the Y values, Ybar.
We will also call the median values, Xmed for the X values and Ymed for the Y values.

That is,
   Y'bar = m Xbar + c
OR,
   Y'med = m Xmed + c.

Is this a reasonable assumption ?

Let us interpret this rule in terms of our sample data: It means that a car of average weight will get average miles per gallon. This seems to make sense! (Doesn't it?)

For our data, Xbar= 68.76 in. and Ybar=165.65 lbs. That is, a player that is 68.76 inches is estimated to weigh 165.65 lbs.

Similarly, Xmed = 68 in. and Ymed= 170 lbs. That is, a player that is 68 inches is estimated to weigh 170 lbs.


Exploration:

Use the spreadsheet you downloaded to answer the following questions.

For the data set, use m = 1 and assume the line passes through the median point, (on the spreadsheet, look at the columns (A - F)) and answer the following questions.


Continue to the next section: A more formal approach to fitting a line to data.