Content code
m1378
Slug (identifier)
the-regression-line
Grades
Secondary IV
Topic
Mathematics
Tags
extrapolation
interpolation
linear regression
scatter plot
Mayer line
median-median line
Content
Contenu
Corps

When the scatter plot of a two variable data distribution demonstrates a linear correlation, the distribution can be modelled by a regression line, often called the “line of best fit.”

Content
Corps

The linear regression line is a line that passes through a scatter plot that best represents the 2-variable distribution being studied.

Image
Scatter plot on a Cartesian plane modelled by a regression line.
Corps

A regression line is used to predict the value of one variable from the value of the other variable through extrapolation or interpolation.

There are several methods for finding the equation of a regression line.

Columns number
2 columns
Format
50% / 50%
First column
Links
Second column
Links
Title (level 2)
The Graphing Method (freehand drawing)
Title slug (identifier)
graph-method
Contenu
Corps

As the name suggests, simply draw a line through the scatter plot so that there are approximately an equal number of points on each side of the line and the slope of the line matches the orientation of the scatter plot as closely as possible.

Later, the rule of the drawn line can be found by using 2 points found on the line itself.

Content
Corps

The graphs below show the same scatter plot. In the first one, 3 lines |(y_1,| |y_2,| and |y_3)| have been drawn and they do not constitute proper regression lines, while the lines in the 2nd figure, |(y_4| and |y_5)|, can both be considered valid regression lines.

Columns number
2 columns
Format
50% / 50%
First column
Image
Graph showing 3 regression lines that do not best represent the scatter plot.
Corps

Line |y_1| is not a valid regression line, because it is almost constant, while the scatter plot shows a positive correlation (increasing).

Line |y_2| is not a valid regression line either, because its slope is much steeper than that of the scatter plot.

Line |y_3| has a slope similar to that of the scatter plot, but it does not have an equal number of points on both sides.

Second column
Image
Graph showing 2 regression lines that best fit the scatter plot.
Corps

Lines |y_4| and |y_5| are 2 lines that represent the trend of the scatter plot more accurately and have the same number of points on each side. It is possible to use either of these lines to make predictions.

Content
Corps

Although the freehand method is acceptable in most situations, it is the least accurate method. As seen in the previous example, 2 lines seem to be appropriate for the same scatter plot, even if they have slightly different slopes. For more accuracy, it is better to use another method such as the Mayer line or the median-median line methods.

Corps

The steps to follow to predict data from a scatter plot using the graphing method are as follows:

Content
Corps
  1. Place the points on a Cartesian plane.

  2. Sketch a regression line.

  3. Find 2 points that are situated on the regression line.

  4. Find the rule of the line using these 2 points.

  5. Predict a value using the rule.

Content
Corps

Following a survey of |16| Quebec families, total spending on sports and recreation was examined in relation to their household income. The following table of values shows the data collected.

Corps
Sports and Recreation Spending in Relation to Household Income
Household income
($/year)
|125\ 000| |65\ 000| |35\ 000| |145\ 000| |130\ 000| |80\ 000| |50\ 000| |40\ 000|
Spending on sports and recreation
($/year)
|10\ 000| |8\ 000| |1\ 000| |9\ 000| |8\ 000| |6\ 000| |4\ 000| |2\ 000|
Household income
($/year)
|90\ 000| |20\ 000| |75\ 000| |105\ 000| |100\ 000| |140\ 000| |150\ 000| |65\ 000|
Spending on sports and recreation
($/year)
|10\ 000| |500| |4\ 000| |6\ 000| |8\ 000| |13\ 000| |5\ 000| |5\ 000|
Corps

a) A family has an annual household income of | \$250\ 000.| If this family follows the same trend as the other Quebec families surveyed, how much do they budget for sports and recreation?

b) A family spends an average of | \$7500| a year on sports and recreation. What is their annual household income if they are a typical Quebec family?


  1. Place the points on a Cartesian plane.

The annual household income is placed on the |x| axis and spending on sports and recreation on the |y| axis.

Image
Scatter plot representing a positive correlation.
Corps
  1. Sketch a regression line.

Since there are |16| points in the scatter plot, we must be sure to place |8| points on either side of the line. We also make sure that the slope of the line best fits the scatter plot.

Image
Scatter plot representing a positive correlation with a regression line.
Corps

Note: The line could have been drawn a little higher or a little lower. If this is the case, then both the rule and the predictions may slightly vary as a result.

  1. Find 2 points situated on the regression line.

Image
Scatter plot representing a positive correlation with a regression line and 2 points situated on the regression line.
Corps

Let’s choose the points |(10\ 000, 0)| and |(170\ 000, 12\ 000).|

  1. Find the rule of the line using these 2 points.

Since it is linear, the rule will be of the form |y=ax+b.|

We begin by calculating the slope |(a).| ||\begin{align}a&=\dfrac{y_2-y_1}{x_2-x_1}\\&=\dfrac{12\ 000-0}{170\ 000-10\ 000}\\&=\dfrac{12\ 000}{160\ 000}\\&=\dfrac{3}{40}\\ &=0.075\end{align}|| Next, we replace |a| with |0.075| and the |x| and |y| variables with the coordinates of one of the 2 points, and isolate |b.| ||\begin{align}y&=ax+b\\ y&=0.075x+b\\0&=0.075(10\ 000)+b\\0&=750+b\\-750&=b \end{align}||Therefore, the rule of the regression line that was sketched freehand is |y=0.075x-750.|

  1. Predict a value using the rule.

a) A family has an annual household income of |\boldsymbol{\$250\ 000.}| If this family follows the same trend as the other Quebec families surveyed, how much do they budget for sports and recreation?

Since the household income in question | (\$250\ 000)| is outside the range studied (| \$20\ 000| to | \$150\ 000|), this is a prediction by extrapolation.

We replace the |x| variable with |250\ 000| in the regression line rule and complete the calculation. ||\begin{align}y&=0.075x-750\\y&=0.075(250\ 000)-750\\y&=18\ 750-750\\y&=\$ 18\ 000\ \end{align}||

Answer: A household with an annual income of | \$250\ 000| would spend approximately | \$18\ 000| on sports and recreation if it followed the same trend as the other Quebec families surveyed.

b) A family spends an average of |\boldsymbol{\$7\ 500}| a year on sports and recreation. What is their annual household income if they are a typical Quebec family?

This prediction is an interpolation because the annual budget for recreation and sports |( \$7500)| is within the interval studied |(500| to | \$13\ 000).| We can therefore estimate the annual household income of this family using the regression line.

We replace |y| with |7500| and isolate |x.| ||\begin{align} y &= 0.075x-750 \\ 7500 &= 0.075x-750 \\ 7\ 500\boldsymbol{\color{#ec0000}{+750}} &= 0.075x-750\boldsymbol{\color{#ec0000}{+750}} \\ \color{#ec0000}{\dfrac{\color{black}{8250}}{\boldsymbol{0.075}}} &= \color{#ec0000}{\dfrac{\color{black}{0.075x}}{\boldsymbol{0.075}}} \\ 110\ 000\ \$ &= x \end{align}||

Answer: If a household spends on average | \$7500| per year on sports and recreation, we can predict that the household income is about | \$110\ 000.|

In the graph, we can see that the regression line drawn does indeed pass through the point |(110\ 000, 7500).|

Image
A regression line used to interpolate data from a scatter plot.
Corps

Note: The same problem was solved in the median-median line and Mayer line concept sheets. In each case, comparable results were obtained.

Title (level 2)
Technological Tools (spreadsheets)
Title slug (identifier)
spreadsheets
Contenu
Corps

Spreadsheets are software programs used to make tables and graphs. They are programmed to calculate correlation coefficients and to calculate the equation of the line or curve that best fits a given scatter plot.

Spreadsheets are therefore the quickest, most versatile, and most efficient way to find the rule of a regression line and to predict results thereafter.

Content
Corps

Following a survey of |16| Quebec families, total spending on sports and recreation was examined in relation to their household income. The following table of values shows the data collected.

Corps
Sports and Recreation Spending in Relation to Household Income
Household income
($/year)
|125\ 000| |65\ 000| |35\ 000| |145\ 000| |130\ 000| |80\ 000| |50\ 000| |40\ 000|
Spending on sports and recreation
($/year)
|10\ 000| |8\ 000| |1\ 000| |9\ 000| |8\ 000| |6\ 000| |4\ 000| |2\ 000|
Household income
($/year)
|90\ 000| |20\ 000| |75\ 000| |105\ 000| |100\ 000| |140\ 000| |150\ 000| |65\ 000|
Spending on sports and recreation
($/year)
|10\ 000| |500| |4\ 000| |6\ 000| |8\ 000| |13\ 000| |5\ 000| |5\ 000|
Corps

a) A family has an annual household income of | \$250\ 000.| If this family follows the same trend as the other Quebec families surveyed, how much do they budget for sports and recreation?

b) A family spends an average of | \$7500| a year on sports and recreation. What is their annual household income if they are a typical Quebec family?


  1. Transcribe the table of values into the spreadsheet.

Image
Overview of a table of values made with spreadsheet software.
Corps
  1. Draw the scatter plot.

By selecting the cells A1 to Q2, we can ask the software to plot the scatter plot for us in a couple of clicks. The result should look like this:

Image
Overview of a scatter plot made with a spreadsheet program.
Corps
  1. Sketch the regression line.

Clicking on the graph gives access to a number of options. One of them is to draw a trend curve. If you choose the linear model, which is often the first model proposed, you get this:

Image
Overview of a scatter plot made with a spreadsheet program.
Corps
  1. Find the rule of the line

We can also decide whether to display the equation of the line or not. In the previous diagram, it appears at the top of the graph.

Therefore, the rule for the regression line found using a spreadsheet is |y=0.063\,7x+586.|

  1. Predict a value using the rule.

Generally, software can perform the following calculations automatically. Simply enter an |x| or |y| value and ask the software to predict the value of the other variable using the rule it just calculated.

a) A family has an annual household income of |\boldsymbol{\$250\ 000.}| If this family follows the same trend as the other Quebec families surveyed, how much do they budget for sports and recreation?

We replace the |x| variable with |250\ 000| in the regression line rule and complete the calculation. ||\begin{align}y&=0.0637x+586\\y&=0.0637(250\ 000)+586\\y&=15\ 925+586\\y&=\$16\ 511\ \end{align}||

Answer: A household with an annual income of | \$250\ 000| would spend approximately | \$16\ 511| on sports and recreation if it followed the same trend as the other Quebec families surveyed.

b) A family spends an average of |\boldsymbol{\$7\ 500}| a year on sports and recreation. What is their annual household income if they are a typical Quebec family?

We replace |y| with |7500| in the regression line and isolate |x.|
||\begin{align} y &= 0.0637x+586 \\ 7500 &= 0.0637x+586 \\ 7500\boldsymbol{\color{#ec0000}{-586}} &= 0.0637x+586\boldsymbol{\color{#ec0000}{-586}} \\ \color{#ec0000}{\dfrac{\color{black}{6\ 914}}{\boldsymbol{0.0637}}} &= \color{#ec0000}{\dfrac{\color{black}{0.0637x}}{\boldsymbol{0.0637}}} \\ \$108\ 540\ &\approx x \end{align}||

Answer: If a household spends on average |\$7500| per year on sports and recreation, we can predict that the household income is about |\$108\ 540.|

Note: The same problem was solved in the median-median line and Mayer line concept sheets. In each case, comparable results were obtained.

Corps

A prediction made from a scatter plot is never perfectly accurate. This is why, when presenting our predictions, it is best to use the conditional tense. This is also why, when correcting such problems, a margin of error is accepted.

Furthermore, different software programs will not always give exactly the same regression line rule, since they do not necessarily use the same algorithm to calculate it. However, the most common method used by software to calculate a regression line is the least squares method. To find out how this works in practice, please see the following section on this algebraic method.

Contenu
Title
The Least Squares Method (the algebraic method)
Content
Corps

The method of least squares is an algebraic method that consists of finding the value of the regression line parameters |a| and |b| by trying to minimize the sum of the distances of all the points of the scatter plot from the line in question. To do so, the following formulas must be applied.

Content
Columns number
2 columns
Format
50% / 50%
First column
Corps

Parameter |\boldsymbol{a}| ||a=\dfrac{n\big(\sum x_iy_i\big)-\big(\sum x_i\big)\big(\sum y_i\big)}{n\big(\sum{x_i}^2\big)-\big(\sum{x_i}\big)^2}||

Second column
Corps

Parameter |\boldsymbol{b}| ||b=\overline{y}-a\overline{x}||

Corps

where
|a:| rate of change
|b:| y-intercept
|n:| number of points observed
|\overline{x}:| mean of the x-values
|\overline{y}:| mean of the y-values

|\sum| indicates that we must perform sums of several elements in succession.
|x_i| represents the |i^{\text{th}}| x-value
|y_i| represents the |i^{\text{th}}| y-value

Content
Corps

The points on the following Cartesian plane represent the coordinates of 8 houses in a new housing development. The development contractor wants to run a fibre optic network underground as close as possible to each of these houses. Find the equation of the line that represents the optimal position of the underground fibre optic.

Columns number
2 columns
Format
50% / 50%
First column
Corps
The Cartesian coordinates of 8 houses
House |\boldsymbol{x}| Coordinate |\boldsymbol{y}| Coordinate
A |10| |30|
B |25| |20|
C |50| |70|
D |65| |60|
E |120| |90|
F |40| |45|
G |80| |90|
H |100| |70|
Second column
Image
The location of houses on a Cartesian plane.
Corps

To help us use the formulas of the algebraic method, we start by completing the following table:

Corps
House |\boldsymbol{x}| Coordinate |\boldsymbol{y}| Coordinate |\boldsymbol{xy}| |\boldsymbol{x^2}|
A |10| |30| |300| |100|
B |25| |20| |500| |625|
C |50| |70| |3\ 500| |2\ 500|
D |65| |60| |3\ 900| |4\ 225|
E |120| |90| |10\ 800| |14\ 400|
F |40| |45| |1\ 800| |1\ 600|
G |80| |90| |7\ 200| |6\ 400|
H |100| |70| |7\ 000| |10\ 000|
Sum |\sum x_i=490| |\sum y_i=475| |\sum x_iy_i=35\ 000| |\sum{x_i}^2=39\ 850|
Mean |\begin{align}\overline{x}&=\dfrac{490}{8}\\&=61.25 \end{align}| |\begin{align}\overline{y}&=\dfrac{475}{8}\\&=59.375 \end{align}|    
Corps

We now apply the formula to calculate the rate of change |(a)| of the regression line. ||\begin{align}a&=\dfrac{n\big(\sum x_iy_i\big)-\big(\sum x_i\big)\big(\sum y_i\big)}{n\big(\sum{x_i}^2\big)-\big(\sum{x_i}\big)^2}\\&=\dfrac{8(35\ 000)-(490)(475)}{8(39\ 850)-(490)^2}\\ &=\dfrac{47\ 250}{78\ 700}\\ &\approx 0.6\end{align}||

We calculate the y-intercept |(b)| using the 2nd formula. ||\begin{align}b&=\overline{y}-a\overline{x}\\&=59.375-0.6(61.25)\\&\approx 22.6\end{align}||

Columns number
2 columns
Format
50% / 50%
First column
Corps

The rule of the regression line is therefore
|y=0.6x+22.6.|

Therefore, by running the main fibre optic cable along this straight line, the contractor minimizes the overall distance from the houses to the cable.

Second column
Image
The location of a fibre optic cable in a housing development.
Title (level 2)
Using a Regression Line to Make Predictions
Title slug (identifier)
using
Contenu
Corps

This concept sheet, as well as those on the Mayer and median-median lines, show how to first find the rule of a regression line and then how to use this rule to make predictions.

Content
Corps

Making a prediction means finding a new value from the data used to establish a mathematical model. A prediction can be made through interpolation or extrapolation.

  • Interpolation consists of estimating the value of one of the 2 variables under study using the value of the other variable whose value is known and lies within the range covered by the sample data.

  • Extrapolation consists of predicting the value of one of the 2 variables under study from the value of the other variable whose value is known and lies outside the range covered by the sample data.

Corps

In general, interpolation is considered more reliable than extrapolation. For example, a given scatter plot may appear to follow a linear pattern, when in fact it follows a quadratic, exponential or other pattern.

Image
Graph of the extrapolation of the braking distance of a car in relation to its speed.
Corps

In the graph above, the braking distance of a car is shown as a function of its speed. This phenomenon follows a quadratic pattern. You can see that over the interval covered by the scatter plot, or from |50| to |130\ \text{km/h},| the regression line that models the scatter plot and the actual curve are very close to each other. So, if we use the regression line to interpolate the braking distance of a car travelling between |50| and |130\ \text{km/h},| we get a value that is very close to the actual value.

On the other hand, if the regression line is used to predict the braking distance of a car travelling at |180\ \text{km/h},| by extrapolation, that is, beyond the range studied, the obtained value would be far from the real value.

In other words, just because a scatter plot shows a linear trend does not mean that the same trend continues beyond the scatter plot.

Title (level 2)
Comparison of the Different Methods
Title slug (identifier)
comparison
Contenu
Corps

The problem about spending on sports and recreation in relation to household income has been solved using several methods. In the following summary table, we can compare the different results obtained.

Corps

 

Rule obtained
|x:| Household income |(\$/\text{year})|
|y:| Spending on sports and recreation |(\$/\text{year})|

Extrapolation
Estimated annual spending on sports and recreation for a household income of |\$250\ 000\ /\text{year}|

Interpolation
Estimated annual household income for a sports and recreation spending of |\$7500\ /\text{year}|

The Graphing Method (freehand drawing)

|y=0.075x-750|

| \$18\ 000|

| \$110\ 000|

The Least Squares Method (algebraic method)

|y=0.0637x+586|

| \$16\ 511|

| \$108\ 540|

The Mayer Line

|y=0.07x+6|

| \$17\ 506|

| \$107\ 057|

The Median-Median Line

|y=0.07x-183|

| \$17\ 317|

| \$109\ 757|

Corps

We see that the rules are similar. The rates of change are very close, but it is the initial values that vary the most. The following graph shows the scatter plot and the 4 regression lines obtained using the different methods.

Image
Graph showing a scatter plot and the regression lines obtained using the 4 different methods.
Corps

For extrapolation, the estimates range from | \$16\ 511| to | \$18\ 000.|  For interpolation, they range from | \$107\ 057| to | \$110\ 000.| If the scatter plot had a stronger correlation, that is, with a correlation coefficient |(r)| closer to |1,| the predictions would have been even closer together.

Finally, since there are no outliers in the data distribution, the Mayer line method is no less reliable than the others.
 

Contenu
Title
See also
Links
Remove audio playback
No