Chapter 5: Regression and Prediction

AP Statistics · Exploring Two-Variable Data · 3 interactive graphs · 8 practice problems

Learning Objectives

Find and use the equation of the least-squares regression line (LSRL)
Interpret the slope and y-intercept in context
Calculate and interpret residuals
Construct and analyze residual plots to check linearity
Calculate and interpret the coefficient of determination $r^2$
Identify dangers of extrapolation and influential observations

5.1 The Least-Squares Regression Line

The least-squares regression line (LSRL) is the line that minimizes the sum of the squared vertical distances (residuals) from each data point to the line.

LSRL Equation and Formulas

The LSRL has the form: $\hat{y} = a + bx$, where

$$b = r \cdot \frac{s_y}{s_x} \qquad \text{and} \qquad a = \bar{y} - b\bar{x}$$

$b$ = slope; $a$ = y-intercept
$r$ = correlation coefficient; $s_x, s_y$ = standard deviations; $\bar{x}, \bar{y}$ = means
The LSRL always passes through the point $(\bar{x}, \bar{y})$

Interpreting the Slope and Intercept

The AP exam requires careful contextual interpretation:

Slope $b$: "For each additional [unit of x], the predicted [y] increases/decreases by $|b|$ [units of y], on average."
Intercept $a$: "When [x] = 0, the predicted [y] is $a$ [units of y]." (Only interpret if $x = 0$ is meaningful.)

Example 5.1 — Finding and Interpreting the LSRL

Using the study hours data from Chapter 4: $\bar{x} = 8.5$, $\bar{y} = 76.3$, $s_x = 5.07$, $s_y = 13.5$, $r = 0.993$.

$$b = 0.993 \cdot \frac{13.5}{5.07} \approx 2.64 \qquad a = 76.3 - 2.64(8.5) \approx 53.9$$

LSRL: $\hat{y} = 53.9 + 2.64x$

Slope: For each additional hour of study per week, the predicted exam score increases by about 2.64 points, on average.

Intercept: A student who studies 0 hours is predicted to score about 53.9. (Borderline meaningful — a score of ~54 with no studying is plausible.)

TRY IT

A regression of height (inches) on shoe size gives: $\hat{\text{height}} = 50.2 + 1.8(\text{shoe size})$. Interpret the slope in context.

Show Answer

For each one-unit increase in shoe size, the predicted height increases by 1.8 inches, on average.

Interactive LSRL fitting — drag the slope and intercept sliders to see how the line fits the data and how residuals change.

Figure 5.1 — Least-Squares Regression Line with Residuals

5.2 Residuals and Residual Plots

Definition: Residual

A residual is the difference between an observed y-value and the predicted y-value from the LSRL:

$$\text{residual} = y - \hat{y} = \text{observed} - \text{predicted}$$

A positive residual means the point is above the line (actual > predicted). A negative residual means the point is below the line (actual < predicted).

Important property: The sum of all residuals from the LSRL is always exactly 0. The LSRL balances positive and negative residuals perfectly.

Residual Plots

A residual plot graphs the residuals ($y - \hat{y}$) on the vertical axis versus the explanatory variable ($x$) on the horizontal axis. Use residual plots to check whether a linear model is appropriate:

Good (linear model appropriate): Residuals scattered randomly with no pattern; residuals roughly centered at 0
Bad (curved pattern): A systematic curved pattern in residuals indicates a linear model is NOT appropriate — use a different model
Bad (fan shape): Increasing spread indicates non-constant variance

Example 5.2 — Computing Residuals

Using $\hat{y} = 53.9 + 2.64x$, find the residual for the student who studies 10 hours and scores 79.

$\hat{y} = 53.9 + 2.64(10) = 80.3$

Residual $= 79 - 80.3 = -1.3$

Interpretation: This student scored 1.3 points below what the LSRL predicted for someone who studies 10 hours.

TRY IT

A student studies 12 hours and scores 84. Using $\hat{y} = 53.9 + 2.64x$, find the residual and interpret it.

Show Answer

$\hat{y} = 53.9 + 2.64(12) = 85.6$. Residual $= 84 - 85.6 = -1.6$. This student scored 1.6 points below what the model predicted for 12 hours of study.

Residual plot for the study hours data — residuals scattered randomly confirms the linear model is appropriate.

Figure 5.2 — Residual Plot: No Systematic Pattern (Linear Model Appropriate)

5.3 The Coefficient of Determination r²

Definition: r² (Coefficient of Determination)

$r^2$ is the fraction of the variation in $y$ that is explained by the linear relationship with $x$:

$$r^2 = 1 - \frac{\text{SSE}}{\text{SST}} = \frac{\text{variation explained by } x}{\text{total variation in } y}$$

where SSE = sum of squared residuals, SST = total sum of squares around $\bar{y}$.

Interpreting r²

The standard AP template: "About [r² × 100]% of the variation in [y] is explained by the linear relationship with [x]."

Example 5.3 — Interpreting r²

For the study hours data, $r = 0.993$, so $r^2 = (0.993)^2 \approx 0.986$.

Interpretation: About 98.6% of the variation in exam scores is explained by the linear relationship with study hours. The remaining 1.4% is due to other factors (individual ability, test-day conditions, etc.).

★

AP Exam Tip: Always interpret $r^2$ using the phrase "explained by the linear relationship with..." Never say "explained by" alone without specifying the linear relationship. Also note $r^2$ is always between 0 and 1, regardless of the sign of $r$.

5.4 Regression Cautions: Extrapolation and Influential Points

Extrapolation

Extrapolation means using the LSRL to predict $y$ for values of $x$ outside the range of the data. Extrapolated predictions are often unreliable because the linear relationship may not hold beyond the observed data range.

Example 5.4 — Danger of Extrapolation

Using $\hat{y} = 53.9 + 2.64x$ for study hours, predict the score for someone who studies 40 hours per week.

$\hat{y} = 53.9 + 2.64(40) = 159.5$

An exam score of 159.5 is impossible (scores max at 100). Extrapolating far beyond the data range of 2–18 hours produces absurd predictions. The linear model breaks down outside its range.

Influential Observations in Regression

An influential observation in regression is a point whose removal would substantially change the LSRL slope or intercept. Points far from $\bar{x}$ in the x-direction have the most leverage and tend to be most influential.

★

AP Exam Tip: A point that is an outlier in y (high residual) is not necessarily influential. An influential point is one that, if removed, would significantly change the regression line. Always check both the residual plot and the scatterplot for unusual points.

LSRL with an influential point in red — observe how far the line moves when the influential point is added.

Figure 5.3 — Influential Point Pulling the LSRL

Practice Problems

A regression of weight (kg) on height (cm) gives $\hat{y} = -80 + 0.65x$. Interpret the slope and decide whether the intercept is meaningful.

Show Solution

Slope: For each additional cm of height, predicted weight increases by 0.65 kg, on average. Intercept: $x = 0$ cm (height = 0) is not meaningful — no person has zero height, so $-80$ kg has no practical interpretation.

Given $r = -0.75$, $s_x = 4$, $s_y = 6$, $\bar{x} = 10$, $\bar{y} = 30$. Find the LSRL equation.

Show Solution

$b = -0.75 \cdot (6/4) = -1.125$. $a = 30 - (-1.125)(10) = 30 + 11.25 = 41.25$. LSRL: $\hat{y} = 41.25 - 1.125x$

A residual plot shows a clear curved pattern (points go low, then high, then low as x increases). What does this indicate?

Show Solution

The curved pattern indicates that a linear model is not appropriate for this data. A nonlinear model (e.g., quadratic or exponential) would be more suitable. The LSRL predictions will be systematically off in a predictable way.

For a dataset, $r^2 = 0.64$. A student says "64% of the data points fall on the regression line." Correct this statement.

Show Solution

The correct interpretation: "About 64% of the variation in y is explained by the linear relationship with x." It does NOT mean 64% of points are on the line — in fact, very few points are ever exactly on the LSRL.

LSRL: $\hat{y} = 12 + 3.5x$. An observed point has $x = 8$, $y = 38$. Find the residual and state whether the point is above or below the line.

Show Solution

$\hat{y} = 12 + 3.5(8) = 40$. Residual $= 38 - 40 = -2$. The point is below the line by 2 units (actual < predicted).

A researcher uses a regression line fit to data from students ages 8–16 to predict reading level for a 5-year-old. What concern does this raise?

Show Solution

Extrapolation: Age 5 is outside the range 8–16 used to fit the model. The linear relationship may not hold at age 5, making the prediction unreliable. Extrapolated predictions should not be trusted.

AP FRQ: A linear regression of test score ($y$) on hours tutored ($x$) gives $\hat{y} = 58 + 6.2x$ with $r^2 = 0.81$. (a) Predict the score for 5 hours tutored. (b) Interpret $r^2$. (c) A student tutored 5 hours scored 72. Find and interpret the residual.

Show Solution

(a) $\hat{y} = 58 + 6.2(5) = 89$. (b) About 81% of the variation in test scores is explained by the linear relationship with tutoring hours. (c) Residual $= 72 - 89 = -17$. This student scored 17 points below what the model predicted for 5 hours of tutoring.

Two datasets both have LSRL $\hat{y} = 5 + 2x$. Dataset A has $r^2 = 0.95$; Dataset B has $r^2 = 0.30$. What does this tell you about the two datasets?

Show Solution

Both datasets have the same regression line, but the linear model fits Dataset A much better. In Dataset A, 95% of the variation in y is explained by x (points cluster tightly around the line). In Dataset B, only 30% is explained — points are much more scattered around the same line.

📋 Chapter Summary

Least-Squares Regression Line

LSRL Equation

$\hat{y} = a + bx$ where $b = r\dfrac{s_y}{s_x}$ and $a = \bar{y} - b\bar{x}$. The line always passes through $(\bar{x}, \bar{y})$.

Slope Interpretation

For each 1-unit increase in $x$, $\hat{y}$ increases by $b$ units on average. Always interpret in context with units.

Residuals

$e = y - \hat{y}$ — the vertical distance from a point to the LSRL. Positive: point above line. Negative: point below line. $\sum e = 0$.

$r^2$ (Coefficient of Determination)

The fraction of variation in $y$ explained by the LSRL. $r^2 = 0.85$ means 85% of variation in $y$ is explained by $x$.

Checking Conditions

Linear — scatterplot shows a linear trend; residual plot shows no pattern
Constant variance — spread of residuals roughly equal across all $x$ values
No strong outliers — high-leverage or influential points distort the line
Check $r^2$ — higher values indicate stronger linear fit

📘 Key Terms

LSRLLeast-Squares Regression Line — the line minimizing the sum of squared residuals $\sum(y-\hat{y})^2$.

Residual$e = y - \hat{y}$ — the difference between an observed value and the value predicted by the regression line.

$r^2$Coefficient of determination — the proportion of variation in $y$ explained by the linear relationship with $x$.

Residual PlotA graph of residuals vs. $x$ or $\hat{y}$. A random scatter indicates a good linear fit.

Influential PointA point that, if removed, would markedly change the slope or intercept of the LSRL.

ExtrapolationUsing the LSRL to predict $y$ for $x$ values far outside the data range — often unreliable.

← Chapter 4: Scatterplots AP Statistics Home →