Chapter 5: Regression and Prediction
Learning Objectives
- Find and use the equation of the least-squares regression line (LSRL)
- Interpret the slope and y-intercept in context
- Calculate and interpret residuals
- Construct and analyze residual plots to check linearity
- Calculate and interpret the coefficient of determination $r^2$
- Identify dangers of extrapolation and influential observations
5.1 The Least-Squares Regression Line
The least-squares regression line (LSRL) is the line that minimizes the sum of the squared vertical distances (residuals) from each data point to the line.
LSRL Equation and Formulas
The LSRL has the form: $\hat{y} = a + bx$, where
$$b = r \cdot \frac{s_y}{s_x} \qquad \text{and} \qquad a = \bar{y} - b\bar{x}$$
- $b$ = slope; $a$ = y-intercept
- $r$ = correlation coefficient; $s_x, s_y$ = standard deviations; $\bar{x}, \bar{y}$ = means
- The LSRL always passes through the point $(\bar{x}, \bar{y})$
Interpreting the Slope and Intercept
The AP exam requires careful contextual interpretation:
- Slope $b$: "For each additional [unit of x], the predicted [y] increases/decreases by $|b|$ [units of y], on average."
- Intercept $a$: "When [x] = 0, the predicted [y] is $a$ [units of y]." (Only interpret if $x = 0$ is meaningful.)
Example 5.1 — Finding and Interpreting the LSRL
Using the study hours data from Chapter 4: $\bar{x} = 8.5$, $\bar{y} = 76.3$, $s_x = 5.07$, $s_y = 13.5$, $r = 0.993$.
$$b = 0.993 \cdot \frac{13.5}{5.07} \approx 2.64 \qquad a = 76.3 - 2.64(8.5) \approx 53.9$$
LSRL: $\hat{y} = 53.9 + 2.64x$
Slope: For each additional hour of study per week, the predicted exam score increases by about 2.64 points, on average.
Intercept: A student who studies 0 hours is predicted to score about 53.9. (Borderline meaningful — a score of ~54 with no studying is plausible.)
A regression of height (inches) on shoe size gives: $\hat{\text{height}} = 50.2 + 1.8(\text{shoe size})$. Interpret the slope in context.
Show Answer
Interactive LSRL fitting — drag the slope and intercept sliders to see how the line fits the data and how residuals change.
Figure 5.1 — Least-Squares Regression Line with Residuals
5.2 Residuals and Residual Plots
Definition: Residual
A residual is the difference between an observed y-value and the predicted y-value from the LSRL:
$$\text{residual} = y - \hat{y} = \text{observed} - \text{predicted}$$
A positive residual means the point is above the line (actual > predicted). A negative residual means the point is below the line (actual < predicted).
Residual Plots
A residual plot graphs the residuals ($y - \hat{y}$) on the vertical axis versus the explanatory variable ($x$) on the horizontal axis. Use residual plots to check whether a linear model is appropriate:
- Good (linear model appropriate): Residuals scattered randomly with no pattern; residuals roughly centered at 0
- Bad (curved pattern): A systematic curved pattern in residuals indicates a linear model is NOT appropriate — use a different model
- Bad (fan shape): Increasing spread indicates non-constant variance
Example 5.2 — Computing Residuals
Using $\hat{y} = 53.9 + 2.64x$, find the residual for the student who studies 10 hours and scores 79.
$\hat{y} = 53.9 + 2.64(10) = 80.3$
Residual $= 79 - 80.3 = -1.3$
Interpretation: This student scored 1.3 points below what the LSRL predicted for someone who studies 10 hours.
A student studies 12 hours and scores 84. Using $\hat{y} = 53.9 + 2.64x$, find the residual and interpret it.
Show Answer
Residual plot for the study hours data — residuals scattered randomly confirms the linear model is appropriate.
Figure 5.2 — Residual Plot: No Systematic Pattern (Linear Model Appropriate)
5.3 The Coefficient of Determination r²
Definition: r² (Coefficient of Determination)
$r^2$ is the fraction of the variation in $y$ that is explained by the linear relationship with $x$:
$$r^2 = 1 - \frac{\text{SSE}}{\text{SST}} = \frac{\text{variation explained by } x}{\text{total variation in } y}$$
where SSE = sum of squared residuals, SST = total sum of squares around $\bar{y}$.
Interpreting r²
The standard AP template: "About [r² × 100]% of the variation in [y] is explained by the linear relationship with [x]."
Example 5.3 — Interpreting r²
For the study hours data, $r = 0.993$, so $r^2 = (0.993)^2 \approx 0.986$.
Interpretation: About 98.6% of the variation in exam scores is explained by the linear relationship with study hours. The remaining 1.4% is due to other factors (individual ability, test-day conditions, etc.).
AP Exam Tip: Always interpret $r^2$ using the phrase "explained by the linear relationship with..." Never say "explained by" alone without specifying the linear relationship. Also note $r^2$ is always between 0 and 1, regardless of the sign of $r$.
5.4 Regression Cautions: Extrapolation and Influential Points
Extrapolation
Extrapolation means using the LSRL to predict $y$ for values of $x$ outside the range of the data. Extrapolated predictions are often unreliable because the linear relationship may not hold beyond the observed data range.
Example 5.4 — Danger of Extrapolation
Using $\hat{y} = 53.9 + 2.64x$ for study hours, predict the score for someone who studies 40 hours per week.
$\hat{y} = 53.9 + 2.64(40) = 159.5$
An exam score of 159.5 is impossible (scores max at 100). Extrapolating far beyond the data range of 2–18 hours produces absurd predictions. The linear model breaks down outside its range.
Influential Observations in Regression
An influential observation in regression is a point whose removal would substantially change the LSRL slope or intercept. Points far from $\bar{x}$ in the x-direction have the most leverage and tend to be most influential.
AP Exam Tip: A point that is an outlier in y (high residual) is not necessarily influential. An influential point is one that, if removed, would significantly change the regression line. Always check both the residual plot and the scatterplot for unusual points.
LSRL with an influential point in red — observe how far the line moves when the influential point is added.
Figure 5.3 — Influential Point Pulling the LSRL
Practice Problems
A regression of weight (kg) on height (cm) gives $\hat{y} = -80 + 0.65x$. Interpret the slope and decide whether the intercept is meaningful.
Show Solution
Given $r = -0.75$, $s_x = 4$, $s_y = 6$, $\bar{x} = 10$, $\bar{y} = 30$. Find the LSRL equation.
Show Solution
A residual plot shows a clear curved pattern (points go low, then high, then low as x increases). What does this indicate?
Show Solution
For a dataset, $r^2 = 0.64$. A student says "64% of the data points fall on the regression line." Correct this statement.
Show Solution
LSRL: $\hat{y} = 12 + 3.5x$. An observed point has $x = 8$, $y = 38$. Find the residual and state whether the point is above or below the line.
Show Solution
A researcher uses a regression line fit to data from students ages 8–16 to predict reading level for a 5-year-old. What concern does this raise?
Show Solution
AP FRQ: A linear regression of test score ($y$) on hours tutored ($x$) gives $\hat{y} = 58 + 6.2x$ with $r^2 = 0.81$. (a) Predict the score for 5 hours tutored. (b) Interpret $r^2$. (c) A student tutored 5 hours scored 72. Find and interpret the residual.
Show Solution
Two datasets both have LSRL $\hat{y} = 5 + 2x$. Dataset A has $r^2 = 0.95$; Dataset B has $r^2 = 0.30$. What does this tell you about the two datasets?
Show Solution
📋 Chapter Summary
Least-Squares Regression Line
$\hat{y} = a + bx$ where $b = r\dfrac{s_y}{s_x}$ and $a = \bar{y} - b\bar{x}$. The line always passes through $(\bar{x}, \bar{y})$.
For each 1-unit increase in $x$, $\hat{y}$ increases by $b$ units on average. Always interpret in context with units.
$e = y - \hat{y}$ — the vertical distance from a point to the LSRL. Positive: point above line. Negative: point below line. $\sum e = 0$.
The fraction of variation in $y$ explained by the LSRL. $r^2 = 0.85$ means 85% of variation in $y$ is explained by $x$.
Checking Conditions
- Linear — scatterplot shows a linear trend; residual plot shows no pattern
- Constant variance — spread of residuals roughly equal across all $x$ values
- No strong outliers — high-leverage or influential points distort the line
- Check $r^2$ — higher values indicate stronger linear fit