Chapter 4: Scatterplots and Correlation
Learning Objectives
- Construct and interpret scatterplots for two quantitative variables
- Describe the direction, form, and strength of an association
- Calculate and interpret the correlation coefficient $r$
- Identify the properties and limitations of $r$
- Distinguish correlation from causation; identify lurking variables
- Identify influential points and outliers in scatterplots
4.1 Constructing and Interpreting Scatterplots
A scatterplot displays the relationship between two quantitative variables measured on the same individuals. One variable is plotted on the horizontal axis (explanatory variable) and the other on the vertical axis (response variable).
Definition: Explanatory and Response Variables
The explanatory variable (independent variable) may help explain or predict changes in the response variable (dependent variable). When examining a relationship, place the explanatory variable on the x-axis and the response variable on the y-axis.
When describing a scatterplot, always address four characteristics using the acronym DOFS:
- Direction — positive, negative, or no association
- Outliers — any points that deviate from the overall pattern
- Form — linear, curved, or no clear form
- Strength — how closely points follow the form (weak, moderate, strong)
Example 4.1 — Describing a Scatterplot
A study records study hours per week (x) and exam score (y) for 10 students:
| Hours (x) | 2 | 4 | 5 | 7 | 8 | 10 | 12 | 14 | 15 | 18 |
|---|---|---|---|---|---|---|---|---|---|---|
| Score (y) | 55 | 62 | 65 | 71 | 74 | 79 | 84 | 87 | 91 | 95 |
Description: There is a strong, positive, linear association between study hours and exam score. No outliers are apparent. As study hours increase, exam scores tend to increase as well.
A dataset shows that as temperature (°F) increases from 60 to 100, ice cream sales increase. However, at 105°F, sales drop sharply. How would you describe the form and any outliers?
Show Answer
Scatterplot of study hours vs. exam score — observe direction, form, and strength.
Figure 4.1 — Study Hours vs. Exam Score
4.2 Measuring Correlation: The Coefficient r
The correlation coefficient $r$ measures the direction and strength of a linear association between two quantitative variables.
Formula: Pearson Correlation Coefficient
$$r = \frac{1}{n-1}\sum_{i=1}^{n}\left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)$$
- $n$ = number of data pairs
- $\bar{x}, \bar{y}$ = sample means of $x$ and $y$
- $s_x, s_y$ = sample standard deviations of $x$ and $y$
- $r$ is always between $-1$ and $1$: $-1 \le r \le 1$
Interpreting r
- $r > 0$: positive association (as $x$ increases, $y$ tends to increase)
- $r < 0$: negative association (as $x$ increases, $y$ tends to decrease)
- $r = 0$: no linear association
- $|r|$ close to 1: strong linear association; close to 0: weak linear association
AP Exam Tip: $r$ has no units and is not changed by linear transformations (adding a constant, multiplying by a constant, or switching $x$ and $y$ does not change $|r|$). Always state direction, strength, and form when describing $r$ in context.
Example 4.2 — Calculating and Interpreting r
For the study hours data above, $r \approx 0.993$. Interpret this value.
Interpretation: There is a strong, positive, linear association between study hours and exam score ($r = 0.993$). Students who study more hours tend to score higher on exams.
A researcher finds $r = -0.82$ between hours of TV watched per day and GPA. Interpret this value in context.
Show Answer
Adjust the slider to see how the correlation coefficient $r$ changes the shape of a scatterplot. Strong vs. weak, positive vs. negative.
Figure 4.2 — Visualizing Correlation Strength
4.3 Properties and Cautions for r
While $r$ is a powerful measure, it has important limitations that appear frequently on the AP exam:
Key Properties of r
- Linear only: $r$ measures only linear association. A strong curved relationship may have $r \approx 0$.
- Not resistant: $r$ is strongly affected by outliers and influential points.
- No units: $r$ is a pure number, not expressed in any unit.
- Symmetric: Switching $x$ and $y$ does not change $r$.
- Scale-free: Multiplying all values by a constant does not change $r$.
4.3.1 Influential Points and Outliers
An influential point is one whose removal would significantly change $r$ or the regression line. A point far from the overall pattern in the x-direction tends to be highly influential. An outlier in a scatterplot falls far from the linear pattern of the rest of the data.
Example 4.3 — Effect of an Outlier on r
Suppose one student in the study hours data studies 2 hours but scores 98 (far above the pattern). This outlier could dramatically increase or decrease $r$ depending on its position relative to the pattern. Always check scatterplots visually before relying on $r$ alone.
4.4 Correlation Does Not Imply Causation
A strong correlation between two variables does not prove that one causes the other. There are three possible explanations for an observed association:
Explanations for Association
- Direct causation: $x$ directly causes changes in $y$.
- Common response (lurking variable): A third variable $z$ causes both $x$ and $y$ to change, creating the appearance of an association.
- Confounding: The effect of $x$ on $y$ is mixed up with the effect of other variables.
A lurking variable is a variable that influences both $x$ and $y$ but is not included in the analysis.
Example 4.4 — Lurking Variables
Studies show a strong positive correlation between the number of firefighters at a fire and the amount of fire damage. Does sending more firefighters cause more damage?
No. The lurking variable is fire size. Larger fires attract more firefighters AND cause more damage. Firefighters don't cause damage — fire size drives both variables.
There is a strong positive correlation between shoe size and reading ability in elementary school children. Is shoe size a cause of reading ability? Identify the lurking variable.
Show Answer
AP Exam Tip: When asked to explain why correlation does not imply causation, always describe a specific lurking variable in context. Saying "there could be other factors" is insufficient for full credit.
Scatterplot with an influential outlier — observe how a single point can dramatically shift $r$. The red point is the outlier.
Figure 4.3 — Effect of an Outlier on Correlation
Practice Problems
A scatterplot shows a curved, U-shaped pattern with $r \approx 0.02$. A student concludes there is no association. Is this correct? Explain.
Show Solution
For a dataset: $\bar{x} = 5$, $\bar{y} = 20$, $s_x = 2$, $s_y = 4$. One data point has $x = 7$ and $y = 24$. What is its contribution to $r$?
Show Solution
Describe the expected direction and approximate strength of the association: hours of sleep vs. reaction time in a driving simulator.
Show Solution
Which of the following does NOT change the value of $r$? (a) Adding 10 to all y-values; (b) Multiplying all x-values by 3; (c) Removing an outlier; (d) Switching x and y.
Show Solution
A researcher finds $r = 0.91$ between per-capita chocolate consumption and number of Nobel Prize winners per country. What explains this association without claiming causation?
Show Solution
A scatterplot has 9 points tightly clustered along a line with $r = 0.97$, plus one outlier that is far below the pattern. Will removing the outlier increase or decrease $r$?
Show Solution
AP Exam FRQ: A study of 30 cities finds $r = 0.87$ between the number of libraries per capita (x) and high school graduation rate (y). A city council concludes: "Building more libraries causes graduation rates to rise." Evaluate this conclusion.
Show Solution
Two variables have $r = -1$. Describe the scatterplot and what you know about the data.
Show Solution
📋 Chapter Summary
Describing Scatterplots
Positive association: as $x$ increases, $y$ tends to increase. Negative: as $x$ increases, $y$ tends to decrease.
Linear (points roughly follow a line) vs. nonlinear (curved pattern). Identify before using correlation.
How closely points follow the form. Strong: points cluster tightly. Weak: points widely scattered.
Points that deviate from the overall pattern. Can strongly influence $r$ and the regression line.
Correlation $r$
$-1 \leq r \leq 1$. $r = \pm 1$ means perfect linear association. $r = 0$ means no linear association. $r$ has no units.
$r = \dfrac{1}{n-1}\displaystyle\sum\left(\dfrac{x_i - \bar{x}}{s_x}\right)\!\left(\dfrac{y_i - \bar{y}}{s_y}\right)$ — average product of standardized scores.
A high $r$ does not mean $x$ causes $y$. A lurking variable may cause both, or the association may be coincidental.
Only valid for linear relationships between quantitative variables. Not appropriate for categorical data or clearly curved patterns.