MathHub US AP Statistics Chapter 4

Chapter 4: Scatterplots and Correlation

AP Statistics · Exploring Two-Variable Data · 3 interactive graphs · 8 practice problems

Learning Objectives

4.1 Constructing and Interpreting Scatterplots

A scatterplot displays the relationship between two quantitative variables measured on the same individuals. One variable is plotted on the horizontal axis (explanatory variable) and the other on the vertical axis (response variable).

Definition: Explanatory and Response Variables

The explanatory variable (independent variable) may help explain or predict changes in the response variable (dependent variable). When examining a relationship, place the explanatory variable on the x-axis and the response variable on the y-axis.

When describing a scatterplot, always address four characteristics using the acronym DOFS:

Example 4.1 — Describing a Scatterplot

A study records study hours per week (x) and exam score (y) for 10 students:

Hours (x)245781012141518
Score (y)55626571747984879195

Description: There is a strong, positive, linear association between study hours and exam score. No outliers are apparent. As study hours increase, exam scores tend to increase as well.

TRY IT

A dataset shows that as temperature (°F) increases from 60 to 100, ice cream sales increase. However, at 105°F, sales drop sharply. How would you describe the form and any outliers?

Show Answer
The association is roughly positive and linear from 60–100°F, but curved/non-linear when the 105°F point is included. The point at 105°F is a potential outlier that does not follow the linear pattern of the other points.

Scatterplot of study hours vs. exam score — observe direction, form, and strength.

Figure 4.1 — Study Hours vs. Exam Score

4.2 Measuring Correlation: The Coefficient r

The correlation coefficient $r$ measures the direction and strength of a linear association between two quantitative variables.

Formula: Pearson Correlation Coefficient

$$r = \frac{1}{n-1}\sum_{i=1}^{n}\left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)$$

Interpreting r

AP Exam Tip: $r$ has no units and is not changed by linear transformations (adding a constant, multiplying by a constant, or switching $x$ and $y$ does not change $|r|$). Always state direction, strength, and form when describing $r$ in context.

Example 4.2 — Calculating and Interpreting r

For the study hours data above, $r \approx 0.993$. Interpret this value.

Interpretation: There is a strong, positive, linear association between study hours and exam score ($r = 0.993$). Students who study more hours tend to score higher on exams.

TRY IT

A researcher finds $r = -0.82$ between hours of TV watched per day and GPA. Interpret this value in context.

Show Answer
There is a strong, negative, linear association between TV hours and GPA ($r = -0.82$). Students who watch more TV per day tend to have lower GPAs.

Adjust the slider to see how the correlation coefficient $r$ changes the shape of a scatterplot. Strong vs. weak, positive vs. negative.

Figure 4.2 — Visualizing Correlation Strength

4.3 Properties and Cautions for r

While $r$ is a powerful measure, it has important limitations that appear frequently on the AP exam:

Key Properties of r

4.3.1 Influential Points and Outliers

An influential point is one whose removal would significantly change $r$ or the regression line. A point far from the overall pattern in the x-direction tends to be highly influential. An outlier in a scatterplot falls far from the linear pattern of the rest of the data.

Example 4.3 — Effect of an Outlier on r

Suppose one student in the study hours data studies 2 hours but scores 98 (far above the pattern). This outlier could dramatically increase or decrease $r$ depending on its position relative to the pattern. Always check scatterplots visually before relying on $r$ alone.

4.4 Correlation Does Not Imply Causation

A strong correlation between two variables does not prove that one causes the other. There are three possible explanations for an observed association:

Explanations for Association

  1. Direct causation: $x$ directly causes changes in $y$.
  2. Common response (lurking variable): A third variable $z$ causes both $x$ and $y$ to change, creating the appearance of an association.
  3. Confounding: The effect of $x$ on $y$ is mixed up with the effect of other variables.

A lurking variable is a variable that influences both $x$ and $y$ but is not included in the analysis.

Example 4.4 — Lurking Variables

Studies show a strong positive correlation between the number of firefighters at a fire and the amount of fire damage. Does sending more firefighters cause more damage?

No. The lurking variable is fire size. Larger fires attract more firefighters AND cause more damage. Firefighters don't cause damage — fire size drives both variables.

TRY IT

There is a strong positive correlation between shoe size and reading ability in elementary school children. Is shoe size a cause of reading ability? Identify the lurking variable.

Show Answer
No. The lurking variable is age. Older children have larger feet AND have had more time to develop reading skills. Age causes both shoe size and reading ability to increase together.

AP Exam Tip: When asked to explain why correlation does not imply causation, always describe a specific lurking variable in context. Saying "there could be other factors" is insufficient for full credit.

Scatterplot with an influential outlier — observe how a single point can dramatically shift $r$. The red point is the outlier.

Figure 4.3 — Effect of an Outlier on Correlation

Practice Problems

1

A scatterplot shows a curved, U-shaped pattern with $r \approx 0.02$. A student concludes there is no association. Is this correct? Explain.

Show Solution
The student is incorrect. $r \approx 0.02$ indicates nearly no linear association, but a strong nonlinear (curved) association exists. The correlation coefficient only measures linear relationships.
2

For a dataset: $\bar{x} = 5$, $\bar{y} = 20$, $s_x = 2$, $s_y = 4$. One data point has $x = 7$ and $y = 24$. What is its contribution to $r$?

Show Solution
The z-scores are $z_x = (7-5)/2 = 1$ and $z_y = (24-20)/4 = 1$. The contribution is $z_x \cdot z_y = (1)(1) = 1$. Since both z-scores are positive, this point pulls $r$ toward +1.
3

Describe the expected direction and approximate strength of the association: hours of sleep vs. reaction time in a driving simulator.

Show Solution
Negative direction: more sleep → faster reaction (lower times). Likely moderate to strong linear association. $r$ would be negative, possibly around $-0.7$ to $-0.9$.
4

Which of the following does NOT change the value of $r$? (a) Adding 10 to all y-values; (b) Multiplying all x-values by 3; (c) Removing an outlier; (d) Switching x and y.

Show Solution
(a), (b), and (d) do NOT change $|r|$ — linear transformations and swapping axes preserve correlation. Removing an outlier DOES change $r$ because $r$ is not resistant to outliers.
5

A researcher finds $r = 0.91$ between per-capita chocolate consumption and number of Nobel Prize winners per country. What explains this association without claiming causation?

Show Solution
A likely lurking variable is national wealth (GDP per capita). Wealthier countries can afford more chocolate consumption AND fund more research leading to Nobel Prizes. Wealth drives both variables simultaneously.
6

A scatterplot has 9 points tightly clustered along a line with $r = 0.97$, plus one outlier that is far below the pattern. Will removing the outlier increase or decrease $r$?

Show Solution
If the outlier is far below the pattern (but at a typical x-value), it weakens the linear pattern, pulling $r$ down from what it would be without it. Removing it will likely increase $r$ closer to 1.
7

AP Exam FRQ: A study of 30 cities finds $r = 0.87$ between the number of libraries per capita (x) and high school graduation rate (y). A city council concludes: "Building more libraries causes graduation rates to rise." Evaluate this conclusion.

Show Solution
The conclusion is not justified. Correlation does not imply causation. A likely lurking variable is community wealth or education funding. Wealthier communities tend to build more libraries AND fund better schools, leading to higher graduation rates. An experiment would be needed to establish causation.
8

Two variables have $r = -1$. Describe the scatterplot and what you know about the data.

Show Solution
$r = -1$ means a perfect negative linear association. All points fall exactly on a straight line with negative slope. Knowing $x$ perfectly predicts $y$. In practice, $r = -1$ is extremely rare in real data.

📋 Chapter Summary

Describing Scatterplots

Direction

Positive association: as $x$ increases, $y$ tends to increase. Negative: as $x$ increases, $y$ tends to decrease.

Form

Linear (points roughly follow a line) vs. nonlinear (curved pattern). Identify before using correlation.

Strength

How closely points follow the form. Strong: points cluster tightly. Weak: points widely scattered.

Outliers

Points that deviate from the overall pattern. Can strongly influence $r$ and the regression line.

Correlation $r$

Properties of $r$

$-1 \leq r \leq 1$. $r = \pm 1$ means perfect linear association. $r = 0$ means no linear association. $r$ has no units.

Formula

$r = \dfrac{1}{n-1}\displaystyle\sum\left(\dfrac{x_i - \bar{x}}{s_x}\right)\!\left(\dfrac{y_i - \bar{y}}{s_y}\right)$ — average product of standardized scores.

Caution: Correlation ≠ Causation

A high $r$ does not mean $x$ causes $y$. A lurking variable may cause both, or the association may be coincidental.

Conditions for $r$

Only valid for linear relationships between quantitative variables. Not appropriate for categorical data or clearly curved patterns.

📘 Key Terms

ScatterplotA graph of paired $(x, y)$ data showing the relationship between two quantitative variables.
Explanatory VariableThe $x$-variable, which may help explain or predict the response variable.
Response VariableThe $y$-variable, which is measured as an outcome. Also called the dependent variable.
Correlation ($r$)Measures the direction and strength of the linear relationship between two quantitative variables. $-1 \leq r \leq 1$.
Lurking VariableA variable not in the analysis that is associated with both $x$ and $y$, potentially explaining their relationship.
CausationA direct cause-and-effect relationship. Cannot be established by correlation alone — requires a controlled experiment.
← Chapter 3: Normal Distributions Chapter 5: Regression →