Chapter 1: Exploring Data — Distributions
Learning Objectives
- Distinguish between categorical and quantitative variables
- Display distributions using dotplots, stemplots, and histograms
- Describe distribution shape: symmetric, skewed left, skewed right
- Identify center and spread from a graphical display
- Construct and interpret boxplots; identify outliers using the IQR rule
- Compare distributions using appropriate graphical and numerical summaries
1.1 Types of Variables
Statistics begins with data — information collected about individuals. Before analyzing data, we must identify the type of variable being measured, since different variable types require different methods.
Definition: Types of Variables
A categorical variable (also called qualitative) places each individual into one of several groups or categories. Examples: eye color, gender, country of birth, AP exam score (A/B/C/D/F).
A quantitative variable takes numerical values for which arithmetic makes sense. Examples: height in cm, SAT score, temperature, number of siblings.
Quantitative variables can be further divided:
- Discrete: takes countable values (e.g., number of pets: 0, 1, 2, 3, …)
- Continuous: can take any value in an interval (e.g., time, weight, length)
Example 1.1 — Identifying Variable Types
A survey of 30 AP Statistics students records the following information. Classify each variable.
| Variable | Type | Reason |
|---|---|---|
| Favorite subject | Categorical | Places student in a category (Math, English, …) |
| Hours studied per week | Quantitative (continuous) | Numerical, arithmetic makes sense |
| Number of AP exams taken | Quantitative (discrete) | Countable whole numbers |
| Grade in AP Stats (A/B/C) | Categorical | Letter grades are categories, not numbers |
A researcher records: (a) blood type of each patient, (b) systolic blood pressure, (c) number of hospitalizations. Classify each variable.
Show Answer
(b) Systolic blood pressure: Quantitative (continuous) — a measurement that can be any positive number
(c) Number of hospitalizations: Quantitative (discrete) — a countable whole number (0, 1, 2, …)
1.2 Displaying Distributions with Graphs
To understand a dataset, we start by making a graph. The graph reveals the distribution of a variable — what values occur and how often.
Dotplots
A dotplot places each data value as a dot above a number line. Dotplots work well for small datasets and show individual values clearly.
Example 1.2 — Reading a Dotplot
The number of text messages sent by 12 students in one hour: 3, 5, 5, 7, 8, 8, 8, 10, 12, 12, 15, 20
Each value gets one dot. Stacked dots indicate repeated values. We can see immediately that most students sent 5–12 messages, with one outlier at 20.
Histograms
A histogram divides the range of data into equal-width intervals (called bins) and displays the count or percent of observations in each bin. Histograms work well for large datasets.
How to Construct a Histogram
- Choose a convenient number of bins (typically 5–10)
- Make the bins equal in width, covering the full range
- Count the observations in each bin
- Draw bars of height = frequency (or relative frequency); bars touch each other
Interactive: Adjust the slider to change bin width and observe how the histogram shape changes.
Figure 1.1 — Histogram of Test Scores (n = 30)
Describing Shape
When you look at a histogram (or any distribution graph), describe its shape:
Distribution Shapes
- Symmetric: left and right sides are roughly mirror images; the mean ≈ median
- Skewed right (positively skewed): long tail extends to the right; mean > median
- Skewed left (negatively skewed): long tail extends to the left; mean < median
- Unimodal: one peak; Bimodal: two peaks; Uniform: roughly flat
Three distribution shapes: symmetric (blue), right-skewed (red), left-skewed (green)
Figure 1.2 — Symmetric vs. Skewed Distributions
AP Exam Tip: When describing a distribution, always address Shape, Center, Spread, and any Outliers (SCSO or "SOCS"). Free-response graders look for all four components.
1.3 Boxplots and the Five-Number Summary
A boxplot (box-and-whisker plot) summarizes a distribution using five key values called the five-number summary: Minimum, Q1, Median (Q2), Q3, Maximum.
Five-Number Summary
Given a dataset sorted in order:
- Minimum: smallest value
- Q1 (first quartile): median of the lower half of the data
- Median (Q2): middle value (or average of two middle values)
- Q3 (third quartile): median of the upper half of the data
- Maximum: largest value
The Interquartile Range (IQR) $= Q_3 - Q_1$ measures the spread of the middle 50% of data.
Example 1.3 — Computing the Five-Number Summary
AP exam scores for 15 students (sorted):
1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 5
Step 1 — Median: The 8th value = 4
Step 2 — Q1: Lower half = {1, 2, 2, 3, 3, 3, 4}; median = 3
Step 3 — Q3: Upper half = {4, 4, 4, 5, 5, 5, 5}; median = 5
Five-number summary: Min = 1, Q1 = 3, Median = 4, Q3 = 5, Max = 5
IQR = Q3 − Q1 = 5 − 3 = 2
Identifying Outliers
The 1.5 × IQR Rule for Outliers
An observation is a suspected outlier if it falls:
- Below $Q_1 - 1.5 \times \text{IQR}$, or
- Above $Q_3 + 1.5 \times \text{IQR}$
On a modified boxplot, outliers are plotted as individual points; whiskers extend only to the last non-outlier value.
Interactive boxplot — the five-number summary is displayed. Outliers are shown as separate points.
Figure 1.3 — Modified Boxplot with Outlier Detection
A dataset has Q1 = 12, Q3 = 20. Calculate the IQR and the outlier fences.
Show Answer
Lower fence: Q1 − 1.5(8) = 12 − 12 = 0
Upper fence: Q3 + 1.5(8) = 20 + 12 = 32
Any value below 0 or above 32 is a suspected outlier.
1.4 Comparing Distributions
A common AP Statistics task is to compare two or more distributions. Use side-by-side boxplots or back-to-back stemplots. Always compare shape, center, spread, and outliers in context.
Example 1.4 — Comparing Two Distributions
Two classes take the same quiz. Class A: min=52, Q1=68, median=74, Q3=82, max=96. Class B: min=60, Q1=72, median=80, Q3=85, max=92.
Center: Class B has a higher median (80 vs 74), suggesting Class B performed better on average.
Spread: Class A has a larger IQR (82−68=14) vs Class B (85−72=13), so Class A is slightly more variable.
Shape: Both distributions appear roughly symmetric based on the summary values.
Outliers: No outliers visible from the five-number summary.
AP Exam Tip: When comparing distributions, always write comparisons in context and use comparative language ("Class B's median is higher than Class A's median"). Simply listing each distribution's statistics without comparing earns partial credit only.
Practice Problems
A sample of 10 students recorded how many hours they sleep per night:
6, 7, 7, 8, 8, 8, 9, 9, 10, 12
(a) Find the five-number summary.
(b) Calculate the IQR.
(c) Identify any outliers using the 1.5 × IQR rule.
Show Solution
Five-number summary: 6 | 7 | 8 | 9 | 12
(b) IQR = Q3 − Q1 = 9 − 7 = 2
(c) Lower fence = 7 − 1.5(2) = 4; Upper fence = 9 + 1.5(2) = 12
The value 12 equals the upper fence but is not strictly beyond it, so no outliers by the strict rule. (Note: some texts use "≥ fence" rather than "> fence" — clarify which your teacher uses.)
A histogram shows that the distribution of household incomes in a city is strongly skewed right.
(a) What does the skewed-right shape tell us about most households vs. a few households?
(b) Would you expect the mean income to be greater than or less than the median income? Explain.
Show Solution
(b) The mean will be greater than the median. The few extremely high earners pull the mean toward the right tail, but the median (middle value) is not affected by extreme values. This is a classic pattern in income data.
Classify each variable as categorical or quantitative:
(a) ZIP code (b) Annual rainfall in mm (c) Shirt size (S/M/L/XL) (d) Number of siblings
Show Solution
(b) Annual rainfall: Quantitative (continuous)
(c) Shirt size: Categorical — ordered categories, but not truly numerical
(d) Number of siblings: Quantitative (discrete)
Two competing tutoring programs (Program A and Program B) report the following SAT Math score gains
for a sample of students:
Program A: Min=20, Q1=40, Median=60, Q3=80, Max=150
Program B: Min=30, Q1=50, Median=65, Q3=75, Max=100
Compare the distributions of score gains for the two programs. Write a complete response using the SOCS framework.
Show Solution
Center: Program B has a slightly higher median score gain (65 points) compared to Program A (60 points), suggesting Program B typically produces marginally larger gains.
Spread: Program A has greater variability: IQR = 80 − 40 = 40, compared to Program B's IQR = 75 − 50 = 25. Program A's range is also larger (130 vs. 70). Program A's results are more inconsistent.
Outliers: Program A's maximum of 150 is a likely outlier. Check: upper fence = 80 + 1.5(40) = 140; since 150 > 140, the value of 150 is a suspected outlier. No outliers are apparent in Program B.
A distribution has Q1 = 45 and Q3 = 65. Which of the following values would be classified as an outlier?
(A) 20 (B) 35 (C) 70 (D) 80
Show Solution
Lower fence = 45 − 1.5(20) = 45 − 30 = 15
Upper fence = 65 + 1.5(20) = 65 + 30 = 95
Values outside (15, 95) are outliers. Checking options:
(A) 20: between 15 and 95 → not an outlier
(B) 35: between 15 and 95 → not an outlier
(C) 70: between 15 and 95 → not an outlier
(D) 80: between 15 and 95 → not an outlier
None of the options given are outliers. Answer: None of the above (This tests whether students carefully apply the fence rule rather than guessing "the biggest number".)
A dataset of 20 values is described: the mean is 55 and the median is 42. What does this tell you about the shape of the distribution? Explain your reasoning.
Show Solution
Identify an appropriate graph for each situation:
(a) Display the distribution of birth months (Jan–Dec) for 50 students
(b) Compare the heights of male and female students in a class of 60
(c) Show the distribution of 200 SAT scores
Show Solution
(b) Side-by-side boxplots — best for comparing two groups on a quantitative variable; back-to-back stemplot also works for smaller datasets
(c) Histogram — 200 observations of a quantitative variable; individual values would be too crowded for a dotplot or stemplot
A dataset has the property that the mean, median, and mode are all equal.
(a) What shape does the distribution likely have?
(b) Give a specific example of such a dataset with 5 values.
Show Solution
(b) Example: 2, 4, 4, 4, 6
Mean = (2+4+4+4+6)/5 = 20/5 = 4 ✓
Median = middle value = 4 ✓
Mode = most frequent = 4 ✓
📋 Chapter Summary
Types of Data
Records which group or category an individual belongs to. Examples: gender, color, region. Summarized with frequency tables and bar charts.
Records numerical values where arithmetic makes sense. Examples: height, temperature, income. Summarized with histograms, dotplots, boxplots.
Describes the pattern of values: shape (symmetric, skewed, bimodal), center (mean/median), spread (range/IQR/SD), and outliers.
Use parallel boxplots or back-to-back stemplots. Compare shape, center, spread, and outliers in context. Always use comparative language.
Graph Types
Groups quantitative data into intervals (bins). Shows shape clearly. Use for large datasets.
Shows the five-number summary: min, Q1, median, Q3, max. Outliers plotted individually beyond 1.5×IQR from Q1/Q3.
Shows every data value. Useful for small datasets to see exact values and identify gaps or clusters.
For categorical data. Bars represent frequency or relative frequency for each category. Bars should NOT touch.
Shape Descriptions
- Symmetric — roughly mirror-image on both sides of center
- Skewed right — tail extends to the right; mean > median
- Skewed left — tail extends to the left; mean < median
- Unimodal / Bimodal — one or two distinct peaks