IGCSE Mathematics: Statistics
Statistics is the science of collecting, organising, summarising, and interpreting data. In this chapter you will learn to classify different types of data, calculate measures of average and spread, construct and interpret a wide range of statistical diagrams, and draw conclusions by comparing distributions. These skills are essential for analysing real-world information critically.
Specification Note
Content labelled Extended is required for Extended tier (Cambridge) or Higher tier (Edexcel) only.
1. Types of Data and Sampling
Classifying Data
Types of Data
- Qualitative (categorical): Data described by words or categories. Examples: colour of cars, favourite subject, nationality.
- Quantitative: Data described by numbers. This is further divided into:
- Discrete: Can only take specific values (usually whole numbers). Examples: number of children, shoe size, goals scored.
- Continuous: Can take any value within a range. Examples: height, weight, time, temperature.
- Primary data: Collected by the researcher themselves (e.g., via a questionnaire or experiment).
- Secondary data: Collected by someone else and used by the researcher (e.g., census data, published statistics).
Sampling Methods
It is rarely practical to collect data from every member of a population. Instead, we select a sample. The method of selection affects how representative the sample is.
- Random sampling: Every member of the population has an equal chance of being selected (e.g., names drawn from a hat, random number generator). Avoids bias.
- Systematic sampling: Every $n$th member of a list is selected (e.g., every 10th name on a register). Simple and structured.
- Stratified sampling: The population is divided into groups (strata) and a proportional sample is taken from each group. Ensures representation of sub-groups.
- Convenience (opportunity) sampling: The easiest people to reach are selected. Simple but prone to bias.
Example 1 — Stratified sampling
A school has 600 students: 240 in Year 10 and 360 in Year 11. A stratified sample of 50 students is to be selected. How many should come from each year group?
Step 1 Find the proportion from each year group:
Year 10: $\dfrac{240}{600} \times 50 = 20$ students
Year 11: $\dfrac{360}{600} \times 50 = 30$ students
Check: $20 + 30 = 50$ ✓
Practice 1a
Classify each of the following as qualitative, discrete, or continuous: (i) the length of a leaf in cm, (ii) the number of pets owned, (iii) the brand of a mobile phone, (iv) the time taken to complete a race.
Show Solution
(i) Continuous — length can take any value in a range.
(ii) Discrete — whole numbers only.
(iii) Qualitative — described by a category (brand name).
(iv) Continuous — time can take any value.
2. Averages and Spread
Measures of Average
- Mean: Sum of all values divided by the number of values. $\bar{x} = \dfrac{\sum x}{n}$
- Median: The middle value when data is arranged in order. For $n$ values, the median is at position $\dfrac{n+1}{2}$.
- Mode: The value that occurs most frequently. A data set may have more than one mode, or no mode.
Measures of Spread
- Range: Largest value $-$ Smallest value.
- Interquartile range (IQR): Upper quartile $Q_3$ $-$ Lower quartile $Q_1$. The IQR is less affected by extreme values than the range.
Mean from a Frequency Table
$$\bar{x} = \frac{\sum f x}{\sum f}$$where $f$ is the frequency and $x$ is the value.
Estimating the Mean from Grouped Data
Use the midpoint of each class interval as the representative value $x$:
$$\bar{x} \approx \frac{\sum f m}{\sum f}$$where $m$ is the midpoint of each class. This is an estimate because we do not know the exact values within each class.
Example 2 — Mean from a frequency table
The table shows the number of books read by 20 students last month.
| Books read ($x$) | Frequency ($f$) | $fx$ |
|---|---|---|
| 0 | 3 | 0 |
| 1 | 5 | 5 |
| 2 | 7 | 14 |
| 3 | 4 | 12 |
| 4 | 1 | 4 |
| Total | 20 | 35 |
$\bar{x} = \dfrac{35}{20} = 1.75$ books
Median: $20$ values, so median is between the 10th and 11th values. Cumulative frequencies: 0→3, 1→8, 2→15. Both the 10th and 11th values lie in the $x=2$ group. Median $= 2$.
Mode: $x = 2$ (highest frequency = 7).
Example 3 — Estimating mean from grouped data
The heights (in cm) of 30 students are recorded in the grouped frequency table below.
| Height (cm) | Frequency ($f$) | Midpoint ($m$) | $fm$ |
|---|---|---|---|
| $150 \leq h < 160$ | 6 | 155 | 930 |
| $160 \leq h < 170$ | 11 | 165 | 1815 |
| $170 \leq h < 180$ | 9 | 175 | 1575 |
| $180 \leq h < 190$ | 4 | 185 | 740 |
| Total | 30 | 5060 |
Estimated mean $= \dfrac{5060}{30} \approx 168.7$ cm
Practice 2a
The ages of 8 people are: 14, 17, 15, 21, 14, 19, 16, 14. Find the mean, median, mode, and range.
Show Solution
Ordered: 14, 14, 14, 15, 16, 17, 19, 21
Mean $= \dfrac{14+14+14+15+16+17+19+21}{8} = \dfrac{130}{8} = 16.25$
Median: $n=8$, position $= 4.5$, so average of 4th and 5th values $= \dfrac{15+16}{2} = 15.5$
Mode $= 14$ (appears 3 times)
Range $= 21 - 14 = 7$
3. Charts and Diagrams
Statistical diagrams provide a visual summary of data. The choice of diagram depends on the type of data and what you want to communicate.
Bar Charts
Used for qualitative or discrete data. Bars are drawn with gaps between them. The height of each bar represents the frequency (or relative frequency). Compound bar charts place related data in stacked or side-by-side bars. Dual bar charts show two data sets alongside each other for comparison.
Pie Charts
A pie chart divides a circle into sectors, where each sector represents a category. The angle for each sector is calculated as:
$$\text{angle} = \frac{\text{frequency}}{\text{total frequency}} \times 360°$$Pictograms
Use symbols or pictures to represent data. A key is always included to show what each symbol represents. Partial symbols represent fractions of the unit value.
Example 4 — Pie chart calculation
In a survey, 80 students chose their favourite sport: Football 32, Basketball 20, Tennis 16, Swimming 12. Calculate the angle for each sector.
| Sport | Frequency | Angle |
|---|---|---|
| Football | 32 | $\dfrac{32}{80} \times 360 = 144°$ |
| Basketball | 20 | $\dfrac{20}{80} \times 360 = 90°$ |
| Tennis | 16 | $\dfrac{16}{80} \times 360 = 72°$ |
| Swimming | 12 | $\dfrac{12}{80} \times 360 = 54°$ |
| Total | 80 | 360° |
Practice 3a
A pie chart shows the transport used by 120 pupils to travel to school. The sector for "bus" has an angle of 150°. How many pupils travel by bus?
Show Solution
Number of pupils $= \dfrac{150}{360} \times 120 = \dfrac{5}{12} \times 120 = 50$ pupils.
4. Scatter Diagrams and Correlation
A scatter diagram (or scatter graph) plots pairs of values $(x, y)$ to investigate whether a relationship exists between two variables.
Types of Correlation
- Positive correlation: As $x$ increases, $y$ tends to increase. Points slope upwards from left to right.
- Negative correlation: As $x$ increases, $y$ tends to decrease. Points slope downwards from left to right.
- No correlation: No clear pattern. Points are scattered randomly.
- Correlation can also be described as strong (points close to a line) or weak (points more scattered).
Line of Best Fit
A line of best fit (or trend line) is drawn through the middle of the data points so that roughly equal numbers of points lie on each side. It should pass through the mean point $(\bar{x}, \bar{y})$.
- Interpolation: Using the line of best fit to estimate values within the range of the data. Generally reliable.
- Extrapolation: Using the line of best fit to estimate values outside the range of the data. Less reliable, as the relationship may not continue.
- Outliers: Data points that lie well away from the general trend. They may indicate measurement errors or genuinely unusual values.
Example 5 — Interpreting a scatter diagram
A scatter diagram shows the marks scored in Maths ($x$) and Science ($y$) by 10 students. The points show a strong positive correlation. The line of best fit passes through $(20, 25)$ and $(60, 65)$.
Step 1 Find the gradient: $m = \dfrac{65-25}{60-20} = \dfrac{40}{40} = 1$
Step 2 Equation of line: $y - 25 = 1(x - 20) \Rightarrow y = x + 5$
Step 3 Estimate Science mark for Maths mark of 45: $y = 45 + 5 = 50$
This is interpolation — reliable, as 45 lies within the data range.
Important: Correlation and Causation
Correlation does not imply causation. Just because two variables are correlated does not mean that one causes the other. There may be a third variable influencing both, or the correlation may be coincidental.
Practice 4a
Describe the type of correlation you would expect between: (i) the temperature and the number of ice creams sold; (ii) the age of a car and its resale value; (iii) a person's height and their IQ score.
Show Solution
(i) Positive correlation — as temperature increases, more ice creams are sold.
(ii) Negative correlation — as a car ages, its resale value generally decreases.
(iii) No correlation — height and IQ are not related.
5. Stem-and-Leaf Diagrams
A stem-and-leaf diagram organises numerical data by splitting each value into a stem (the leading digit(s)) and a leaf (the final digit). The data retains its original values, making it easy to find the median and quartiles.
In a back-to-back stem-and-leaf diagram, two sets of data share a common stem, with one set's leaves going left and the other's going right. This allows direct comparison of two distributions.
Example 6 — Back-to-back stem-and-leaf diagram
The times (in minutes) taken by Group A and Group B to complete a puzzle are:
Group A: 12, 15, 18, 21, 23, 25, 27, 31, 34, 38
Group B: 14, 16, 19, 20, 24, 26, 29, 30, 32, 35
Step 1 Find the median of Group A: 10 values, median is between 5th and 6th: $\dfrac{23+25}{2} = 24$ minutes.
Step 2 $Q_1$ of Group A: median of lower 5 values (12, 15, 18, 21, 23) = 18.
Step 3 $Q_3$ of Group A: median of upper 5 values (25, 27, 31, 34, 38) = 31.
Step 4 IQR of Group A $= 31 - 18 = 13$ minutes.
Group B median: $\dfrac{24+26}{2} = 25$ minutes. Group B $Q_1 = 19$, $Q_3 = 32$, IQR $= 13$ minutes.
Comparison: Group B has a slightly higher median (25 vs 24 minutes), but both groups have the same IQR, suggesting similar spread.
Practice 5a
The following data shows test scores for 9 students: 43, 51, 55, 57, 62, 68, 71, 74, 79. Draw a stem-and-leaf diagram and find the median and IQR.
Show Solution
Median: 9 values, 5th value = 62.
$Q_1$: median of lower 4 values (43, 51, 55, 57) $= \dfrac{51+55}{2} = 53$
$Q_3$: median of upper 4 values (68, 71, 74, 79) $= \dfrac{71+74}{2} = 72.5$
IQR $= 72.5 - 53 = 19.5$
6. Histograms and Frequency Density
A histogram is used to display grouped continuous data. Unlike bar charts, histograms have no gaps between bars because the data is continuous. In a histogram, the area of each bar (not the height) represents the frequency.
Frequency Density
$$\text{Frequency density} = \frac{\text{Frequency}}{\text{Class width}}$$Rearranging: $\text{Frequency} = \text{Frequency density} \times \text{Class width}$
The $y$-axis of a histogram is always labelled "Frequency density", not "Frequency".
Example 7 — Drawing a histogram
The masses (in kg) of 60 parcels are recorded below.
| Mass (kg) | Frequency | Class width | Frequency density |
|---|---|---|---|
| $0 \leq m < 2$ | 8 | 2 | $8 \div 2 = 4$ |
| $2 \leq m < 5$ | 18 | 3 | $18 \div 3 = 6$ |
| $5 \leq m < 8$ | 21 | 3 | $21 \div 3 = 7$ |
| $8 \leq m < 12$ | 12 | 4 | $12 \div 4 = 3$ |
| $12 \leq m < 20$ | 4 | 8 | $4 \div 8 = 0.5$ |
| Total | 60 |
Draw a histogram with mass on the $x$-axis and frequency density on the $y$-axis. The width of each bar spans its class interval and the height equals its frequency density.
Example 8 — Reading a histogram
A histogram shows the following frequency densities for age groups: $0 \leq a < 10$: fd = 3.5; $10 \leq a < 20$: fd = 5; $20 \leq a < 30$: fd = 4.2. Find the number of people in each group.
Each class width = 10.
$0$–$10$: $3.5 \times 10 = 35$ people
$10$–$20$: $5 \times 10 = 50$ people
$20$–$30$: $4.2 \times 10 = 42$ people
Practice 6a
A histogram has a bar for the interval $15 \leq t < 25$ with frequency density 3.6. A bar for $25 \leq t < 30$ has frequency density 8. Find the frequency for each interval and the total number of data values in these two classes.
Show Solution
Interval $15 \leq t < 25$: class width $= 10$, frequency $= 3.6 \times 10 = 36$
Interval $25 \leq t < 30$: class width $= 5$, frequency $= 8 \times 5 = 40$
Total: $36 + 40 = 76$ data values
7. Cumulative Frequency and Box Plots
Cumulative frequency is the running total of frequencies up to and including each class. A cumulative frequency curve (ogive) is used to estimate the median, quartiles, and interquartile range from grouped data.
Reading from a Cumulative Frequency Curve
For $n$ data values:
- Median at cumulative frequency $\dfrac{n}{2}$
- Lower quartile $Q_1$ at cumulative frequency $\dfrac{n}{4}$
- Upper quartile $Q_3$ at cumulative frequency $\dfrac{3n}{4}$
- Interquartile range $= Q_3 - Q_1$
Figure 9.1 — A cumulative frequency step function (blue) and a smooth cumulative frequency curve (green) for grouped data. The median and quartiles are read off horizontally from the $\frac{n}{2}$ and $\frac{n}{4}$ levels.
Figure 9.2 — A normal distribution bell curve, illustrating the symmetric, unimodal shape that naturally arises in many large data sets (e.g., heights, examination marks).
Example 9 — Cumulative frequency table and curve
The times (in seconds) taken by 80 swimmers to complete a length are given in the grouped frequency table.
| Time (s) | Frequency | Cumulative Frequency |
|---|---|---|
| $50 \leq t < 60$ | 5 | 5 |
| $60 \leq t < 70$ | 18 | 23 |
| $70 \leq t < 80$ | 28 | 51 |
| $80 \leq t < 90$ | 20 | 71 |
| $90 \leq t < 100$ | 9 | 80 |
Step 1 Plot cumulative frequency against the upper class boundary of each interval: $(60, 5)$, $(70, 23)$, $(80, 51)$, $(90, 71)$, $(100, 80)$. Also plot $(50, 0)$.
Step 2 Join with a smooth curve.
Step 3 Read off: Median at cf = 40 → $t \approx 77$ s; $Q_1$ at cf = 20 → $t \approx 68$ s; $Q_3$ at cf = 60 → $t \approx 83$ s.
IQR $\approx 83 - 68 = 15$ s
Box-and-Whisker Plots
A box-and-whisker plot (or box plot) provides a visual summary of a distribution using five key values: minimum, $Q_1$, median, $Q_3$, maximum. The box spans the IQR; the whiskers extend to the minimum and maximum values (excluding outliers).
Example 10 — Comparing distributions using box plots
Two classes sit the same test. Their results:
- Class X: Min = 20, $Q_1$ = 40, Median = 55, $Q_3$ = 70, Max = 90
- Class Y: Min = 30, $Q_1$ = 50, Median = 65, $Q_3$ = 75, Max = 95
Comparison:
- Class Y has a higher median (65 vs 55), suggesting Class Y performed better on average.
- Class X has a larger IQR ($70 - 40 = 30$) compared to Class Y ($75 - 50 = 25$), suggesting Class X's results are more spread out.
- Both classes have a similar range (Class X: 70, Class Y: 65).
Exam Tip — Comparing Distributions
When asked to compare two distributions, always comment on both a measure of average (mean or median) and a measure of spread (range or IQR), and interpret them in context.
Practice 7a
From the cumulative frequency curve in Example 9, estimate the number of swimmers who took more than 85 seconds.
Show Solution
Read the cumulative frequency at $t = 85$: approximately 63 swimmers took 85 seconds or less.
Number taking more than 85 s $= 80 - 63 = 17$ swimmers.
Practice 7b
Two data sets have the following box plot summaries:
Set A: Min = 5, $Q_1$ = 12, Median = 18, $Q_3$ = 26, Max = 40
Set B: Min = 8, $Q_1$ = 15, Median = 22, $Q_3$ = 28, Max = 35
Write two comparisons between Set A and Set B.
Show Solution
Average: Set B has a higher median (22 vs 18), so the values in Set B tend to be larger.
Spread: Set A has a larger IQR ($26 - 12 = 14$) compared to Set B ($28 - 15 = 13$), and a larger overall range ($40 - 5 = 35$ vs $35 - 8 = 27$), so Set A's values are more spread out.
8. Mixed Practice Problems
Question 1
The number of goals scored by a football team in each of 12 matches is: 0, 1, 2, 1, 3, 0, 2, 4, 1, 2, 1, 0. Find the mean, median, mode, and range.
Show Solution
Ordered: 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 4
Mean $= \dfrac{0+0+0+1+1+1+1+2+2+2+3+4}{12} = \dfrac{17}{12} \approx 1.42$
Median: 12 values, average of 6th and 7th: $\dfrac{1+1}{2} = 1$
Mode $= 1$ (appears 4 times)
Range $= 4 - 0 = 4$
Question 2
The table shows the masses of 50 apples. Estimate the mean mass.
| Mass (g) | Frequency |
|---|---|
| $80 \leq m < 100$ | 8 |
| $100 \leq m < 120$ | 17 |
| $120 \leq m < 140$ | 19 |
| $140 \leq m < 160$ | 6 |
Show Solution
Midpoints: 90, 110, 130, 150
$\sum fm = 8 \times 90 + 17 \times 110 + 19 \times 130 + 6 \times 150$
$= 720 + 1870 + 2470 + 900 = 5960$
Estimated mean $= \dfrac{5960}{50} = 119.2$ g
Question 3
In a school of 900 pupils, 360 are in Key Stage 3 and 540 are in Key Stage 4. A stratified sample of 75 pupils is needed. How many pupils should be selected from each Key Stage?
Show Solution
KS3: $\dfrac{360}{900} \times 75 = 0.4 \times 75 = 30$ pupils
KS4: $\dfrac{540}{900} \times 75 = 0.6 \times 75 = 45$ pupils
Question 4
A pie chart shows that 60 people chose "blue" as their favourite colour. The angle of the blue sector is 144°. How many people were surveyed in total?
Show Solution
$\dfrac{60}{\text{total}} = \dfrac{144}{360} = 0.4$
Total $= \dfrac{60}{0.4} = 150$ people
Question 5
A histogram bar for the interval $20 \leq x < 30$ has frequency density 4.5, and a bar for $30 \leq x < 50$ has frequency density 2.5. Find the total frequency for these two intervals.
Show Solution
$20 \leq x < 30$: width $= 10$, frequency $= 4.5 \times 10 = 45$
$30 \leq x < 50$: width $= 20$, frequency $= 2.5 \times 20 = 50$
Total frequency $= 45 + 50 = 95$
Question 6
The stem-and-leaf diagram below shows the ages of people at a community event:
Find: (a) the median, (b) the interquartile range.
Show Solution
Data in order: 15, 18, 22, 24, 26, 29, 31, 33, 35, 35, 37, 40, 42, 48, 53, 56
$n = 16$. Median at position 8.5: $\dfrac{33+35}{2} = 34$
$Q_1$: median of lower 8 values (15–31) at position 4.5: $\dfrac{24+26}{2} = 25$
$Q_3$: median of upper 8 values (33–56) at position 4.5 within that group: $\dfrac{40+42}{2} = 41$
IQR $= 41 - 25 = 16$
Question 7
The following data shows the heights (in cm) of 10 plants: 34, 38, 42, 45, 47, 50, 53, 58, 61, 65. Draw a box-and-whisker plot using this data.
Show Solution
Minimum $= 34$, Maximum $= 65$
Median: 10 values, average of 5th and 6th: $\dfrac{47+50}{2} = 48.5$
$Q_1$: median of lower 5 values (34, 38, 42, 45, 47) $= 42$
$Q_3$: median of upper 5 values (50, 53, 58, 61, 65) $= 58$
Box-and-whisker plot: draw a number line, mark 34, 42, 48.5, 58, 65. Draw a box from $Q_1 = 42$ to $Q_3 = 58$ with a line at the median $48.5$. Draw whiskers from $34$ to $42$ and from $58$ to $65$.
Question 8
A scatter diagram shows the temperature ($x$°C) and the number of visitors to a park ($y$). The line of best fit has equation $y = 12x + 40$. Estimate the number of visitors when the temperature is 18°C, and comment on the reliability of using the line of best fit for a temperature of $-5$°C.
Show Solution
At $x = 18$: $y = 12(18) + 40 = 216 + 40 = 256$ visitors.
At $x = -5$°C: This is extrapolation — using the line of best fit outside the range of observed temperatures. The result ($y = 12(-5) + 40 = -20$ visitors) is not meaningful. Predictions outside the data range are unreliable.
Question 9
The cumulative frequency for the times (in minutes) taken by 100 people to complete a puzzle is as follows: by 10 min: 8; by 15 min: 27; by 20 min: 56; by 25 min: 81; by 30 min: 100. Estimate the median and interquartile range.
Show Solution
$n = 100$. Median at cf $= 50$; from the data, cf goes from 27 at 15 min to 56 at 20 min.
By linear interpolation: $15 + \dfrac{50-27}{56-27} \times 5 = 15 + \dfrac{23}{29} \times 5 \approx 15 + 3.97 \approx 19.0$ min
$Q_1$ at cf $= 25$: between 15 min (cf=27) so $Q_1 \approx 15$ min. More precisely: $10 + \dfrac{25-8}{27-8} \times 5 = 10 + \dfrac{17}{19} \times 5 \approx 14.5$ min
$Q_3$ at cf $= 75$: between 20 min (cf=56) and 25 min (cf=81): $20 + \dfrac{75-56}{81-56} \times 5 = 20 + \dfrac{19}{25} \times 5 = 20 + 3.8 = 23.8$ min
IQR $\approx 23.8 - 14.5 = 9.3$ min
Question 10
Two classes each took a test out of 50 marks. Class A: median = 32, IQR = 18, range = 42. Class B: median = 36, IQR = 10, range = 30. Write a comparison of the two distributions, referring to both average and spread.
Show Solution
Average: Class B has a higher median mark (36 vs 32), so Class B performed better on average.
Spread: Class A has a larger IQR (18 vs 10) and a larger range (42 vs 30), so Class A's marks are more spread out and less consistent. Class B's results are more tightly clustered around the median.