IGCSE Mathematics: Statistics

Cambridge IGCSE 0580 & Edexcel 4MA1 · Updated March 2026

Statistics is the science of collecting, organising, summarising, and interpreting data. In this chapter you will learn to classify different types of data, calculate measures of average and spread, construct and interpret a wide range of statistical diagrams, and draw conclusions by comparing distributions. These skills are essential for analysing real-world information critically.

Specification Note

Content labelled Extended is required for Extended tier (Cambridge) or Higher tier (Edexcel) only.

1. Types of Data and Sampling

Classifying Data

Types of Data

Sampling Methods

It is rarely practical to collect data from every member of a population. Instead, we select a sample. The method of selection affects how representative the sample is.

Example 1 — Stratified sampling

A school has 600 students: 240 in Year 10 and 360 in Year 11. A stratified sample of 50 students is to be selected. How many should come from each year group?

Step 1 Find the proportion from each year group:

Year 10: $\dfrac{240}{600} \times 50 = 20$ students

Year 11: $\dfrac{360}{600} \times 50 = 30$ students

Check: $20 + 30 = 50$ ✓

Practice 1a

Classify each of the following as qualitative, discrete, or continuous: (i) the length of a leaf in cm, (ii) the number of pets owned, (iii) the brand of a mobile phone, (iv) the time taken to complete a race.

Show Solution

(i) Continuous — length can take any value in a range.

(ii) Discrete — whole numbers only.

(iii) Qualitative — described by a category (brand name).

(iv) Continuous — time can take any value.

2. Averages and Spread

Measures of Average

Measures of Spread

Mean from a Frequency Table

$$\bar{x} = \frac{\sum f x}{\sum f}$$

where $f$ is the frequency and $x$ is the value.

Estimating the Mean from Grouped Data

Use the midpoint of each class interval as the representative value $x$:

$$\bar{x} \approx \frac{\sum f m}{\sum f}$$

where $m$ is the midpoint of each class. This is an estimate because we do not know the exact values within each class.

Example 2 — Mean from a frequency table

The table shows the number of books read by 20 students last month.

Books read ($x$)Frequency ($f$)$fx$
030
155
2714
3412
414
Total2035

$\bar{x} = \dfrac{35}{20} = 1.75$ books

Median: $20$ values, so median is between the 10th and 11th values. Cumulative frequencies: 0→3, 1→8, 2→15. Both the 10th and 11th values lie in the $x=2$ group. Median $= 2$.

Mode: $x = 2$ (highest frequency = 7).

Example 3 — Estimating mean from grouped data

The heights (in cm) of 30 students are recorded in the grouped frequency table below.

Height (cm)Frequency ($f$)Midpoint ($m$)$fm$
$150 \leq h < 160$6155930
$160 \leq h < 170$111651815
$170 \leq h < 180$91751575
$180 \leq h < 190$4185740
Total305060

Estimated mean $= \dfrac{5060}{30} \approx 168.7$ cm

Practice 2a

The ages of 8 people are: 14, 17, 15, 21, 14, 19, 16, 14. Find the mean, median, mode, and range.

Show Solution

Ordered: 14, 14, 14, 15, 16, 17, 19, 21

Mean $= \dfrac{14+14+14+15+16+17+19+21}{8} = \dfrac{130}{8} = 16.25$

Median: $n=8$, position $= 4.5$, so average of 4th and 5th values $= \dfrac{15+16}{2} = 15.5$

Mode $= 14$ (appears 3 times)

Range $= 21 - 14 = 7$

3. Charts and Diagrams

Statistical diagrams provide a visual summary of data. The choice of diagram depends on the type of data and what you want to communicate.

Bar Charts

Used for qualitative or discrete data. Bars are drawn with gaps between them. The height of each bar represents the frequency (or relative frequency). Compound bar charts place related data in stacked or side-by-side bars. Dual bar charts show two data sets alongside each other for comparison.

Pie Charts

A pie chart divides a circle into sectors, where each sector represents a category. The angle for each sector is calculated as:

$$\text{angle} = \frac{\text{frequency}}{\text{total frequency}} \times 360°$$

Pictograms

Use symbols or pictures to represent data. A key is always included to show what each symbol represents. Partial symbols represent fractions of the unit value.

Example 4 — Pie chart calculation

In a survey, 80 students chose their favourite sport: Football 32, Basketball 20, Tennis 16, Swimming 12. Calculate the angle for each sector.

SportFrequencyAngle
Football32$\dfrac{32}{80} \times 360 = 144°$
Basketball20$\dfrac{20}{80} \times 360 = 90°$
Tennis16$\dfrac{16}{80} \times 360 = 72°$
Swimming12$\dfrac{12}{80} \times 360 = 54°$
Total80360°

Practice 3a

A pie chart shows the transport used by 120 pupils to travel to school. The sector for "bus" has an angle of 150°. How many pupils travel by bus?

Show Solution

Number of pupils $= \dfrac{150}{360} \times 120 = \dfrac{5}{12} \times 120 = 50$ pupils.

4. Scatter Diagrams and Correlation

A scatter diagram (or scatter graph) plots pairs of values $(x, y)$ to investigate whether a relationship exists between two variables.

Types of Correlation

Line of Best Fit

A line of best fit (or trend line) is drawn through the middle of the data points so that roughly equal numbers of points lie on each side. It should pass through the mean point $(\bar{x}, \bar{y})$.

Example 5 — Interpreting a scatter diagram

A scatter diagram shows the marks scored in Maths ($x$) and Science ($y$) by 10 students. The points show a strong positive correlation. The line of best fit passes through $(20, 25)$ and $(60, 65)$.

Step 1 Find the gradient: $m = \dfrac{65-25}{60-20} = \dfrac{40}{40} = 1$

Step 2 Equation of line: $y - 25 = 1(x - 20) \Rightarrow y = x + 5$

Step 3 Estimate Science mark for Maths mark of 45: $y = 45 + 5 = 50$

This is interpolation — reliable, as 45 lies within the data range.

Important: Correlation and Causation

Correlation does not imply causation. Just because two variables are correlated does not mean that one causes the other. There may be a third variable influencing both, or the correlation may be coincidental.

Practice 4a

Describe the type of correlation you would expect between: (i) the temperature and the number of ice creams sold; (ii) the age of a car and its resale value; (iii) a person's height and their IQ score.

Show Solution

(i) Positive correlation — as temperature increases, more ice creams are sold.

(ii) Negative correlation — as a car ages, its resale value generally decreases.

(iii) No correlation — height and IQ are not related.

5. Stem-and-Leaf Diagrams

A stem-and-leaf diagram organises numerical data by splitting each value into a stem (the leading digit(s)) and a leaf (the final digit). The data retains its original values, making it easy to find the median and quartiles.

In a back-to-back stem-and-leaf diagram, two sets of data share a common stem, with one set's leaves going left and the other's going right. This allows direct comparison of two distributions.

Example 6 — Back-to-back stem-and-leaf diagram

The times (in minutes) taken by Group A and Group B to complete a puzzle are:

Group A: 12, 15, 18, 21, 23, 25, 27, 31, 34, 38

Group B: 14, 16, 19, 20, 24, 26, 29, 30, 32, 35

Group A | Stem | Group B | | 8 5 2| 1 | 4 6 9 7 5 3 1| 2 | 0 4 6 9 8 4 1| 3 | 0 2 5 Key: 2|1|4 means 12 minutes (A) and 14 minutes (B)

Step 1 Find the median of Group A: 10 values, median is between 5th and 6th: $\dfrac{23+25}{2} = 24$ minutes.

Step 2 $Q_1$ of Group A: median of lower 5 values (12, 15, 18, 21, 23) = 18.

Step 3 $Q_3$ of Group A: median of upper 5 values (25, 27, 31, 34, 38) = 31.

Step 4 IQR of Group A $= 31 - 18 = 13$ minutes.

Group B median: $\dfrac{24+26}{2} = 25$ minutes. Group B $Q_1 = 19$, $Q_3 = 32$, IQR $= 13$ minutes.

Comparison: Group B has a slightly higher median (25 vs 24 minutes), but both groups have the same IQR, suggesting similar spread.

Practice 5a

The following data shows test scores for 9 students: 43, 51, 55, 57, 62, 68, 71, 74, 79. Draw a stem-and-leaf diagram and find the median and IQR.

Show Solution
Stem | Leaves 4 | 3 5 | 1 5 7 6 | 2 8 7 | 1 4 9 Key: 4|3 = 43

Median: 9 values, 5th value = 62.

$Q_1$: median of lower 4 values (43, 51, 55, 57) $= \dfrac{51+55}{2} = 53$

$Q_3$: median of upper 4 values (68, 71, 74, 79) $= \dfrac{71+74}{2} = 72.5$

IQR $= 72.5 - 53 = 19.5$

6. Histograms and Frequency Density

A histogram is used to display grouped continuous data. Unlike bar charts, histograms have no gaps between bars because the data is continuous. In a histogram, the area of each bar (not the height) represents the frequency.

Frequency Density

$$\text{Frequency density} = \frac{\text{Frequency}}{\text{Class width}}$$

Rearranging: $\text{Frequency} = \text{Frequency density} \times \text{Class width}$

The $y$-axis of a histogram is always labelled "Frequency density", not "Frequency".

Example 7 — Drawing a histogram

The masses (in kg) of 60 parcels are recorded below.

Mass (kg)FrequencyClass widthFrequency density
$0 \leq m < 2$82$8 \div 2 = 4$
$2 \leq m < 5$183$18 \div 3 = 6$
$5 \leq m < 8$213$21 \div 3 = 7$
$8 \leq m < 12$124$12 \div 4 = 3$
$12 \leq m < 20$48$4 \div 8 = 0.5$
Total60

Draw a histogram with mass on the $x$-axis and frequency density on the $y$-axis. The width of each bar spans its class interval and the height equals its frequency density.

Example 8 — Reading a histogram

A histogram shows the following frequency densities for age groups: $0 \leq a < 10$: fd = 3.5; $10 \leq a < 20$: fd = 5; $20 \leq a < 30$: fd = 4.2. Find the number of people in each group.

Each class width = 10.

$0$–$10$: $3.5 \times 10 = 35$ people

$10$–$20$: $5 \times 10 = 50$ people

$20$–$30$: $4.2 \times 10 = 42$ people

Practice 6a

A histogram has a bar for the interval $15 \leq t < 25$ with frequency density 3.6. A bar for $25 \leq t < 30$ has frequency density 8. Find the frequency for each interval and the total number of data values in these two classes.

Show Solution

Interval $15 \leq t < 25$: class width $= 10$, frequency $= 3.6 \times 10 = 36$

Interval $25 \leq t < 30$: class width $= 5$, frequency $= 8 \times 5 = 40$

Total: $36 + 40 = 76$ data values

7. Cumulative Frequency and Box Plots

Cumulative frequency is the running total of frequencies up to and including each class. A cumulative frequency curve (ogive) is used to estimate the median, quartiles, and interquartile range from grouped data.

Reading from a Cumulative Frequency Curve

For $n$ data values:

Figure 9.1 — A cumulative frequency step function (blue) and a smooth cumulative frequency curve (green) for grouped data. The median and quartiles are read off horizontally from the $\frac{n}{2}$ and $\frac{n}{4}$ levels.

Figure 9.2 — A normal distribution bell curve, illustrating the symmetric, unimodal shape that naturally arises in many large data sets (e.g., heights, examination marks).

Example 9 — Cumulative frequency table and curve

The times (in seconds) taken by 80 swimmers to complete a length are given in the grouped frequency table.

Time (s)FrequencyCumulative Frequency
$50 \leq t < 60$55
$60 \leq t < 70$1823
$70 \leq t < 80$2851
$80 \leq t < 90$2071
$90 \leq t < 100$980

Step 1 Plot cumulative frequency against the upper class boundary of each interval: $(60, 5)$, $(70, 23)$, $(80, 51)$, $(90, 71)$, $(100, 80)$. Also plot $(50, 0)$.

Step 2 Join with a smooth curve.

Step 3 Read off: Median at cf = 40 → $t \approx 77$ s; $Q_1$ at cf = 20 → $t \approx 68$ s; $Q_3$ at cf = 60 → $t \approx 83$ s.

IQR $\approx 83 - 68 = 15$ s

Box-and-Whisker Plots

A box-and-whisker plot (or box plot) provides a visual summary of a distribution using five key values: minimum, $Q_1$, median, $Q_3$, maximum. The box spans the IQR; the whiskers extend to the minimum and maximum values (excluding outliers).

Example 10 — Comparing distributions using box plots

Two classes sit the same test. Their results:

Comparison:

Exam Tip — Comparing Distributions

When asked to compare two distributions, always comment on both a measure of average (mean or median) and a measure of spread (range or IQR), and interpret them in context.

Practice 7a

From the cumulative frequency curve in Example 9, estimate the number of swimmers who took more than 85 seconds.

Show Solution

Read the cumulative frequency at $t = 85$: approximately 63 swimmers took 85 seconds or less.

Number taking more than 85 s $= 80 - 63 = 17$ swimmers.

Practice 7b

Two data sets have the following box plot summaries:

Set A: Min = 5, $Q_1$ = 12, Median = 18, $Q_3$ = 26, Max = 40

Set B: Min = 8, $Q_1$ = 15, Median = 22, $Q_3$ = 28, Max = 35

Write two comparisons between Set A and Set B.

Show Solution

Average: Set B has a higher median (22 vs 18), so the values in Set B tend to be larger.

Spread: Set A has a larger IQR ($26 - 12 = 14$) compared to Set B ($28 - 15 = 13$), and a larger overall range ($40 - 5 = 35$ vs $35 - 8 = 27$), so Set A's values are more spread out.

8. Mixed Practice Problems

Question 1

The number of goals scored by a football team in each of 12 matches is: 0, 1, 2, 1, 3, 0, 2, 4, 1, 2, 1, 0. Find the mean, median, mode, and range.

Show Solution

Ordered: 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 4

Mean $= \dfrac{0+0+0+1+1+1+1+2+2+2+3+4}{12} = \dfrac{17}{12} \approx 1.42$

Median: 12 values, average of 6th and 7th: $\dfrac{1+1}{2} = 1$

Mode $= 1$ (appears 4 times)

Range $= 4 - 0 = 4$

Question 2

The table shows the masses of 50 apples. Estimate the mean mass.

Mass (g)Frequency
$80 \leq m < 100$8
$100 \leq m < 120$17
$120 \leq m < 140$19
$140 \leq m < 160$6
Show Solution

Midpoints: 90, 110, 130, 150

$\sum fm = 8 \times 90 + 17 \times 110 + 19 \times 130 + 6 \times 150$

$= 720 + 1870 + 2470 + 900 = 5960$

Estimated mean $= \dfrac{5960}{50} = 119.2$ g

Question 3

In a school of 900 pupils, 360 are in Key Stage 3 and 540 are in Key Stage 4. A stratified sample of 75 pupils is needed. How many pupils should be selected from each Key Stage?

Show Solution

KS3: $\dfrac{360}{900} \times 75 = 0.4 \times 75 = 30$ pupils

KS4: $\dfrac{540}{900} \times 75 = 0.6 \times 75 = 45$ pupils

Question 4

A pie chart shows that 60 people chose "blue" as their favourite colour. The angle of the blue sector is 144°. How many people were surveyed in total?

Show Solution

$\dfrac{60}{\text{total}} = \dfrac{144}{360} = 0.4$

Total $= \dfrac{60}{0.4} = 150$ people

Question 5

A histogram bar for the interval $20 \leq x < 30$ has frequency density 4.5, and a bar for $30 \leq x < 50$ has frequency density 2.5. Find the total frequency for these two intervals.

Show Solution

$20 \leq x < 30$: width $= 10$, frequency $= 4.5 \times 10 = 45$

$30 \leq x < 50$: width $= 20$, frequency $= 2.5 \times 20 = 50$

Total frequency $= 45 + 50 = 95$

Question 6

The stem-and-leaf diagram below shows the ages of people at a community event:

Stem | Leaves 1 | 5 8 2 | 2 4 6 9 3 | 1 3 5 5 7 4 | 0 2 8 5 | 3 6 Key: 1|5 = 15 years

Find: (a) the median, (b) the interquartile range.

Show Solution

Data in order: 15, 18, 22, 24, 26, 29, 31, 33, 35, 35, 37, 40, 42, 48, 53, 56

$n = 16$. Median at position 8.5: $\dfrac{33+35}{2} = 34$

$Q_1$: median of lower 8 values (15–31) at position 4.5: $\dfrac{24+26}{2} = 25$

$Q_3$: median of upper 8 values (33–56) at position 4.5 within that group: $\dfrac{40+42}{2} = 41$

IQR $= 41 - 25 = 16$

Question 7

The following data shows the heights (in cm) of 10 plants: 34, 38, 42, 45, 47, 50, 53, 58, 61, 65. Draw a box-and-whisker plot using this data.

Show Solution

Minimum $= 34$, Maximum $= 65$

Median: 10 values, average of 5th and 6th: $\dfrac{47+50}{2} = 48.5$

$Q_1$: median of lower 5 values (34, 38, 42, 45, 47) $= 42$

$Q_3$: median of upper 5 values (50, 53, 58, 61, 65) $= 58$

Box-and-whisker plot: draw a number line, mark 34, 42, 48.5, 58, 65. Draw a box from $Q_1 = 42$ to $Q_3 = 58$ with a line at the median $48.5$. Draw whiskers from $34$ to $42$ and from $58$ to $65$.

Question 8

A scatter diagram shows the temperature ($x$°C) and the number of visitors to a park ($y$). The line of best fit has equation $y = 12x + 40$. Estimate the number of visitors when the temperature is 18°C, and comment on the reliability of using the line of best fit for a temperature of $-5$°C.

Show Solution

At $x = 18$: $y = 12(18) + 40 = 216 + 40 = 256$ visitors.

At $x = -5$°C: This is extrapolation — using the line of best fit outside the range of observed temperatures. The result ($y = 12(-5) + 40 = -20$ visitors) is not meaningful. Predictions outside the data range are unreliable.

Question 9

The cumulative frequency for the times (in minutes) taken by 100 people to complete a puzzle is as follows: by 10 min: 8; by 15 min: 27; by 20 min: 56; by 25 min: 81; by 30 min: 100. Estimate the median and interquartile range.

Show Solution

$n = 100$. Median at cf $= 50$; from the data, cf goes from 27 at 15 min to 56 at 20 min.

By linear interpolation: $15 + \dfrac{50-27}{56-27} \times 5 = 15 + \dfrac{23}{29} \times 5 \approx 15 + 3.97 \approx 19.0$ min

$Q_1$ at cf $= 25$: between 15 min (cf=27) so $Q_1 \approx 15$ min. More precisely: $10 + \dfrac{25-8}{27-8} \times 5 = 10 + \dfrac{17}{19} \times 5 \approx 14.5$ min

$Q_3$ at cf $= 75$: between 20 min (cf=56) and 25 min (cf=81): $20 + \dfrac{75-56}{81-56} \times 5 = 20 + \dfrac{19}{25} \times 5 = 20 + 3.8 = 23.8$ min

IQR $\approx 23.8 - 14.5 = 9.3$ min

Question 10

Two classes each took a test out of 50 marks. Class A: median = 32, IQR = 18, range = 42. Class B: median = 36, IQR = 10, range = 30. Write a comparison of the two distributions, referring to both average and spread.

Show Solution

Average: Class B has a higher median mark (36 vs 32), so Class B performed better on average.

Spread: Class A has a larger IQR (18 vs 10) and a larger range (42 vs 30), so Class A's marks are more spread out and less consistent. Class B's results are more tightly clustered around the median.