← Back to IB Mathematics

1. Sampling Methods

Sampling Techniques

MethodDescriptionAdvantage
Simple randomEvery member has equal chance; use random number table/GDCUnbiased, no prior knowledge needed
SystematicSelect every $k$th item from list (e.g., every 10th)Simple to apply, spread across population
StratifiedDivide into groups (strata), sample proportionally from eachEnsures representation of all subgroups
ConvenienceUse whoever is availableEasy, but biased

2. Bivariate Data and Correlation

Pearson's Correlation Coefficient

$r$ measures the strength and direction of a linear relationship: $-1 \le r \le 1$.

Regression line $y$ on $x$: $\hat{y} = ax + b$ (minimises sum of squared vertical residuals). Always passes through $(\bar{x}, \bar{y})$.

Spearman's rank correlation $r_s$: use when data is ordinal or non-linear.

Worked Example 1

Interpreting a Regression Line

Study hours ($x$) and test scores ($y$) give regression line $\hat{y} = 7.2x + 34$, with $r = 0.91$.

1
Interpret gradient: For each additional hour studied, the predicted score increases by 7.2 marks.
2
Interpret $y$-intercept: A student who studies 0 hours is predicted to score 34. (May not be meaningful in context.)
3
Interpret $r = 0.91$: There is a strong positive linear correlation between hours studied and test score.
4
Predict: For $x = 5$ hours: $\hat{y} = 7.2(5)+34 = 70$ marks. This is interpolation (within data range) and is reliable.

3. Normal Distribution

Normal Distribution: $X \sim N(\mu, \sigma^2)$

The normal distribution is bell-shaped and symmetric about the mean $\mu$. The standard deviation $\sigma$ controls the spread.

Standardisation: $Z = \dfrac{X-\mu}{\sigma}$, where $Z\sim N(0,1)$.

Key percentages (empirical rule):

Use GDC normalcdf($a$, $b$, $\mu$, $\sigma$) for $P(a \le X \le b)$. Use invNorm($p$, $\mu$, $\sigma$) for the value with $P(X \le x) = p$.

Worked Example 2

Normal Distribution: Test Scores

Test scores are $N(68, 12^2)$. Find: (a) $P(X > 80)$; (b) the score exceeded by only 10% of students.

1
(a) Standardise: $Z = \dfrac{80-68}{12} = 1$. $P(X > 80) = P(Z > 1) \approx 0.1587$ (15.87% of students).
2
(b) GDC: invNorm(0.90, 68, 12) $\approx 83.4$. So the top 10% scored above approximately 83.4.

4. Chi-Squared Test for Independence

The $\chi^2$ test determines whether two categorical variables are independent in a two-way contingency table.

Chi-Squared Test

Hypotheses: $H_0$: the two variables are independent; $H_1$: they are not independent.

Expected frequency: $E_{ij} = \dfrac{(\text{row }i\text{ total}) \times (\text{column }j\text{ total})}{\text{grand total}}$

Test statistic: $\chi^2_{\text{calc}} = \displaystyle\sum \dfrac{(O-E)^2}{E}$

Degrees of freedom: $\nu = (\text{rows}-1)(\text{columns}-1)$

Decision rule: Reject $H_0$ if $\chi^2_{\text{calc}} > \chi^2_{\text{crit}}$ (or if $p$-value $<$ significance level, typically 5%).

Worked Example 3

Chi-Squared Test: Diet and Health

A 2×3 contingency table records diet type (vegetarian, vegan, omnivore) vs. health outcome (good, poor) for 200 people. Observed frequencies:

VegetarianVeganOmnivoreTotal
Good health302555110
Poor health20155590
Total5040110200
1
Expected (good, veg): $E = \dfrac{110 \times 50}{200} = 27.5$. Compute all six expected values similarly.
2
$\chi^2$: $\displaystyle\sum\dfrac{(O-E)^2}{E} = \dfrac{(30-27.5)^2}{27.5}+\cdots$ (use GDC for full calculation). Result $\approx 1.48$.
3
Degrees of freedom: $(2-1)(3-1) = 2$. Critical value at 5%: $\chi^2_{\text{crit}}(2) = 5.991$.
4
Conclusion: Since $1.48 < 5.991$, do not reject $H_0$. There is insufficient evidence to conclude that diet type and health outcome are associated at the 5% significance level.

Practice Problems

Q1. Heights are normally distributed with mean 175 cm and standard deviation 8 cm. What percentage of people are between 163 cm and 191 cm tall?
Show Solution

$163 = 175 - 1.5\times8$ and $191 = 175+2\times8$. Using GDC: $P(163 \le X \le 191) = P(-1.5 \le Z \le 2) \approx 0.9104 - 0.0668 = 0.9104$ — actually normalcdf(163,191,175,8) $\approx 91.0\%$.

Q2. The Spearman rank correlation for two sets of ranked exam scores is $r_s = 0.85$. Comment on the relationship and explain when Spearman is preferred over Pearson.
Show Solution

$r_s = 0.85$ indicates a strong positive monotonic relationship between the two sets of ranked scores. Spearman's rank correlation is preferred when data is ordinal (ranked), when the relationship may be monotonic but not linear, or when there are outliers that would distort the Pearson coefficient.

Q3. A survey tests whether gender (male, female) is independent of preferred subject (Science, Arts, Other). Observed: $\chi^2_{\text{calc}} = 8.34$, $df = 2$. At 5% significance, what is the conclusion?
Show Solution

$\chi^2_{\text{crit}}(2)$ at 5% significance level $= 5.991$. Since $8.34 > 5.991$, reject $H_0$. There is sufficient evidence at the 5% significance level to conclude that gender and preferred subject are not independent.

5. Exam Tips