Statistics & Probability — IB Math AI

1. Sampling Methods

Sampling Techniques

Method	Description	Advantage
Simple random	Every member has equal chance; use random number table/GDC	Unbiased, no prior knowledge needed
Systematic	Select every $k$th item from list (e.g., every 10th)	Simple to apply, spread across population
Stratified	Divide into groups (strata), sample proportionally from each	Ensures representation of all subgroups
Convenience	Use whoever is available	Easy, but biased

2. Bivariate Data and Correlation

Pearson's Correlation Coefficient

$r$ measures the strength and direction of a linear relationship: $-1 \le r \le 1$.

$r$ close to $+1$: strong positive linear correlation
$r$ close to $-1$: strong negative linear correlation
$r$ close to $0$: weak or no linear correlation

Regression line $y$ on $x$: $\hat{y} = ax + b$ (minimises sum of squared vertical residuals). Always passes through $(\bar{x}, \bar{y})$.

Spearman's rank correlation $r_s$: use when data is ordinal or non-linear.

Worked Example 1

Interpreting a Regression Line

Study hours ($x$) and test scores ($y$) give regression line $\hat{y} = 7.2x + 34$, with $r = 0.91$.

Interpret gradient: For each additional hour studied, the predicted score increases by 7.2 marks.

Interpret $y$-intercept: A student who studies 0 hours is predicted to score 34. (May not be meaningful in context.)

Interpret $r = 0.91$: There is a strong positive linear correlation between hours studied and test score.

Predict: For $x = 5$ hours: $\hat{y} = 7.2(5)+34 = 70$ marks. This is interpolation (within data range) and is reliable.

3. Normal Distribution

Normal Distribution: $X \sim N(\mu, \sigma^2)$

The normal distribution is bell-shaped and symmetric about the mean $\mu$. The standard deviation $\sigma$ controls the spread.

Standardisation: $Z = \dfrac{X-\mu}{\sigma}$, where $Z\sim N(0,1)$.

Key percentages (empirical rule):

$P(\mu-\sigma < X < \mu+\sigma) \approx 68\%$
$P(\mu-2\sigma < X < \mu+2\sigma) \approx 95\%$
$P(\mu-3\sigma < X < \mu+3\sigma) \approx 99.7\%$

Use GDC normalcdf($a$, $b$, $\mu$, $\sigma$) for $P(a \le X \le b)$. Use invNorm($p$, $\mu$, $\sigma$) for the value with $P(X \le x) = p$.

Worked Example 2

Normal Distribution: Test Scores

Test scores are $N(68, 12^2)$. Find: (a) $P(X > 80)$; (b) the score exceeded by only 10% of students.

(a) Standardise: $Z = \dfrac{80-68}{12} = 1$. $P(X > 80) = P(Z > 1) \approx 0.1587$ (15.87% of students).

(b) GDC: invNorm(0.90, 68, 12) $\approx 83.4$. So the top 10% scored above approximately 83.4.

4. Chi-Squared Test for Independence

The $\chi^2$ test determines whether two categorical variables are independent in a two-way contingency table.

Chi-Squared Test

Hypotheses: $H_0$: the two variables are independent; $H_1$: they are not independent.

Expected frequency: $E_{ij} = \dfrac{(\text{row }i\text{ total}) \times (\text{column }j\text{ total})}{\text{grand total}}$

Test statistic: $\chi^2_{\text{calc}} = \displaystyle\sum \dfrac{(O-E)^2}{E}$

Degrees of freedom: $\nu = (\text{rows}-1)(\text{columns}-1)$

Decision rule: Reject $H_0$ if $\chi^2_{\text{calc}} > \chi^2_{\text{crit}}$ (or if $p$-value $<$ significance level, typically 5%).

Worked Example 3

Chi-Squared Test: Diet and Health

A 2×3 contingency table records diet type (vegetarian, vegan, omnivore) vs. health outcome (good, poor) for 200 people. Observed frequencies:

	Vegetarian	Vegan	Omnivore	Total
Good health	30	25	55	110
Poor health	20	15	55	90
Total	50	40	110	200

Expected (good, veg): $E = \dfrac{110 \times 50}{200} = 27.5$. Compute all six expected values similarly.

$\chi^2$: $\displaystyle\sum\dfrac{(O-E)^2}{E} = \dfrac{(30-27.5)^2}{27.5}+\cdots$ (use GDC for full calculation). Result $\approx 1.48$.

Degrees of freedom: $(2-1)(3-1) = 2$. Critical value at 5%: $\chi^2_{\text{crit}}(2) = 5.991$.

Conclusion: Since $1.48 < 5.991$, do not reject $H_0$. There is insufficient evidence to conclude that diet type and health outcome are associated at the 5% significance level.

Practice Problems

Q1. Heights are normally distributed with mean 175 cm and standard deviation 8 cm. What percentage of people are between 163 cm and 191 cm tall?

Show Solution

$163 = 175 - 1.5\times8$ and $191 = 175+2\times8$. Using GDC: $P(163 \le X \le 191) = P(-1.5 \le Z \le 2) \approx 0.9104 - 0.0668 = 0.9104$ — actually normalcdf(163,191,175,8) $\approx 91.0\%$.

Q2. The Spearman rank correlation for two sets of ranked exam scores is $r_s = 0.85$. Comment on the relationship and explain when Spearman is preferred over Pearson.

Show Solution

$r_s = 0.85$ indicates a strong positive monotonic relationship between the two sets of ranked scores. Spearman's rank correlation is preferred when data is ordinal (ranked), when the relationship may be monotonic but not linear, or when there are outliers that would distort the Pearson coefficient.

Q3. A survey tests whether gender (male, female) is independent of preferred subject (Science, Arts, Other). Observed: $\chi^2_{\text{calc}} = 8.34$, $df = 2$. At 5% significance, what is the conclusion?

Show Solution

$\chi^2_{\text{crit}}(2)$ at 5% significance level $= 5.991$. Since $8.34 > 5.991$, reject $H_0$. There is sufficient evidence at the 5% significance level to conclude that gender and preferred subject are not independent.

5. Exam Tips

Always state $H_0$ and $H_1$ clearly, including the context (not just "variables are independent").
The regression line $y$ on $x$ is used to predict $y$ from $x$; do not use $x$ on $y$ for this purpose.
For normal distribution: invNorm gives the value, normalcdf gives the probability. Know which to use.
In $\chi^2$ tests: expected frequencies should all be $\ge 5$ (IB requirement). If not, combine categories and note this.
Never extrapolate far beyond the data range; always comment on reliability when predicting.