🧪 4.4 Statistical Testing — From Questions to Evidence

In this notebook you’ll practice hypothesis testing on vitamin_trial.csv. We’ll start with assumption checks, then run parametric and non-parametric tests, include effect sizes with confidence intervals, and handle multiple comparisons.

🎯 Objectives

Choose the right test for a research question (one-sample, two-sample, paired, one-way ANOVA).
Check assumptions (approx. normality, variance homogeneity) and use non-parametric alternatives when needed.
Report p-values and effect sizes (Cohen’s d, Hedges’ g) with 95% CI.
Handle multiple comparisons (Holm or FDR/Benjamini–Hochberg).

Quick refresher (click)

One-sample: compare a mean to a reference value.
Two-sample (independent): compare means of two groups (Welch default for unequal variances).
Paired: compare means of before/after within the same ID.
One-way ANOVA: compare means across ≥3 groups (plus post-hoc with multiplicity control).
Non-parametric: Wilcoxon signed-rank (paired), Mann–Whitney U (independent), Kruskal–Wallis (≥3 groups).

# Setup for Google Colab: fetch dataset automatically or allow manual upload
import os
from google.colab import files

MODULE = '04_data_analysis'
DATASET = 'vitamin_trial.csv'
BASE_PATH = '/content/data-analysis-projects'
MODULE_PATH = os.path.join(BASE_PATH, 'notebooks', MODULE)
DATASET_PATH = os.path.join('data', DATASET)

try:
    print('Attempting to clone repository...')
    if not os.path.exists(BASE_PATH):
        !git clone https://github.com/ggkuhnle/data-analysis-projects.git
    print('Setting working directory...')
    os.chdir(MODULE_PATH)
    if os.path.exists(DATASET_PATH):
        print(f'Dataset found: {DATASET_PATH} ✅')
    else:
        raise FileNotFoundError('Dataset missing after clone.')
except Exception as e:
    print(f'Cloning failed: {e}')
    print('Falling back to manual upload...')
    os.makedirs('data', exist_ok=True)
    uploaded = files.upload()
    if DATASET in uploaded:
        with open(DATASET_PATH, 'wb') as f:
            f.write(uploaded[DATASET])
        print(f'Successfully uploaded {DATASET} ✅')
    else:
        raise FileNotFoundError(f'Upload failed. Please upload {DATASET}.')

%pip install -q pandas numpy scipy statsmodels seaborn
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.stats.multicomp import pairwise_tukeyhsd
from statsmodels.stats.multitest import multipletests

pd.set_option('display.max_columns', 50)
sns.set_theme()
print('Statistical testing environment ready.')

1) Load & Inspect

We load the dataset and check what we have. Typical columns: ID, Group (Control/Treatment), Vitamin_D (numeric), Time (e.g., 0/1), and possibly Outcome.

df = pd.read_csv('data/vitamin_trial.csv')
print('Shape:', df.shape)
display(df.head())
display(df.describe(include='all'))

2) Assumption Checks (lightweight)

Parametric tests (t-tests/ANOVA) typically assume roughly normal residuals and equal variances (for some tests). We inspect normality with Shapiro–Wilk and equality of variances with Levene.

Good practice

Use visuals (see 4.1/4.2) first; formal tests can be overly sensitive with large n.
If assumptions look shaky, lean on Welch t-test (default below) or non-parametrics.

# Example: normality of Vitamin_D by Group (if present)
if {'Vitamin_D','Group'}.issubset(df.columns):
    for g, sub in df.groupby('Group', observed=True):
        y = pd.to_numeric(sub['Vitamin_D'], errors='coerce').dropna()
        if len(y) >= 3:
            stat, p = stats.shapiro(y)
            print(f'Shapiro–Wilk Vitamin_D in Group={g}: W={stat:.3f}, p={p:.3g}, n={len(y)}')
else:
    print('Columns Vitamin_D and/or Group not available for normality demo.')

# Example: equality of variances across groups with Levene (median-centered, robust-ish)
if {'Vitamin_D','Group'}.issubset(df.columns):
    groups = [pd.to_numeric(sub['Vitamin_D'], errors='coerce').dropna() for _, sub in df.groupby('Group', observed=True)]
    if len(groups) >= 2 and all(len(g) >= 2 for g in groups):
        stat, p = stats.levene(*groups, center='median')
        print(f'Levene (Vitamin_D ~ Group): W={stat:.3f}, p={p:.3g}')
else:
    pass

3) Helper functions — effect sizes & CIs

We’ll compute Cohen’s d / Hedges’ g, plus 95% CIs for mean differences using Welch’s t framework. For paired designs, we compute effect sizes on the difference scores.

def cohens_d_independent(x, y):
    """Cohen's d for independent samples (pooled SD)."""
    x = pd.Series(x).dropna().astype(float)
    y = pd.Series(y).dropna().astype(float)
    nx, ny = len(x), len(y)
    if nx < 2 or ny < 2:
        return np.nan
    sx2, sy2 = x.var(ddof=1), y.var(ddof=1)
    sp = np.sqrt(((nx-1)*sx2 + (ny-1)*sy2) / (nx+ny-2))
    if sp == 0:
        return np.nan
    d = (x.mean() - y.mean()) / sp
    return d

def hedges_g(d, nx, ny):
    """Small-sample corrected Cohen's d (Hedges' g)."""
    # J correction
    df = nx + ny - 2
    if df <= 0 or np.isnan(d):
        return np.nan
    J = 1 - (3 / (4*df - 1))
    return d * J

def mean_diff_ci_welch(x, y, alpha=0.05):
    """95% CI for mean difference (x - y) using Welch's t."""
    x = pd.Series(x).dropna().astype(float)
    y = pd.Series(y).dropna().astype(float)
    nx, ny = len(x), len(y)
    mx, my = x.mean(), y.mean()
    vx, vy = x.var(ddof=1), y.var(ddof=1)
    if nx == 0 or ny == 0:
        return (np.nan, np.nan, np.nan)
    se = np.sqrt(vx/nx + vy/ny)
    # Welch–Satterthwaite df
    df_w = (vx/nx + vy/ny)**2 / ((vx**2)/((nx**2)*(nx-1)) + (vy**2)/((ny**2)*(ny-1))) if nx>1 and ny>1 else np.nan
    if se == 0 or not np.isfinite(df_w):
        return (mx-my, np.nan, np.nan)
    tcrit = stats.t.ppf(1 - alpha/2, df=df_w)
    lo, hi = (mx-my) - tcrit*se, (mx-my) + tcrit*se
    return (mx-my, lo, hi)

def cohens_d_paired(x_pre, x_post):
    """Cohen's d for paired samples: mean(diff) / sd(diff)."""
    pre = pd.Series(x_pre).astype(float)
    post = pd.Series(x_post).astype(float)
    # align by index (IDs)
    m = pre.notna() & post.notna()
    d = post[m] - pre[m]
    if len(d) < 2:
        return np.nan
    return d.mean() / d.std(ddof=1) if d.std(ddof=1) != 0 else np.nan

def mean_diff_ci_paired(x_pre, x_post, alpha=0.05):
    pre = pd.Series(x_pre).astype(float)
    post = pd.Series(x_post).astype(float)
    m = pre.notna() & post.notna()
    d = post[m] - pre[m]
    if len(d) < 2:
        return (np.nan, np.nan, np.nan)
    md = d.mean()
    se = d.std(ddof=1) / np.sqrt(len(d))
    tcrit = stats.t.ppf(1 - alpha/2, df=len(d)-1)
    return (md, md - tcrit*se, md + tcrit*se)

4) One-sample tests

Question: Is mean Vitamin D (overall) different from a reference value (e.g., 12.5 µg)?

Notes

Parametric: one-sample t-test on the mean.
Non-parametric: sign test or Wilcoxon signed-rank against a hypothesised median (scipy’s wilcoxon needs pairs, so we’ll stick to t here for brevity).

if 'Vitamin_D' in df.columns:
    y = pd.to_numeric(df['Vitamin_D'], errors='coerce').dropna()
    mu0 = 12.5  # reference mean (change to your scientific target)
    tstat, pval = stats.ttest_1samp(y, popmean=mu0)
    print(f'One-sample t-test: H0: mean={mu0}')
    print(f'  t={tstat:.3f}, p={pval:.3g}, n={len(y)}; mean={y.mean():.3f}')
else:
    print('Vitamin_D not found for one-sample test.')

5) Two-sample (independent) tests

Question: Do Control vs Treatment differ in mean Vitamin D?

Parametric: Welch’s t-test (robust to unequal variances; default here).
Non-parametric: Mann–Whitney U.
Report effect sizes + 95% CI for mean difference.

if {'Vitamin_D','Group'}.issubset(df.columns):
    g_levels = df['Group'].dropna().unique().tolist()
    if len(g_levels) == 2:
        g1, g2 = sorted(g_levels)
        x = pd.to_numeric(df.loc[df['Group']==g1, 'Vitamin_D'], errors='coerce').dropna()
        y = pd.to_numeric(df.loc[df['Group']==g2, 'Vitamin_D'], errors='coerce').dropna()
        # Welch t-test
        t, p = stats.ttest_ind(x, y, equal_var=False)
        d = cohens_d_independent(x, y)
        g = hedges_g(d, len(x), len(y))
        md, lo, hi = mean_diff_ci_welch(x, y)
        print(f'Welch t-test ({g1} vs {g2}): t={t:.3f}, p={p:.3g}')
        print(f'  mean_diff={md:.3f} [{lo:.3f}, {hi:.3f}] (x−y)')
        print(f'  Cohen d={d:.3f}, Hedges g={g:.3f}')
        # Mann–Whitney U
        u, p_u = stats.mannwhitneyu(x, y, alternative='two-sided')
        print(f'Mann–Whitney U: U={u:.1f}, p={p_u:.3g}')
    else:
        print('Two-sample demo expects exactly 2 groups in Group.')
else:
    print('Need Vitamin_D and Group columns for two-sample tests.')

6) Paired tests (within-ID before/after)

If your dataset has repeated measurements per ID (e.g., Time=0 and Time=1), use paired tests: - Parametric: paired t-test on the difference. - Non-parametric: Wilcoxon signed-rank on the difference.

We’ll try to form pairs from Time==0 vs Time==1 within each ID for Vitamin_D. If not available, we skip gracefully.

if {'ID','Time','Vitamin_D'}.issubset(df.columns):
    # Wide-style pairing: each row is an ID×Time; pivot to columns Time0, Time1
    tmp = df[['ID','Time','Vitamin_D']].dropna()
    # Only keep IDs that have BOTH time points 0 and 1
    have_both = tmp.groupby('ID')['Time'].nunique() >= 2
    ids = have_both[have_both].index
    sub = tmp[tmp['ID'].isin(ids)].copy()
    wide = sub.pivot_table(index='ID', columns='Time', values='Vitamin_D', aggfunc='mean')
    if 0 in wide.columns and 1 in wide.columns:
        pre = wide[0]; post = wide[1]
        # Paired t-test
        m = pre.notna() & post.notna()
        if m.sum() >= 2:
            t, p = stats.ttest_rel(pre[m], post[m])
            d_pair = cohens_d_paired(pre[m], post[m])
            md, lo, hi = mean_diff_ci_paired(pre[m], post[m])
            print(f'Paired t-test (Time 0 vs 1): t={t:.3f}, p={p:.3g}, n={m.sum()}')
            print(f'  mean_diff={md:.3f} [{lo:.3f}, {hi:.3f}] (post−pre)')
            print(f'  Cohen d (paired)={d_pair:.3f}')
            # Wilcoxon signed-rank
            w, p_w = stats.wilcoxon(post[m] - pre[m])
            print(f'Wilcoxon signed-rank: W={w:.1f}, p={p_w:.3g}')
        else:
            print('Not enough paired (non-missing) observations for paired tests.')
    else:
        print('Could not find both Time=0 and Time=1 columns after pivot.')
else:
    print('Need ID, Time, Vitamin_D for paired tests.')

7) ≥3 groups — ANOVA and Kruskal–Wallis

Compare means across three or more groups: - Parametric: one-way ANOVA (we’ll use OLS via statsmodels and Type II ANOVA table). - Non-parametric: Kruskal–Wallis (rank-based). For post-hoc, we’ll do pairwise tests with multiplicity control.

Post-hoc choices

Parametric: Tukey HSD is classic (assumes equal variances / balanced design). If design is unbalanced/heteroscedastic, many prefer pairwise Welch t with Holm correction.
Non-parametric: pairwise Mann–Whitney U with Holm correction is a pragmatic alternative (Dunn’s test is another option but needs extra packages).

if {'Vitamin_D','Group'}.issubset(df.columns):
    # Require ≥ 3 groups with some data
    counts = df.groupby('Group', observed=True)['Vitamin_D'].apply(lambda s: pd.to_numeric(s, errors='coerce').notna().sum())
    valid_groups = counts[counts >= 2].index.tolist()
    if len(valid_groups) >= 3:
        dfa = df[df['Group'].isin(valid_groups)].copy()
        dfa['Vitamin_D'] = pd.to_numeric(dfa['Vitamin_D'], errors='coerce')
        dfa = dfa.dropna(subset=['Vitamin_D'])
        # OLS + ANOVA table
        model = smf.ols('Vitamin_D ~ C(Group)', data=dfa).fit()
        aov = sm.stats.anova_lm(model, typ=2)
        print('One-way ANOVA (Type II):')
        display(aov)
        # Tukey HSD (if desired)
        try:
            tuk = pairwise_tukeyhsd(endog=dfa['Vitamin_D'], groups=dfa['Group'], alpha=0.05)
            print('\nTukey HSD (parametric, equal-variance assumption):')
            display(pd.DataFrame(data=tuk.summary(data=False)[1:], columns=tuk.summary().data[0]))
        except Exception as e:
            print('Tukey HSD not available:', e)
        # Kruskal–Wallis
        arrays = [pd.to_numeric(dfa.loc[dfa['Group']==g, 'Vitamin_D'], errors='coerce').dropna() for g in valid_groups]
        H, pH = stats.kruskal(*arrays)
        print(f'\nKruskal–Wallis: H={H:.3f}, p={pH:.3g}')
        # Pairwise Mann–Whitney with Holm correction
        pairs, pvals = [], []
        for i in range(len(valid_groups)):
            for j in range(i+1, len(valid_groups)):
                g1, g2 = valid_groups[i], valid_groups[j]
                x = pd.to_numeric(dfa.loc[dfa['Group']==g1, 'Vitamin_D'], errors='coerce').dropna()
                y = pd.to_numeric(dfa.loc[dfa['Group']==g2, 'Vitamin_D'], errors='coerce').dropna()
                if len(x) >= 2 and len(y) >= 2:
                    u, p = stats.mannwhitneyu(x, y, alternative='two-sided')
                    pairs.append((g1, g2)); pvals.append(p)
        if pvals:
            rej, p_adj, _, _ = multipletests(pvals, alpha=0.05, method='holm')
            out = pd.DataFrame({'pair': pairs, 'p_raw': pvals, 'p_holm': p_adj, 'reject_0.05': rej})
            print('\nPairwise Mann–Whitney U with Holm correction:')
            display(out.sort_values('p_holm'))
    else:
        print('Need ≥3 groups with data for ANOVA/Kruskal–Wallis.')
else:
    print('Need Vitamin_D and Group for ANOVA/Kruskal–Wallis.')

8) Reporting results

When reporting, include: - The test used (and why), - The estimate (mean difference or effect size) with 95% CI, - The test statistic and p-value, - Any assumption checks or transformations, - How multiple testing was handled (if applicable).

Template sentences

“Mean Vitamin D differed between Treatment and Control (Welch’s t(≈df)=t, p=p). The mean difference was md [lo, hi] µg; Cohen’s d = d (Hedges’ g = g).”

    <li>“Across the three groups, ANOVA indicated a difference in means (F(df1, df2)=F, p=p). Tukey HSD post-hoc found X>Y (p_adj=…).”</li>
    <li>“Assumptions appeared [reasonable/not met]; therefore we used [Welch/non-parametric] tests.”</li>

🧪 Exercises

Set a one-sample target
- Choose a clinically meaningful Vitamin_D target (e.g., 12.5 µg).
- Test whether the overall mean differs; report the mean ± 95% CI.
Two-sample with assumptions
- Compare groups (Control vs Treatment).
- Show Welch t-test and Mann–Whitney; report d, g, mean difference CI.
- State which you’d trust if variances differ or distributions are skewed.
Paired change
- If Time=0/1 exist per ID, test within-ID change (paired t and Wilcoxon).
- Report mean change ± 95% CI and Cohen’s d (paired).
≥3 groups
- Run one-way ANOVA and Kruskal–Wallis across Group levels.
- Perform post-hoc comparisons with multiplicity control (Tukey or Holm-adjusted pairwise tests).
- Which groups differ meaningfully? Back up with CIs.
Multiplicity
- If you compare several endpoints (or many pairs), apply Holm or FDR to control false positives.
- Explain your choice briefly.

✅ Conclusion

You selected appropriate tests, verified assumptions, used non-parametric alternatives when needed, and reported both p-values and effect sizes with CIs. These practices make your findings more robust and interpretable.