📈 4.3 Correlation & Association

In this notebook we quantify associations in vitamin_trial.csv. You’ll compute classic correlations (Pearson, Spearman, Kendall), handle binary vs numeric with point-biserial, and add bootstrap confidence intervals, a permutation test, a simple partial correlation (controlling for a categorical factor like Group), and non-linear association metrics (Mutual Information, Distance Correlation).

🎯 Objectives

Compute Pearson (linear), Spearman (rank/monotonic), Kendall (rank/robust) correlations.
Use point-biserial for numeric ↔︎ binary associations.
Compare correlations with and without log transforms (when skewed).
Quantify uncertainty with bootstrap CIs and significance with a permutation test.
Estimate a partial correlation by controlling for a categorical factor.
Assess non-linear association via Mutual Information and Distance Correlation.

👉 Visual overviews (pair plots, heatmaps) live in 4.2 EDA. Normality/log-shape diagnostics live in 4.1 Distributions.

Friendly reminders

Correlation ≠ causation: confounding and directionality matter.
Linearity & outliers: Pearson assumes linearity and is outlier-sensitive; Spearman/Kendall are more robust to outliers and monotonic nonlinearity.
Missingness: correlations silently drop NA pairs; patterned missingness can bias results.

# Setup for Google Colab: Fetch datasets automatically or manually
import os
from google.colab import files

MODULE = '04_data_analysis'
DATASET = 'vitamin_trial.csv'
BASE_PATH = '/content/data-analysis-projects'
MODULE_PATH = os.path.join(BASE_PATH, 'notebooks', MODULE)
DATASET_PATH = os.path.join('data', DATASET)

try:
    print('Attempting to clone repository...')
    if not os.path.exists(BASE_PATH):
        !git clone https://github.com/ggkuhnle/data-analysis-projects.git
    print('Setting working directory...')
    os.chdir(MODULE_PATH)
    if os.path.exists(DATASET_PATH):
        print(f'Dataset found: {DATASET_PATH} ✅')
    else:
        raise FileNotFoundError('Dataset missing after clone.')
except Exception as e:
    print(f'Cloning failed: {e}')
    print('Falling back to manual upload...')
    os.makedirs('data', exist_ok=True)
    uploaded = files.upload()
    if DATASET in uploaded:
        with open(DATASET_PATH, 'wb') as f:
            f.write(uploaded[DATASET])
        print(f'Successfully uploaded {DATASET} ✅')
    else:
        raise FileNotFoundError(f'Upload failed. Please upload {DATASET}.')

%pip install -q pandas numpy seaborn scipy statsmodels scikit-learn
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import pearsonr, spearmanr, kendalltau, pointbiserialr
import statsmodels.api as sm
from sklearn.feature_selection import mutual_info_regression

pd.set_option('display.max_columns', 50)
sns.set_theme()
print('Correlation environment ready.')

1) Load & Select Variables

We’ll load the dataset, identify numeric columns, and prepare a clean subset for correlation calculations.

df = pd.read_csv('data/vitamin_trial.csv')
print('Shape:', df.shape)
display(df.head())

# Identify numeric columns (we will use these for correlation matrices if needed)
num = df.select_dtypes(include=[np.number]).copy()
print('Numeric columns:', num.columns.tolist())

# For demos below, we try Vitamin_D and Time if present
has_vit = 'Vitamin_D' in df.columns
has_time = 'Time' in df.columns and np.issubdtype(df['Time'].dtype, np.number)

2) Core correlations (pairwise)

We start with two variables (e.g., Vitamin_D and Time) and compute Pearson, Spearman, and Kendall.

def corr_all(x, y):
    x, y = pd.Series(x).astype(float), pd.Series(y).astype(float)
    m = x.notna() & y.notna()
    x, y = x[m], y[m]
    if len(x) < 2:
        return {'pearson_r': np.nan, 'pearson_p': np.nan,
                'spearman_r': np.nan, 'spearman_p': np.nan,
                'kendall_tau': np.nan, 'kendall_p': np.nan, 'n': len(x)}
    r_p, p_p = pearsonr(x, y)
    r_s, p_s = spearmanr(x, y)
    r_k, p_k = kendalltau(x, y)
    return {'pearson_r': r_p, 'pearson_p': p_p,
            'spearman_r': r_s, 'spearman_p': p_s,
            'kendall_tau': r_k, 'kendall_p': p_k, 'n': len(x)}

if has_vit and has_time:
    res = corr_all(df['Vitamin_D'], df['Time'])
    print('Vitamin_D vs Time:')
    for k, v in res.items():
        print(f'  {k}: {v:.4g}' if isinstance(v, (float, np.floating)) else f'  {k}: {v}')
else:
    print('Need a numeric Time column and Vitamin_D for the pairwise demo.')

2.1 Sensitivity to log-transform (if skewed)

If Vitamin_D is right-skewed (see 4.1), a log may clarify linear associations. We compare Pearson on raw vs log(Vitamin_D).

if has_vit and has_time:
    eps = 1e-6
    r_raw, p_raw = pearsonr(df['Vitamin_D'], df['Time'])
    r_log, p_log = pearsonr(np.log(df['Vitamin_D'] + eps), df['Time'])
    print(f'Pearson (raw): r={r_raw:.3f}, p={p_raw:.2e}')
    print(f'Pearson (log Vitamin_D): r={r_log:.3f}, p={p_log:.2e}')
else:
    print('Skipping log sensitivity: need Vitamin_D and numeric Time.')

3) Numeric ↔︎ Binary: Point-biserial correlation

If you have a binary categorical variable (e.g., Outcome with two levels, or a yes/no), use point-biserial to measure association with a numeric variable (e.g., Vitamin_D).

# Try Outcome (binary) vs Vitamin_D
if has_vit and 'Outcome' in df.columns:
    levels = pd.Series(df['Outcome']).dropna().unique().tolist()
    if len(levels) == 2:
        levels_sorted = sorted(levels)
        bin_map = {levels_sorted[0]:0, levels_sorted[1]:1}
        y_bin = pd.Series(df['Outcome']).map(bin_map)
        m = y_bin.notna() & pd.Series(df['Vitamin_D']).notna()
        r, p = pointbiserialr(y_bin[m].astype(int), df.loc[m, 'Vitamin_D'].astype(float))
        print(f'Point-biserial (Outcome vs Vitamin_D): r={r:.3f}, p={p:.2e}')
        print(f'Outcome mapping: {levels_sorted[0]}→0, {levels_sorted[1]}→1')
    else:
        print('Outcome is not binary here; point-biserial skipped.')
else:
    print('Need Vitamin_D and Outcome for point-biserial.')

4) Bootstrap confidence interval for a correlation

We’ll estimate a 95% CI for Pearson r via nonparametric bootstrap (resampling rows with replacement). This is useful when analytic CIs are awkward or assumptions are questionable.

What you get

A distribution of r-values across bootstrap resamples.
Percentile CI (e.g., 2.5th and 97.5th percentiles).

def bootstrap_corr(x, y, B=2000, seed=42):
    rng = np.random.default_rng(seed)
    x, y = pd.Series(x).astype(float), pd.Series(y).astype(float)
    m = x.notna() & y.notna()
    x, y = x[m].values, y[m].values
    n = len(x)
    if n < 2:
        return np.array([])
    out = np.empty(B, dtype=float)
    for b in range(B):
        idx = rng.integers(0, n, size=n)
        out[b] = pearsonr(x[idx], y[idx])[0]
    return out

if has_vit and has_time:
    boots = bootstrap_corr(df['Vitamin_D'], df['Time'], B=2000, seed=1)
    if boots.size:
        lo, hi = np.percentile(boots, [2.5, 97.5])
        print(f'Bootstrap Pearson r 95% CI: [{lo:.3f}, {hi:.3f}]')
        plt.figure(figsize=(6,3))
        sns.histplot(boots, bins=30, kde=True)
        plt.title('Bootstrap distribution of r (Vitamin_D vs Time)')
        plt.xlabel('Pearson r')
        plt.tight_layout(); plt.show()
else:
    print('Skipping bootstrap CI: need Vitamin_D and numeric Time.')

5) Permutation test for correlation (non-parametric p-value)

Under the null of no association, the correlation between x and randomly permuted y should look like what we observe. We compute a permutation p-value as the fraction of permuted |r| ≥ observed |r|.

When useful?

When parametric assumptions for Pearson p-values are doubtful.
Quick, assumption-light check (though computationally heavier).

def perm_test_corr(x, y, B=5000, seed=123):
    rng = np.random.default_rng(seed)
    x, y = pd.Series(x).astype(float), pd.Series(y).astype(float)
    m = x.notna() & y.notna()
    x, y = x[m].values, y[m].values
    if len(x) < 2:
        return np.nan, np.nan
    r_obs = pearsonr(x, y)[0]
    cnt = 0
    for b in range(B):
        y_perm = rng.permutation(y)
        r_b = pearsonr(x, y_perm)[0]
        if abs(r_b) >= abs(r_obs):
            cnt += 1
    pval = (cnt + 1) / (B + 1)  # add-one smoothing
    return r_obs, pval

if has_vit and has_time:
    r_obs, p_perm = perm_test_corr(df['Vitamin_D'], df['Time'], B=2000, seed=2)
    print(f'Permutation test: r_obs={r_obs:.3f}, p_perm={p_perm:.4f}')
else:
    print('Skipping permutation test: need Vitamin_D and numeric Time.')

6) Partial correlation (control for a categorical confounder)

We estimate the association between Vitamin_D and Time controlling for Group by residualising both variables on Group (dummy-coded), then correlating the residuals.

Notes

This is a simple, common approach; for more complex designs, consider regression or SEM.
If Time is not present or Group is missing, we skip.

def residualize(y, X):
    X_ = sm.add_constant(X, has_constant='add')
    model = sm.OLS(y, X_, missing='drop')
    res = model.fit()
    resid = pd.Series(np.nan, index=y.index)
    resid.loc[res.fittedvalues.index] = y.loc[res.fittedvalues.index] - res.fittedvalues
    return resid

if has_vit and has_time and 'Group' in df.columns:
    G = pd.get_dummies(df['Group'], drop_first=True)
    y1 = pd.to_numeric(df['Vitamin_D'], errors='coerce')
    y2 = pd.to_numeric(df['Time'], errors='coerce')
    r1 = residualize(y1, G)
    r2 = residualize(y2, G)
    m = r1.notna() & r2.notna()
    if m.sum() > 1:
        r_pc, p_pc = pearsonr(r1[m], r2[m])
        print(f'Partial correlation (Vitamin_D ~ Time | Group dummies): r={r_pc:.3f}, p={p_pc:.2e}, n={m.sum()}')
    else:
        print('Not enough paired residuals to compute partial correlation.')
else:
    print('Need Vitamin_D, Time, and Group for partial correlation.')

7) Non-linear Association

Linear correlations miss curved/heteroscedastic relationships. Two widely used measures that capture general dependence are: - Mutual Information (MI) — information shared between variables (0 = independent). - Distance Correlation (dCor) — 0 if and only if independent (for finite variance), sensitive to many forms of nonlinearity.

We compute MI with scikit-learn (k-NN estimator) and implement a small, self-contained dCor. We also add permutation p-values for both.

from sklearn.utils import check_random_state

def mutual_info_xy(x, y, n_neighbors=3, seed=42):
    """Mutual information (continuous-continuous) using k-NN estimator (scikit-learn)."""
    x = pd.Series(x).astype(float)
    y = pd.Series(y).astype(float)
    m = x.notna() & y.notna()
    X = x[m].to_numpy().reshape(-1,1)
    Y = y[m].to_numpy()
    if len(X) < 5:
        return np.nan
    mi = mutual_info_regression(X, Y, n_neighbors=n_neighbors, random_state=check_random_state(seed))
    return float(mi[0])

def _double_center(D):
    """Double-center a distance matrix."""
    n = D.shape[0]
    row_mean = D.mean(axis=1, keepdims=True)
    col_mean = D.mean(axis=0, keepdims=True)
    grand = D.mean()
    return D - row_mean - col_mean + grand

def distance_correlation(x, y):
    """Distance correlation (biased version, adequate for EDA)."""
    x = pd.Series(x).astype(float)
    y = pd.Series(y).astype(float)
    m = x.notna() & y.notna()
    x = x[m].to_numpy()
    y = y[m].to_numpy()
    if x.size < 2:
        return np.nan
    X = np.abs(x[:,None] - x[None,:])
    Y = np.abs(y[:,None] - y[None,:])
    A = _double_center(X)
    B = _double_center(Y)
    dcov2 = (A * B).mean()
    dvarx = (A * A).mean()
    dvary = (B * B).mean()
    if dvarx <= 0 or dvary <= 0:
        return 0.0
    return float(np.sqrt(max(dcov2, 0)) / np.sqrt(np.sqrt(dvarx * dvary)))

def perm_pvalue_stat(x, y, stat_fn, B=2000, seed=123):
    rng = np.random.default_rng(seed)
    x = pd.Series(x).astype(float)
    y = pd.Series(y).astype(float)
    m = x.notna() & y.notna()
    x = x[m].to_numpy()
    y = y[m].to_numpy()
    if len(x) < 5:
        return np.nan, np.nan
    s_obs = stat_fn(x, y)
    cnt = 0
    for b in range(B):
        y_perm = rng.permutation(y)
        s_b = stat_fn(x, y_perm)
        if s_b >= s_obs:  # right-tailed (both MI and dCor are >= 0)
            cnt += 1
    pval = (cnt + 1) / (B + 1)
    return s_obs, pval

if has_vit and has_time:
    mi, p_mi = perm_pvalue_stat(df['Vitamin_D'], df['Time'],
                                 lambda a,b: mutual_info_xy(a,b,n_neighbors=5, seed=1),
                                 B=1000, seed=2)
    dcor, p_dc = perm_pvalue_stat(df['Vitamin_D'], df['Time'], distance_correlation, B=1000, seed=3)
    print(f'Mutual Information (kNN): MI={mi:.4f}, perm p={p_mi:.4f}')
    print(f'Distance Correlation:     dCor={dcor:.4f}, perm p={p_dc:.4f}')
else:
    print('Need Vitamin_D and numeric Time for non-linear measures.')

8) (Optional) Tiny correlation heatmap

Use 4.2 for full heatmaps/pairplots. Here’s a tiny demo with numeric columns (plus log(Vitamin_D) if present).

num2 = df.select_dtypes(include=[np.number]).copy()
if 'Vitamin_D' in num2.columns:
    num2['Vitamin_D_log'] = np.log(num2['Vitamin_D'] + 1e-6)
if num2.shape[1] >= 2:
    corr_pear = num2.corr(numeric_only=True, method='pearson')
    plt.figure(figsize=(5,4))
    sns.heatmap(corr_pear, annot=True, fmt='.2f', vmin=-1, vmax=1, cmap='coolwarm')
    plt.title('Pearson correlations (tiny demo)')
    plt.tight_layout(); plt.show()
else:
    print('Not enough numeric variables for a small heatmap. See 4.2 for richer visuals.')

🧪 Exercises

Pearson vs Spearman vs Kendall
- Pick two numeric variables (e.g., Vitamin_D and Time).
- Compute all three correlations. Which is largest? Are p-values consistent?
Point-biserial
- If you have a binary subgroup (e.g., a dichotomised Outcome), compute the point-biserial with Vitamin_D.
- Interpret the sign (which group tends to be higher?).
Bootstrap CI + Permutation test
- Build a bootstrap CI for Pearson r between two variables of your choice.
- Run a permutation test; compare p-values with pearsonr’s parametric p-value.
Partial correlation
- Control for Group when relating Vitamin_D and Time.
- Does controlling for Group change the strength noticeably?
Non-linear
- Compute MI and dCor for Vitamin_D vs Time (or another numeric pair).
- Use permutation p-values; compare to Pearson/Spearman conclusions.

🔗 Need distribution checks and log reasoning? See 4.1.
🔗 Need broad visual overviews (pair plots/heatmaps)? See 4.2.

✅ Conclusion

You quantified associations via Pearson/Spearman/Kendall, handled binary–numeric with point-biserial, and added uncertainty (bootstrap CI), a non-parametric significance check (permutation), a simple partial correlation, and nonlinear association metrics (MI, dCor). You’re ready to proceed to inferential comparisons and models.