📊 4.6 Logistic Regression and Survival Analysis

Classical tools for binary outcomes and time-to-event questions. We’ll model the probability of improvement with logistic regression and the time to improvement with survival methods.

🎯 Objectives

Fit, evaluate, and interpret logistic regression (with odds ratios + CIs)
Add interaction terms and regularisation
Plot ROC, PR, calibration; compute confusion matrix and classification report
Estimate Kaplan–Meier curves; run the log-rank test
Fit a Cox proportional hazards model and check PH assumptions

Fun fact

Hippos don’t do RCTs, but their vitamin D data powers our models 🦛

# Setup for Google Colab: Fetch datasets automatically or manually
import os
from google.colab import files

MODULE = '04_data_analysis'
DATASET = 'vitamin_trial.csv'
BASE_PATH = '/content/data-analysis-projects'
MODULE_PATH = os.path.join(BASE_PATH, 'notebooks', MODULE)
DATASET_PATH = os.path.join('data', DATASET)

try:
    if not os.path.exists(BASE_PATH):
        !git clone https://github.com/ggkuhnle/data-analysis-projects.git
    os.chdir(MODULE_PATH)
    if os.path.exists(DATASET_PATH):
        print(f'Dataset found: {DATASET_PATH} 🦛')
    else:
        raise FileNotFoundError('Dataset missing after clone.')
except Exception as e:
    print(f'Cloning failed: {e}')
    print('Falling back to manual upload ...')
    os.makedirs('data', exist_ok=True)
    uploaded = files.upload()
    if DATASET in uploaded:
        with open(DATASET_PATH, 'wb') as f:
            f.write(uploaded[DATASET])
        print(f'Successfully uploaded {DATASET} 🦛')
    else:
        raise FileNotFoundError(f'Upload failed. Please upload {DATASET}.')

%pip install -q pandas numpy scikit-learn lifelines matplotlib seaborn statsmodels

import pandas as pd, numpy as np
import matplotlib.pyplot as plt, seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.metrics import (roc_auc_score, average_precision_score, roc_curve,
                             precision_recall_curve, confusion_matrix, classification_report,
                             brier_score_loss)
from sklearn.calibration import calibration_curve
import statsmodels.api as sm
import statsmodels.formula.api as smf
from lifelines import KaplanMeierFitter, CoxPHFitter
from lifelines.statistics import logrank_test

sns.set_theme()
pd.set_option('display.max_columns', 50)
print('Environment ready.')

📥 Data Preparation

vitamin_trial.csv columns (simulated): - ID: participant - Group: Control / Treatment - Vitamin_D: serum vitamin D level (units as per dataset) - Time: time to outcome (months) - Outcome: Normal / Improved

We’ll create: - Improved = 1 if Outcome==“Improved”, else 0 - Train/test split (70/30)

df = pd.read_csv('data/vitamin_trial.csv')
df['Improved'] = (df['Outcome'] == 'Improved').astype(int)
print('Shape:', df.shape)
display(df.head())

1) Logistic Regression — Baseline

We’ll predict Improved from Vitamin_D and Group. Start with a simple, interpretable model and evaluate on a test set.

Why not only accuracy?

Class imbalance and thresholding can make accuracy misleading. We report ROC-AUC, PR-AUC, and show calibration.

# Design matrix
X = pd.get_dummies(df[['Vitamin_D','Group']], drop_first=True)
y = df['Improved']

X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3, random_state=1, stratify=y)

scaler = StandardScaler(with_mean=False)  # with_mean=False keeps sparse-compatible; safe here too
X_trs = scaler.fit_transform(X_tr)
X_tes = scaler.transform(X_te)

clf = LogisticRegression(max_iter=200, solver='lbfgs')
clf.fit(X_trs, y_tr)

pred_prob = clf.predict_proba(X_tes)[:,1]
pred_lbl = (pred_prob >= 0.5).astype(int)

roc_auc = roc_auc_score(y_te, pred_prob)
pr_auc = average_precision_score(y_te, pred_prob)
cm = confusion_matrix(y_te, pred_lbl)

print('ROC-AUC:', round(roc_auc,3), '| PR-AUC:', round(pr_auc,3))
print('\nConfusion matrix (@0.5):\n', cm)
print('\nClassification report:\n', classification_report(y_te, pred_lbl))

Curves: ROC, Precision–Recall, Calibration

fpr, tpr, _ = roc_curve(y_te, pred_prob)
prec, rec, _ = precision_recall_curve(y_te, pred_prob)
prob_true, prob_pred = calibration_curve(y_te, pred_prob, n_bins=10)

fig, ax = plt.subplots(1,3, figsize=(15,4))
ax[0].plot(fpr, tpr); ax[0].plot([0,1],[0,1],'--')
ax[0].set_title('ROC'); ax[0].set_xlabel('FPR'); ax[0].set_ylabel('TPR')
ax[1].plot(rec, prec)
ax[1].set_title('Precision–Recall'); ax[1].set_xlabel('Recall'); ax[1].set_ylabel('Precision')
ax[2].plot(prob_pred, prob_true, marker='o'); ax[2].plot([0,1],[0,1],'--')
ax[2].set_title('Calibration'); ax[2].set_xlabel('Pred prob'); ax[2].set_ylabel('Observed freq')
plt.tight_layout(); plt.show()

print('Brier score (lower is better):', round(brier_score_loss(y_te, pred_prob), 4))

2) Odds Ratios (+ 95% CI) with `statsmodels`

Scikit-learn is great for prediction; statsmodels gives easy ORs, CIs, and p-values for interpretation. We also include an interaction term: Vitamin_D × Group_Treatment.

df_sm = df.copy()
df_sm = pd.get_dummies(df_sm, columns=['Group'], drop_first=True)
formula = 'Improved ~ Vitamin_D + Group_Treatment + Vitamin_D:Group_Treatment'
logit = smf.logit(formula, data=df_sm).fit(disp=False)
print(logit.summary())

params = logit.params
conf = logit.conf_int()
or_tbl = pd.DataFrame({
    'OR': np.exp(params),
    'low': np.exp(conf[0]),
    'high': np.exp(conf[1])
})
display(or_tbl)

3) Regularised Logistic (cross-validated)

Regularisation (penalty) controls overfitting and helps when predictors are correlated. Below we tune C via internal CV.

logit_cv = LogisticRegressionCV(
    Cs=10, cv=5, scoring='roc_auc', max_iter=500, solver='lbfgs')
logit_cv.fit(X_trs, y_tr)
pp = logit_cv.predict_proba(X_tes)[:,1]
print('CV-logistic ROC-AUC (test):', round(roc_auc_score(y_te, pp),3))

4) Kaplan–Meier Survival Curves + Log-Rank Test

We consider event = improvement. Survival here means not yet improved. We compare Control vs Treatment with log-rank.

df['Event'] = (df['Outcome'] == 'Improved').astype(int)
km = KaplanMeierFitter()
plt.figure(figsize=(7,5))
for g in ['Control','Treatment']:
    mask = df['Group']==g
    km.fit(durations=df.loc[mask,'Time'], event_observed=df.loc[mask,'Event'], label=g)
    km.plot_survival_function()
plt.title('Kaplan–Meier: Time to Improvement')
plt.xlabel('Time (months)'); plt.ylabel('Survival (not improved)')
plt.grid(True, alpha=.3); plt.show()

lr = logrank_test(
    df.loc[df['Group']=='Control','Time'],
    df.loc[df['Group']=='Treatment','Time'],
    event_observed_A=df.loc[df['Group']=='Control','Event'],
    event_observed_B=df.loc[df['Group']=='Treatment','Event']
)
print('Log-rank p-value:', lr.p_value)

5) Cox Proportional Hazards Model

CoxPH estimates hazard ratios (HR). HR > 1 means faster time to improvement (higher hazard). We include Vitamin_D and Group.

Assumption

The proportional hazards (PH) assumption: hazard ratios are constant over time. We’ll run basic checks.

cox_df = df[['Time','Event','Vitamin_D','Group']].copy()
cox_df = pd.get_dummies(cox_df, columns=['Group'], drop_first=True)
cph = CoxPHFitter()
cph.fit(cox_df, duration_col='Time', event_col='Event')
cph.print_summary()  # HR, CI, p

# PH assumption checks (prints warnings/tables; may raise plots if available)
try:
    cph.check_assumptions(cox_df, p_value_threshold=0.05, show_plots=False)
except Exception as e:
    print('PH check note:', e)

🧪 Exercises

Threshold tuning: Choose a decision threshold that maximises F1 (or Youden’s J) on the validation set; recompute the confusion matrix.
Extended model: Add an interaction to the Cox model (Vitamin_D × Group_Treatment) and interpret the HR.
Calibration: Use quantile bins (e.g., deciles) to compare predicted vs observed improvement rates; comment on miscalibration.
Sensitivity analysis: Refit logistic with StandardScaler(with_mean=True) on dense features (drop one-hot first) and compare coefficients and AUC.

Hints

For 1): grid thresholds from 0.1 to 0.9 and pick argmax of F1.
For 2): add Vitamin_D:Group_Treatment column in cox_df before fitting.
For 3): pd.qcut(pred_prob, 10) then groupby to get observed means.

✅ Conclusion

You built interpretable classifiers (logistic with ORs and CIs), assessed predictive performance (ROC/PR/calibration), and analysed time-to-event outcomes (KM/log-rank/Cox with PH checks).

👉 Next: 4.7 Clinical Trial Analysis — put everything together for effect sizes and reporting.