# Setup for Google Colab: clone repo and locate data
import os
from google.colab import files
= '04_data_analysis' # where vitamin_trial.csv lives in this course
MODULE = 'vitamin_trial.csv'
DATASET = '/content/data-analysis-projects'
BASE_PATH = os.path.join(BASE_PATH, 'notebooks', MODULE)
MODULE_PATH = os.path.join('data', DATASET)
DATASET_PATH
try:
print('Attempting to clone repository...')
if not os.path.exists(BASE_PATH):
!git clone https://github.com/ggkuhnle/data-analysis-projects.git
print('Setting working directory...')
os.chdir(MODULE_PATH)if os.path.exists(DATASET_PATH):
print(f'Dataset found: {DATASET_PATH} ✅')
else:
raise FileNotFoundError('Dataset missing after clone.')
except Exception as e:
print(f'Cloning failed: {e}')
print('Falling back to manual upload...')
'data', exist_ok=True)
os.makedirs(= files.upload()
uploaded if DATASET in uploaded:
with open(DATASET_PATH, 'wb') as f:
f.write(uploaded[DATASET])print(f'Successfully uploaded {DATASET} ✅')
else:
raise FileNotFoundError(f'Upload failed. Please upload {DATASET}.')
🧠 5.1 Bayesian Methods in Nutrition Research
This notebook introduces Bayesian methods for nutrition data analysis using PyMC and ArviZ. We’ll compare two groups (Control vs Treatment) in a vitamin D trial to estimate group means, their difference, and the uncertainty around them.
You will: - Fit a simple two-group Bayesian model to Vitamin_D
. - Examine diagnostics (trace plots, R-hat, ESS). - Quantify the posterior difference between groups. - Perform posterior predictive checks (PPC).
Why Bayesian?
Bayesian analysis combines prior information (priors) with data (likelihood) to produce a distribution of plausible parameter values (posterior). This is helpful in nutrition research where uncertainty and variation are the norm.%pip install -q pandas numpy pymc arviz matplotlib
import pandas as pd, numpy as np, pymc as pm, arviz as az, matplotlib.pyplot as plt
'plot.matplotlib.show'] = True
az.rcParams[print('Bayesian environment ready.')
📥 Load & Explore
We use vitamin_trial.csv
which (in this course) contains at least: - Vitamin_D
: numeric outcome (e.g., µg) - Group
: categorical (Control
/ Treatment
)
We’ll compare group means for Vitamin_D
.
= pd.read_csv('data/vitamin_trial.csv')
df print('Shape:', df.shape)
display(df.head())
assert {'Vitamin_D','Group'}.issubset(df.columns), 'Expected columns Vitamin_D and Group are missing.'
print('\nMeans by Group:')
'Group', observed=True)['Vitamin_D'].mean().round(2)) display(df.groupby(
🧩 Model Specification
We model Vitamin_D ~ Normal( μ[group], σ ) with group-specific means and a shared σ.
Priors (weakly-informative): - μ[group] ~ Normal(0, 10) - σ ~ HalfNormal(10)
These are intentionally broad; adapt to your domain knowledge as needed.
= df['Vitamin_D'].to_numpy()
y = df['Group'].map({'Control':0, 'Treatment':1}).to_numpy()
grp
with pm.Model() as model:
= pm.Normal('mu', mu=0, sigma=10, shape=2)
mu = pm.HalfNormal('sigma', sigma=10)
sigma = pm.Normal('y_obs', mu=mu[grp], sigma=sigma, observed=y)
y_obs
# Sample posterior
= pm.sample(1500, tune=1500, target_accept=0.9, random_seed=42)
idata
# Posterior predictive for PPC
= pm.sample_posterior_predictive(idata, random_seed=42) ppc
🔍 Diagnostics & Summaries
Check MCMC diagnostics (R-hat, ESS) and inspect trace plots to ensure good mixing.
print(az.summary(idata, var_names=['mu','sigma'], round_to=2))
=['mu','sigma']); plt.show()
az.plot_trace(idata, var_names
# Posterior for group means
=['mu']); plt.show() az.plot_posterior(idata, var_names
🔗 Posterior Difference (Treatment − Control)
We compute the posterior for Δ = μ[Treatment] − μ[Control] and the probability that Δ > 0.
= idata.posterior['mu'] # (chain, draw, 2)
mu_post = (mu_post.sel(mu_dim_0=1) - mu_post.sel(mu_dim_0=0)).values.flatten()
diff =0); plt.title('Posterior: Δ mean (Treatment - Control)'); plt.show()
az.plot_posterior(diff, ref_valprint(f"P(Δ>0) = {(diff>0).mean():.3f}")
📦 Posterior Predictive Checks (PPC)
Compare simulated data from the posterior with observed data to assess model fit.
# Robust conversion to InferenceData for PPC plot (handles PyMC return variations)
try:
# PyMC >= 5 often returns InferenceData directly
= ppc
idata_ppc # If not InferenceData, this will raise
= idata_ppc.posterior_predictive
_ except Exception:
# Dict-like fallback
= az.from_dict(posterior_predictive={'y_obs': ppc['y_obs']}, observed_data={'y_obs': y})
idata_ppc
; plt.show() az.plot_ppc(idata_ppc)
📝 Interpretation Tips
- Posterior means (μ): central tendency for each group.
- Δ posterior: if most mass > 0, Treatment likely increases Vitamin_D relative to Control.
- PPC: large misfit suggests revisiting the likelihood (e.g., heavier tails), adding predictors, or transforming the outcome.
Further reading
- PyMC: https://www.pymc.io/
- ArviZ: https://arviz-devs.github.io/arviz/
- Prior choice: think in outcome units (e.g., µg) and expected variability.
🧪 Exercises
Change the prior for
mu
toNormal(10, 5)
andsigma
toHalfNormal(5)
. Does the posterior difference meaningfully change?Add a predictor: include
Time
if available (create a simple linear model:Vitamin_D ~ Normal( α[group] + β·Time, σ )
).Heavier tails: re-fit with a Student-t likelihood (e.g.,
pm.StudentT
) and compare PPCs.
💡 Write a short markdown interpretation for each experiment.
✅ Wrap-up
You fitted a two-group Bayesian model, checked diagnostics, quantified the posterior difference, and validated fit with PPC. This workflow generalises to many nutrition questions: comparing interventions, stratifying by subgroups, or extending to multivariable models.