# Setup for Google Colab: Fetch datasets automatically or manually
import os
from google.colab import files
= '04_data_analysis'
MODULE = 'vitamin_trial.csv'
DATASET = '/content/data-analysis-projects'
BASE_PATH = os.path.join(BASE_PATH, 'notebooks', MODULE)
MODULE_PATH = os.path.join('data', DATASET)
DATASET_PATH
try:
print('Attempting to clone repository...')
if not os.path.exists(BASE_PATH):
!git clone https://github.com/ggkuhnle/data-analysis-projects.git
print('Setting working directory...')
os.chdir(MODULE_PATH)if os.path.exists(DATASET_PATH):
print(f'Dataset found: {DATASET_PATH} ✅')
else:
raise FileNotFoundError('Dataset missing after clone.')
except Exception as e:
print(f'Cloning failed: {e}')
print('Falling back to manual upload...')
'data', exist_ok=True)
os.makedirs(= files.upload()
uploaded if DATASET in uploaded:
with open(DATASET_PATH, 'wb') as f:
f.write(uploaded[DATASET])print(f'Successfully uploaded {DATASET} ✅')
else:
raise FileNotFoundError(f'Upload failed. Please upload {DATASET}.')
📊 4.1 Data Distributions and Visualisation
In this notebook we’ll explore the shape of our data and how to visualise it clearly. You’ll learn to create histograms, density plots, ECDFs, and box/violin plots—and see when a log transform helps. We’ll include just one light bivariate teaser (scatter with a regression line) and leave multivariate overviews for 4.2.
We’ll use vitamin_trial.csv
(a simulated trial dataset with vitamin D measurements).
🎯 Objectives
- Visualise distributions (histogram, KDE/density, ECDF), and compare groups.
- Use box/violin plots to summarise variation and outliers.
- Check normality with Q–Q plots and Shapiro–Wilk.
- See how a log transform can clarify shape.
- (Teaser) A single scatter with regression; full multivariate EDA is in 4.2.
%pip install -q pandas numpy matplotlib seaborn statsmodels scipy
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from scipy.stats import shapiro
'display.max_columns', 50)
pd.set_option(
sns.set_theme()print('Visualisation environment ready.')
🔧 Load the Data and Inspect
We’ll load vitamin_trial.csv
and take a quick look at structure and the first few rows.
= pd.read_csv('data/vitamin_trial.csv')
df print('Shape:', df.shape)
display(df.head())='all')) display(df.describe(include
📦 Distribution Plots
We start with a single numeric variable, e.g. Vitamin_D.
Why distributions?
- They reveal skewness, heavy tails, and outliers.
- They inform transformations (e.g., log for right-skew).
- They guide the choice of statistical model.
= df['Vitamin_D'].dropna()
x
= plt.subplots(1, 3, figsize=(15, 4))
fig, axes
# 1) Histogram
0].hist(x, bins=20, edgecolor='black')
axes[0].set_title('Histogram: Vitamin_D')
axes[0].set_xlabel('Vitamin_D (µg)')
axes[
# 2) KDE / density
=x, ax=axes[1])
sns.kdeplot(x1].set_title('KDE / Density: Vitamin_D')
axes[1].set_xlabel('Vitamin_D (µg)')
axes[
# 3) ECDF (empirical CDF - cummulative distribution function)
= np.sort(x.values)
xs = np.arange(1, len(xs)+1) / len(xs)
ys 2].plot(xs, ys, marker='.', linestyle='none')
axes[2].set_title('ECDF: Vitamin_D')
axes[2].set_xlabel('Vitamin_D (µg)')
axes[2].set_ylabel('Proportion ≤ x')
axes[
; plt.show() plt.tight_layout()
🧪 Compare Groups (Box/Violin)
Box and violin plots summarise spread and potential outliers; violin also shows the density shape. Let’s compare by Group (e.g. Control vs Treatment).
= plt.subplots(1, 2, figsize=(12, 4))
fig, axes =df, x='Group', y='Vitamin_D', ax=axes[0])
sns.boxplot(data0].set_title('Boxplot by Group')
axes[0].set_ylabel('Vitamin_D (µg)')
axes[
=df, x='Group', y='Vitamin_D', inner='quartile', ax=axes[1])
sns.violinplot(data1].set_title('Violin by Group (with quartiles)')
axes[1].set_ylabel('Vitamin_D (µg)')
axes[
; plt.show() plt.tight_layout()
🌀 Normality Checks
Some methods (e.g., classic t-tests, OLS regression) assume approximate normality of residuals/variables. Always look at the data first: - Q–Q plots: compare quantiles of your data to a normal distribution. - Shapiro–Wilk test: formal test (sensitive with large n).
Learn more (click)
= df['Vitamin_D'].dropna()
y
# Q–Q plot (raw)
=(5,5))
plt.figure(figsize='s')
sm.qqplot(y, line'Q–Q Plot (raw Vitamin_D)')
plt.title(
plt.show()
# Shapiro–Wilk
= shapiro(y)
stat, p print(f'Shapiro–Wilk (raw): statistic={stat:.3f}, p={p:.3g}')
print('Interpretation: p<0.05 → reject normality; p≥0.05 → normality plausible.')
Log Transform (and Q–Q side-by-side)
Right-skewed data often benefits from a log transform. If zeros are possible, add a small constant ε
or shift the data. Here we’ll add a small ε
safely.
= 1e-6
eps = np.log(y + eps)
y_log
= plt.subplots(1, 2, figsize=(10, 5))
fig, axes ='s', ax=axes[0])
sm.qqplot(y, line0].set_title('Q–Q: raw Vitamin_D')
axes[='s', ax=axes[1])
sm.qqplot(y_log, line1].set_title('Q–Q: log(Vitamin_D)')
axes[; plt.show()
plt.tight_layout()
= shapiro(y_log)
stat_l, p_l print(f'Shapiro–Wilk (log): statistic={stat_l:.3f}, p={p_l:.3g}')
🧪 Exercises
- Plot histograms (or KDEs) of Vitamin_D by Group (use
hue='Group'
). Describe differences. - Produce Q–Q plots (raw vs log) for Vitamin_D. Which is closer to normal?
- Use box/violin plots to compare Group distributions and comment on outliers.
👉 Next: 4.2 EDA — full workflow (profile, missingness, grouped summaries, pair plots, heatmaps, outliers).
✅ Conclusion
You characterised distributions, compared groups, inspected normality, and saw how log transforms clarify structure. Multivariate relationships and correlation overviews live in 4.2.
More reading
- Seaborn: https://seaborn.pydata.org/
- Matplotlib: https://matplotlib.org/
- statsmodels (Q–Q): https://www.statsmodels.org/stable/graphics.html
- Scipy stats: https://docs.scipy.org/doc/scipy/reference/stats.html