📊 4.1 Data Distributions and Visualisation

In this notebook we’ll explore the shape of our data and how to visualise it clearly. You’ll learn to create histograms, density plots, ECDFs, and box/violin plots—and see when a log transform helps. We’ll include just one light bivariate teaser (scatter with a regression line) and leave multivariate overviews for 4.2.

We’ll use vitamin_trial.csv (a simulated trial dataset with vitamin D measurements).

🎯 Objectives

Visualise distributions (histogram, KDE/density, ECDF), and compare groups.
Use box/violin plots to summarise variation and outliers.
Check normality with Q–Q plots and Shapiro–Wilk.
See how a log transform can clarify shape.
(Teaser) A single scatter with regression; full multivariate EDA is in 4.2.

Helpful links

# Setup for Google Colab: Fetch datasets automatically or manually
import os
from google.colab import files

MODULE = '04_data_analysis'
DATASET = 'vitamin_trial.csv'
BASE_PATH = '/content/data-analysis-projects'
MODULE_PATH = os.path.join(BASE_PATH, 'notebooks', MODULE)
DATASET_PATH = os.path.join('data', DATASET)

try:
    print('Attempting to clone repository...')
    if not os.path.exists(BASE_PATH):
        !git clone https://github.com/ggkuhnle/data-analysis-projects.git
    print('Setting working directory...')
    os.chdir(MODULE_PATH)
    if os.path.exists(DATASET_PATH):
        print(f'Dataset found: {DATASET_PATH} ✅')
    else:
        raise FileNotFoundError('Dataset missing after clone.')
except Exception as e:
    print(f'Cloning failed: {e}')
    print('Falling back to manual upload...')
    os.makedirs('data', exist_ok=True)
    uploaded = files.upload()
    if DATASET in uploaded:
        with open(DATASET_PATH, 'wb') as f:
            f.write(uploaded[DATASET])
        print(f'Successfully uploaded {DATASET} ✅')
    else:
        raise FileNotFoundError(f'Upload failed. Please upload {DATASET}.')

%pip install -q pandas numpy matplotlib seaborn statsmodels scipy
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from scipy.stats import shapiro

pd.set_option('display.max_columns', 50)
sns.set_theme()
print('Visualisation environment ready.')

🔧 Load the Data and Inspect

We’ll load vitamin_trial.csv and take a quick look at structure and the first few rows.

df = pd.read_csv('data/vitamin_trial.csv')
print('Shape:', df.shape)
display(df.head())
display(df.describe(include='all'))

📦 Distribution Plots

We start with a single numeric variable, e.g. Vitamin_D.

Why distributions?

They reveal skewness, heavy tails, and outliers.
They inform transformations (e.g., log for right-skew).
They guide the choice of statistical model.

x = df['Vitamin_D'].dropna()

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# 1) Histogram
axes[0].hist(x, bins=20, edgecolor='black')
axes[0].set_title('Histogram: Vitamin_D')
axes[0].set_xlabel('Vitamin_D (µg)')

# 2) KDE / density
sns.kdeplot(x=x, ax=axes[1])
axes[1].set_title('KDE / Density: Vitamin_D')
axes[1].set_xlabel('Vitamin_D (µg)')

# 3) ECDF (empirical CDF - cummulative distribution function)
xs = np.sort(x.values)
ys = np.arange(1, len(xs)+1) / len(xs)
axes[2].plot(xs, ys, marker='.', linestyle='none')
axes[2].set_title('ECDF: Vitamin_D')
axes[2].set_xlabel('Vitamin_D (µg)')
axes[2].set_ylabel('Proportion ≤ x')

plt.tight_layout(); plt.show()

🧪 Compare Groups (Box/Violin)

Box and violin plots summarise spread and potential outliers; violin also shows the density shape. Let’s compare by Group (e.g. Control vs Treatment).

fig, axes = plt.subplots(1, 2, figsize=(12, 4))
sns.boxplot(data=df, x='Group', y='Vitamin_D', ax=axes[0])
axes[0].set_title('Boxplot by Group')
axes[0].set_ylabel('Vitamin_D (µg)')

sns.violinplot(data=df, x='Group', y='Vitamin_D', inner='quartile', ax=axes[1])
axes[1].set_title('Violin by Group (with quartiles)')
axes[1].set_ylabel('Vitamin_D (µg)')

plt.tight_layout(); plt.show()

🌀 Normality Checks

Some methods (e.g., classic t-tests, OLS regression) assume approximate normality of residuals/variables. Always look at the data first: - Q–Q plots: compare quantiles of your data to a normal distribution. - Shapiro–Wilk test: formal test (sensitive with large n).

Learn more (click)

y = df['Vitamin_D'].dropna()

# Q–Q plot (raw)
plt.figure(figsize=(5,5))
sm.qqplot(y, line='s')
plt.title('Q–Q Plot (raw Vitamin_D)')
plt.show()

# Shapiro–Wilk
stat, p = shapiro(y)
print(f'Shapiro–Wilk (raw): statistic={stat:.3f}, p={p:.3g}')
print('Interpretation: p<0.05 → reject normality; p≥0.05 → normality plausible.')

Log Transform (and Q–Q side-by-side)

Right-skewed data often benefits from a log transform. If zeros are possible, add a small constant ε or shift the data. Here we’ll add a small ε safely.

eps = 1e-6
y_log = np.log(y + eps)

fig, axes = plt.subplots(1, 2, figsize=(10, 5))
sm.qqplot(y, line='s', ax=axes[0])
axes[0].set_title('Q–Q: raw Vitamin_D')
sm.qqplot(y_log, line='s', ax=axes[1])
axes[1].set_title('Q–Q: log(Vitamin_D)')
plt.tight_layout(); plt.show()

stat_l, p_l = shapiro(y_log)
print(f'Shapiro–Wilk (log): statistic={stat_l:.3f}, p={p_l:.3g}')

🔗 (Teaser) A Single Scatter with Regression

Full multivariate exploration (pair plots, heatmaps) is in 4.2 EDA. Here we just preview a scatter with a regression line. If Time is numeric, compare raw vs log(Vitamin_D).

if 'Time' in df.columns and np.issubdtype(df['Time'].dtype, np.number):
    fig, axes = plt.subplots(1, 2, figsize=(12, 4))
    sns.regplot(data=df, x='Time', y='Vitamin_D', ax=axes[0])
    axes[0].set_title('Scatter (raw) with regression')
    axes[0].set_ylabel('Vitamin_D (µg)')

    sns.regplot(data=df.assign(Vitamin_D_log=np.log(df['Vitamin_D'] + 1e-6)),
                x='Time', y='Vitamin_D_log', ax=axes[1])
    axes[1].set_title('Scatter (log Vitamin_D) with regression')
    axes[1].set_ylabel('log Vitamin_D')
    plt.tight_layout(); plt.show()
else:
    print('No numeric Time variable found for scatter comparison. See 4.2 for broader EDA.')

🧪 Exercises

Plot histograms (or KDEs) of Vitamin_D by Group (use hue='Group'). Describe differences.
Produce Q–Q plots (raw vs log) for Vitamin_D. Which is closer to normal?
Use box/violin plots to compare Group distributions and comment on outliers.

👉 Next: 4.2 EDA — full workflow (profile, missingness, grouped summaries, pair plots, heatmaps, outliers).

✅ Conclusion

You characterised distributions, compared groups, inspected normality, and saw how log transforms clarify structure. Multivariate relationships and correlation overviews live in 4.2.