🔄 3.4 Data Transformation

In this notebook you’ll learn how to transform data so it’s easy to analyse and model. We’ll work with hippo_nutrients.csv and cover filtering, grouping, reshaping (wide/long), pivoting, and numeric transformations such as log and z-score standardisation. We’ll also add density plots to visualise how transformations change the distribution.

🎯 Objectives

Filter rows and select columns with expressive, reproducible code.
Aggregate groups (what it means, when to use it) using .groupby(...).agg(...) with clear syntax.
Reshape between wide and long formats (melt / pivot).
Pivot-tabulate values for quick comparisons.
Create derived variables and apply log, z-score, min–max scaling.
(Bonus) Group-wise standardisation and rolling transforms over time.
Use density plots to see how transformations change distributions.

📌 Context

Transformation is the ladder from raw data to insight. In nutrition datasets, you’ll often transform by group (e.g., nutrient, sex, year) and reshape between wide (one row per ID with many columns) and long (one row per measurement).

Wide vs Long - Wide: Each variable has its own column (e.g., Iron, Calcium, Vitamin_D). Good for humans and some models. - Long (tidy): One row per observation; variables are in columns, not headers (e.g., a Nutrient column and a Value column). Preferred for plotting, grouping, modelling.

Fun Fact

Transforming data is like a hippo rearranging its snacks — same nutrients, better view! 🦛

# Setup for Google Colab: Fetch datasets automatically or manually
import os
from google.colab import files

MODULE = '03_data_handling'
DATASET = 'hippo_nutrients.csv'
BASE_PATH = '/content/data-analysis-projects'
MODULE_PATH = os.path.join(BASE_PATH, 'notebooks', MODULE)
DATASET_PATH = os.path.join('data', DATASET)

try:
    if not os.path.exists(BASE_PATH):
        !git clone https://github.com/ggkuhnle/data-analysis-projects.git
    os.chdir(MODULE_PATH)
    assert os.path.exists(DATASET_PATH)
    print(f'Dataset found: {DATASET_PATH} ✅')
except Exception as e:
    print(f'Automatic clone failed: {e}')
    os.makedirs('data', exist_ok=True)
    uploaded = files.upload()
    if DATASET in uploaded:
        with open(DATASET_PATH, 'wb') as f:
            f.write(uploaded[DATASET])
        print(f'Successfully uploaded {DATASET} ✅')
    else:
        raise FileNotFoundError(f'Upload failed. Please ensure you uploaded {DATASET}.')

print('Python environment ready.')

Add required libraries (pandas, numpy, matplotlib for density plots).

%pip install -q pandas numpy matplotlib
import pandas as pd, numpy as np
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', 30)
print('Libraries ready.')

1) Load and Inspect

Let’s load hippo_nutrients.csv and inspect the first rows and data types.

df = pd.read_csv('data/hippo_nutrients.csv')
print('Shape:', df.shape)
print('Dtypes:\n', df.dtypes)
display(df.head(5))

2) Filtering Rows and Selecting Columns

Use boolean masks to keep the rows you want, and bracket notation to select columns. This is bread-and-butter transformation.

# Example: female hippos, Iron only, years ≥ 2024
mask = (df['Sex'] == 'F') & (df['Nutrient'] == 'Iron') & (df['Year'] >= 2024)
cols = ['ID', 'Year', 'Nutrient', 'Value', 'Age', 'Sex']
df_female_iron = df.loc[mask, cols].sort_values(['ID', 'Year']).reset_index(drop=True)
display(df_female_iron.head(5))

3) Aggregation — Concept and `.agg` Syntax

Aggregation means reducing many rows in a group to summary values (e.g., mean iron intake by sex). In pandas, you usually do this with:

df.groupby(<group-cols>)[<value-col>].agg(<how>)

What can `<how>` be?

A single function: 'mean', 'median', 'count', 'std', or a custom function:

df.groupby('Nutrient')['Value'].agg('mean')

A list of functions: returns multiple columns with each statistic:

df.groupby('Nutrient')['Value'].agg(['mean','median','count'])

A dictionary mapping column → function(s):

df.groupby(['Nutrient','Sex']).agg({
  'Value': ['mean','median','count'],
  'Age': 'mean'  # optional extra summaries
})

Named aggregations (cleaner column names):

df.groupby(['Nutrient','Sex']).agg(
  Mean_Value = ('Value', 'mean'),
  Median_Value = ('Value', 'median'),
  N = ('Value', 'size')
)

Why aggregate?

To summarise groups (e.g., mean intake per nutrient, per sex).
To compare across categories (e.g., male vs female, 2024 vs 2025).
To reduce noise and prepare for plotting or reporting.

3.1 Basic aggregations (mean, median, count)

# Mean Value by Nutrient
mean_by_nutrient = df.groupby('Nutrient')['Value'].mean().sort_values(ascending=False)
display(mean_by_nutrient)

# Multiple aggregations by Nutrient and Sex (named aggregations for tidy column names)
agg_ns = (
    df.groupby(['Nutrient','Sex'])
      .agg(Mean_Value=('Value','mean'),
           Median_Value=('Value','median'),
           N=('Value','size'))
      .reset_index()
)
display(agg_ns.head(8))

4) Reshaping Between Wide and Long

Many real datasets arrive in wide form; most analyses and plots like long (tidy) form.

Long (tidy) → one row per observation

Columns: ID, Year, Age, Sex, Nutrient, Value

Wide → columns are variables (e.g., one column per nutrient)

Columns: ID, Year, Age, Sex, Iron, Calcium, Vitamin_D, …

4.1 Long → Wide (pivot wider)

Use pivot_table with index = identifier columns and columns = the variable you want as new columns (Nutrient). Values come from Value.

id_cols = ['ID','Year','Age','Sex']
wide = df.pivot_table(index=id_cols, columns='Nutrient', values='Value', aggfunc='mean')
wide = wide.reset_index()
display(wide.head(5))

4.2 Wide → Long (melt)

Use melt to gather nutrient columns back into two columns: Nutrient (variable name) and Value (measurement). This is the canonical tidy structure for plotting, modelling, and faceting.

value_vars = [c for c in wide.columns if c not in id_cols]
long_again = wide.melt(id_vars=id_cols, value_vars=value_vars,
                      var_name='Nutrient', value_name='Value')
display(long_again.head(6))

5) Pivot Tables for Comparisons

Cross-tabulate means by two dimensions (e.g., Nutrient × Year).

pt = df.pivot_table(values='Value', index='Nutrient', columns='Year', aggfunc='mean')
display(pt)

6) Derived Variables and Numeric Transformations

Real analyses often need transformed variables to meet model assumptions or to compare across scales.

6.1 Log transform

Use when data are right-skewed (e.g., highly variable intakes). Add a small ε to avoid log(0).

6.2 z-score standardisation

Convert values to standard units: (x - mean) / std. Helpful when combining variables on different scales.

6.3 Min–max scaling (0–1)

Maps the minimum to 0 and maximum to 1. Useful for comparability in dashboards.

6.4 Group-wise transforms

Standardise within groups (e.g., within each Nutrient) so each nutrient’s distribution is centred and scaled separately.

df_trans = df.copy()

# 6.1 Log transform: add epsilon to avoid log(0) if present
eps = 1e-6
df_trans['Value_log'] = np.log(df_trans['Value'] + eps)

# 6.2 Global z-score
mu = df_trans['Value'].mean()
sd = df_trans['Value'].std(ddof=0)
df_trans['Value_z'] = (df_trans['Value'] - mu) / (sd if sd else 1.0)

# 6.3 Min–max scaling
vmin, vmax = df_trans['Value'].min(), df_trans['Value'].max()
rng = vmax - vmin if vmax > vmin else 1.0
df_trans['Value_minmax'] = (df_trans['Value'] - vmin) / rng

# 6.4 Group-wise z-score by Nutrient
df_trans['Value_z_by_nutrient'] = (
    df_trans.groupby('Nutrient')['Value']
           .transform(lambda s: (s - s.mean()) / (s.std(ddof=0) if s.std(ddof=0) else 1.0))
)

display(df_trans.head(8))

7) Density Plots — Visualise Transformations

Let’s visualise how the distribution changes after transformations. We’ll use histograms with density=True (a simple density estimate) for one nutrient (e.g., Iron). You should see that the log transform often makes a right-skewed distribution more symmetric, and z-scores centre the data at 0 with unit variance.

# Choose one nutrient to illustrate (fallback to first if Iron absent)
nutrient_to_plot = 'Iron' if 'Iron' in df['Nutrient'].unique() else df['Nutrient'].unique()[0]
sub = df_trans[df_trans['Nutrient'] == nutrient_to_plot].copy()

print(f'Plotting distributions for nutrient: {nutrient_to_plot}')

# Original scale density (histogram)
plt.figure()
sub['Value'].plot(kind='hist', bins=20, density=True, edgecolor='black')
plt.title(f'{nutrient_to_plot}: Original Value (density)')
plt.xlabel('Value')
plt.ylabel('Density')
plt.show()

# Log-transformed density
plt.figure()
sub['Value_log'].plot(kind='hist', bins=20, density=True, edgecolor='black')
plt.title(f'{nutrient_to_plot}: Log(Value) (density)')
plt.xlabel('log(Value + ε)')
plt.ylabel('Density')
plt.show()

# Z-score density
plt.figure()
sub['Value_z'].plot(kind='hist', bins=20, density=True, edgecolor='black')
plt.title(f'{nutrient_to_plot}: Z-score (density)')
plt.xlabel('Z-score')
plt.ylabel('Density')
plt.show()

8) (Bonus) Time-based and Rolling Transforms

For longitudinal data (e.g., the same ID measured across Year), compute rolling means or deltas. This is illustrative if your dataset has multiple years per ID.

df_roll = (
    df.sort_values(['ID','Nutrient','Year'])
      .groupby(['ID','Nutrient'], as_index=False)
      .apply(lambda g: g.assign(
          Value_roll_mean=g['Value'].rolling(window=2, min_periods=1).mean(),
          Value_delta=g['Value'].diff()
      ))
)
display(df_roll.head(10))

🧪 Exercise 1 — Grouped Summary

Filter to Nutrient == 'Vitamin_D', then compute median Value by Sex and Year. Present the result sorted by Year and then Sex.

Hints: - Filter: df[df['Nutrient'] == 'Vitamin_D'] - Group and aggregate: .groupby(['Sex','Year'])['Value'].median().reset_index() - Sort: .sort_values(['Year','Sex'])

# Your code here

🧪 Exercise 2 — Wide ↔︎ Long

Create a wide table with index [ID, Year] and columns as nutrients (values = mean Value).
Convert that wide table back to long using melt with Nutrient and Value columns.

Hints: - pivot_table(index=['ID','Year'], columns='Nutrient', values='Value', aggfunc='mean') - melt(id_vars=['ID','Year'], var_name='Nutrient', value_name='Value')

# Your code here

🧪 Exercise 3 — Transformations for Modelling

Create a dataframe with these new columns on the Iron subset only: - Value_log_iron — log-transformed value with epsilon. - Value_z_iron_by_sex — z-score within Sex. - Value_minmax_iron_by_year — min–max scaling within Year.

Hints: - Filter: (df['Nutrient'] == 'Iron') - Group-wise transforms with .groupby(...).transform(...)

# Your code here

✅ Conclusion & Next Steps

You’ve practised the core transformation tools: - Filtering and selecting for clear subsets. - Aggregation with .groupby(...).agg(...), including named aggregations and why aggregation matters. - Reshaping between wide and long for tidy analysis. - Pivot tables for quick comparisons. - Numeric transformations: log, z-score, min–max, and group-wise standardisation — plus density plots to see the changes.

👉 Next: 3.5 Data Aggregation — robust summaries, grouped statistics, and combining pipelines for analysis-ready datasets.

Resources: - Pandas GroupBy: https://pandas.pydata.org/docs/user_guide/groupby.html - Reshaping (melt/pivot): https://pandas.pydata.org/docs/user_guide/reshaping.html - Working with dtypes: https://pandas.pydata.org/docs/user_guide/basics.html#basics-dtypes