# Setup for Google Colab: Fetch datasets automatically or manually
import os
from google.colab import files
= '03_data_handling'
MODULE = 'hippo_nutrients.csv'
DATASET = '/content/data-analysis-projects'
BASE_PATH = os.path.join(BASE_PATH, 'notebooks', MODULE)
MODULE_PATH = os.path.join('data', DATASET)
DATASET_PATH
try:
if not os.path.exists(BASE_PATH):
!git clone https://github.com/ggkuhnle/data-analysis-projects.git
os.chdir(MODULE_PATH)assert os.path.exists(DATASET_PATH)
print(f'Dataset found: {DATASET_PATH} ✅')
except Exception as e:
print(f'Automatic clone failed: {e}')
'data', exist_ok=True)
os.makedirs(= files.upload()
uploaded if DATASET in uploaded:
with open(DATASET_PATH, 'wb') as f:
f.write(uploaded[DATASET])print(f'Successfully uploaded {DATASET} ✅')
else:
raise FileNotFoundError(f'Upload failed. Please ensure you uploaded {DATASET}.')
print('Python environment ready.')
🔄 3.4 Data Transformation
In this notebook you’ll learn how to transform data so it’s easy to analyse and model. We’ll work with hippo_nutrients.csv
and cover filtering, grouping, reshaping (wide/long), pivoting, and numeric transformations such as log and z-score standardisation. We’ll also add density plots to visualise how transformations change the distribution.
🎯 Objectives
- Filter rows and select columns with expressive, reproducible code.
- Aggregate groups (what it means, when to use it) using
.groupby(...).agg(...)
with clear syntax. - Reshape between wide and long formats (melt / pivot).
- Pivot-tabulate values for quick comparisons.
- Create derived variables and apply log, z-score, min–max scaling.
- (Bonus) Group-wise standardisation and rolling transforms over time.
- Use density plots to see how transformations change distributions.
📌 Context
Transformation is the ladder from raw data to insight. In nutrition datasets, you’ll often transform by group (e.g., nutrient, sex, year) and reshape between wide (one row per ID with many columns) and long (one row per measurement).
Wide vs Long - Wide: Each variable has its own column (e.g.,
Iron
,Calcium
,Vitamin_D
). Good for humans and some models. - Long (tidy): One row per observation; variables are in columns, not headers (e.g., aNutrient
column and aValue
column). Preferred for plotting, grouping, modelling.
Fun Fact
Transforming data is like a hippo rearranging its snacks — same nutrients, better view! 🦛Add required libraries (pandas, numpy, matplotlib for density plots).
%pip install -q pandas numpy matplotlib
import pandas as pd, numpy as np
import matplotlib.pyplot as plt
'display.max_columns', 30)
pd.set_option(print('Libraries ready.')
1) Load and Inspect
Let’s load hippo_nutrients.csv
and inspect the first rows and data types.
= pd.read_csv('data/hippo_nutrients.csv')
df print('Shape:', df.shape)
print('Dtypes:\n', df.dtypes)
5)) display(df.head(
2) Filtering Rows and Selecting Columns
Use boolean masks to keep the rows you want, and bracket notation to select columns. This is bread-and-butter transformation.
# Example: female hippos, Iron only, years ≥ 2024
= (df['Sex'] == 'F') & (df['Nutrient'] == 'Iron') & (df['Year'] >= 2024)
mask = ['ID', 'Year', 'Nutrient', 'Value', 'Age', 'Sex']
cols = df.loc[mask, cols].sort_values(['ID', 'Year']).reset_index(drop=True)
df_female_iron 5)) display(df_female_iron.head(
3) Aggregation — Concept and .agg
Syntax
Aggregation means reducing many rows in a group to summary values (e.g., mean iron intake by sex). In pandas, you usually do this with:
<group-cols>)[<value-col>].agg(<how>) df.groupby(
What can <how>
be?
- A single function:
'mean'
,'median'
,'count'
,'std'
, or a custom function:
'Nutrient')['Value'].agg('mean') df.groupby(
- A list of functions: returns multiple columns with each statistic:
'Nutrient')['Value'].agg(['mean','median','count']) df.groupby(
- A dictionary mapping column → function(s):
'Nutrient','Sex']).agg({
df.groupby(['Value': ['mean','median','count'],
'Age': 'mean' # optional extra summaries
})
- Named aggregations (cleaner column names):
'Nutrient','Sex']).agg(
df.groupby([= ('Value', 'mean'),
Mean_Value = ('Value', 'median'),
Median_Value = ('Value', 'size')
N )
Why aggregate?
- To summarise groups (e.g., mean intake per nutrient, per sex).
- To compare across categories (e.g., male vs female, 2024 vs 2025).
- To reduce noise and prepare for plotting or reporting.
3.1 Basic aggregations (mean, median, count)
# Mean Value by Nutrient
= df.groupby('Nutrient')['Value'].mean().sort_values(ascending=False)
mean_by_nutrient
display(mean_by_nutrient)
# Multiple aggregations by Nutrient and Sex (named aggregations for tidy column names)
= (
agg_ns 'Nutrient','Sex'])
df.groupby([=('Value','mean'),
.agg(Mean_Value=('Value','median'),
Median_Value=('Value','size'))
N
.reset_index()
)8)) display(agg_ns.head(
4) Reshaping Between Wide and Long
Many real datasets arrive in wide form; most analyses and plots like long (tidy) form.
Long (tidy) → one row per observation
- Columns:
ID
,Year
,Age
,Sex
,Nutrient
,Value
Wide → columns are variables (e.g., one column per nutrient)
- Columns:
ID
,Year
,Age
,Sex
,Iron
,Calcium
,Vitamin_D
, …
4.1 Long → Wide (pivot wider)
Use pivot_table
with index
= identifier columns and columns
= the variable you want as new columns (Nutrient
). Values come from Value
.
= ['ID','Year','Age','Sex']
id_cols = df.pivot_table(index=id_cols, columns='Nutrient', values='Value', aggfunc='mean')
wide = wide.reset_index()
wide 5)) display(wide.head(
4.2 Wide → Long (melt)
Use melt
to gather nutrient columns back into two columns: Nutrient
(variable name) and Value
(measurement). This is the canonical tidy structure for plotting, modelling, and faceting.
= [c for c in wide.columns if c not in id_cols]
value_vars = wide.melt(id_vars=id_cols, value_vars=value_vars,
long_again ='Nutrient', value_name='Value')
var_name6)) display(long_again.head(
5) Pivot Tables for Comparisons
Cross-tabulate means by two dimensions (e.g., Nutrient
× Year
).
= df.pivot_table(values='Value', index='Nutrient', columns='Year', aggfunc='mean')
pt display(pt)
6) Derived Variables and Numeric Transformations
Real analyses often need transformed variables to meet model assumptions or to compare across scales.
6.1 Log transform
Use when data are right-skewed (e.g., highly variable intakes). Add a small ε
to avoid log(0)
.
6.2 z-score standardisation
Convert values to standard units: (x - mean) / std
. Helpful when combining variables on different scales.
6.3 Min–max scaling (0–1)
Maps the minimum to 0 and maximum to 1. Useful for comparability in dashboards.
6.4 Group-wise transforms
Standardise within groups (e.g., within each Nutrient
) so each nutrient’s distribution is centred and scaled separately.
= df.copy()
df_trans
# 6.1 Log transform: add epsilon to avoid log(0) if present
= 1e-6
eps 'Value_log'] = np.log(df_trans['Value'] + eps)
df_trans[
# 6.2 Global z-score
= df_trans['Value'].mean()
mu = df_trans['Value'].std(ddof=0)
sd 'Value_z'] = (df_trans['Value'] - mu) / (sd if sd else 1.0)
df_trans[
# 6.3 Min–max scaling
= df_trans['Value'].min(), df_trans['Value'].max()
vmin, vmax = vmax - vmin if vmax > vmin else 1.0
rng 'Value_minmax'] = (df_trans['Value'] - vmin) / rng
df_trans[
# 6.4 Group-wise z-score by Nutrient
'Value_z_by_nutrient'] = (
df_trans['Nutrient')['Value']
df_trans.groupby(lambda s: (s - s.mean()) / (s.std(ddof=0) if s.std(ddof=0) else 1.0))
.transform(
)
8)) display(df_trans.head(
7) Density Plots — Visualise Transformations
Let’s visualise how the distribution changes after transformations. We’ll use histograms with density=True (a simple density estimate) for one nutrient (e.g., Iron). You should see that the log transform often makes a right-skewed distribution more symmetric, and z-scores centre the data at 0 with unit variance.
# Choose one nutrient to illustrate (fallback to first if Iron absent)
= 'Iron' if 'Iron' in df['Nutrient'].unique() else df['Nutrient'].unique()[0]
nutrient_to_plot = df_trans[df_trans['Nutrient'] == nutrient_to_plot].copy()
sub
print(f'Plotting distributions for nutrient: {nutrient_to_plot}')
# Original scale density (histogram)
plt.figure()'Value'].plot(kind='hist', bins=20, density=True, edgecolor='black')
sub[f'{nutrient_to_plot}: Original Value (density)')
plt.title('Value')
plt.xlabel('Density')
plt.ylabel(
plt.show()
# Log-transformed density
plt.figure()'Value_log'].plot(kind='hist', bins=20, density=True, edgecolor='black')
sub[f'{nutrient_to_plot}: Log(Value) (density)')
plt.title('log(Value + ε)')
plt.xlabel('Density')
plt.ylabel(
plt.show()
# Z-score density
plt.figure()'Value_z'].plot(kind='hist', bins=20, density=True, edgecolor='black')
sub[f'{nutrient_to_plot}: Z-score (density)')
plt.title('Z-score')
plt.xlabel('Density')
plt.ylabel( plt.show()
8) (Bonus) Time-based and Rolling Transforms
For longitudinal data (e.g., the same ID
measured across Year
), compute rolling means or deltas. This is illustrative if your dataset has multiple years per ID.
= (
df_roll 'ID','Nutrient','Year'])
df.sort_values(['ID','Nutrient'], as_index=False)
.groupby([apply(lambda g: g.assign(
.=g['Value'].rolling(window=2, min_periods=1).mean(),
Value_roll_mean=g['Value'].diff()
Value_delta
))
)10)) display(df_roll.head(
🧪 Exercise 1 — Grouped Summary
Filter to Nutrient == 'Vitamin_D'
, then compute median Value
by Sex
and Year
. Present the result sorted by Year
and then Sex
.
Hints: - Filter: df[df['Nutrient'] == 'Vitamin_D']
- Group and aggregate: .groupby(['Sex','Year'])['Value'].median().reset_index()
- Sort: .sort_values(['Year','Sex'])
# Your code here
🧪 Exercise 2 — Wide ↔︎ Long
- Create a wide table with index
[ID, Year]
and columns as nutrients (values = meanValue
).
- Convert that wide table back to long using
melt
withNutrient
andValue
columns.
Hints: - pivot_table(index=['ID','Year'], columns='Nutrient', values='Value', aggfunc='mean')
- melt(id_vars=['ID','Year'], var_name='Nutrient', value_name='Value')
# Your code here
🧪 Exercise 3 — Transformations for Modelling
Create a dataframe with these new columns on the Iron subset only: - Value_log_iron
— log-transformed value with epsilon. - Value_z_iron_by_sex
— z-score within Sex. - Value_minmax_iron_by_year
— min–max scaling within Year.
Hints: - Filter: (df['Nutrient'] == 'Iron')
- Group-wise transforms with .groupby(...).transform(...)
# Your code here
✅ Conclusion & Next Steps
You’ve practised the core transformation tools: - Filtering and selecting for clear subsets. - Aggregation with .groupby(...).agg(...)
, including named aggregations and why aggregation matters. - Reshaping between wide and long for tidy analysis. - Pivot tables for quick comparisons. - Numeric transformations: log, z-score, min–max, and group-wise standardisation — plus density plots to see the changes.
👉 Next: 3.5 Data Aggregation — robust summaries, grouped statistics, and combining pipelines for analysis-ready datasets.
Resources: - Pandas GroupBy: https://pandas.pydata.org/docs/user_guide/groupby.html - Reshaping (melt/pivot): https://pandas.pydata.org/docs/user_guide/reshaping.html - Working with dtypes: https://pandas.pydata.org/docs/user_guide/basics.html#basics-dtypes