# Setup for Google Colab: Fetch datasets automatically or manually
import os
from google.colab import files
= '03_data_handling' # module folder
MODULE = 'hippo_nutrients.csv'
DATASET = '/content/data-analysis-projects'
BASE_PATH = os.path.join(BASE_PATH, 'notebooks', MODULE)
MODULE_PATH = os.path.join('data', DATASET)
DATASET_PATH
try:
if not os.path.exists(BASE_PATH):
!git clone https://github.com/ggkuhnle/data-analysis-projects.git
os.chdir(MODULE_PATH)if not os.path.exists(DATASET_PATH):
raise FileNotFoundError(DATASET_PATH)
print(f'Dataset found: {DATASET_PATH} 🦛')
except Exception as e:
print(f'Automatic clone failed: {e}')
'data', exist_ok=True)
os.makedirs(= files.upload()
uploaded if DATASET in uploaded:
with open(DATASET_PATH, 'wb') as f:
f.write(uploaded[DATASET])print(f'Successfully uploaded {DATASET} 🦛')
else:
raise FileNotFoundError(f'Upload failed. Please ensure you uploaded {DATASET}.')
📈 3.5 Data Aggregation
Aggregation turns rows of measurements into concise summaries. In nutrition science, that might mean mean iron intake by sex, median calcium by age-group, or % of hippos meeting a reference intake.
This notebook gives you a practical tour of pandas aggregation: groupby
, custom aggregations, weighted summaries, joining reference data, and avoiding common pitfalls.
🎯 Objectives
By the end you can: - Use groupby(...).agg(...)
for multi-metric summaries (mean, median, count, SD, percentiles). - Compute weighted summaries (e.g., weight-adjusted averages). - Join tables (left / inner / right) safely and diagnose join issues. - Concatenate multiple datasets and add group labels. - Produce clean, analysis-ready summary tables.
We’ll work with hippo_nutrients.csv
and a small reference-intake table.
Install libraries.
%pip install pandas numpy -q
import pandas as pd, numpy as np
'display.max_columns', 40)
pd.set_option(print('Environment ready.')
1) Load and Inspect
We’ll start by loading the data and checking the structure. The columns should include ID, Nutrient, Year, Value, Age, Sex
.
= pd.read_csv('data/hippo_nutrients.csv')
df print(df.shape)
display(df.head()) display(df.dtypes)
2) GroupBy Basics — The Split–Apply–Combine Pattern
groupby
splits the data into groups, applies a function within each group, and combines the results.
Example: Mean Value
by Nutrient
and Sex
= df.groupby(['Nutrient','Sex'], as_index=True)['Value'].mean()
mean_by_ns display(mean_by_ns)
Multiple metrics at once
You can compute several statistics in one go using .agg
with a dict or list. We’ll compute count, mean, median, std, and a 90th percentile (0.9 quantile).
= (
summary_multi 'Nutrient','Sex'])['Value']
df.groupby([='count',
.agg(count='mean',
mean='median',
median='std',
sd=lambda s: s.quantile(0.9))
p90
) display(summary_multi)
Optional: Reset index and tidy column names
When presenting results, it’s often clearer to have a flat index and simple column names.
= summary_multi.reset_index()
summary_tidy = [c.replace(' ', '_') for c in summary_tidy.columns]
summary_tidy.columns display(summary_tidy.head())
3) Weighted Aggregation
Sometimes a plain mean is misleading. Suppose we have body weights per hippo and want a weight-adjusted average nutrient value (purely illustrative). We’ll create a tiny weight table and compute a weighted mean by Nutrient
and Sex
.
Pedagogical note: In real analyses, choose weights that are defensible (e.g., sampling weights, exposure weights).
# Toy weight data (kg) — in practice this would come from another table
= pd.DataFrame({
weights 'ID': df['ID'].drop_duplicates().head(6).tolist(),
'BodyWeight_kg': [2000, 2100, 1950, 2050, 1980, 2020][:len(df['ID'].drop_duplicates().head(6))]
})
display(weights.head())
# Left-join weights into df
= df.merge(weights, on='ID', how='left')
df_w
display(df_w.head())
def weighted_mean(x, w):
= np.asarray(x, dtype=float)
x = np.asarray(w, dtype=float)
w = np.isfinite(x) & np.isfinite(w)
m if m.sum() == 0 or w[m].sum() == 0:
return np.nan
return np.sum(x[m] * w[m]) / np.sum(w[m])
= (
wm 'Nutrient','Sex'])
df_w.groupby([apply(lambda g: weighted_mean(g['Value'], g['BodyWeight_kg']),
.=False)
include_groups'weighted_mean')
.rename(
.reset_index()
) display(wm.head())
4) Joining Reference Data (Left/Inner/Right)
Let’s define a small reference intake table (illustrative), join it to our nutrient data, and compute % of reference achieved.
We’ll use left joins to keep our rows, and demonstrate inner joins (keep matching rows only) for comparison.
Good practice when joining
- Inspect key uniqueness (e.g., each Nutrient appears once in the reference table).
- Check row counts before/after joins.
- Consider anti-joins to find non-matching rows (rows with no partner).
# Reference intakes (illustrative values)
= pd.DataFrame({
ref 'Nutrient': ['Iron','Calcium','Vitamin_D'],
'ReferenceValue': [15.0, 1200.0, 10.0], # units aligned with df['Value']
'Units': ['mg/day','mg/day','µg/day']
})
display(ref)
# Check key uniqueness in ref
assert ref['Nutrient'].is_unique, 'Reference Nutrient keys should be unique'
# Aggregate df first (e.g., mean by Nutrient and Sex), then join reference
= df.groupby(['Nutrient','Sex'], as_index=False)['Value'].mean()
mean_by_ns = mean_by_ns.merge(ref, on='Nutrient', how='left')
left_joined 'Pct_of_Ref'] = 100 * left_joined['Value'] / left_joined['ReferenceValue']
left_joined[
display(left_joined)
# Inner join (only rows with a matching Nutrient in ref)
= mean_by_ns.merge(ref, on='Nutrient', how='inner')
inner_joined
display(inner_joined.head())
# Anti-join: which Nutrients in our data are missing from ref?
= (
missing_in_ref ~mean_by_ns['Nutrient'].isin(ref['Nutrient'])]
mean_by_ns[
)=['Nutrient'])) display(missing_in_ref.drop_duplicates(subset
5) Concatenation (Stacking Datasets)
If you receive multiple similar files (e.g., two cohorts or two years), you can stack them with pd.concat
.
- Ensure columns align (same names and types).
- Optionally add a column to track the dataset (
source
,cohort
,year
).
# Small synthetic split: first half vs second half
= len(df)//2
half = df.iloc[:half].copy(); df_a['source'] = 'A'
df_a = df.iloc[half:].copy(); df_b['source'] = 'B'
df_b
= pd.concat([df_a, df_b], axis=0, ignore_index=True)
stacked 'source','Nutrient','Value']].head())
display(stacked[[
# Aggregate by source to compare
= stacked.groupby(['source','Nutrient'])['Value'].agg(mean='mean', n='count').reset_index()
by_source display(by_source.head())
6) Binning and Aggregating (Age Bands)
Sometimes it’s useful to summarise by bands (e.g., age groups). Use pd.cut
to create categories and aggregate by them.
if 'Age' in df.columns:
= [0, 20, 30, 40, 200]
bins = ['<20','20–29','30–39','40+']
labels 'AgeBand'] = pd.cut(df['Age'], bins=bins, labels=labels, right=False)
df[= (
by_band 'Nutrient','AgeBand'], observed=True)['Value']
df.groupby([='count', mean='mean', median='median')
.agg(n
.reset_index()
) display(by_band.head())
7) Presenting Results: Pivot and Confidence Intervals (quick)
You can pivot summaries for a compact view, and optionally add simple standard errors and approximate 95% CIs (for teaching purposes).
= df.groupby(['Nutrient','Sex'])['Value']
g
= (
ci_tbl
g.agg(= 'mean',
mean = 'count',
n = lambda s: s.std(ddof=1)
sd
)
.assign(= lambda d: d['sd'] / np.sqrt(d['n']).where(d['n'] > 0, np.nan),
se = lambda d: d['mean'] - 1.96 * d['se'],
low = lambda d: d['mean'] + 1.96 * d['se']
high
)='sd')
.drop(columns
.reset_index()
)
display(ci_tbl.head())
= ci_tbl.pivot_table(index='Nutrient', columns='Sex', values='mean')
ci_pivot display(ci_pivot)
8) Common Pitfalls (and fixes)
- Missing values: By default, many aggregations ignore NaNs; check
count
vs the expected sample size. - Unequal groups: When comparing means across groups with very different
n
, consider medians, robust measures, or report CIs. - Joins that multiply rows: Many-to-many joins can inflate data. Inspect key uniqueness before joining.
- Units: Ensure units match before comparing or aggregating across tables (e.g., mg vs µg).
- MultiIndex outputs: Flatten columns or
reset_index()
for presentation.
🧪 Exercises
- Multi-metric summary
- For each
(Nutrient, Year)
, compute:n
,mean
,median
,sd
,min
,max
.
- Present as a tidy table (no MultiIndex).
- For each
- Weighted summary
- Create a new weighting scheme (e.g.,
Age
as a simple weight proxy).
- Compute a weighted mean
Value
by(Nutrient, Sex)
and compare to the unweighted mean.
- Create a new weighting scheme (e.g.,
- Join with reference
- Add a column
% of reference
for each(Nutrient, Sex)
at Year 2025 only.
- Which nutrient and sex combination is furthest below reference?
- Add a column
- Concatenate and label
- Split the dataset into two random halves, label each half,
concat
, and compute the meanValue
by(source, Nutrient)
.
- Are the differences meaningful or just sampling noise?
- Split the dataset into two random halves, label each half,
- Age bands
- Define your own custom age bands and summarise
Value
by(Nutrient, AgeBand, Sex)
.
- Report the top 3
(Nutrient, AgeBand)
pairs by meanValue
.
- Define your own custom age bands and summarise
✅ Conclusion
You’ve practised core aggregation patterns: - groupby
with multiple metrics and tidy outputs. - Weighted means where appropriate. - Safe joins against reference intake tables, with diagnostics. - Concatenation and labelling of multiple datasets. - Clear presentation (pivot, CIs) for reporting.
👉 Next: move into 04 Data Analysis for visualisation, correlation, and modelling.
Resources
- Pandas GroupBy: https://pandas.pydata.org/docs/user_guide/groupby.html
- Aggregation & Transform: https://pandas.pydata.org/docs/user_guide/groupby.html#aggregation
- Merge/Join: https://pandas.pydata.org/docs/user_guide/merging.html
- Reshaping (pivot/melt): https://pandas.pydata.org/docs/user_guide/reshaping.html