📈 3.5 Data Aggregation

Aggregation turns rows of measurements into concise summaries. In nutrition science, that might mean mean iron intake by sex, median calcium by age-group, or % of hippos meeting a reference intake.

This notebook gives you a practical tour of pandas aggregation: groupby, custom aggregations, weighted summaries, joining reference data, and avoiding common pitfalls.

🎯 Objectives

By the end you can: - Use groupby(...).agg(...) for multi-metric summaries (mean, median, count, SD, percentiles). - Compute weighted summaries (e.g., weight-adjusted averages). - Join tables (left / inner / right) safely and diagnose join issues. - Concatenate multiple datasets and add group labels. - Produce clean, analysis-ready summary tables.

We’ll work with hippo_nutrients.csv and a small reference-intake table.

# Setup for Google Colab: Fetch datasets automatically or manually
import os
from google.colab import files

MODULE = '03_data_handling'  # module folder
DATASET = 'hippo_nutrients.csv'
BASE_PATH = '/content/data-analysis-projects'
MODULE_PATH = os.path.join(BASE_PATH, 'notebooks', MODULE)
DATASET_PATH = os.path.join('data', DATASET)

try:
    if not os.path.exists(BASE_PATH):
        !git clone https://github.com/ggkuhnle/data-analysis-projects.git
    os.chdir(MODULE_PATH)
    if not os.path.exists(DATASET_PATH):
        raise FileNotFoundError(DATASET_PATH)
    print(f'Dataset found: {DATASET_PATH} 🦛')
except Exception as e:
    print(f'Automatic clone failed: {e}')
    os.makedirs('data', exist_ok=True)
    uploaded = files.upload()
    if DATASET in uploaded:
        with open(DATASET_PATH, 'wb') as f:
            f.write(uploaded[DATASET])
        print(f'Successfully uploaded {DATASET} 🦛')
    else:
        raise FileNotFoundError(f'Upload failed. Please ensure you uploaded {DATASET}.')

Install libraries.

%pip install pandas numpy -q
import pandas as pd, numpy as np
pd.set_option('display.max_columns', 40)
print('Environment ready.')

1) Load and Inspect

We’ll start by loading the data and checking the structure. The columns should include ID, Nutrient, Year, Value, Age, Sex.

df = pd.read_csv('data/hippo_nutrients.csv')
print(df.shape)
display(df.head())
display(df.dtypes)

2) GroupBy Basics — The Split–Apply–Combine Pattern

groupby splits the data into groups, applies a function within each group, and combines the results.

Example: Mean `Value` by `Nutrient` and `Sex`

mean_by_ns = df.groupby(['Nutrient','Sex'], as_index=True)['Value'].mean()
display(mean_by_ns)

Multiple metrics at once

You can compute several statistics in one go using .agg with a dict or list. We’ll compute count, mean, median, std, and a 90th percentile (0.9 quantile).

summary_multi = (
    df.groupby(['Nutrient','Sex'])['Value']
      .agg(count='count',
           mean='mean',
           median='median',
           sd='std',
           p90=lambda s: s.quantile(0.9))
)
display(summary_multi)

Optional: Reset index and tidy column names

When presenting results, it’s often clearer to have a flat index and simple column names.

summary_tidy = summary_multi.reset_index()
summary_tidy.columns = [c.replace(' ', '_') for c in summary_tidy.columns]
display(summary_tidy.head())

3) Weighted Aggregation

Sometimes a plain mean is misleading. Suppose we have body weights per hippo and want a weight-adjusted average nutrient value (purely illustrative). We’ll create a tiny weight table and compute a weighted mean by Nutrient and Sex.

Pedagogical note: In real analyses, choose weights that are defensible (e.g., sampling weights, exposure weights).

# Toy weight data (kg) — in practice this would come from another table
weights = pd.DataFrame({
    'ID': df['ID'].drop_duplicates().head(6).tolist(),
    'BodyWeight_kg': [2000, 2100, 1950, 2050, 1980, 2020][:len(df['ID'].drop_duplicates().head(6))]
})
display(weights.head())

# Left-join weights into df
df_w = df.merge(weights, on='ID', how='left')
display(df_w.head())

def weighted_mean(x, w):
    x = np.asarray(x, dtype=float)
    w = np.asarray(w, dtype=float)
    m = np.isfinite(x) & np.isfinite(w)
    if m.sum() == 0 or w[m].sum() == 0:
        return np.nan
    return np.sum(x[m] * w[m]) / np.sum(w[m])

wm = (
    df_w.groupby(['Nutrient','Sex'])
       .apply(lambda g: weighted_mean(g['Value'], g['BodyWeight_kg']),
       include_groups=False)
       .rename('weighted_mean')
       .reset_index()
)
display(wm.head())

4) Joining Reference Data (Left/Inner/Right)

Let’s define a small reference intake table (illustrative), join it to our nutrient data, and compute % of reference achieved.

We’ll use left joins to keep our rows, and demonstrate inner joins (keep matching rows only) for comparison.

Good practice when joining

Inspect key uniqueness (e.g., each Nutrient appears once in the reference table).
Check row counts before/after joins.
Consider anti-joins to find non-matching rows (rows with no partner).

# Reference intakes (illustrative values)
ref = pd.DataFrame({
    'Nutrient': ['Iron','Calcium','Vitamin_D'],
    'ReferenceValue': [15.0, 1200.0, 10.0],  # units aligned with df['Value']
    'Units': ['mg/day','mg/day','µg/day']
})
display(ref)

# Check key uniqueness in ref
assert ref['Nutrient'].is_unique, 'Reference Nutrient keys should be unique'

# Aggregate df first (e.g., mean by Nutrient and Sex), then join reference
mean_by_ns = df.groupby(['Nutrient','Sex'], as_index=False)['Value'].mean()
left_joined = mean_by_ns.merge(ref, on='Nutrient', how='left')
left_joined['Pct_of_Ref'] = 100 * left_joined['Value'] / left_joined['ReferenceValue']
display(left_joined)

# Inner join (only rows with a matching Nutrient in ref)
inner_joined = mean_by_ns.merge(ref, on='Nutrient', how='inner')
display(inner_joined.head())

# Anti-join: which Nutrients in our data are missing from ref?
missing_in_ref = (
    mean_by_ns[~mean_by_ns['Nutrient'].isin(ref['Nutrient'])]
)
display(missing_in_ref.drop_duplicates(subset=['Nutrient']))

5) Concatenation (Stacking Datasets)

If you receive multiple similar files (e.g., two cohorts or two years), you can stack them with pd.concat.

Ensure columns align (same names and types).
Optionally add a column to track the dataset (source, cohort, year).

# Small synthetic split: first half vs second half
half = len(df)//2
df_a = df.iloc[:half].copy(); df_a['source'] = 'A'
df_b = df.iloc[half:].copy(); df_b['source'] = 'B'

stacked = pd.concat([df_a, df_b], axis=0, ignore_index=True)
display(stacked[['source','Nutrient','Value']].head())

# Aggregate by source to compare
by_source = stacked.groupby(['source','Nutrient'])['Value'].agg(mean='mean', n='count').reset_index()
display(by_source.head())

6) Binning and Aggregating (Age Bands)

Sometimes it’s useful to summarise by bands (e.g., age groups). Use pd.cut to create categories and aggregate by them.

if 'Age' in df.columns:
    bins = [0, 20, 30, 40, 200]
    labels = ['<20','20–29','30–39','40+']
    df['AgeBand'] = pd.cut(df['Age'], bins=bins, labels=labels, right=False)
    by_band = (
        df.groupby(['Nutrient','AgeBand'], observed=True)['Value']
          .agg(n='count', mean='mean', median='median')
          .reset_index()
    )
    display(by_band.head())

7) Presenting Results: Pivot and Confidence Intervals (quick)

You can pivot summaries for a compact view, and optionally add simple standard errors and approximate 95% CIs (for teaching purposes).

g = df.groupby(['Nutrient','Sex'])['Value']

ci_tbl = (
    g.agg(
        mean = 'mean',
        n    = 'count',
        sd   = lambda s: s.std(ddof=1)
    )
    .assign(
        se  = lambda d: d['sd'] / np.sqrt(d['n']).where(d['n'] > 0, np.nan),
        low = lambda d: d['mean'] - 1.96 * d['se'],
        high= lambda d: d['mean'] + 1.96 * d['se']
    )
    .drop(columns='sd')
    .reset_index()
)

display(ci_tbl.head())

ci_pivot = ci_tbl.pivot_table(index='Nutrient', columns='Sex', values='mean')
display(ci_pivot)

8) Common Pitfalls (and fixes)

Missing values: By default, many aggregations ignore NaNs; check count vs the expected sample size.
Unequal groups: When comparing means across groups with very different n, consider medians, robust measures, or report CIs.
Joins that multiply rows: Many-to-many joins can inflate data. Inspect key uniqueness before joining.
Units: Ensure units match before comparing or aggregating across tables (e.g., mg vs µg).
MultiIndex outputs: Flatten columns or reset_index() for presentation.

🧪 Exercises

Multi-metric summary
- For each (Nutrient, Year), compute: n, mean, median, sd, min, max.
- Present as a tidy table (no MultiIndex).
Weighted summary
- Create a new weighting scheme (e.g., Age as a simple weight proxy).
- Compute a weighted mean Value by (Nutrient, Sex) and compare to the unweighted mean.
Join with reference
- Add a column % of reference for each (Nutrient, Sex) at Year 2025 only.
- Which nutrient and sex combination is furthest below reference?
Concatenate and label
- Split the dataset into two random halves, label each half, concat, and compute the mean Value by (source, Nutrient).
- Are the differences meaningful or just sampling noise?
Age bands
- Define your own custom age bands and summarise Value by (Nutrient, AgeBand, Sex).
- Report the top 3 (Nutrient, AgeBand) pairs by mean Value.

✅ Conclusion

You’ve practised core aggregation patterns: - groupby with multiple metrics and tidy outputs. - Weighted means where appropriate. - Safe joins against reference intake tables, with diagnostics. - Concatenation and labelling of multiple datasets. - Clear presentation (pivot, CIs) for reporting.

👉 Next: move into 04 Data Analysis for visualisation, correlation, and modelling.

Resources
- Pandas GroupBy: https://pandas.pydata.org/docs/user_guide/groupby.html
- Aggregation & Transform: https://pandas.pydata.org/docs/user_guide/groupby.html#aggregation
- Merge/Join: https://pandas.pydata.org/docs/user_guide/merging.html
- Reshaping (pivot/melt): https://pandas.pydata.org/docs/user_guide/reshaping.html