📈 3.5 Data Aggregation

Aggregation turns rows of measurements into concise summaries. In nutrition science, that might mean mean iron intake by sex, median calcium by age-group, or % of hippos meeting a reference intake.

This notebook gives you a practical tour of pandas aggregation: groupby, custom aggregations, weighted summaries, joining reference data, and avoiding common pitfalls.


🎯 Objectives

By the end you can: - Use groupby(...).agg(...) for multi-metric summaries (mean, median, count, SD, percentiles). - Compute weighted summaries (e.g., weight-adjusted averages). - Join tables (left / inner / right) safely and diagnose join issues. - Concatenate multiple datasets and add group labels. - Produce clean, analysis-ready summary tables.

We’ll work with hippo_nutrients.csv and a small reference-intake table.

# Setup for Google Colab: Fetch datasets automatically or manually
import os
from google.colab import files

MODULE = '03_data_handling'  # module folder
DATASET = 'hippo_nutrients.csv'
BASE_PATH = '/content/data-analysis-projects'
MODULE_PATH = os.path.join(BASE_PATH, 'notebooks', MODULE)
DATASET_PATH = os.path.join('data', DATASET)

try:
    if not os.path.exists(BASE_PATH):
        !git clone https://github.com/ggkuhnle/data-analysis-projects.git
    os.chdir(MODULE_PATH)
    if not os.path.exists(DATASET_PATH):
        raise FileNotFoundError(DATASET_PATH)
    print(f'Dataset found: {DATASET_PATH} 🦛')
except Exception as e:
    print(f'Automatic clone failed: {e}')
    os.makedirs('data', exist_ok=True)
    uploaded = files.upload()
    if DATASET in uploaded:
        with open(DATASET_PATH, 'wb') as f:
            f.write(uploaded[DATASET])
        print(f'Successfully uploaded {DATASET} 🦛')
    else:
        raise FileNotFoundError(f'Upload failed. Please ensure you uploaded {DATASET}.')

Install libraries.

%pip install pandas numpy -q
import pandas as pd, numpy as np
pd.set_option('display.max_columns', 40)
print('Environment ready.')

1) Load and Inspect

We’ll start by loading the data and checking the structure. The columns should include ID, Nutrient, Year, Value, Age, Sex.

df = pd.read_csv('data/hippo_nutrients.csv')
print(df.shape)
display(df.head())
display(df.dtypes)

2) GroupBy Basics — The Split–Apply–Combine Pattern

groupby splits the data into groups, applies a function within each group, and combines the results.

Example: Mean Value by Nutrient and Sex

mean_by_ns = df.groupby(['Nutrient','Sex'], as_index=True)['Value'].mean()
display(mean_by_ns)

Multiple metrics at once

You can compute several statistics in one go using .agg with a dict or list. We’ll compute count, mean, median, std, and a 90th percentile (0.9 quantile).

summary_multi = (
    df.groupby(['Nutrient','Sex'])['Value']
      .agg(count='count',
           mean='mean',
           median='median',
           sd='std',
           p90=lambda s: s.quantile(0.9))
)
display(summary_multi)

Optional: Reset index and tidy column names

When presenting results, it’s often clearer to have a flat index and simple column names.

summary_tidy = summary_multi.reset_index()
summary_tidy.columns = [c.replace(' ', '_') for c in summary_tidy.columns]
display(summary_tidy.head())

3) Weighted Aggregation

Sometimes a plain mean is misleading. Suppose we have body weights per hippo and want a weight-adjusted average nutrient value (purely illustrative). We’ll create a tiny weight table and compute a weighted mean by Nutrient and Sex.

Pedagogical note: In real analyses, choose weights that are defensible (e.g., sampling weights, exposure weights).

# Toy weight data (kg) — in practice this would come from another table
weights = pd.DataFrame({
    'ID': df['ID'].drop_duplicates().head(6).tolist(),
    'BodyWeight_kg': [2000, 2100, 1950, 2050, 1980, 2020][:len(df['ID'].drop_duplicates().head(6))]
})
display(weights.head())

# Left-join weights into df
df_w = df.merge(weights, on='ID', how='left')
display(df_w.head())

def weighted_mean(x, w):
    x = np.asarray(x, dtype=float)
    w = np.asarray(w, dtype=float)
    m = np.isfinite(x) & np.isfinite(w)
    if m.sum() == 0 or w[m].sum() == 0:
        return np.nan
    return np.sum(x[m] * w[m]) / np.sum(w[m])

wm = (
    df_w.groupby(['Nutrient','Sex'])
       .apply(lambda g: weighted_mean(g['Value'], g['BodyWeight_kg']),
       include_groups=False)
       .rename('weighted_mean')
       .reset_index()
)
display(wm.head())

4) Joining Reference Data (Left/Inner/Right)

Let’s define a small reference intake table (illustrative), join it to our nutrient data, and compute % of reference achieved.

We’ll use left joins to keep our rows, and demonstrate inner joins (keep matching rows only) for comparison.

Good practice when joining

  • Inspect key uniqueness (e.g., each Nutrient appears once in the reference table).
  • Check row counts before/after joins.
  • Consider anti-joins to find non-matching rows (rows with no partner).
# Reference intakes (illustrative values)
ref = pd.DataFrame({
    'Nutrient': ['Iron','Calcium','Vitamin_D'],
    'ReferenceValue': [15.0, 1200.0, 10.0],  # units aligned with df['Value']
    'Units': ['mg/day','mg/day','µg/day']
})
display(ref)

# Check key uniqueness in ref
assert ref['Nutrient'].is_unique, 'Reference Nutrient keys should be unique'

# Aggregate df first (e.g., mean by Nutrient and Sex), then join reference
mean_by_ns = df.groupby(['Nutrient','Sex'], as_index=False)['Value'].mean()
left_joined = mean_by_ns.merge(ref, on='Nutrient', how='left')
left_joined['Pct_of_Ref'] = 100 * left_joined['Value'] / left_joined['ReferenceValue']
display(left_joined)

# Inner join (only rows with a matching Nutrient in ref)
inner_joined = mean_by_ns.merge(ref, on='Nutrient', how='inner')
display(inner_joined.head())

# Anti-join: which Nutrients in our data are missing from ref?
missing_in_ref = (
    mean_by_ns[~mean_by_ns['Nutrient'].isin(ref['Nutrient'])]
)
display(missing_in_ref.drop_duplicates(subset=['Nutrient']))

5) Concatenation (Stacking Datasets)

If you receive multiple similar files (e.g., two cohorts or two years), you can stack them with pd.concat.

  • Ensure columns align (same names and types).
  • Optionally add a column to track the dataset (source, cohort, year).
# Small synthetic split: first half vs second half
half = len(df)//2
df_a = df.iloc[:half].copy(); df_a['source'] = 'A'
df_b = df.iloc[half:].copy(); df_b['source'] = 'B'

stacked = pd.concat([df_a, df_b], axis=0, ignore_index=True)
display(stacked[['source','Nutrient','Value']].head())

# Aggregate by source to compare
by_source = stacked.groupby(['source','Nutrient'])['Value'].agg(mean='mean', n='count').reset_index()
display(by_source.head())

6) Binning and Aggregating (Age Bands)

Sometimes it’s useful to summarise by bands (e.g., age groups). Use pd.cut to create categories and aggregate by them.

if 'Age' in df.columns:
    bins = [0, 20, 30, 40, 200]
    labels = ['<20','20–29','30–39','40+']
    df['AgeBand'] = pd.cut(df['Age'], bins=bins, labels=labels, right=False)
    by_band = (
        df.groupby(['Nutrient','AgeBand'], observed=True)['Value']
          .agg(n='count', mean='mean', median='median')
          .reset_index()
    )
    display(by_band.head())

7) Presenting Results: Pivot and Confidence Intervals (quick)

You can pivot summaries for a compact view, and optionally add simple standard errors and approximate 95% CIs (for teaching purposes).

g = df.groupby(['Nutrient','Sex'])['Value']

ci_tbl = (
    g.agg(
        mean = 'mean',
        n    = 'count',
        sd   = lambda s: s.std(ddof=1)
    )
    .assign(
        se  = lambda d: d['sd'] / np.sqrt(d['n']).where(d['n'] > 0, np.nan),
        low = lambda d: d['mean'] - 1.96 * d['se'],
        high= lambda d: d['mean'] + 1.96 * d['se']
    )
    .drop(columns='sd')
    .reset_index()
)

display(ci_tbl.head())

ci_pivot = ci_tbl.pivot_table(index='Nutrient', columns='Sex', values='mean')
display(ci_pivot)

8) Common Pitfalls (and fixes)

  • Missing values: By default, many aggregations ignore NaNs; check count vs the expected sample size.
  • Unequal groups: When comparing means across groups with very different n, consider medians, robust measures, or report CIs.
  • Joins that multiply rows: Many-to-many joins can inflate data. Inspect key uniqueness before joining.
  • Units: Ensure units match before comparing or aggregating across tables (e.g., mg vs µg).
  • MultiIndex outputs: Flatten columns or reset_index() for presentation.

🧪 Exercises

  1. Multi-metric summary
    • For each (Nutrient, Year), compute: n, mean, median, sd, min, max.
    • Present as a tidy table (no MultiIndex).
  2. Weighted summary
    • Create a new weighting scheme (e.g., Age as a simple weight proxy).
    • Compute a weighted mean Value by (Nutrient, Sex) and compare to the unweighted mean.
  3. Join with reference
    • Add a column % of reference for each (Nutrient, Sex) at Year 2025 only.
    • Which nutrient and sex combination is furthest below reference?
  4. Concatenate and label
    • Split the dataset into two random halves, label each half, concat, and compute the mean Value by (source, Nutrient).
    • Are the differences meaningful or just sampling noise?
  5. Age bands
    • Define your own custom age bands and summarise Value by (Nutrient, AgeBand, Sex).
    • Report the top 3 (Nutrient, AgeBand) pairs by mean Value.

✅ Conclusion

You’ve practised core aggregation patterns: - groupby with multiple metrics and tidy outputs. - Weighted means where appropriate. - Safe joins against reference intake tables, with diagnostics. - Concatenation and labelling of multiple datasets. - Clear presentation (pivot, CIs) for reporting.

👉 Next: move into 04 Data Analysis for visualisation, correlation, and modelling.


Resources
- Pandas GroupBy: https://pandas.pydata.org/docs/user_guide/groupby.html
- Aggregation & Transform: https://pandas.pydata.org/docs/user_guide/groupby.html#aggregation
- Merge/Join: https://pandas.pydata.org/docs/user_guide/merging.html
- Reshaping (pivot/melt): https://pandas.pydata.org/docs/user_guide/reshaping.html