🧹 3.3 Data Cleaning — Principles and Practice

Clean data is the foundation of valid inference. In nutrition research (NDNS-style surveys, trials, food logs), data often arrives messy: missing values, inconsistent formats, duplicates, and subtle logic errors. This notebook gives you a practical, principled workflow for cleaning data in pandas.

🧭 Principles of Data Cleaning (Field-tested)

Profile first: Look before you leap. Inspect shapes, column types, unique values, and obvious anomalies.
Make minimal, reversible changes: Keep raw data intact; work on a copy. Store transformations in code, not by hand.
Be explicit about rules: Write why you drop/impute/convert. Use comments and validation checks.
Prefer tidy structure: One variable per column, one observation per row, one unit per table.
Treat missingness as information: The pattern of missing values is a signal. Don’t blindly fill everything.
Standardise representations: Dates, decimals, units, categories (e.g., sex codes) should be uniform.
Idempotence: Running the cleaning script twice should not change results further.
Document decisions: Note assumptions (e.g., iron in mg/day), thresholds (plausibility bounds), and imputations used.

# Setup for Google Colab: Fetch datasets automatically or manually
import os
from google.colab import files

MODULE = '03_data_handling'
DATASET = 'hippo_nutrients.csv'
BASE_PATH = '/content/data-analysis-projects'
MODULE_PATH = os.path.join(BASE_PATH, 'notebooks', MODULE)
DATASET_PATH = os.path.join('data', DATASET)

try:
    if not os.path.exists(BASE_PATH):
        !git clone https://github.com/ggkuhnle/data-analysis-projects.git
    os.chdir(MODULE_PATH)
    assert os.path.exists(DATASET_PATH)
    print(f'Dataset found: {DATASET_PATH} ✅')
except Exception as e:
    print(f'Automatic clone failed: {e}')
    os.makedirs('data', exist_ok=True)
    uploaded = files.upload()
    if DATASET in uploaded:
        with open(DATASET_PATH, 'wb') as f:
            f.write(uploaded[DATASET])
        print(f'Successfully uploaded {DATASET} ✅')
    else:
        raise FileNotFoundError(f'Upload failed. Please ensure you uploaded {DATASET}.')

%pip install pandas openpyxl sqlalchemy -q
import pandas as pd, numpy as np
print('Environment ready.')

1) Load and Profile

We start by loading the data and building a quick profile: shapes, dtypes, head/tail, and simple summaries. Profiling guides all subsequent cleaning steps.

raw = pd.read_csv('data/hippo_nutrients.csv')
df = raw.copy()  # work on a copy; keep raw intact

print('Shape:', df.shape)
print('\nDtypes:')
print(df.dtypes)

print('\nHead:')
display(df.head(5))

print('\nBasic summary:')
display(df.describe(include='all'))

2) Missing Values — Inspect before you impute

Missing data can be MCAR (completely at random), MAR (at random given observed variables), or MNAR (not at random). Your strategy depends on the context.

Typical steps: - Identify missingness overall and by variable. - Check missingness patterns by groups (e.g., by Nutrient, Year, Sex). - Decide: drop, impute (mean/median/group-wise), or model-based methods (later modules).

print('Missing counts per column:')
print(df.isna().sum().sort_values(ascending=False))

# Missingness by nutrient (for Value only)
if 'Value' in df.columns and 'Nutrient' in df.columns:
    miss_by_nutrient = (
        df.groupby('Nutrient')['Value']
          .apply(lambda s: s.isna().mean())
          .sort_values(ascending=False)
    )
    print('\nProportion missing in Value by Nutrient:')
    display(miss_by_nutrient.head(10))

Sensible Imputation Patterns (for teaching datasets)

Numeric measurements (e.g., Value): often impute by group (by Nutrient, sometimes by Nutrient×Sex), using median to reduce outlier influence.
Categoricals (e.g., Sex): leave missing or impute with a new category like 'Unknown' (document this!).
IDs/keys: missing IDs are usually fatal — investigate or drop those rows with a note.

Below we demonstrate group-wise median imputation for Value and keep everything else untouched (for transparency).

if 'Value' in df.columns and 'Nutrient' in df.columns:
    df['Value_imputed'] = df['Value']  # keep original and add an imputed version
    df['Value_imputed_flag'] = df['Value'].isna()  # track which were imputed
    df['Value_imputed'] = (
        df.groupby('Nutrient')['Value_imputed']
          .transform(lambda s: s.fillna(s.median()))
    )

    print('Imputed Value (group-median by Nutrient). Rows imputed:', int(df['Value_imputed_flag'].sum()))
    display(df.loc[df['Value_imputed_flag']].head())

3) Poorly Formatted Values — Standardise representations

Common issues in nutrition datasets: - Numbers stored as strings: e.g., '8.5', ' 8.5 ', '8,5' (decimal comma), '8.5 mg'. - Percentages: e.g., '12%' instead of 0.12 or 12. - Thousands separators: '1,234.5' or in some locales '1.234,5'. - Category typos/casing: 'female', 'FEMALE', 'F' representing the same concept.

Goal: convert to consistent numeric types, consistent units, and canonical categories. Always document unit assumptions.

Example: Decimal commas (German-style) → decimal points

If values arrive as strings with commas (e.g., '8,5'), convert them safely using str.replace + pd.to_numeric.

Note: Do not blanket-replace commas if the column also uses commas as thousands separators. Decide the rule from the data context.

# Demo column with mixed formats (create if not present)
if 'Value_raw' not in df.columns:
    df['Value_raw'] = df['Value'].astype(str)
    # inject a few messy examples for teaching (won't affect original Value)
    ex_idx = df.sample(min(3, len(df)), random_state=1).index
    df.loc[ex_idx, 'Value_raw'] = ['8,5', ' 12,0 ', '7,9 mg'][:len(ex_idx)]

def to_numeric_decomma(s: pd.Series) -> pd.Series:
    s = s.astype(str).str.strip()
    # remove unit strings like 'mg' or 'kcal' (simple demo)
    s = s.str.replace(r'[^0-9,.-]', '', regex=True)
    # convert decimal comma to point (assumes comma == decimal separator here)
    s = s.str.replace(',', '.', regex=False)
    return pd.to_numeric(s, errors='coerce')

df['Value_clean_decimal'] = to_numeric_decomma(df['Value_raw'])
display(df[['Value_raw','Value_clean_decimal']].head(6))

Example: Percent strings → numeric fractions or percents

Decide on a convention: - Store fractions (0–1), or - Store percentage points (0–100).

Below, we convert '12%' → 12.0 (percentage points). Adjust as needed.

demo = pd.Series(['12%', ' 7.5 %', '0.5%', None, 'n/a'])
pct = (demo.astype(str).str.strip()
              .str.replace('%','', regex=False)
              .str.replace(',','.', regex=False))
pct_num = pd.to_numeric(pct, errors='coerce')  # percentage points
display(pd.DataFrame({'raw': demo, 'pct_points': pct_num}))

Example: Thousands separators

Remove separators explicitly. Beware of locale mixes.

UK/US style: 1,234.5 → remove commas → 1234.5.
DE style: 1.234,5 → remove dots, replace comma with dot → 1234.5.

Whenever possible, confirm the intended locale and units from metadata.

us = pd.Series(['1,234.5', '2,000', '10,050.75'])
us_clean = pd.to_numeric(us.str.replace(',', '', regex=False), errors='coerce')

de = pd.Series(['1.234,5', '2.000', '10.050,75'])
de_clean = pd.to_numeric(de.str.replace('.', '', regex=False).str.replace(',', '.', regex=False), errors='coerce')

display(pd.DataFrame({'us_raw': us, 'us_clean': us_clean}))
display(pd.DataFrame({'de_raw': de, 'de_clean': de_clean}))

Example: Canonical categories (e.g., Sex)

Standardise category labels and document the mapping. Keep original in case you need to audit later.

if 'Sex' in df.columns:
    df['Sex_raw'] = df['Sex'].astype(str)
    sex_map = {
        'f':'F', 'female':'F', 'FEMALE':'F', 'F':'F',
        'm':'M', 'male':'M', 'MALE':'M', 'M':'M'
    }
    df['Sex_std'] = df['Sex_raw'].str.strip().str.lower().map(sex_map).fillna('Unknown')
    display(df[['Sex_raw','Sex_std']].drop_duplicates().head(10))

4) Duplicates — Exact, key-based, and near-duplicates

Duplicates can arise from repeated data entry, merges, or concatenations.

Types

Exact duplicates: Entire rows identical.
Key-based duplicates: Same (ID,Nutrient,Year) but different values.
Near-duplicates: Slightly different strings or values (e.g., whitespace, casing, rounding). Often require domain judgment.

Strategy: 1) Count exact duplicates. 2) Define a key and check for multiple rows per key. 3) Decide how to resolve (e.g., prefer non-missing, average within key, keep latest record, or flag for manual review).

# 4.1 Exact duplicates (full-row)
exact_dupes = df.duplicated(keep=False)
print('Exact duplicate rows (including originals):', int(exact_dupes.sum()))
display(df[exact_dupes].head(6))

# If you decide to drop exact duplicates:
df_no_exact_dupes = df.drop_duplicates().copy()

# 4.2 Key-based duplicates
key = ['ID','Nutrient','Year']
if all(k in df.columns for k in key):
    counts = df.groupby(key).size().reset_index(name='n')
    multi = counts[counts['n'] > 1]
    print('Keys with >1 row:', len(multi))
    display(multi.head(10))

    # Inspect conflicts for a sample key (if present)
    if not multi.empty:
        sample_key = multi.iloc[0][key].to_dict()
        print('Example conflicting key:', sample_key)
        q = (df['ID'].eq(sample_key['ID']) &
             df['Nutrient'].eq(sample_key['Nutrient']) &
             df['Year'].eq(sample_key['Year']))
        display(df.loc[q])

    # Example resolution: within each key, keep the row with the least missingness in Value/Value_imputed
    # (Alternatively: take mean, prefer non-missing, or most recent record if you have timestamps.)
    def resolve_group(g):
        # prefer non-missing Value; fall back to Value_imputed
        g = g.copy()
        score = g['Value'].notna().astype(int) + g.get('Value_imputed_flag', False).map({True:0, False:0})
        # keep the first row with highest score
        return g.loc[score.sort_values(ascending=False).index[0:1]]

    df_resolved = (
        df.groupby(key, as_index=False, group_keys=False)
          .apply(resolve_group)
    )
    print('After key-based resolution, shape:', df_resolved.shape)

5) Validate with Simple Rules

Codify assumptions as checks. Fail loudly when rules are broken: - Type checks: numeric columns should be numeric. - Range checks: plausible ranges (e.g., iron intake in mg/day is rarely negative or 10,000). - Membership checks: categories in known sets.

These checks turn silent data issues into visible test failures.

problems = []

# Type checks
if 'Value_imputed' in df.columns:
    if not np.issubdtype(df['Value_imputed'].dtype, np.number):
        problems.append('Value_imputed is not numeric.')

# Range checks (example plausible bounds, adjust to your domain)
if 'Value_imputed' in df.columns:
    out_of_range = ~df['Value_imputed'].between(0, 10000)  # generous upper bound
    if out_of_range.any():
        problems.append(f"{int(out_of_range.sum())} rows have Value_imputed outside [0, 10000].")

# Membership checks
if 'Sex_std' in df.columns:
    allowed = {'M','F','Unknown'}
    bad = ~df['Sex_std'].isin(allowed)
    if bad.any():
        problems.append(f"{int(bad.sum())} rows have unexpected Sex_std labels.")

print('Validation summary:')
if problems:
    for p in problems:
        print(' -', p)
else:
    print('All checks passed ✅')

6) Save the Cleaned Dataset (Optional)

Export a clean, analysis-ready table. Keep both raw and clean versions under version control, and record cleaning decisions in README or the notebook header.

# Choose the resolved DataFrame if you performed key-based resolution; otherwise use df
clean = df.copy()
clean_path = 'data/hippo_nutrients_clean.csv'
clean.to_csv(clean_path, index=False)
print('Saved:', clean_path)

🧪 Exercises (Guided)

Missingness by subgroup
- Compute the proportion of missing Value by (Nutrient, Sex).
- Impute Value using median by Nutrient×Sex and track imputed rows with a boolean flag.
Locale & units clean-up
- Create a column Value_text by copying Value as strings and inject a few messy entries (with 'mg', spaces, decimal commas).
- Write a clean_numeric() function that removes units, trims, converts decimal commas, and returns numeric.
- Compare distributions before/after cleaning.
Duplicates resolution policy
- Identify keys with multiple rows.
- Implement two resolution policies: (a) prefer non-missing Value, (b) take the mean within key.
- Compare results — do any policies change downstream summaries?
Validation rules
- Add a rule: Year must be between 1990 and 2035.
- Add a rule: Nutrient must belong to a specific allowed set found in df['Nutrient'].unique() (or a curated list).
- Make the notebook fail loudly (raise an Exception) if violations exceed a small tolerance.

✅ Conclusion & Next Steps

You’ve implemented a principled cleaning pipeline: - Profiled the data. - Handled missingness with transparent, group-wise imputation. - Standardised messy numeric and categorical fields (decimal commas, units, percentages, Sex labels). - Detected exact and key-based duplicates and demonstrated a resolution policy. - Encoded assumptions as validation rules.

👉 Next: 3.4 Data Transformation — reshaping, deriving variables, and feature engineering for analysis.

Resources
- pandas: Missing data — https://pandas.pydata.org/docs/user_guide/missing_data.html
- pandas: Working with text data — https://pandas.pydata.org/docs/user_guide/text.html
- pandas: Duplicate labels/rows — https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html
- On assumptions & validation in data analysis: keep a crisp README alongside your cleaned outputs.