# Setup for Google Colab: Fetch datasets automatically or manually
import os
from google.colab import files
= '03_data_handling'
MODULE = 'hippo_nutrients.csv'
DATASET = '/content/data-analysis-projects'
BASE_PATH = os.path.join(BASE_PATH, 'notebooks', MODULE)
MODULE_PATH = os.path.join('data', DATASET)
DATASET_PATH
try:
if not os.path.exists(BASE_PATH):
!git clone https://github.com/ggkuhnle/data-analysis-projects.git
os.chdir(MODULE_PATH)assert os.path.exists(DATASET_PATH)
print(f'Dataset found: {DATASET_PATH} β
')
except Exception as e:
print(f'Automatic clone failed: {e}')
'data', exist_ok=True)
os.makedirs(= files.upload()
uploaded if DATASET in uploaded:
with open(DATASET_PATH, 'wb') as f:
f.write(uploaded[DATASET])print(f'Successfully uploaded {DATASET} β
')
else:
raise FileNotFoundError(f'Upload failed. Please ensure you uploaded {DATASET}.')
%pip install pandas openpyxl sqlalchemy -q
import pandas as pd, numpy as np
print('Environment ready.')
π§Ή 3.3 Data Cleaning β Principles and Practice
Clean data is the foundation of valid inference. In nutrition research (NDNS-style surveys, trials, food logs), data often arrives messy: missing values, inconsistent formats, duplicates, and subtle logic errors. This notebook gives you a practical, principled workflow for cleaning data in pandas.
π§ Principles of Data Cleaning (Field-tested)
- Profile first: Look before you leap. Inspect shapes, column types, unique values, and obvious anomalies.
- Make minimal, reversible changes: Keep raw data intact; work on a copy. Store transformations in code, not by hand.
- Be explicit about rules: Write why you drop/impute/convert. Use comments and validation checks.
- Prefer tidy structure: One variable per column, one observation per row, one unit per table.
- Treat missingness as information: The pattern of missing values is a signal. Donβt blindly fill everything.
- Standardise representations: Dates, decimals, units, categories (e.g., sex codes) should be uniform.
- Idempotence: Running the cleaning script twice should not change results further.
- Document decisions: Note assumptions (e.g., iron in mg/day), thresholds (plausibility bounds), and imputations used.
1) Load and Profile
We start by loading the data and building a quick profile: shapes, dtypes, head/tail, and simple summaries. Profiling guides all subsequent cleaning steps.
= pd.read_csv('data/hippo_nutrients.csv')
raw = raw.copy() # work on a copy; keep raw intact
df
print('Shape:', df.shape)
print('\nDtypes:')
print(df.dtypes)
print('\nHead:')
5))
display(df.head(
print('\nBasic summary:')
='all')) display(df.describe(include
2) Missing Values β Inspect before you impute
Missing data can be MCAR (completely at random), MAR (at random given observed variables), or MNAR (not at random). Your strategy depends on the context.
Typical steps: - Identify missingness overall and by variable. - Check missingness patterns by groups (e.g., by Nutrient, Year, Sex). - Decide: drop, impute (mean/median/group-wise), or model-based methods (later modules).
print('Missing counts per column:')
print(df.isna().sum().sort_values(ascending=False))
# Missingness by nutrient (for Value only)
if 'Value' in df.columns and 'Nutrient' in df.columns:
= (
miss_by_nutrient 'Nutrient')['Value']
df.groupby(apply(lambda s: s.isna().mean())
.=False)
.sort_values(ascending
)print('\nProportion missing in Value by Nutrient:')
10)) display(miss_by_nutrient.head(
Sensible Imputation Patterns (for teaching datasets)
- Numeric measurements (e.g.,
Value
): often impute by group (byNutrient
, sometimes byNutrient
ΓSex
), using median to reduce outlier influence. - Categoricals (e.g.,
Sex
): leave missing or impute with a new category like'Unknown'
(document this!). - IDs/keys: missing IDs are usually fatal β investigate or drop those rows with a note.
Below we demonstrate group-wise median imputation for Value
and keep everything else untouched (for transparency).
if 'Value' in df.columns and 'Nutrient' in df.columns:
'Value_imputed'] = df['Value'] # keep original and add an imputed version
df['Value_imputed_flag'] = df['Value'].isna() # track which were imputed
df['Value_imputed'] = (
df['Nutrient')['Value_imputed']
df.groupby(lambda s: s.fillna(s.median()))
.transform(
)
print('Imputed Value (group-median by Nutrient). Rows imputed:', int(df['Value_imputed_flag'].sum()))
'Value_imputed_flag']].head()) display(df.loc[df[
3) Poorly Formatted Values β Standardise representations
Common issues in nutrition datasets: - Numbers stored as strings: e.g., '8.5'
, ' 8.5 '
, '8,5'
(decimal comma), '8.5 mg'
. - Percentages: e.g., '12%'
instead of 0.12
or 12
. - Thousands separators: '1,234.5'
or in some locales '1.234,5'
. - Category typos/casing: 'female'
, 'FEMALE'
, 'F'
representing the same concept.
Goal: convert to consistent numeric types, consistent units, and canonical categories. Always document unit assumptions.
Example: Decimal commas (German-style) β decimal points
If values arrive as strings with commas (e.g., '8,5'
), convert them safely using str.replace
+ pd.to_numeric
.
Note: Do not blanket-replace commas if the column also uses commas as thousands separators. Decide the rule from the data context.
# Demo column with mixed formats (create if not present)
if 'Value_raw' not in df.columns:
'Value_raw'] = df['Value'].astype(str)
df[# inject a few messy examples for teaching (won't affect original Value)
= df.sample(min(3, len(df)), random_state=1).index
ex_idx 'Value_raw'] = ['8,5', ' 12,0 ', '7,9 mg'][:len(ex_idx)]
df.loc[ex_idx,
def to_numeric_decomma(s: pd.Series) -> pd.Series:
= s.astype(str).str.strip()
s # remove unit strings like 'mg' or 'kcal' (simple demo)
= s.str.replace(r'[^0-9,.-]', '', regex=True)
s # convert decimal comma to point (assumes comma == decimal separator here)
= s.str.replace(',', '.', regex=False)
s return pd.to_numeric(s, errors='coerce')
'Value_clean_decimal'] = to_numeric_decomma(df['Value_raw'])
df['Value_raw','Value_clean_decimal']].head(6)) display(df[[
Example: Percent strings β numeric fractions or percents
Decide on a convention: - Store fractions (0β1), or - Store percentage points (0β100).
Below, we convert '12%'
β 12.0
(percentage points). Adjust as needed.
= pd.Series(['12%', ' 7.5 %', '0.5%', None, 'n/a'])
demo = (demo.astype(str).str.strip()
pct str.replace('%','', regex=False)
.str.replace(',','.', regex=False))
.= pd.to_numeric(pct, errors='coerce') # percentage points
pct_num 'raw': demo, 'pct_points': pct_num})) display(pd.DataFrame({
Example: Thousands separators
Remove separators explicitly. Beware of locale mixes.
- UK/US style:
1,234.5
β remove commas β1234.5
. - DE style:
1.234,5
β remove dots, replace comma with dot β1234.5
.
Whenever possible, confirm the intended locale and units from metadata.
= pd.Series(['1,234.5', '2,000', '10,050.75'])
us = pd.to_numeric(us.str.replace(',', '', regex=False), errors='coerce')
us_clean
= pd.Series(['1.234,5', '2.000', '10.050,75'])
de = pd.to_numeric(de.str.replace('.', '', regex=False).str.replace(',', '.', regex=False), errors='coerce')
de_clean
'us_raw': us, 'us_clean': us_clean}))
display(pd.DataFrame({'de_raw': de, 'de_clean': de_clean})) display(pd.DataFrame({
Example: Canonical categories (e.g., Sex)
Standardise category labels and document the mapping. Keep original in case you need to audit later.
if 'Sex' in df.columns:
'Sex_raw'] = df['Sex'].astype(str)
df[= {
sex_map 'f':'F', 'female':'F', 'FEMALE':'F', 'F':'F',
'm':'M', 'male':'M', 'MALE':'M', 'M':'M'
}'Sex_std'] = df['Sex_raw'].str.strip().str.lower().map(sex_map).fillna('Unknown')
df['Sex_raw','Sex_std']].drop_duplicates().head(10)) display(df[[
4) Duplicates β Exact, key-based, and near-duplicates
Duplicates can arise from repeated data entry, merges, or concatenations.
Types
- Exact duplicates: Entire rows identical.
- Key-based duplicates: Same (
ID
,Nutrient
,Year
) but different values. - Near-duplicates: Slightly different strings or values (e.g., whitespace, casing, rounding). Often require domain judgment.
Strategy: 1) Count exact duplicates. 2) Define a key and check for multiple rows per key. 3) Decide how to resolve (e.g., prefer non-missing, average within key, keep latest record, or flag for manual review).
# 4.1 Exact duplicates (full-row)
= df.duplicated(keep=False)
exact_dupes print('Exact duplicate rows (including originals):', int(exact_dupes.sum()))
6))
display(df[exact_dupes].head(
# If you decide to drop exact duplicates:
= df.drop_duplicates().copy() df_no_exact_dupes
# 4.2 Key-based duplicates
= ['ID','Nutrient','Year']
key if all(k in df.columns for k in key):
= df.groupby(key).size().reset_index(name='n')
counts = counts[counts['n'] > 1]
multi print('Keys with >1 row:', len(multi))
10))
display(multi.head(
# Inspect conflicts for a sample key (if present)
if not multi.empty:
= multi.iloc[0][key].to_dict()
sample_key print('Example conflicting key:', sample_key)
= (df['ID'].eq(sample_key['ID']) &
q 'Nutrient'].eq(sample_key['Nutrient']) &
df['Year'].eq(sample_key['Year']))
df[
display(df.loc[q])
# Example resolution: within each key, keep the row with the least missingness in Value/Value_imputed
# (Alternatively: take mean, prefer non-missing, or most recent record if you have timestamps.)
def resolve_group(g):
# prefer non-missing Value; fall back to Value_imputed
= g.copy()
g = g['Value'].notna().astype(int) + g.get('Value_imputed_flag', False).map({True:0, False:0})
score # keep the first row with highest score
return g.loc[score.sort_values(ascending=False).index[0:1]]
= (
df_resolved =False, group_keys=False)
df.groupby(key, as_indexapply(resolve_group)
.
)print('After key-based resolution, shape:', df_resolved.shape)
5) Validate with Simple Rules
Codify assumptions as checks. Fail loudly when rules are broken: - Type checks: numeric columns should be numeric. - Range checks: plausible ranges (e.g., iron intake in mg/day is rarely negative or 10,000). - Membership checks: categories in known sets.
These checks turn silent data issues into visible test failures.
= []
problems
# Type checks
if 'Value_imputed' in df.columns:
if not np.issubdtype(df['Value_imputed'].dtype, np.number):
'Value_imputed is not numeric.')
problems.append(
# Range checks (example plausible bounds, adjust to your domain)
if 'Value_imputed' in df.columns:
= ~df['Value_imputed'].between(0, 10000) # generous upper bound
out_of_range if out_of_range.any():
f"{int(out_of_range.sum())} rows have Value_imputed outside [0, 10000].")
problems.append(
# Membership checks
if 'Sex_std' in df.columns:
= {'M','F','Unknown'}
allowed = ~df['Sex_std'].isin(allowed)
bad if bad.any():
f"{int(bad.sum())} rows have unexpected Sex_std labels.")
problems.append(
print('Validation summary:')
if problems:
for p in problems:
print(' -', p)
else:
print('All checks passed β
')
6) Save the Cleaned Dataset (Optional)
Export a clean, analysis-ready table. Keep both raw and clean versions under version control, and record cleaning decisions in README
or the notebook header.
# Choose the resolved DataFrame if you performed key-based resolution; otherwise use df
= df.copy()
clean = 'data/hippo_nutrients_clean.csv'
clean_path =False)
clean.to_csv(clean_path, indexprint('Saved:', clean_path)
π§ͺ Exercises (Guided)
- Missingness by subgroup
- Compute the proportion of missing
Value
by(Nutrient, Sex)
.
- Impute
Value
using median by NutrientΓSex and track imputed rows with a boolean flag.
- Compute the proportion of missing
- Locale & units clean-up
- Create a column
Value_text
by copyingValue
as strings and inject a few messy entries (with'mg'
, spaces, decimal commas).
- Write a
clean_numeric()
function that removes units, trims, converts decimal commas, and returns numeric.
- Compare distributions before/after cleaning.
- Create a column
- Duplicates resolution policy
- Identify keys with multiple rows.
- Implement two resolution policies: (a) prefer non-missing
Value
, (b) take the mean within key.
- Compare results β do any policies change downstream summaries?
- Identify keys with multiple rows.
- Validation rules
- Add a rule:
Year
must be between 1990 and 2035.
- Add a rule:
Nutrient
must belong to a specific allowed set found indf['Nutrient'].unique()
(or a curated list).
- Make the notebook fail loudly (raise an Exception) if violations exceed a small tolerance.
- Add a rule:
β Conclusion & Next Steps
Youβve implemented a principled cleaning pipeline: - Profiled the data. - Handled missingness with transparent, group-wise imputation. - Standardised messy numeric and categorical fields (decimal commas, units, percentages, Sex labels). - Detected exact and key-based duplicates and demonstrated a resolution policy. - Encoded assumptions as validation rules.
π Next: 3.4 Data Transformation β reshaping, deriving variables, and feature engineering for analysis.
Resources
- pandas: Missing data β https://pandas.pydata.org/docs/user_guide/missing_data.html
- pandas: Working with text data β https://pandas.pydata.org/docs/user_guide/text.html
- pandas: Duplicate labels/rows β https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html
- On assumptions & validation in data analysis: keep a crisp README
alongside your cleaned outputs.