🦛 3.1 Data Types and Structures — Wide vs Long (Tidy)

In this notebook we explore how data can be structured in Python using pandas, with a special focus on the concepts of wide and long (tidy) formats.

Good data organisation is the foundation of effective analysis. Just like a hippo’s lunch tray 🦛 needs everything neatly arranged to avoid chaos, your dataset should also be organised so that analyses are clear, reproducible, and easy to extend.

📖 Why This Matters

Most real-world datasets do not arrive perfectly formatted. Some are wide because they come from spreadsheets, others are long because they come from experiments or databases. You need to: - Reshape them for analysis and plotting. - Ensure consistency across different datasets. - Follow the principle of tidy data: each variable is a column, each observation is a row, and each type of observational unit is a table.

# Setup for Google Colab: Fetch datasets automatically or manually
import os
from google.colab import files

MODULE = '03_data_handling'
DATASET = 'hippo_nutrients.csv'
BASE_PATH = '/content/data-analysis-projects'
MODULE_PATH = os.path.join(BASE_PATH, 'notebooks', MODULE)
DATASET_PATH = os.path.join('data', DATASET)

try:
    if not os.path.exists(BASE_PATH):
        !git clone https://github.com/ggkuhnle/data-analysis-projects.git
    os.chdir(MODULE_PATH)
    assert os.path.exists(DATASET_PATH)
    print(f'Dataset found: {DATASET_PATH} ✅')
except Exception as e:
    print(f'Automatic clone failed: {e}')
    os.makedirs('data', exist_ok=True)
    uploaded = files.upload()
    if DATASET in uploaded:
        with open(DATASET_PATH, 'wb') as f:
            f.write(uploaded[DATASET])
        print(f'Successfully uploaded {DATASET} ✅')
    else:
        raise FileNotFoundError(f'Upload failed. Please ensure you uploaded {DATASET}.')

%pip install pandas numpy -q
import pandas as pd, numpy as np
print('Environment ready.')

Wide vs Long: Two Ways of Storing the Same Data

Imagine you want to record the iron intake of hippos over two years. You could write it in two different ways:

Wide format (spreadsheet-style):

ID	Age	Sex	Iron_2024	Iron_2025
H1	25	F	8.2	8.5

Here, each column holds a different measurement year. This is common in Excel files, but difficult to analyse with pandas.

Long (tidy) format:

ID	Age	Sex	Nutrient	Year	Value
H1	25	F	Iron	2024	8.2
H1	26	F	Iron	2025	8.5

Here, each row is one observation: a hippo’s intake of a nutrient in a specific year. This is tidy data and works seamlessly with pandas and most plotting libraries.

👉 Summary: - Wide data: easy for humans to read, harder for computers. - Long/tidy data: the standard for analysis, modelling, and plotting.

df = pd.read_csv('data/hippo_nutrients.csv')
df.head(6)

Our dataset is already long/tidy: it has Nutrient and Value columns, so each row represents a single measurement.

Let’s practise converting it into wide form and then back to long form, to see the difference.

🔄 Long → Wide using `pivot`

We can make the dataset wider by spreading nutrients into columns. This is useful when we want a table with one row per hippo and year.

df_wide = df.pivot(
    index=['ID','Year','Age','Sex'],
    columns='Nutrient',
    values='Value'
).reset_index()

This produces a DataFrame where nutrients like Iron, Calcium, etc. are each their own column.

df_wide = df.pivot(index=['ID','Year','Age','Sex'], columns='Nutrient', values='Value').reset_index()
df_wide.head(6)

🔄 Wide → Long using `melt`

If we receive a dataset in wide format (many nutrient columns), we can convert it back to tidy format with melt.

df_long = df_wide.melt(
    id_vars=['ID','Year','Age','Sex'],
    value_vars=[c for c in df_wide.columns if c not in ['ID','Year','Age','Sex']],
    var_name='Nutrient',
    value_name='Value'
)

This stacks all nutrient columns into two columns: Nutrient (the name) and Value (the measurement).

df_long = df_wide.melt(
    id_vars=['ID','Year','Age','Sex'],
    var_name='Nutrient',
    value_name='Value'
)
df_long.head(8)

🧪 Exercise 1: Explore Wide Data

Starting from the tidy df, create a wide DataFrame with pivot.
Inspect how many columns of nutrients you now have.
Which format (wide or long) would be easier if you wanted to plot Iron intake over time?

🧪 Exercise 2: Work with Long Data

Take df_long and filter it to only include rows where Nutrient == 'Iron'.
Sort by Year and describe the pattern.
Which format (wide or long) would be easier if you wanted to calculate the mean intake for each nutrient?

✅ Conclusion

In this notebook you: - Learned the difference between wide and long (tidy) data. - Practised reshaping with pivot and melt. - Saw why tidy data is powerful: it makes filtering, grouping, and plotting much simpler.

👉 In the next notebook (3.2), you will learn how to import data from different sources (CSV, Excel, databases) and start applying these tidy principles to new datasets.