πŸ¦› 3.1 Data Types and Structures β€” Wide vs Long (Tidy)

In this notebook we explore how data can be structured in Python using pandas, with a special focus on the concepts of wide and long (tidy) formats.

Good data organisation is the foundation of effective analysis. Just like a hippo’s lunch tray πŸ¦› needs everything neatly arranged to avoid chaos, your dataset should also be organised so that analyses are clear, reproducible, and easy to extend.

πŸ“– Why This Matters

Most real-world datasets do not arrive perfectly formatted. Some are wide because they come from spreadsheets, others are long because they come from experiments or databases. You need to: - Reshape them for analysis and plotting. - Ensure consistency across different datasets. - Follow the principle of tidy data: each variable is a column, each observation is a row, and each type of observational unit is a table.

# Setup for Google Colab: Fetch datasets automatically or manually
import os
from google.colab import files

MODULE = '03_data_handling'
DATASET = 'hippo_nutrients.csv'
BASE_PATH = '/content/data-analysis-projects'
MODULE_PATH = os.path.join(BASE_PATH, 'notebooks', MODULE)
DATASET_PATH = os.path.join('data', DATASET)

try:
    if not os.path.exists(BASE_PATH):
        !git clone https://github.com/ggkuhnle/data-analysis-projects.git
    os.chdir(MODULE_PATH)
    assert os.path.exists(DATASET_PATH)
    print(f'Dataset found: {DATASET_PATH} βœ…')
except Exception as e:
    print(f'Automatic clone failed: {e}')
    os.makedirs('data', exist_ok=True)
    uploaded = files.upload()
    if DATASET in uploaded:
        with open(DATASET_PATH, 'wb') as f:
            f.write(uploaded[DATASET])
        print(f'Successfully uploaded {DATASET} βœ…')
    else:
        raise FileNotFoundError(f'Upload failed. Please ensure you uploaded {DATASET}.')

%pip install pandas numpy -q
import pandas as pd, numpy as np
print('Environment ready.')

Wide vs Long: Two Ways of Storing the Same Data

Imagine you want to record the iron intake of hippos over two years. You could write it in two different ways:

Wide format (spreadsheet-style):

ID Age Sex Iron_2024 Iron_2025
H1 25 F 8.2 8.5

Here, each column holds a different measurement year. This is common in Excel files, but difficult to analyse with pandas.

Long (tidy) format:

ID Age Sex Nutrient Year Value
H1 25 F Iron 2024 8.2
H1 26 F Iron 2025 8.5

Here, each row is one observation: a hippo’s intake of a nutrient in a specific year. This is tidy data and works seamlessly with pandas and most plotting libraries.

πŸ‘‰ Summary: - Wide data: easy for humans to read, harder for computers. - Long/tidy data: the standard for analysis, modelling, and plotting.

df = pd.read_csv('data/hippo_nutrients.csv')
df.head(6)

Our dataset is already long/tidy: it has Nutrient and Value columns, so each row represents a single measurement.

Let’s practise converting it into wide form and then back to long form, to see the difference.

πŸ”„ Long β†’ Wide using pivot

We can make the dataset wider by spreading nutrients into columns. This is useful when we want a table with one row per hippo and year.

df_wide = df.pivot(
    index=['ID','Year','Age','Sex'],
    columns='Nutrient',
    values='Value'
).reset_index()

This produces a DataFrame where nutrients like Iron, Calcium, etc. are each their own column.

df_wide = df.pivot(index=['ID','Year','Age','Sex'], columns='Nutrient', values='Value').reset_index()
df_wide.head(6)

πŸ”„ Wide β†’ Long using melt

If we receive a dataset in wide format (many nutrient columns), we can convert it back to tidy format with melt.

df_long = df_wide.melt(
    id_vars=['ID','Year','Age','Sex'],
    value_vars=[c for c in df_wide.columns if c not in ['ID','Year','Age','Sex']],
    var_name='Nutrient',
    value_name='Value'
)

This stacks all nutrient columns into two columns: Nutrient (the name) and Value (the measurement).

df_long = df_wide.melt(
    id_vars=['ID','Year','Age','Sex'],
    var_name='Nutrient',
    value_name='Value'
)
df_long.head(8)

πŸ§ͺ Exercise 1: Explore Wide Data

  1. Starting from the tidy df, create a wide DataFrame with pivot.
  2. Inspect how many columns of nutrients you now have.
  3. Which format (wide or long) would be easier if you wanted to plot Iron intake over time?

πŸ§ͺ Exercise 2: Work with Long Data

  1. Take df_long and filter it to only include rows where Nutrient == 'Iron'.
  2. Sort by Year and describe the pattern.
  3. Which format (wide or long) would be easier if you wanted to calculate the mean intake for each nutrient?

βœ… Conclusion

In this notebook you: - Learned the difference between wide and long (tidy) data. - Practised reshaping with pivot and melt. - Saw why tidy data is powerful: it makes filtering, grouping, and plotting much simpler.

πŸ‘‰ In the next notebook (3.2), you will learn how to import data from different sources (CSV, Excel, databases) and start applying these tidy principles to new datasets.