# Setup for Google Colab: Fetch datasets automatically or manually
import os
from google.colab import files
= '03_data_handling'
MODULE = 'hippo_nutrients.csv'
DATASET = '/content/data-analysis-projects'
BASE_PATH = os.path.join(BASE_PATH, 'notebooks', MODULE)
MODULE_PATH = os.path.join('data', DATASET)
DATASET_PATH
try:
if not os.path.exists(BASE_PATH):
!git clone https://github.com/ggkuhnle/data-analysis-projects.git
os.chdir(MODULE_PATH)assert os.path.exists(DATASET_PATH)
print(f'Dataset found: {DATASET_PATH} β
')
except Exception as e:
print(f'Automatic clone failed: {e}')
'data', exist_ok=True)
os.makedirs(= files.upload()
uploaded if DATASET in uploaded:
with open(DATASET_PATH, 'wb') as f:
f.write(uploaded[DATASET])print(f'Successfully uploaded {DATASET} β
')
else:
raise FileNotFoundError(f'Upload failed. Please ensure you uploaded {DATASET}.')
%pip install pandas numpy -q
import pandas as pd, numpy as np
print('Environment ready.')
π¦ 3.1 Data Types and Structures β Wide vs Long (Tidy)
In this notebook we explore how data can be structured in Python using pandas, with a special focus on the concepts of wide and long (tidy) formats.
Good data organisation is the foundation of effective analysis. Just like a hippoβs lunch tray π¦ needs everything neatly arranged to avoid chaos, your dataset should also be organised so that analyses are clear, reproducible, and easy to extend.
π Why This Matters
Most real-world datasets do not arrive perfectly formatted. Some are wide because they come from spreadsheets, others are long because they come from experiments or databases. You need to: - Reshape them for analysis and plotting. - Ensure consistency across different datasets. - Follow the principle of tidy data: each variable is a column, each observation is a row, and each type of observational unit is a table.
Wide vs Long: Two Ways of Storing the Same Data
Imagine you want to record the iron intake of hippos over two years. You could write it in two different ways:
Wide format (spreadsheet-style):
ID | Age | Sex | Iron_2024 | Iron_2025 |
---|---|---|---|---|
H1 | 25 | F | 8.2 | 8.5 |
Here, each column holds a different measurement year. This is common in Excel files, but difficult to analyse with pandas.
Long (tidy) format:
ID | Age | Sex | Nutrient | Year | Value |
---|---|---|---|---|---|
H1 | 25 | F | Iron | 2024 | 8.2 |
H1 | 26 | F | Iron | 2025 | 8.5 |
Here, each row is one observation: a hippoβs intake of a nutrient in a specific year. This is tidy data and works seamlessly with pandas and most plotting libraries.
π Summary: - Wide data: easy for humans to read, harder for computers. - Long/tidy data: the standard for analysis, modelling, and plotting.
= pd.read_csv('data/hippo_nutrients.csv')
df 6) df.head(
Our dataset is already long/tidy: it has Nutrient
and Value
columns, so each row represents a single measurement.
Letβs practise converting it into wide form and then back to long form, to see the difference.
π Long β Wide using pivot
We can make the dataset wider by spreading nutrients into columns. This is useful when we want a table with one row per hippo and year.
= df.pivot(
df_wide =['ID','Year','Age','Sex'],
index='Nutrient',
columns='Value'
values ).reset_index()
This produces a DataFrame where nutrients like Iron, Calcium, etc. are each their own column.
= df.pivot(index=['ID','Year','Age','Sex'], columns='Nutrient', values='Value').reset_index()
df_wide 6) df_wide.head(
π Wide β Long using melt
If we receive a dataset in wide format (many nutrient columns), we can convert it back to tidy format with melt
.
= df_wide.melt(
df_long =['ID','Year','Age','Sex'],
id_vars=[c for c in df_wide.columns if c not in ['ID','Year','Age','Sex']],
value_vars='Nutrient',
var_name='Value'
value_name )
This stacks all nutrient columns into two columns: Nutrient
(the name) and Value
(the measurement).
= df_wide.melt(
df_long =['ID','Year','Age','Sex'],
id_vars='Nutrient',
var_name='Value'
value_name
)8) df_long.head(
π§ͺ Exercise 1: Explore Wide Data
- Starting from the tidy
df
, create a wide DataFrame withpivot
. - Inspect how many columns of nutrients you now have.
- Which format (wide or long) would be easier if you wanted to plot Iron intake over time?
π§ͺ Exercise 2: Work with Long Data
- Take
df_long
and filter it to only include rows whereNutrient == 'Iron'
. - Sort by
Year
and describe the pattern. - Which format (wide or long) would be easier if you wanted to calculate the mean intake for each nutrient?
β Conclusion
In this notebook you: - Learned the difference between wide and long (tidy) data. - Practised reshaping with pivot
and melt
. - Saw why tidy data is powerful: it makes filtering, grouping, and plotting much simpler.
π In the next notebook (3.2), you will learn how to import data from different sources (CSV, Excel, databases) and start applying these tidy principles to new datasets.