📊 2.4 Data Structures

This notebook introduces Python data structures (lists, dictionaries, DataFrames) for organizing data.

Objectives: - Use lists and dictionaries to store nutrient data. - Introduce pandas DataFrames for tabular data. - Manipulate data structures for analysis.

Context: Data structures are the backbone of nutrition datasets, like NDNS or hippo diet logs.

Fun Fact

A DataFrame is like a hippo’s pantry—everything neatly organised for quick access! 🦛

📦 What Are Data Structures?

When we work with data—whether it’s nutrient intakes, participant IDs, or sensory scores—we need a way to store, organise, and manipulate that information in our programme. This is where data structures come in.

A data structure is a container that lets you group and organise multiple pieces of data, often in a way that reflects how they’re related. Just like you might sort food in a fridge by type (dairy, vegetables, snacks), you use different data structures in Python to group related values together and work with them efficiently.

Different structures are better suited for different tasks:

A list is like a shopping list—just an ordered series of items.
A dictionary is like a nutrition label—each value is labelled.
A NumPy array is like a spreadsheet with numbers—fast, regular, and good for maths.
A pandas DataFrame is like an Excel sheet or R data.frame—columns with headers, rows with values, and powerful tools for analysis.

Understanding data structures helps you:

Organise your code clearly

Avoid repeating yourself
Analyse large datasets effectively (e.g. NDNS or food diaries)
Prepare your data for visualisation, statistics, or modelling

In this notebook, we’ll explore the main Python data structures used in nutrition research and learn how to apply them in practice.

# Setup for Google Colab: Ensure environment is ready
# Note: This module (Programming Basics) does not require datasets
print('No dataset required for this notebook 🦛')
# Install required packages for this notebook
%pip install pandas
%pip install numpy

# ✅ Import necessary libraries
import pandas as pd
import numpy as np

print('Python environment ready.')

🧺 Lists in Python

A list is a built-in Python data structure used to store a collection of items.

Think of it like a hippo’s lunchbox 🦛—it might contain apples, carrots, and sweet potatoes all in one tidy container.

✅ Why Lists Matter

Store multiple related values in a single variable
Access, modify, or remove individual items
Iterate over them using loops
Useful for nutrition data, e.g. daily intakes, meal components

🛠️ Creating and Accessing Lists

# A list of daily calcium intakes (mg) for different hippos
calcium_intakes = [1200, 1150, 1250]

# Accessing elements (index starts at 0)
print(calcium_intakes[0])   # First element
print(calcium_intakes[-1])  # Last element

🧮 List Operations

# Built-in operations
print(len(calcium_intakes))       # Number of items
print(sum(calcium_intakes))       # Total calcium intake
print(max(calcium_intakes))       # Highest value

# Add and update
calcium_intakes.append(1180)      # Add new intake to the end
calcium_intakes[1] = 1190         # Update second value

# Remove
calcium_intakes.remove(1250)      # Remove specific value

💡 Advanced Tip: Lists Can Store Anything

Lists can hold different types of values—even other lists!

mixed_list = ['apple', 3.5, True, [1, 2, 3]]

🧪 Exercise: List Practice

Create a list of iron intakes for three hippos (e.g. 8.2, 7.9, 8.5). Then:

Print the full list
Access the second value and print it
Add one more value to the list
Calculate and print the average

💬 Add comments to explain what each line does.

# Your code here

✏️ Hint for the average:
average = sum(your_list) / len(your_list)

💡 Show Solution

Solution code below:

# Create a list of iron intakes
iron_intakes = [8.2, 7.9, 8.5]  # Three values in mg

# Print the full list
print("All iron intakes:", iron_intakes)

# Access and print the second value
print("Second value:", iron_intakes[1])

# Add another value
iron_intakes.append(8.1)

# Calculate the average
average = sum(iron_intakes) / len(iron_intakes)
print("Average:", round(average, 2), "mg")

📖 Python Dictionaries

Dictionaries are another fundamental data structure in Python. Unlike lists which store elements by position, dictionaries store key–value pairs. They’re very useful for structured data, like storing the nutrients of a single hippo.

🔧 Creating a Dictionary

# Create a dictionary with nutrient values
hippo_nutrients = {
    "Iron": 8.2,       # mg/day
    "Calcium": 1200,   # mg/day
    "Protein": 80.5    # g/day
}

📦 Accessing and Modifying Values

# Access a value by key
print(hippo_nutrients["Calcium"])  # Output: 1200

# Add or update a value
hippo_nutrients["Vitamin C"] = 90  # Add a new nutrient
hippo_nutrients["Iron"] = 8.5      # Update the iron value

# Remove a key
del hippo_nutrients["Protein"]

🔁 Looping Through a Dictionary

for nutrient, value in hippo_nutrients.items():
    print(f"{nutrient}: {value}")

⚠️ Notes

Keys must be unique and immutable (typically strings or numbers).
Dictionaries are unordered before Python 3.7 (from 3.7 onward, they preserve insertion order).

🧪 Exercise: Build a Dictionary

Create a dictionary called hippo_H5 with the following values:

"Calories": 2500
"Protein": 82.0
"Water": 3500

Print the dictionary and then update "Water" to 3600.

# Your code here

✏️ Hint for the average:
average = sum(your_list) / len(your_list)

💡 Show Solution

Solution code below:

# Create a list of iron intakes
iron_intakes = [8.2, 7.9, 8.5]  # Three values in mg

# Print the full list
print("All iron intakes:", iron_intakes)

# Access and print the second value
print("Second value:", iron_intakes[1])

# Add another value
iron_intakes.append(8.1)

# Calculate the average
average = sum(iron_intakes) / len(iron_intakes)
print("Average:", round(average, 2), "mg")

🧮 NumPy Arrays

NumPy (short for Numerical Python) is a powerful library for numerical computing in Python. It introduces the array, a structure similar to a list, but designed for mathematical operations on large datasets. Arrays are:

Faster and more memory-efficient than Python lists
Ideal for numerical calculations
The basis for many operations in pandas, scikit-learn, and other libraries

📐 What is a NumPy Array?

A NumPy array is like a supercharged list. It can be 1D (like a list), 2D (like a table or matrix), or even higher dimensions.

import numpy as np

# Create a 1D array
iron = np.array([8.2, 7.9, 8.5])
print("1D array:", iron)

# Create a 2D array (like a table of values)
nutrients = np.array([[8.2, 1200], [7.9, 1150], [8.5, 1250]])
print("2D array:\n", nutrients)

🧾 Key Features

Arrays support element-wise operations:

print(iron + 1)  # Adds 1 to each element
print(iron * 2)  # Multiplies each element by 2

You can access elements using indices, starting from 0:

print(iron[1])         # 2nd value (7.9)
print(nutrients[0, 1]) # Row 0, column 1 (1200)

You can easily calculate statistics:

print(np.mean(iron))     # Average
print(np.max(iron))      # Max
print(np.std(iron))      # Standard deviation

🧪 Exercise: NumPy Practice

Create a NumPy array of calcium intakes: [1200, 1150, 1250]
Print the array
Calculate and print the average
Multiply all values by 0.001 to convert to grams
Access the last value

💬 Add comments explaining each step.

# Your code here

💡 Show Answer

Solution code below:

import numpy as np

# Create the NumPy array
calcium = np.array([1200, 1250, 1150])

# Print the array
print("Array:", calcium)

# Check the type
print("Type:", type(calcium))

# Calculate and print the mean
mean_calcium = np.mean(calcium)
print("Mean calcium:", mean_calcium)

# Add 100 to all elements
updated_calcium = calcium + 100
print("Updated calcium:", updated_calcium)

🐼 Pandas DataFrames in Python

pandas is a powerful library for data analysis in Python. It allows you to create and manipulate DataFrames, which are like spreadsheets or Excel tables. Each column can have a different type (e.g. numbers, strings, dates), making them perfect for real-world datasets.

🧠 What is a DataFrame?

Think of a DataFrame as a table of rows and columns: - Rows represent individual entries (e.g. each hippo) - Columns represent different variables (e.g. Calories, Protein, Water)

You can create a DataFrame from a dictionary, a list of dictionaries, or by reading from a CSV or Excel file.

📦 Creating a DataFrame from a Dictionary

import pandas as pd

# Dictionary of hippo data
hippo_data = {
    "ID": ["H1", "H2", "H3"],
    "Calories": [2500, 2450, 2600],
    "Protein": [80.5, 78.0, 85.2],
    "Water": [3400, 3300, 3600]
}

# Create a DataFrame
df = pd.DataFrame(hippo_data)

# Display the DataFrame
print(df)

This will display:

   ID  Calories  Protein  Water
0  H1      2500     80.5   3400
1  H2      2450     78.0   3300
2  H3      2600     85.2   3600

You can access columns with df["Calories"], filter with df[df["Calories"] > 2500], and more!

🧪 Exercise: Create a Hippo DataFrame

Create a DataFrame with two hippos and the following information:

ID: H4, H5
Calories: 2400, 2550
Protein: 78.0, 81.0
Water: 3450, 3550

Then print the DataFrame.

💬 Hint: Use pd.DataFrame() with a dictionary like in the example above.

# Your code here

🧪 Exercise: Create a DataFrame

Create a DataFrame with nutrient information for three hippos:

Hippo IDs: H1, H2, H3
Calories: 2500, 2400, 2550
Protein: 80.5, 77.0, 83.2

Then: - Print the full DataFrame - Access the protein value of the second hippo - Add a new column called “Water” with values: 3200, 3300, 3100

💬 Use pd.DataFrame() and df.loc or df['column'] syntax.

💡 Show Answer

Solution code below:

import pandas as pd

# Create the DataFrame
data = {
    "ID": ["H1", "H2", "H3"],
    "Calories": [2500, 2400, 2550],
    "Protein": [80.5, 77.0, 83.2]
}
df = pd.DataFrame(data)

# Print the full DataFrame
print("Original DataFrame:")
print(df)

# Access protein value of second hippo (index 1)
print("Protein for H2:", df.loc[1, "Protein"])

# Add Water column
df["Water"] = [3200, 3300, 3100]
print("Updated DataFrame:")
print(df)

📦 Summary: Choosing and Using Data Structures in Python

In this section, you’ve learned how to use different data structures in Python for handling and analysing nutrition data. Let’s summarise the main structures and when to use them.

🧰 Overview of Structures

Structure	Use For	Syntax / Package
List	Ordered collection of values	`[1, 2, 3]`
Dictionary	Key–value pairs	`{'Iron': 8.2}`
NumPy Array	Numerical arrays, efficient computation	`np.array([1,2,3])`
DataFrame	Tabular data (like Excel or R data.frame)	`pd.DataFrame(...)`

🧭 Which One Should I Use?

Use a list when:
- You need to store a simple, ordered set of values.
- Indexing is important (e.g. mylist[0]).
Use a dictionary when:
- You want to label values clearly with keys.
- You need to retrieve items by name, not position.
Use a NumPy array when:
- You’re doing numerical calculations (means, sums, matrix operations).
- Performance and speed are priorities.
Use a pandas DataFrame when:
- You are working with tabular data (like a CSV file or Excel sheet).
- You want to manipulate rows/columns, filter data, or summarise results.

🧪 Example Scenarios

Scenario	Recommended Structure
Store iron intakes for 3 hippos	`list`
Store nutrient info for 1 hippo	`dict`
Efficiently analyse 1000 hippos’ iron levels	`NumPy array`
Work with NDNS-style tables with many columns/rows	`DataFrame`

🧠 Advanced Tip: Converting Between Structures

list → array: np.array(mylist)
dict → DataFrame: pd.DataFrame([mydict])
DataFrame column → list: df['Calories'].tolist()

✅ Conclusion

Python gives you powerful tools for structuring and analysing data. By choosing the right structure, you make your code faster, easier to read, and more reproducible. 🎯

In future sections, you’ll work more with DataFrames for real datasets—but the foundations you’ve learned here apply throughout your coding journey!