# Setup for Google Colab: Ensure environment is ready
# Note: This module (Programming Basics) does not require datasets
print('No dataset required for this notebook 🦛')
# Install required packages for this notebook
%pip install pandas
%pip install numpy
# ✅ Import necessary libraries
import pandas as pd
import numpy as np
print('Python environment ready.')
📊 2.4 Data Structures
This notebook introduces Python data structures (lists, dictionaries, DataFrames) for organizing data.
Objectives: - Use lists and dictionaries to store nutrient data. - Introduce pandas DataFrames for tabular data. - Manipulate data structures for analysis.
Context: Data structures are the backbone of nutrition datasets, like NDNS or hippo diet logs.
Fun Fact
A DataFrame is like a hippo’s pantry—everything neatly organised for quick access! 🦛📦 What Are Data Structures?
When we work with data—whether it’s nutrient intakes, participant IDs, or sensory scores—we need a way to store, organise, and manipulate that information in our programme. This is where data structures come in.
A data structure is a container that lets you group and organise multiple pieces of data, often in a way that reflects how they’re related. Just like you might sort food in a fridge by type (dairy, vegetables, snacks), you use different data structures in Python to group related values together and work with them efficiently.
Different structures are better suited for different tasks:
- A list is like a shopping list—just an ordered series of items.
- A dictionary is like a nutrition label—each value is labelled.
- A NumPy array is like a spreadsheet with numbers—fast, regular, and good for maths.
- A pandas DataFrame is like an Excel sheet or R data.frame—columns with headers, rows with values, and powerful tools for analysis.
Understanding data structures helps you:
Organise your code clearly
- Avoid repeating yourself
- Analyse large datasets effectively (e.g. NDNS or food diaries)
- Prepare your data for visualisation, statistics, or modelling
In this notebook, we’ll explore the main Python data structures used in nutrition research and learn how to apply them in practice.
🧺 Lists in Python
A list is a built-in Python data structure used to store a collection of items.
Think of it like a hippo’s lunchbox 🦛—it might contain apples, carrots, and sweet potatoes all in one tidy container.
✅ Why Lists Matter
- Store multiple related values in a single variable
- Access, modify, or remove individual items
- Iterate over them using loops
- Useful for nutrition data, e.g. daily intakes, meal components
🛠️ Creating and Accessing Lists
# A list of daily calcium intakes (mg) for different hippos
= [1200, 1150, 1250]
calcium_intakes
# Accessing elements (index starts at 0)
print(calcium_intakes[0]) # First element
print(calcium_intakes[-1]) # Last element
🧮 List Operations
# Built-in operations
print(len(calcium_intakes)) # Number of items
print(sum(calcium_intakes)) # Total calcium intake
print(max(calcium_intakes)) # Highest value
# Add and update
1180) # Add new intake to the end
calcium_intakes.append(1] = 1190 # Update second value
calcium_intakes[
# Remove
1250) # Remove specific value calcium_intakes.remove(
💡 Advanced Tip: Lists Can Store Anything
Lists can hold different types of values—even other lists!
= ['apple', 3.5, True, [1, 2, 3]] mixed_list
🧪 Exercise: List Practice
Create a list of iron intakes for three hippos (e.g. 8.2
, 7.9
, 8.5
). Then:
- Print the full list
- Access the second value and print it
- Add one more value to the list
- Calculate and print the average
💬 Add comments to explain what each line does.
# Your code here
✏️ Hint for the average:
average = sum(your_list) / len(your_list)
💡 Show Solution
Solution code below:
# Create a list of iron intakes
= [8.2, 7.9, 8.5] # Three values in mg
iron_intakes
# Print the full list
print("All iron intakes:", iron_intakes)
# Access and print the second value
print("Second value:", iron_intakes[1])
# Add another value
8.1)
iron_intakes.append(
# Calculate the average
= sum(iron_intakes) / len(iron_intakes)
average print("Average:", round(average, 2), "mg")
📖 Python Dictionaries
Dictionaries are another fundamental data structure in Python. Unlike lists which store elements by position, dictionaries store key–value pairs. They’re very useful for structured data, like storing the nutrients of a single hippo.
🔧 Creating a Dictionary
# Create a dictionary with nutrient values
= {
hippo_nutrients "Iron": 8.2, # mg/day
"Calcium": 1200, # mg/day
"Protein": 80.5 # g/day
}
📦 Accessing and Modifying Values
# Access a value by key
print(hippo_nutrients["Calcium"]) # Output: 1200
# Add or update a value
"Vitamin C"] = 90 # Add a new nutrient
hippo_nutrients["Iron"] = 8.5 # Update the iron value
hippo_nutrients[
# Remove a key
del hippo_nutrients["Protein"]
🔁 Looping Through a Dictionary
for nutrient, value in hippo_nutrients.items():
print(f"{nutrient}: {value}")
⚠️ Notes
- Keys must be unique and immutable (typically strings or numbers).
- Dictionaries are unordered before Python 3.7 (from 3.7 onward, they preserve insertion order).
🧪 Exercise: Build a Dictionary
Create a dictionary called hippo_H5
with the following values:
"Calories"
: 2500"Protein"
: 82.0"Water"
: 3500
Print the dictionary and then update "Water"
to 3600
.
# Your code here
✏️ Hint for the average:
average = sum(your_list) / len(your_list)
💡 Show Solution
Solution code below:
# Create a list of iron intakes
= [8.2, 7.9, 8.5] # Three values in mg
iron_intakes
# Print the full list
print("All iron intakes:", iron_intakes)
# Access and print the second value
print("Second value:", iron_intakes[1])
# Add another value
8.1)
iron_intakes.append(
# Calculate the average
= sum(iron_intakes) / len(iron_intakes)
average print("Average:", round(average, 2), "mg")
🧮 NumPy Arrays
NumPy
(short for Numerical Python) is a powerful library for numerical computing in Python. It introduces the array, a structure similar to a list, but designed for mathematical operations on large datasets. Arrays are:
- Faster and more memory-efficient than Python lists
- Ideal for numerical calculations
- The basis for many operations in
pandas
,scikit-learn
, and other libraries
📐 What is a NumPy Array?
A NumPy array is like a supercharged list. It can be 1D (like a list), 2D (like a table or matrix), or even higher dimensions.
import numpy as np
# Create a 1D array
= np.array([8.2, 7.9, 8.5])
iron print("1D array:", iron)
# Create a 2D array (like a table of values)
= np.array([[8.2, 1200], [7.9, 1150], [8.5, 1250]])
nutrients print("2D array:\n", nutrients)
🧾 Key Features
- Arrays support element-wise operations:
print(iron + 1) # Adds 1 to each element
print(iron * 2) # Multiplies each element by 2
- You can access elements using indices, starting from 0:
print(iron[1]) # 2nd value (7.9)
print(nutrients[0, 1]) # Row 0, column 1 (1200)
- You can easily calculate statistics:
print(np.mean(iron)) # Average
print(np.max(iron)) # Max
print(np.std(iron)) # Standard deviation
🧪 Exercise: NumPy Practice
- Create a NumPy array of calcium intakes:
[1200, 1150, 1250]
- Print the array
- Calculate and print the average
- Multiply all values by 0.001 to convert to grams
- Access the last value
💬 Add comments explaining each step.
# Your code here
💡 Show Answer
Solution code below:
import numpy as np
# Create the NumPy array
= np.array([1200, 1250, 1150])
calcium
# Print the array
print("Array:", calcium)
# Check the type
print("Type:", type(calcium))
# Calculate and print the mean
= np.mean(calcium)
mean_calcium print("Mean calcium:", mean_calcium)
# Add 100 to all elements
= calcium + 100
updated_calcium print("Updated calcium:", updated_calcium)
🐼 Pandas DataFrames in Python
pandas
is a powerful library for data analysis in Python. It allows you to create and manipulate DataFrames, which are like spreadsheets or Excel tables. Each column can have a different type (e.g. numbers, strings, dates), making them perfect for real-world datasets.
🧠 What is a DataFrame?
Think of a DataFrame
as a table of rows and columns: - Rows represent individual entries (e.g. each hippo) - Columns represent different variables (e.g. Calories, Protein, Water)
You can create a DataFrame from a dictionary, a list of dictionaries, or by reading from a CSV or Excel file.
📦 Creating a DataFrame from a Dictionary
import pandas as pd
# Dictionary of hippo data
= {
hippo_data "ID": ["H1", "H2", "H3"],
"Calories": [2500, 2450, 2600],
"Protein": [80.5, 78.0, 85.2],
"Water": [3400, 3300, 3600]
}
# Create a DataFrame
= pd.DataFrame(hippo_data)
df
# Display the DataFrame
print(df)
This will display:
ID Calories Protein Water
0 H1 2500 80.5 3400
1 H2 2450 78.0 3300
2 H3 2600 85.2 3600
You can access columns with df["Calories"]
, filter with df[df["Calories"] > 2500]
, and more!
🧪 Exercise: Create a Hippo DataFrame
Create a DataFrame
with two hippos and the following information:
- ID: H4, H5
- Calories: 2400, 2550
- Protein: 78.0, 81.0
- Water: 3450, 3550
Then print the DataFrame.
💬 Hint: Use pd.DataFrame()
with a dictionary like in the example above.
# Your code here
🧪 Exercise: Create a DataFrame
Create a DataFrame with nutrient information for three hippos:
- Hippo IDs: H1, H2, H3
- Calories: 2500, 2400, 2550
- Protein: 80.5, 77.0, 83.2
Then: - Print the full DataFrame - Access the protein value of the second hippo - Add a new column called “Water” with values: 3200, 3300, 3100
💬 Use pd.DataFrame()
and df.loc
or df['column']
syntax.
💡 Show Answer
Solution code below:
import pandas as pd
# Create the DataFrame
= {
data "ID": ["H1", "H2", "H3"],
"Calories": [2500, 2400, 2550],
"Protein": [80.5, 77.0, 83.2]
}= pd.DataFrame(data)
df
# Print the full DataFrame
print("Original DataFrame:")
print(df)
# Access protein value of second hippo (index 1)
print("Protein for H2:", df.loc[1, "Protein"])
# Add Water column
"Water"] = [3200, 3300, 3100]
df[print("Updated DataFrame:")
print(df)
📦 Summary: Choosing and Using Data Structures in Python
In this section, you’ve learned how to use different data structures in Python for handling and analysing nutrition data. Let’s summarise the main structures and when to use them.
🧰 Overview of Structures
Structure | Use For | Syntax / Package |
---|---|---|
List | Ordered collection of values | [1, 2, 3] |
Dictionary | Key–value pairs | {'Iron': 8.2} |
NumPy Array | Numerical arrays, efficient computation | np.array([1,2,3]) |
DataFrame | Tabular data (like Excel or R data.frame) | pd.DataFrame(...) |
🧭 Which One Should I Use?
- Use a list when:
- You need to store a simple, ordered set of values.
- Indexing is important (e.g.
mylist[0]
).
- Use a dictionary when:
- You want to label values clearly with keys.
- You need to retrieve items by name, not position.
- Use a NumPy array when:
- You’re doing numerical calculations (means, sums, matrix operations).
- Performance and speed are priorities.
- Use a pandas DataFrame when:
- You are working with tabular data (like a CSV file or Excel sheet).
- You want to manipulate rows/columns, filter data, or summarise results.
🧪 Example Scenarios
Scenario | Recommended Structure |
---|---|
Store iron intakes for 3 hippos | list |
Store nutrient info for 1 hippo | dict |
Efficiently analyse 1000 hippos’ iron levels | NumPy array |
Work with NDNS-style tables with many columns/rows | DataFrame |
🧠 Advanced Tip: Converting Between Structures
list
→array
:np.array(mylist)
dict
→DataFrame
:pd.DataFrame([mydict])
DataFrame column
→list
:df['Calories'].tolist()
✅ Conclusion
Python gives you powerful tools for structuring and analysing data. By choosing the right structure, you make your code faster, easier to read, and more reproducible. 🎯
In future sections, you’ll work more with DataFrames
for real datasets—but the foundations you’ve learned here apply throughout your coding journey!