📊 3.2 Importing Data

Before you can analyse data, you first need to import it into Python. This notebook introduces the most common ways of bringing nutrition data into your workflow.

Think of it like shopping for ingredients 🛒: you can’t cook (analyse) until you’ve brought the food (data) into your kitchen (Python).

🎯 Objectives

By the end of this notebook you should be able to: - Import data from CSV and Excel files using pandas. - Recognise other common formats: TSV, JSON, databases, APIs. - Verify the data after import (shapes, column names, head of table). - Apply these skills to hippo_nutrients.csv.

# Setup for Google Colab: Fetch datasets automatically or manually
import os
from google.colab import files


MODULE = '03_data_handling'
DATASET = 'hippo_nutrients.csv'
BASE_PATH = '/content/data-analysis-projects'
MODULE_PATH = os.path.join(BASE_PATH, 'notebooks', MODULE)
DATASET_PATH = os.path.join('data', DATASET)

try:
    if not os.path.exists(BASE_PATH):
        !git clone https://github.com/ggkuhnle/data-analysis-projects.git
    os.chdir(MODULE_PATH)
    assert os.path.exists(DATASET_PATH)
    print(f'Dataset found: {DATASET_PATH} ✅')
except Exception as e:
    print(f'Automatic clone failed: {e}')
    os.makedirs('data', exist_ok=True)
    uploaded = files.upload()
    if DATASET in uploaded:
        with open(DATASET_PATH, 'wb') as f:
            f.write(uploaded[DATASET])
        print(f'Successfully uploaded {DATASET} ✅')
    else:
        raise FileNotFoundError(f'Upload failed. Please ensure you uploaded {DATASET}.')


print('Environment ready.')

Install additional libraries.

%pip install pandas openpyxl sqlalchemy -q
import pandas as pd, numpy as np

📂 Importing CSV Files

CSV (Comma Separated Values) is the most common format in nutrition research. It is plain text, easy to share, and can be opened in Excel or any text editor.

Let’s load hippo_nutrients.csv.

df_csv = pd.read_csv('data/hippo_nutrients.csv')
print(f'Shape: {df_csv.shape}')
print(f'Columns: {df_csv.columns.tolist()}')
display(df_csv.head())

📑 Importing Excel Files

Excel is common in labs and public health datasets. You may encounter files with multiple sheets, formatting quirks, or missing values.

Example (requires openpyxl):

# %pip install openpyxl  # uncomment if needed in Colab
# df_excel = pd.read_excel('data/hippo_nutrients.xlsx', sheet_name='Sheet1')

For this demo we’ll reuse the CSV DataFrame but pretend it came from Excel.

df_excel = df_csv.copy()
print(f'Excel shape: {df_excel.shape}')
display(df_excel.head())

🔗 Other Common Data Sources

TSV (Tab-Separated Values)

df_tsv = pd.read_csv('data/example.tsv', sep='\t')

JSON (e.g. from APIs)

df_json = pd.read_json('data/example.json')

SQL Databases (clinical or survey data)

from sqlalchemy import create_engine
engine = create_engine('sqlite:///data/nutrition.db')
df_sql = pd.read_sql('SELECT * FROM hippo_nutrients', engine)

Web APIs (advanced)

Many nutrition datasets are accessible via APIs. Example:

import requests
url = 'https://api.example.com/nutrients'
data = requests.get(url).json()
df_api = pd.DataFrame(data)

👉 You don’t need to master all these at once, but it’s important to know they exist!

🧪 Exercise 1: CSV Practice

Load hippo_nutrients.csv into a DataFrame.
Print the first 5 rows.
How many unique nutrients are in the dataset?

🧪 Exercise 2: Explore Other Formats

Imagine you receive data from collaborators: - NDNS data in a TSV file. - A JSON file with nutrient metadata.

How would you adapt the code above to import these files?

💡 You don’t need the real files now—just sketch out the pandas command you’d use.

✅ Conclusion

In this notebook you: - Imported CSV and Excel data. - Learned about other formats (TSV, JSON, SQL, APIs). - Practised verifying shapes, columns, and contents after import.

👉 Next up: 3.3 Data Cleaning — because imported data is rarely perfect, and cleaning is where the real fun begins!

Resources: - Pandas I/O Documentation - OpenPyXL Documentation - SQLAlchemy - Requests (Python HTTP for APIs) - Repository: github.com/ggkuhnle/data-analysis-projects