๐Ÿ“Š 3.2 Importing Data

Before you can analyse data, you first need to import it into Python. This notebook introduces the most common ways of bringing nutrition data into your workflow.

Think of it like shopping for ingredients ๐Ÿ›’: you canโ€™t cook (analyse) until youโ€™ve brought the food (data) into your kitchen (Python).


๐ŸŽฏ Objectives

By the end of this notebook you should be able to: - Import data from CSV and Excel files using pandas. - Recognise other common formats: TSV, JSON, databases, APIs. - Verify the data after import (shapes, column names, head of table). - Apply these skills to hippo_nutrients.csv.

# Setup for Google Colab: Fetch datasets automatically or manually
import os
from google.colab import files


MODULE = '03_data_handling'
DATASET = 'hippo_nutrients.csv'
BASE_PATH = '/content/data-analysis-projects'
MODULE_PATH = os.path.join(BASE_PATH, 'notebooks', MODULE)
DATASET_PATH = os.path.join('data', DATASET)

try:
    if not os.path.exists(BASE_PATH):
        !git clone https://github.com/ggkuhnle/data-analysis-projects.git
    os.chdir(MODULE_PATH)
    assert os.path.exists(DATASET_PATH)
    print(f'Dataset found: {DATASET_PATH} โœ…')
except Exception as e:
    print(f'Automatic clone failed: {e}')
    os.makedirs('data', exist_ok=True)
    uploaded = files.upload()
    if DATASET in uploaded:
        with open(DATASET_PATH, 'wb') as f:
            f.write(uploaded[DATASET])
        print(f'Successfully uploaded {DATASET} โœ…')
    else:
        raise FileNotFoundError(f'Upload failed. Please ensure you uploaded {DATASET}.')


print('Environment ready.')

Install additional libraries.

%pip install pandas openpyxl sqlalchemy -q
import pandas as pd, numpy as np

๐Ÿ“‚ Importing CSV Files

CSV (Comma Separated Values) is the most common format in nutrition research. It is plain text, easy to share, and can be opened in Excel or any text editor.

Letโ€™s load hippo_nutrients.csv.

df_csv = pd.read_csv('data/hippo_nutrients.csv')
print(f'Shape: {df_csv.shape}')
print(f'Columns: {df_csv.columns.tolist()}')
display(df_csv.head())

๐Ÿ“‘ Importing Excel Files

Excel is common in labs and public health datasets. You may encounter files with multiple sheets, formatting quirks, or missing values.

Example (requires openpyxl):

# %pip install openpyxl  # uncomment if needed in Colab
# df_excel = pd.read_excel('data/hippo_nutrients.xlsx', sheet_name='Sheet1')

For this demo weโ€™ll reuse the CSV DataFrame but pretend it came from Excel.

df_excel = df_csv.copy()
print(f'Excel shape: {df_excel.shape}')
display(df_excel.head())

๐Ÿ”— Other Common Data Sources

TSV (Tab-Separated Values)

df_tsv = pd.read_csv('data/example.tsv', sep='\t')

JSON (e.g. from APIs)

df_json = pd.read_json('data/example.json')

SQL Databases (clinical or survey data)

from sqlalchemy import create_engine
engine = create_engine('sqlite:///data/nutrition.db')
df_sql = pd.read_sql('SELECT * FROM hippo_nutrients', engine)

Web APIs (advanced)

Many nutrition datasets are accessible via APIs. Example:

import requests
url = 'https://api.example.com/nutrients'
data = requests.get(url).json()
df_api = pd.DataFrame(data)

๐Ÿ‘‰ You donโ€™t need to master all these at once, but itโ€™s important to know they exist!

๐Ÿงช Exercise 1: CSV Practice

  1. Load hippo_nutrients.csv into a DataFrame.
  2. Print the first 5 rows.
  3. How many unique nutrients are in the dataset?

๐Ÿงช Exercise 2: Explore Other Formats

Imagine you receive data from collaborators: - NDNS data in a TSV file. - A JSON file with nutrient metadata.

How would you adapt the code above to import these files?

๐Ÿ’ก You donโ€™t need the real files nowโ€”just sketch out the pandas command youโ€™d use.

โœ… Conclusion

In this notebook you: - Imported CSV and Excel data. - Learned about other formats (TSV, JSON, SQL, APIs). - Practised verifying shapes, columns, and contents after import.

๐Ÿ‘‰ Next up: 3.3 Data Cleaning โ€” because imported data is rarely perfect, and cleaning is where the real fun begins!


Resources: - Pandas I/O Documentation - OpenPyXL Documentation - SQLAlchemy - Requests (Python HTTP for APIs) - Repository: github.com/ggkuhnle/data-analysis-projects