# Setup for Google Colab: Fetch datasets automatically or manually
import os
from google.colab import files
= '03_data_handling'
MODULE = 'hippo_nutrients.csv'
DATASET = '/content/data-analysis-projects'
BASE_PATH = os.path.join(BASE_PATH, 'notebooks', MODULE)
MODULE_PATH = os.path.join('data', DATASET)
DATASET_PATH
try:
if not os.path.exists(BASE_PATH):
!git clone https://github.com/ggkuhnle/data-analysis-projects.git
os.chdir(MODULE_PATH)assert os.path.exists(DATASET_PATH)
print(f'Dataset found: {DATASET_PATH} โ
')
except Exception as e:
print(f'Automatic clone failed: {e}')
'data', exist_ok=True)
os.makedirs(= files.upload()
uploaded if DATASET in uploaded:
with open(DATASET_PATH, 'wb') as f:
f.write(uploaded[DATASET])print(f'Successfully uploaded {DATASET} โ
')
else:
raise FileNotFoundError(f'Upload failed. Please ensure you uploaded {DATASET}.')
print('Environment ready.')
๐ 3.2 Importing Data
Before you can analyse data, you first need to import it into Python. This notebook introduces the most common ways of bringing nutrition data into your workflow.
Think of it like shopping for ingredients ๐: you canโt cook (analyse) until youโve brought the food (data) into your kitchen (Python).
๐ฏ Objectives
By the end of this notebook you should be able to: - Import data from CSV and Excel files using pandas. - Recognise other common formats: TSV, JSON, databases, APIs. - Verify the data after import (shapes, column names, head of table). - Apply these skills to hippo_nutrients.csv
.
Install additional libraries.
%pip install pandas openpyxl sqlalchemy -q
import pandas as pd, numpy as np
๐ Importing CSV Files
CSV (Comma Separated Values) is the most common format in nutrition research. It is plain text, easy to share, and can be opened in Excel or any text editor.
Letโs load hippo_nutrients.csv
.
= pd.read_csv('data/hippo_nutrients.csv')
df_csv print(f'Shape: {df_csv.shape}')
print(f'Columns: {df_csv.columns.tolist()}')
display(df_csv.head())
๐ Importing Excel Files
Excel is common in labs and public health datasets. You may encounter files with multiple sheets, formatting quirks, or missing values.
Example (requires openpyxl
):
# %pip install openpyxl # uncomment if needed in Colab
# df_excel = pd.read_excel('data/hippo_nutrients.xlsx', sheet_name='Sheet1')
For this demo weโll reuse the CSV DataFrame but pretend it came from Excel.
= df_csv.copy()
df_excel print(f'Excel shape: {df_excel.shape}')
display(df_excel.head())
๐ Other Common Data Sources
TSV (Tab-Separated Values)
= pd.read_csv('data/example.tsv', sep='\t') df_tsv
JSON (e.g. from APIs)
= pd.read_json('data/example.json') df_json
SQL Databases (clinical or survey data)
from sqlalchemy import create_engine
= create_engine('sqlite:///data/nutrition.db')
engine = pd.read_sql('SELECT * FROM hippo_nutrients', engine) df_sql
Web APIs (advanced)
Many nutrition datasets are accessible via APIs. Example:
import requests
= 'https://api.example.com/nutrients'
url = requests.get(url).json()
data = pd.DataFrame(data) df_api
๐ You donโt need to master all these at once, but itโs important to know they exist!
๐งช Exercise 1: CSV Practice
- Load
hippo_nutrients.csv
into a DataFrame. - Print the first 5 rows.
- How many unique nutrients are in the dataset?
๐งช Exercise 2: Explore Other Formats
Imagine you receive data from collaborators: - NDNS data in a TSV file. - A JSON file with nutrient metadata.
How would you adapt the code above to import these files?
๐ก You donโt need the real files nowโjust sketch out the pandas command youโd use.
โ Conclusion
In this notebook you: - Imported CSV and Excel data. - Learned about other formats (TSV, JSON, SQL, APIs). - Practised verifying shapes, columns, and contents after import.
๐ Next up: 3.3 Data Cleaning โ because imported data is rarely perfect, and cleaning is where the real fun begins!
Resources: - Pandas I/O Documentation - OpenPyXL Documentation - SQLAlchemy - Requests (Python HTTP for APIs) - Repository: github.com/ggkuhnle/data-analysis-projects