📝 6.2 Text Analysis for Qualitative Research

We’ll prepare text for human-led coding (cleaning, tokenisation, light structure) and add small helper summaries (frequencies, n-grams) that support—not replace—interpretation.

This notebook keeps the qualitative lens front-and-centre while giving you just enough NLP to work efficiently.

🎯 Objectives

  • Clean and tokenise open-ended responses with NLTK.
  • Lemmatise, remove stopwords/punctuation, handle case.
  • Build n-grams (bigrams, trigrams) to surface phrases.
  • Optional: POS tags and (careful) sentiment as exploratory aides.
  • Export a tidy table ready for manual coding or 6.3.
import os
from google.colab import files

MODULE = '06_qualitative'
DATASET = 'food_preferences.txt'
BASE_PATH = '/content/data-analysis-projects'
MODULE_PATH = os.path.join(BASE_PATH, 'notebooks', MODULE)
DATASET_PATH = os.path.join(MODULE_PATH, 'data', DATASET)

try:
    if not os.path.exists(BASE_PATH):
        print('Cloning repository...')
        !git clone https://github.com/ggkuhnle/data-analysis-projects.git
    os.chdir(MODULE_PATH)
    if not os.path.exists(DATASET_PATH):
        raise FileNotFoundError('Dataset missing after clone.')
    print('Dataset ready âś…')
except Exception as e:
    print('Setup fallback: upload file...')
    os.makedirs('data', exist_ok=True)
    uploaded = files.upload()
    if DATASET in uploaded:
        with open(os.path.join('data', DATASET), 'wb') as f:
            f.write(uploaded[DATASET])
        print('Uploaded dataset âś…')
    else:
        raise FileNotFoundError('Upload food_preferences.txt to continue.')
%pip install -q pandas nltk matplotlib seaborn scikit-learn wordcloud

import pandas as pd

import nltk
# Handle tokenizers/taggers across NLTK versions
for pkg in [
    "punkt", "punkt_tab",
    "stopwords", "wordnet",
    "averaged_perceptron_tagger_eng", "averaged_perceptron_tagger",
    "vader_lexicon"
]:
    try:
        nltk.download(pkg, quiet=True)
    except Exception:
        pass

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from collections import Counter 
from wordcloud import WordCloud 
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme()
from pathlib import Path
txt = Path('data')/'food_preferences.txt'
responses = [r.strip() for r in txt.read_text(encoding='utf-8').splitlines() if r.strip()]
df = pd.DataFrame({'response_id': range(1, len(responses)+1), 'text': responses})
print('N responses:', len(df))
df.head(5)
# Then define stoplist + lemmatiser
stop = set(stopwords.words('english')).union({'hippo', 'h1','h2','h3'})
lem = WordNetLemmatizer()

def clean_tokens(text: str):
    words = word_tokenize(text.lower())
    words = [w for w in words if w.isalpha()]  # drop punctuation/numbers
    words = [w for w in words if w not in stop]
    words = [lem.lemmatize(w) for w in words]
    return words

df['tokens'] = df['text'].apply(clean_tokens)
df[['response_id','text','tokens']].head(6)

print("Text/NLP environment ready.")

đź§Ľ Preprocessing pipeline

We’ll lowercase, tokenise, remove stopwords/punctuation, and lemmatise (carrots→carrot). This supports coding by removing noise.

stop = set(stopwords.words('english')).union({'hippo', 'h1','h2','h3'})
lem = WordNetLemmatizer()

def clean_tokens(text: str):
    words = word_tokenize(text.lower())
    words = [w for w in words if w.isalpha()]  # drop punctuation/numbers
    words = [w for w in words if w not in stop]
    words = [lem.lemmatize(w) for w in words]
    return words

df['tokens'] = df['text'].apply(clean_tokens)
df[['response_id','text','tokens']].head(6)

📊 Frequencies & word cloud (orientation only)

all_tokens = [t for row in df['tokens'] for t in row]
freq = Counter(all_tokens).most_common(15)
pd.DataFrame(freq, columns=['word','count'])
wc = WordCloud(width=800, height=400, background_color='white').generate(' '.join(all_tokens))
plt.figure(figsize=(10,4)); plt.imshow(wc); plt.axis('off'); plt.title('Word Cloud'); plt.show()

đź”— N-grams (bigrams & trigrams)

Short phrases can reveal food pairings (e.g., fresh fruit, crunchy carrot).

from nltk.util import ngrams

def ngram_counts(tokens_list, n=2, top=15):
    ng = Counter()
    for toks in tokens_list:
        ng.update(ngrams(toks, n))
    return pd.DataFrame(ng.most_common(top), columns=[f'{n}-gram','count'])

bigrams = ngram_counts(df['tokens'], n=2, top=15)
trigrams = ngram_counts(df['tokens'], n=3, top=10)
display(bigrams); display(trigrams)

đź§Ş Optional aides: POS tags & VADER sentiment

Use sparingly; these are exploratory hints, not findings. Sentiment can be noisy in domain language.

ok = False
for res in ("taggers/averaged_perceptron_tagger_eng",
            "taggers/averaged_perceptron_tagger"):
    try:
        nltk.data.find(res)
        ok = True
        break
    except LookupError:
        pass

try:
    from nltk import pos_tag
    df['pos'] = df['tokens'].apply(pos_tag)  # expects list[str]
    out = df[['response_id','pos']].head(4)
    display(out)  # <-- force rendering
except Exception as e:
    print("POS tagging unavailable:", repr(e), "| tagger_found:", ok)
try:
    from nltk.sentiment import SentimentIntensityAnalyzer
    sia = SentimentIntensityAnalyzer()
    df['sentiment'] = df['text'].apply(lambda x: sia.polarity_scores(x)['compound'])
    sns.histplot(df['sentiment']); plt.title('Sentiment (VADER) — exploratory only'); plt.show()
except Exception as e:
    print('VADER unavailable:', e)

📤 Export a coding-ready table

We create a simple structure that supports manual coding (e.g., in Excel/Sheets or in 6.3).

out = df[['response_id','text','tokens']].copy()
out['initial_code'] = ''  # analyst will fill codes
out['notes'] = ''         # memo/comments
out_path = 'qual_coding_sheet.csv'
out.to_csv(out_path, index=False)
print('Wrote:', out_path)

đź§© Exercises

  1. Stoplist tuning: add domain-specific stopwords (e.g., like, really, very)—how do top words change?
  2. Phrase mining: examine bigrams containing fruit or carrot; collect example quotes.
  3. Coding sheet: add 2–4 provisional initial codes per 10 responses (keep them short & action-oriented).

âś… Conclusion

You prepared text for analysis and produced a coding-ready table. Next: formal coding & thematic analysis with reliability checks (6.3).

More
  • NLTK docs (tokenisation, POS, stopwords)
  • Practical theming workflows