1 Provenance (v2) — fb2nep synthetic cohort
Generator: scripts/generate_dataset.py
Seed: 11088
Sample size: N≈25,000 adults
Baseline age: ≥40 years
Period: Baseline between 2010‑01‑01 and 2015‑12‑31; follow‑up uniformly 5–10 years.
1.1 Design intents
- Teach realistic cohort structure with two endpoints: CVD and Cancer.
- Provide both incident indicators and event dates.
- Encode SES (ABC1/C2DE) and IMD gradients.
- Add clinically relevant covariates: SBP, family history, menopausal status.
- Include selected non‑linear associations (BMI U‑shape, alcohol J‑shape, SBP quadratic in age, vitamin C saturation).
- Add red meat intake as a positive risk factor for Cancer above ~50 g/d.
1.2 Variable generation (summary)
- Age ~ truncated Normal(μ=58, σ=10, 40–90). Sex: 52% F, 48% M.
- IMD_quintile skewed to 2–4.
- SES_class depends on IMD.
- Menopausal status age‑patterned.
- Smoking ~15% current.
- PA: depends on IMD.
- BMI: Normal(27, 4.5), rises with age/smoking.
- Energy: log‑normal around 1900 kcal, scaled by PA/sex.
- Diet: FV, red meat, SSB, fibre, salt depend on IMD, SES, energy.
- Biomarkers: plasma vitC saturates with FV; urinary Na tracks salt.
- SBP: non‑linear in age plus salt, BMI, smoking, PA.
1.3 Outcomes and dates
- CVD: age, BMI (U‑shape), smoking, sex, IMD, SES, salt, SBP, alcohol J‑shape, FHx CVD. Target 10–15%.
- Cancer: age, BMI, smoking, SES, sex, FHx cancer, red meat >50 g/d. Target 8–12%.
- Event dates: baseline + simulated time for incidents.
1.4 Missingness
- MCAR: ~2–3%.
- MAR: +5–8% tied to IMD/age.
- MNAR (tiny): alcohol→alcohol missing; high BMI→BMI missing.
1.5 Validation targets
- Corr(FV, vitC) > 0.45.
- Corr(salt, urinary Na) > 0.55.
- SBP vs age Spearman ρ > 0.35.
- SES/IMD diet gradients as expected.
- CVD incidence 10–15%; Cancer 8–12%.
- Red meat higher in cancer cases.
- Event dates present only if incident=1.
1.6 Notes
This is teaching data: tuned for pedagogy, not inference.