· Fabian Schreuder · Data Science Projects · 3 min read
Uncovering Dietary Patterns to Predict Colorectal Cancer Recurrence: A Data-Driven Approach
A case study on using Latent Profile Analysis and Cox Proportional Hazards modeling to identify pre-diagnosis dietary patterns associated with colorectal cancer recurrence. We navigated complex, high-dimensional health data to deliver clinically relevant insights.

Colorectal Cancer (CRC) is the third most common cancer worldwide, and understanding the drivers of its recurrence is critical for improving patient outcomes. This gap presents a classic data science challenge: can we find a signal in noisy, high-dimensional data to identify habitual diets that are associated with cancer recurrence?
In this project, my team and I developed a multi-stage analysis pipeline to tackle this question, moving from complex raw data to a validated predictive model with actionable insights.
The Challenge: Navigating Complex Health Data
Our data came from the COLON study, a prospective cohort study that tracks CRC patients over time.
This rich dataset came with three major hurdles:
- Significant Missingness: The dataset had a substantial amount of missing values, with an overall missingness of 42.7%.
- High Dimensionality: With 237 separate food intake variables, building a model directly was impractical. This “curse of dimensionality” can obscure meaningful patterns and lead to overfitting.
- Measurement Error: FFQs rely on patient memory and are prone to recall and social desirability biases, introducing noise into the exposure measurements.
Our Approach: A Multi-Stage Analysis Pipeline
To overcome these challenges, we implemented a robust data preparation and modeling strategy using R and the tidyverse
, mice
, and mclust
packages.
1. Intelligent Data Imputation
We addressed the missing data using Multiple Imputation by Chained Equations (MICE).
2. Dimensionality Reduction & Feature Engineering
To make the 237 food variables manageable, we aggregated them into eight clinically relevant food categories: fruits, vegetables, meat, seafood, whole grains, high-fat dairy, low-fat dairy, and ultra-processed products.
3. Uncovering Hidden Patterns with Latent Profile Analysis (LPA)
With our engineered food groups, the next step was to identify distinct dietary patterns. We used Latent Profile Analysis (LPA), a type of Gaussian Mixture Model, to perform data-driven clustering.
This analysis successfully identified five distinct dietary clusters, each with a unique consumption profile. For example, Cluster 3 was characterized by high intake of seafood and ultra-processed foods, while another cluster showed high fruit consumption. These clusters became the primary exposure variable for our predictive model.
Modeling for Impact: Predicting Recurrence with Survival Analysis
To assess whether these dietary patterns were associated with cancer recurrence, we used Cox Proportional Hazards modeling—the gold standard for analyzing time-to-event data.
We built an adjusted model that controlled for key confounders identified from our Directed Acyclic Graph (DAG), including age, education, physical activity, and tumor stage.
The results were revealing:
- As expected, tumor stage was the strongest predictor of recurrence, with a hazard ratio of approximately 3.1 (). This confirmed our model was capturing a critical clinical reality.
- While no dietary cluster reached statistical significance, Cluster 3 (high in ultra-processed foods and meat .
- Conversely, Cluster 5 was associated with a potentially reduced risk (Adjusted HR = 0.85).
Validation and Key Takeaways
A model is only as good as its ability to generalize. We validated our final adjusted Cox model using 5-fold cross-validation, achieving a Concordance Index (C-index) of 0.765.
Although the dietary pattern associations were not statistically significant, this project provides a robust framework and several key insights:
- Clinical Context is King: The analysis reinforces that major clinical factors like tumor stage are the primary drivers of prognosis.
- Diet as a Complementary Factor: The trends, particularly for the high-risk Cluster 3, suggest that pre-diagnosis diet may play a complementary role. This aligns with other studies that have linked processed meat consumption to poorer CRC outcomes.
- A Path for Future Research: We identified a key limitation: our clusters were based on absolute intake (grams) rather than relative intake (e.g., % of total calories).
This project demonstrates a complete, end-to-end data science workflow—from wrangling messy, real-world data to building and validating a sophisticated survival model that offers tangible, clinically-relevant insights.