This repository provides a comprehensive pipeline for survival analysis using the METABRIC (Molecular Taxonomy of Breast Cancer International Consortium) dataset. Breast cancer is the most prevalent cancer in women globally, and survival analysis is crucial for understanding patient prognosis, evaluating treatment efficacy, and identifying key risk factors.
The primary goals of this analysis are:
- Prognostic Modeling: Develop robust models to predict overall survival (OS) and relapse-free survival (RFS) in breast cancer patients.
- Risk Stratification: Identify distinct patient subgroups based on their survival patterns to enable personalized risk assessment.
- Feature Importance: Determine which clinical and molecular features most significantly influence survival outcomes.
- Treatment Insights: Explore associations between different treatment modalities (chemotherapy, hormone therapy, radiotherapy) and patient survival.
The insights from this project can help clinicians make more informed treatment decisions, support patient counseling, and provide a basis for future clinical trial design.
BC_SURVIVAL.ipynb: The main Jupyter notebook containing the complete end to end survival analysis pipeline, from data preprocessing to model evaluation.data/: Directory containing the dataset.Breast Cancer METABRIC.csv: The raw dataset used for the analysis.
environment.yml: Conda environment file listing all necessary dependencies to ensure reproducibility.README.md: This file, providing an overview of the project.
To ensure a reproducible environment, all dependencies are specified in the environment.yml file.
-
Create the Conda Environment: Open the terminal and run the following command to create the
bc-survivalconda environment:conda env create --name bc-survival --file environment.yml
-
Activate the Environment: Once the environment is created, activate it using:
conda activate bc-survival
-
Launch Jupyter: With the environment activated, launch Jupyter Notebook or JupyterLab to run the analysis:
jupyter notebook
- Ensure installation is complete, environment setup steps above.
- Open and run the
BC_SURVIVAL.ipynbnotebook from top to bottom. The notebook is structured to execute the entire pipeline.
The METABRIC dataset contains clinical and molecular data for 2,509 breast cancer patients. After cleaning and removing subjects with missing survival outcomes, the final cohort for analysis consists of 1,980 patients.
-
Target Variables:
- Overall Survival (OS): The primary outcome, defined by
duration_os(time in months from diagnosis to death) andevent_os(1 if deceased, 0 if censored/living). - Relapse-Free Survival (RFS): A secondary outcome, defined by
duration_rfs(time in months to recurrence) andevent_rfs(1 if recurred, 0 if not).
- Overall Survival (OS): The primary outcome, defined by
-
Features: The dataset includes a set of clinical and molecular features, such as:
- Patient Demographics: Age at diagnosis, menopausal state.
- Tumor Characteristics: Tumor size, stage, histologic grade, cellularity, and molecular subtypes (PAM50, 3-Gene classifier).
- Biomarkers: Estrogen Receptor (ER), Progesterone Receptor (PR), and HER2 status.
- Treatment History: Information on if the patient received chemotherapy, hormone therapy, or radiotherapy.
Model performance is rigorously assessed using metrics that capture discrimination, calibration, and clinical utility.
- Survival (Concordance Index (C-index)): Measures the rank correlation between predicted risk and actual survival time (used for Cox Model).
- Fixed-Horizon (AUROC (Area Under the ROC Curve)): "Measures model discrimination at a fixed time horizon (60 month)."
- Fixed-Horizon (AUPRC (Area Under the Precision-Recall Curve)): Useful for imbalanced datasets; superior to AUROC when the event rate (prevalence) is low.
- Calibration (Brier Score Loss): Measures the average squared difference between predicted risk and actual outcome.
Models (CPH, DT, RF) are trained to predict risk, and subsequently, their raw risk scores are passed through Isotonic Regression for calibration. The resulting calibrated risks ( artifacts) are used for final metric calculations and visualized using a Calibration Curve (60 month horizon) to ensure the predicted probabilities match the observed event rates_cal.
The notebook implements a multi-stage survival analysis workflow.
- Data Cleaning: Redundant columns are removed, and patients with missing survival outcomes are excluded.
- Leakage Prevention: A strict separation is maintained between features (X) and outcomes (y). Outcome-related columns are removed from the feature set before any modeling or preprocessing to prevent data leakage.
- Data Splitting: The data is split into stratified training (60%), validation (20%), and test (20%) sets. Stratification is performed on the event status to ensure a balanced distribution of outcomes across all sets.
- Preprocessing Pipeline:
scikit-learnpipeline is constructed to handle missing values (imputation), encode categorical features (one hot encoding), and scale numeric features (standardization). This pipeline is fitted only on the training data and then applied to the validation and test sets to prevent leakage.
- Feature Distributions: Univariate analysis of numeric (histograms, box plots) and categorical (bar charts) features to understand their distributions and identify skewness or imbalances.
- Bivariate Analysis: The relationship between each feature and the survival outcome is explored. Statistical tests (Mann-Whitney U for numeric, Chi-square for categorical) are used to identify features that differ significantly between patients who experienced an event versus those who were censored.
- Overall Survival (OS) Curve: A KM curve is generated for the entire cohort to estimate the overall survival probability over time.
- Relapse-Free Survival (RFS) Curve: A similar analysis is performed for RFS to estimate the probability of remaining recurrence-free.
- Median Survival & RMST: For both OS and RFS, the median survival time and Restricted Mean Survival Time (RMST) are calculated. RMST provides a robust measure of average survival time up to a specific time point.
This section builds interpretable models to understand how different clinical factors contribute to patient risk over time.
Practical Use:
- Use the Hazard Ratios from the Cox model to explain the magnitude and direction of risk associated with factors like tumor size or lymph node involvement.
- The model can inform clinical pathway design and feature prioritization for future research.
This phase focuses on building high-performance machine learning models to predict individual patient outcomes at fixed, clinically relevant time points (for instance, 5-year survival).
Practical Use:
- Use Random Forest models for maximal predictive accuracy when creating risk stratification tools.
- Use Decision Trees when simple, transparent if-then rules are preferred for clinical decision support.
- Decision curve analysis helps select risk thresholds that align with clinical priorities, balancing the trade-off between true positives and false positives.
This phase consists of post-processing model outputs to ensure reliability and translating statistical risks into actionable clinical decisions.
Practical Use:
- The calibrated models and their Brier scores provide the most trustworthy risk estimates for patient communication.
- DCA allows clinicians to select an optimal risk threshold that balances the harm of unnecessary treatment (FPs) against the benefit of early intervention (TPs). This ensures the model's output directly supports a utility driven clinical policy.
- KM plots with survival estimates and subgroup comparisons.
- Cox model summary table with hazard ratios, p-values, and concordance index.
- Calibration plots and Brier scores for all models.
- Performance tables for all models (Cox, Decision Tree, Random Forest), summarizing calibrated AUROC, AUPRC, and Brier scores across time horizons (60, 120, 180).-
- Decision curve analysis plots to assess clinical utility.
- Grouped Bar Charts comparing model performance (AUROC and AUPRC) across all horizons.
- External Validation: Swap in a different breast cancer cohort to validate the models' generalizability.
- Feature Engineering: Incorporate genomic data (gene expression, etc.) or treatment information to help improve predictive performance.
- Advanced Models: Experiment with other survival models like Gradient Boosted Trees (XGBoost) or deep learning approaches.
- Ensure
scikit-survivalis installed fromconda-forgeto avoid compilation issues. - If running into memory issues with large datasets, use
pandas.read_csvwith theusecolsparameter to load only necessary columns.
- These models are trained on historical data and are intended for research and decision support, not as a substitute for clinical judgment.
- Model performance may vary on different populations. Local validation and calibration are essential before any clinical application.
- It is crucial to review model performance across relevant demographic and clinical subgroups to ensure fairness and equity.
- Code under the MIT License
- The code in this repository is available under the MIT License.
- Please cite the original source of the dataset used in analysis.