A comprehensive churn prediction analysis comparing XGBoost and Random Forest models using temporal train/test splits and Optuna hyperparameter optimization.
This project analyzes customer churn in online sports betting using transactional data from March 1, 2019 to February 29, 2020. Two state-of-the-art machine learning models are trained and compared using industry-standard evaluation metrics.
Data Source: doi:10.17632/9j5gcygnwg.1
- Non-overlapping temporal splits to prevent data leakage
- Optuna hyperparameter tuning for both XGBoost and Random Forest
- 18 engineered features with correlation validation (no multicollinearity)
- Comprehensive evaluation: ROC-AUC, PR-AUC, MCC, Youden's J, Balanced Accuracy
- Feature importance analysis from both models
xgboost_churn/
βββ README.md # This file
βββ requirements.txt # Python dependencies
β
βββ notebooks/
β βββ 01_descriptive_statistics_rfm.ipynb # EDA & Feature Engineering
β βββ 02_xgboost_churn_model.ipynb # XGBoost with Optuna
β βββ 03_random_forest_comparison.ipynb # Random Forest with Optuna
β
βββ data/
β βββ raw/
β β βββ Online_sports_DIB.csv # Raw sports betting transactions
β βββ processed/
β βββ rfm_features_sports_with_churn.csv
β βββ model_comparison_data.pkl
β
βββ reports/
βββ model_performance_report.txt
βββ model_performance_report_rf.txt
βββ figures/ (14 generated PNG visualizations)
pip install -r requirements.txt# Step 1: Exploratory Data Analysis & Feature Engineering (~2-3 min)
jupyter notebook notebooks/01_descriptive_statistics_rfm.ipynb
# Step 2: XGBoost Model Training with Optuna (~5-10 min, 50 trials)
jupyter notebook notebooks/02_xgboost_churn_model.ipynb
# Step 3: Random Forest Comparison (~5-10 min, 50 trials)
jupyter notebook notebooks/03_random_forest_comparison.ipynb| Category | Features | Description |
|---|---|---|
| Recency (1) | recency_days |
Days since last transaction |
| Frequency (2) | total_transactions, frequency_per_day |
Transaction count & rate |
| Monetary (4) | total_monetary_value, avg_transaction_amount, std_transaction_amount, amount_volatility |
Spending metrics |
| Behavior (5) | net_flow, net_loss_ratio, deposit_ratio, avg_deposit_amount, avg_withdrawal_amount |
Deposit/withdrawal patterns |
| Temporal (3) | avg_inter_play_hours, cv_inter_play_hours, max_inter_play_hours |
Time between transactions |
| Trends (2) | recent_30d_amount_ratio, recency_to_lifespan_ratio |
Recent activity decline |
| Loyalty (1) | lifespan_days |
Customer tenure |
recent_30d_amount_ratio(31%): Recent spending intensity vs lifetimerecent_30d_trans_ratio(29%): Recent transaction frequency trendrecency_days(18%): Days since last transaction
Non-overlapping 60-day windows prevent data leakage:
| Set | Features Cutoff | Labels Determined By | Purpose |
|---|---|---|---|
| Train | Oct 31, 2019 | Nov 1 - Dec 30, 2019 | Model training |
| Tuning | Nov 30, 2019 | Dec 1 - Jan 29, 2020 | Hyperparameter optimization |
| Test | Dec 31, 2019 | Jan 1 - Feb 29, 2020 | Final evaluation |
Churn Definition: Zero transactions in the 60-day label window after feature cutoff.
- Framework: Gradient Boosted Trees
- Loss Function: Binary Logistic
- Tuning Method: Optuna (50 trials, TPE sampler)
- Hyperparameters: max_depth, learning_rate, n_estimators, subsample, colsample_bytree, min_child_weight, gamma, reg_alpha, reg_lambda
- Framework: Bootstrap Aggregating (Parallel Ensemble)
- Tuning Method: Optuna (50 trials, TPE sampler)
- Hyperparameters: n_estimators, max_depth, min_samples_split, min_samples_leaf, max_features, bootstrap, criterion
Primary Metrics:
- Accuracy, Precision, Recall, Specificity, F1-Score
Error Analysis:
- FPR (False Positive Rate), FNR (False Negative Rate)
AUC Metrics:
- ROC-AUC, PR-AUC
Derived Metrics:
- MCC (Matthews Correlation Coefficient)
- Balanced Accuracy
- Youden's J Statistic
Due to high multicollinearity (|r| > 0.7):
- β Kept:
recent_30d_amount_ratio(removedrecent_30d_trans_ratio, r=0.969) - β Kept:
lifespan_days(removeddays_since_first_trans, r=0.809)
- β No multicollinearity (|r| < 0.7)
- β Timezone-aware UTC timestamps
- β APPROVED transactions only
- β Complete data validation
Generated after notebook execution:
reports/
βββ model_performance_report.txt # XGBoost metrics
βββ model_performance_report_rf.txt # Random Forest metrics
βββ figures/
βββ 01-06: EDA visualizations
βββ 08: Churn comparison
βββ 10-13: XGBoost evaluation (ROC, confusion, features)
βββ 14-16: Random Forest evaluation (ROC, confusion, features)
| Type | Direction | Purpose |
|---|---|---|
| LOYALTYCARDDEBIT | Digital Wallet β Wagering Account | Level 2 deposit (funding play) |
| LOYALTYCARDCREDIT | Wagering Account β Digital Wallet | Level 2 withdrawal (cashing out) |
| LOYALTYCARDCREDITCL | Personal Account β Digital Wallet | Level 1 deposit via card |
| LOYALTYCARDCREDITACH | Personal Account β Digital Wallet | Level 1 deposit via ACH |
- Primary Churn Signal: Declining spending in recent 30 days
- Complementary Signals: Absolute recency + relative inactivity
- Data-Driven Thresholds: Determined from inter-playtime distribution
- Model Comparison: Both models achieve >82% accuracy; see reports for detailed comparison
- Random Seed: 42 (XGBoost, Random Forest, Optuna)
- Sklearn Version: 1.3.2
- XGBoost Version: 2.0.3
- Data Processing: UTC timezone-aware, APPROVED transactions only
- No Data Leakage: Features and labels use non-overlapping windows
- Dataset: https://doi.org/10.17632/9j5gcygnwg.1
- XGBoost: https://xgboost.readthedocs.io/
- Optuna: https://optuna.org/
- Scikit-learn: https://scikit-learn.org/
Status: β Complete Analysis | Last Updated: January 2026
- Level 2 Withdrawals: Wagering Account β Digital Wallet
python -m venv venv.\venv\Scripts\Activate.ps1If you encounter execution policy errors, run:
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUserpip install --upgrade pip
pip install -r requirements.txtpython -c "import pandas, xgboost, sklearn; print('All packages installed successfully!')"jupyter notebook- Load and inspect raw data
- Descriptive statistics
- Distribution analysis (histograms, boxplots)
- Temporal patterns
- Correlation analysis
- Missing data assessment
- Handle missing values
- Parse datetime features
- Engineer transaction-based features
- Create churn labels
- Handle imbalanced data
- Customer-level aggregations
- Behavioral patterns
- Transaction velocity
- Temporal features
- L1/L2 transaction ratios
- Train/test split with temporal awareness
- XGBoost model training
- Hyperparameter tuning
- Cross-validation
- Model evaluation
- Feature importance analysis
- SHAP values
- Performance metrics
- Comparative analysis (sports vs casino)
- Generate publication-ready figures
- Run EDA notebooks to understand the data
- Define churn criteria based on domain knowledge
- Engineer relevant features
- Train and evaluate models
- Generate insights for article