Skip to content

HermannReuterData/xgboost_churn

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

XGBoost vs Random Forest: Churn Prediction Analysis

A comprehensive churn prediction analysis comparing XGBoost and Random Forest models using temporal train/test splits and Optuna hyperparameter optimization.

πŸ“Š Project Overview

This project analyzes customer churn in online sports betting using transactional data from March 1, 2019 to February 29, 2020. Two state-of-the-art machine learning models are trained and compared using industry-standard evaluation metrics.

Data Source: doi:10.17632/9j5gcygnwg.1

Key Features

  • Non-overlapping temporal splits to prevent data leakage
  • Optuna hyperparameter tuning for both XGBoost and Random Forest
  • 18 engineered features with correlation validation (no multicollinearity)
  • Comprehensive evaluation: ROC-AUC, PR-AUC, MCC, Youden's J, Balanced Accuracy
  • Feature importance analysis from both models

πŸ“ Project Structure

xgboost_churn/
β”œβ”€β”€ README.md                          # This file
β”œβ”€β”€ requirements.txt                   # Python dependencies
β”‚
β”œβ”€β”€ notebooks/
β”‚   β”œβ”€β”€ 01_descriptive_statistics_rfm.ipynb    # EDA & Feature Engineering
β”‚   β”œβ”€β”€ 02_xgboost_churn_model.ipynb          # XGBoost with Optuna
β”‚   └── 03_random_forest_comparison.ipynb     # Random Forest with Optuna
β”‚
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ raw/
β”‚   β”‚   └── Online_sports_DIB.csv             # Raw sports betting transactions
β”‚   └── processed/
β”‚       β”œβ”€β”€ rfm_features_sports_with_churn.csv
β”‚       └── model_comparison_data.pkl
β”‚
└── reports/
    β”œβ”€β”€ model_performance_report.txt
    β”œβ”€β”€ model_performance_report_rf.txt
    └── figures/ (14 generated PNG visualizations)

πŸš€ Quick Start

1. Install Dependencies

pip install -r requirements.txt

2. Run Analysis (Sequential)

# Step 1: Exploratory Data Analysis & Feature Engineering (~2-3 min)
jupyter notebook notebooks/01_descriptive_statistics_rfm.ipynb

# Step 2: XGBoost Model Training with Optuna (~5-10 min, 50 trials)
jupyter notebook notebooks/02_xgboost_churn_model.ipynb

# Step 3: Random Forest Comparison (~5-10 min, 50 trials)
jupyter notebook notebooks/03_random_forest_comparison.ipynb

πŸ“Š Feature Engineering

18 Final Features (After Multicollinearity Removal)

Category Features Description
Recency (1) recency_days Days since last transaction
Frequency (2) total_transactions, frequency_per_day Transaction count & rate
Monetary (4) total_monetary_value, avg_transaction_amount, std_transaction_amount, amount_volatility Spending metrics
Behavior (5) net_flow, net_loss_ratio, deposit_ratio, avg_deposit_amount, avg_withdrawal_amount Deposit/withdrawal patterns
Temporal (3) avg_inter_play_hours, cv_inter_play_hours, max_inter_play_hours Time between transactions
Trends (2) recent_30d_amount_ratio, recency_to_lifespan_ratio Recent activity decline
Loyalty (1) lifespan_days Customer tenure

Top 3 Predictors (XGBoost Importance)

  1. recent_30d_amount_ratio (31%): Recent spending intensity vs lifetime
  2. recent_30d_trans_ratio (29%): Recent transaction frequency trend
  3. recency_days (18%): Days since last transaction

πŸ“ˆ Temporal Validation Strategy

Non-overlapping 60-day windows prevent data leakage:

Set Features Cutoff Labels Determined By Purpose
Train Oct 31, 2019 Nov 1 - Dec 30, 2019 Model training
Tuning Nov 30, 2019 Dec 1 - Jan 29, 2020 Hyperparameter optimization
Test Dec 31, 2019 Jan 1 - Feb 29, 2020 Final evaluation

Churn Definition: Zero transactions in the 60-day label window after feature cutoff.

πŸ€– Model Configurations

XGBoost

  • Framework: Gradient Boosted Trees
  • Loss Function: Binary Logistic
  • Tuning Method: Optuna (50 trials, TPE sampler)
  • Hyperparameters: max_depth, learning_rate, n_estimators, subsample, colsample_bytree, min_child_weight, gamma, reg_alpha, reg_lambda

Random Forest

  • Framework: Bootstrap Aggregating (Parallel Ensemble)
  • Tuning Method: Optuna (50 trials, TPE sampler)
  • Hyperparameters: n_estimators, max_depth, min_samples_split, min_samples_leaf, max_features, bootstrap, criterion

πŸ“Š Evaluation Metrics

Primary Metrics:

  • Accuracy, Precision, Recall, Specificity, F1-Score

Error Analysis:

  • FPR (False Positive Rate), FNR (False Negative Rate)

AUC Metrics:

  • ROC-AUC, PR-AUC

Derived Metrics:

  • MCC (Matthews Correlation Coefficient)
  • Balanced Accuracy
  • Youden's J Statistic

πŸ” Data Quality & Feature Validation

Removed Redundant Features

Due to high multicollinearity (|r| > 0.7):

  • βœ“ Kept: recent_30d_amount_ratio (removed recent_30d_trans_ratio, r=0.969)
  • βœ“ Kept: lifespan_days (removed days_since_first_trans, r=0.809)

All Remaining Features

  • βœ“ No multicollinearity (|r| < 0.7)
  • βœ“ Timezone-aware UTC timestamps
  • βœ“ APPROVED transactions only
  • βœ“ Complete data validation

πŸ“ Output Files

Generated after notebook execution:

reports/
β”œβ”€β”€ model_performance_report.txt       # XGBoost metrics
β”œβ”€β”€ model_performance_report_rf.txt    # Random Forest metrics
└── figures/
    β”œβ”€β”€ 01-06: EDA visualizations
    β”œβ”€β”€ 08: Churn comparison
    β”œβ”€β”€ 10-13: XGBoost evaluation (ROC, confusion, features)
    └── 14-16: Random Forest evaluation (ROC, confusion, features)

πŸ“‹ Transaction Type Dictionary

Type Direction Purpose
LOYALTYCARDDEBIT Digital Wallet β†’ Wagering Account Level 2 deposit (funding play)
LOYALTYCARDCREDIT Wagering Account β†’ Digital Wallet Level 2 withdrawal (cashing out)
LOYALTYCARDCREDITCL Personal Account β†’ Digital Wallet Level 1 deposit via card
LOYALTYCARDCREDITACH Personal Account β†’ Digital Wallet Level 1 deposit via ACH

🎯 Key Business Insights

  1. Primary Churn Signal: Declining spending in recent 30 days
  2. Complementary Signals: Absolute recency + relative inactivity
  3. Data-Driven Thresholds: Determined from inter-playtime distribution
  4. Model Comparison: Both models achieve >82% accuracy; see reports for detailed comparison

βœ… Reproducibility

  • Random Seed: 42 (XGBoost, Random Forest, Optuna)
  • Sklearn Version: 1.3.2
  • XGBoost Version: 2.0.3
  • Data Processing: UTC timezone-aware, APPROVED transactions only
  • No Data Leakage: Features and labels use non-overlapping windows

πŸ“š References


Status: βœ… Complete Analysis | Last Updated: January 2026

  1. Level 2 Withdrawals: Wagering Account β†’ Digital Wallet

Setup Instructions

1. Create Virtual Environment

python -m venv venv

2. Activate Virtual Environment

.\venv\Scripts\Activate.ps1

If you encounter execution policy errors, run:

Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser

3. Install Dependencies

pip install --upgrade pip
pip install -r requirements.txt

4. Verify Installation

python -c "import pandas, xgboost, sklearn; print('All packages installed successfully!')"

5. Launch Jupyter Notebook

jupyter notebook

Workflow

Phase 1: Exploratory Data Analysis

  • Load and inspect raw data
  • Descriptive statistics
  • Distribution analysis (histograms, boxplots)
  • Temporal patterns
  • Correlation analysis
  • Missing data assessment

Phase 2: Data Processing

  • Handle missing values
  • Parse datetime features
  • Engineer transaction-based features
  • Create churn labels
  • Handle imbalanced data

Phase 3: Feature Engineering

  • Customer-level aggregations
  • Behavioral patterns
  • Transaction velocity
  • Temporal features
  • L1/L2 transaction ratios

Phase 4: Modeling

  • Train/test split with temporal awareness
  • XGBoost model training
  • Hyperparameter tuning
  • Cross-validation
  • Model evaluation

Phase 5: Results & Visualization

  • Feature importance analysis
  • SHAP values
  • Performance metrics
  • Comparative analysis (sports vs casino)
  • Generate publication-ready figures

Next Steps

  1. Run EDA notebooks to understand the data
  2. Define churn criteria based on domain knowledge
  3. Engineer relevant features
  4. Train and evaluate models
  5. Generate insights for article

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors