Skip to content

nish-debug15/churn_prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Customer Churn Predictor

An end-to-end ML pipeline that predicts telecom customer churn with cost-sensitive threshold tuning, feature explainability, and an interactive Streamlit dashboard.


Project Structure

churn-prediction/
├── data/
│   ├── raw/
│   │   ├── generate_dataset.py
│   │   └── telco_churn.csv
│   └── processed/
│       └── features.csv
│
├── src/
│   ├── features.py
│   ├── train.py
│   ├── evaluate.py
│   └── explain.py
│
├── models/
│   ├── best_model.pkl
│   ├── metadata.json
│   ├── evaluation_report.json
│   └── threshold_sweep.csv
│
├── tests/
│   └── test_pipeline.py
│
├── app.py
├── requirements.txt
└── README.md

Prerequisites

  • Python 3.10+
  • pip

Install dependencies:

pip install -r requirements.txt

Quickstart

Step 1 — Generate dataset

python data/raw/generate_dataset.py

Generates data/raw/telco_churn.csv — 7,043 customers, 25.3% churn rate.

Alternatively, download the real IBM Telco dataset from Kaggle (blastchar/telco-customer-churn) and place it at data/raw/telco_churn.csv. The column names are identical — no code changes required.

Step 2 — Build features

python src/features.py

Step 3 — Train models

python src/train.py

Step 4 — Evaluate

python src/evaluate.py

Step 5 — Launch dashboard

streamlit run app.py

Opens at http://localhost:8501

Run tests

pytest tests/ -v

Results

Dataset

Property Value
Total customers 7,043
Features after engineering 28
Churn rate 25.30%
Train / test split 80% / 20% (stratified)

Cross-Validation (5-Fold ROC-AUC)

Model Mean AUC Std
Logistic Regression 0.7604 0.0109
Voting Ensemble 0.7555 0.0111
Random Forest 0.7516 0.0123
Gradient Boosting 0.7308 0.0112

Validation Set Performance

Model AUC
Logistic Regression 0.7461
Voting Ensemble 0.7375
Random Forest 0.7310
Gradient Boosting 0.7134

Best model selected: Logistic Regression (AUC = 0.7461)

Classification Report — Best Model (threshold = 0.50)

Class Precision Recall F1
No Churn 0.87 0.63 0.73
Churn 0.40 0.73 0.52
Weighted Avg 0.75 0.65 0.68

Overall accuracy: 65% at default threshold.

Confusion Matrix (threshold = 0.50)

              Pred: Stay   Pred: Churn
Actual: Stay     661          391
Actual: Churn     96          261

Cost-Sensitive Threshold Analysis

Parameter Value
FN cost per missed churner $500
FP cost per false retention offer $50
Optimal threshold 0.23
Precision at optimal 0.323
Recall at optimal 0.958
Estimated revenue saved $135,100

The pipeline sweeps thresholds from 0.10 to 0.90 and selects the one that maximises expected revenue saved. At the default 0.50 threshold the model is more conservative; lowering to 0.23 captures 95.8% of actual churners at the cost of more false positives, but the net revenue outcome is significantly better given the cost asymmetry between missing a churner versus a wasted retention offer.


Features Engineered

Feature Description
tenure Months as a customer
MonthlyCharges Current monthly bill
Contract Ordinal: 0 = monthly, 1 = annual, 2 = two-year
service_count Total add-on services subscribed
is_new_customer 1 if tenure < 6 months
has_no_support No tech support and no online security
charges_per_month TotalCharges / tenure
monthly_to_total MonthlyCharges / TotalCharges ratio
contract_tenure Contract x tenure interaction term
OHE columns Gender, InternetService, PaymentMethod

Models

Model Configuration
Random Forest 300 trees, max_depth=12, class_weight=balanced
Gradient Boosting HistGradientBoosting, 300 iterations, lr=0.05
Logistic Regression L2, C=0.5, class_weight=balanced
Voting Ensemble Soft vote across all three

All models use a StandardScaler preprocessing step inside a sklearn Pipeline. The best model is selected by ROC-AUC on the held-out validation set and saved to models/best_model.pkl.


Explainability

The dashboard provides feature-level explanations for every prediction using the model's built-in feature importances. For each customer, it shows the top 15 features driving the prediction, coloured by whether the customer is above or below the dataset average on that feature. A radar chart compares the customer profile against the average churner and average non-churner across six key dimensions.

Rule-based counterfactual suggestions are generated based on the customer's current feature values — for example, recommending a contract upgrade if the customer is on a monthly plan.

If shap is installed, the dashboard upgrades to full SHAP TreeExplainer values automatically.


Dashboard Pages

Page Content
Predict Customer profile form, churn probability gauge, feature importance chart, radar comparison, recommended actions
Evaluate ROC-AUC, PR-AUC, F1, recall, precision, confusion matrix, CV scores by model, per-class metrics, threshold sweep with revenue curve, global feature importance
Explore Churn distribution, tenure histogram, churn rate by contract / internet / payment method, monthly charges boxplot, tenure vs charges scatter, senior vs non-senior breakdown

Customisation

  • Swap in the real IBM Telco dataset — same column names, zero code changes required
  • Adjust COST_FN and COST_FP in evaluate.py to reflect actual business unit economics
  • Add MLflow experiment tracking by wrapping train.py with mlflow.start_run()
  • Deploy to Streamlit Cloud for free: push to GitHub, connect at share.streamlit.io

About

End-to-end ML pipeline for telecom customer churn prediction. Features cost-sensitive threshold optimisation, per-prediction explainability, and an interactive Streamlit dashboard for business-ready churn intelligence.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages