This project focuses on detecting fraudulent credit card transactions in a highly imbalanced dataset using machine learning techniques. The goal is not just high accuracy, but effective fraud detection while minimizing false positives in a real-world alert system.
- Source: Kaggle – Credit Card Fraud Detection Dataset
- Transactions: 284,807
- Fraud Rate: ~0.17%
- Features: PCA-transformed features (
V1–V28) +Amount - Target:
Class(1 = Fraud, 0 = Legitimate)
The dataset is not included in this repository due to size and licensing constraints.
- Python
- NumPy, Pandas
- Scikit-learn
- XGBoost
- Matplotlib, Seaborn
- Data loading and inspection
- Train–test split with stratification
- Handling class imbalance using SMOTE
- Baseline modeling with Random Forest
- Main modeling using XGBoost
- Model evaluation using ROC-AUC
- Precision–Recall analysis
- Threshold tuning to reduce false positives
- Feature importance analysis
- ROC-AUC: ~0.95
- Key Outcome: ~32% reduction in false positives via threshold tuning
- Focus: Alert optimization rather than raw accuracy
- Accuracy is misleading for highly imbalanced datasets
- Threshold selection is a business decision, not a model default
- XGBoost scales better than Random Forest for large imbalanced data
- No temporal validation
- PCA features limit interpretability
- SMOTE may introduce synthetic bias
- Cost-sensitive learning
- Time-based validation
- Concept drift handling