This project performs customer segmentation using unsupervised learning on large-scale retail transaction data. The objective is to group customers based on purchasing behavior to support targeted marketing, retention strategies, and business decision-making.
- Source: UCI Machine Learning Repository
- Dataset: Online Retail
- Records: 500,000+ transaction records
- Customers: ~4,000 unique customers
- CustomerID
- InvoiceNo
- InvoiceDate
- Quantity
- UnitPrice
- Country The raw transactional data was cleaned and aggregated to customer-level features before clustering.
- Python
- Pandas, NumPy
- Matplotlib, Seaborn
- Scikit-learn
- Problem Framing Segmentation of customers using unsupervised learning (no labels available).
- Data Cleaning
- Removed missing CustomerID
- Removed cancelled/returned transactions
- Filtered invalid quantities and prices
- Feature Engineering (RFM Analysis)
- Recency: Days since last purchase
- Frequency: Number of transactions
- Monetary: Total spending
- Exploratory Data Analysis (EDA)
- Analyzed feature distributions and skewness
- Identified need for feature scaling
- Clustering
- Applied K-Means clustering
- Optimized number of clusters using Elbow Method and Silhouette Score
- Dimensionality Reduction
- Used PCA for 2D visualization of customer segments
- Successfully identified distinct customer segments
- Clear separation between:
- High-value loyal customers
- Frequent low-spend customers
- At-risk or churn-prone customers
- Occasional buyers
- Enables targeted marketing campaigns
- Helps prioritize high-value customers
- Identifies churn-risk customers for retention strategies
- Clusters are sensitive to outliers
- No ground-truth labels for validation
- Segmentation quality depends on feature engineering
- Try alternative clustering methods (DBSCAN, Hierarchical)
- Add temporal features
- Evaluate cluster stability over time