This project focuses on predicting whether a customer has health insurance using demographic, financial, and lifestyle data.
The goal is to support data-driven marketing strategies, allowing companies to target high-potential customers, reduce costs, and improve customer acquisition.
Marketing campaigns for health insurance are often broad and inefficient.
This project answers the key question:
How can we predict which customers are most likely to subscribe to health insurance in order to optimize marketing efforts and improve inclusion?
- Observations: 72,458 customers
- Features: 15 variables
- Target:
health_ins(1 = Has insurance, 0 = No insurance)
- Demographic → age, sex, marital status
- Financial → income
- Lifestyle → housing, vehicles, gas usage
- Geographic → state of residence
- Income is the strongest predictor of insurance ownership
- Older individuals are more likely to have insurance
- More vehicles and better housing → higher probability of insurance
- Gas usage showed low predictive power
- Strong class imbalance (majority already insured)
-
Missing values handled (median / mode imputation)
-
Outliers treated (capping + log transformation)
-
Feature scaling (Standardization)
-
Categorical encoding (One-Hot Encoding)
-
Feature engineering:
income_per_vehicleincome_age_interactionfinancial_statusmobility_indicator
- Logistic Regression
- Random Forest
- XGBoost
- Gradient Boosting
- Neural Network (MLP)
| Model | Accuracy | F1-Score | AUC |
|---|---|---|---|
| Logistic Regression | 0.68 | 0.79 | 0.76 |
| Random Forest | 0.84 | 0.91 | 0.73 |
| XGBoost | 0.69 | 0.80 | 0.79 |
| Gradient Boosting | 0.90 | 0.95 | 0.80 |
| Neural Network | 0.23 | 0.27 | 0.47 |
Gradient Boosting achieved the best overall performance:
- High accuracy (0.90)
- Strong balance between precision and recall
- Robust handling of class imbalance
This model enables:
- Targeted marketing campaigns
- Reduction in marketing costs
- Higher conversion rates
- Better understanding of customer behavior
- Improved inclusion of underserved groups
- Final model saved using
joblib - Predictions generated on unseen test data
- Output file:
final_predictions.csv
- Feature engineering has a major impact on performance
- Handling class imbalance is critical
- Simpler models can be strong baselines
- Ensemble methods (Boosting) perform best on structured data
- Python
- Pandas, NumPy
- Scikit-learn
- XGBoost
- Matplotlib, Seaborn
This project demonstrates how machine learning can be applied to solve a real-world business problem by combining data analysis, feature engineering, and predictive modeling.
The final model provides both accurate predictions and actionable insights, making it valuable for strategic decision-making.