Skip to content

thepedromendes/Health_Insurance

Repository files navigation

Health Insurance Prediction – Machine Learning Project

Overview

This project focuses on predicting whether a customer has health insurance using demographic, financial, and lifestyle data.

The goal is to support data-driven marketing strategies, allowing companies to target high-potential customers, reduce costs, and improve customer acquisition.


Business Problem

Marketing campaigns for health insurance are often broad and inefficient.

This project answers the key question:

How can we predict which customers are most likely to subscribe to health insurance in order to optimize marketing efforts and improve inclusion?


Dataset

  • Observations: 72,458 customers
  • Features: 15 variables
  • Target: health_ins (1 = Has insurance, 0 = No insurance)

Feature Types:

  • Demographic → age, sex, marital status
  • Financial → income
  • Lifestyle → housing, vehicles, gas usage
  • Geographic → state of residence

Key Insights from Analysis

  • Income is the strongest predictor of insurance ownership
  • Older individuals are more likely to have insurance
  • More vehicles and better housing → higher probability of insurance
  • Gas usage showed low predictive power
  • Strong class imbalance (majority already insured)

Data Preparation

  • Missing values handled (median / mode imputation)

  • Outliers treated (capping + log transformation)

  • Feature scaling (Standardization)

  • Categorical encoding (One-Hot Encoding)

  • Feature engineering:

    • income_per_vehicle
    • income_age_interaction
    • financial_status
    • mobility_indicator

Models Implemented

  • Logistic Regression
  • Random Forest
  • XGBoost
  • Gradient Boosting
  • Neural Network (MLP)

Model Performance

Model Accuracy F1-Score AUC
Logistic Regression 0.68 0.79 0.76
Random Forest 0.84 0.91 0.73
XGBoost 0.69 0.80 0.79
Gradient Boosting 0.90 0.95 0.80
Neural Network 0.23 0.27 0.47

Best Model

Gradient Boosting achieved the best overall performance:

  • High accuracy (0.90)
  • Strong balance between precision and recall
  • Robust handling of class imbalance

Business Impact

This model enables:

  • Targeted marketing campaigns
  • Reduction in marketing costs
  • Higher conversion rates
  • Better understanding of customer behavior
  • Improved inclusion of underserved groups

Model Deployment

  • Final model saved using joblib
  • Predictions generated on unseen test data
  • Output file: final_predictions.csv

Key Learnings

  • Feature engineering has a major impact on performance
  • Handling class imbalance is critical
  • Simpler models can be strong baselines
  • Ensemble methods (Boosting) perform best on structured data

Tech Stack

  • Python
  • Pandas, NumPy
  • Scikit-learn
  • XGBoost
  • Matplotlib, Seaborn

Conclusion

This project demonstrates how machine learning can be applied to solve a real-world business problem by combining data analysis, feature engineering, and predictive modeling.

The final model provides both accurate predictions and actionable insights, making it valuable for strategic decision-making.

About

Predicting whether or not a customer has health insurance is an essential task for the company, as it allows the identification of potential customers, to focus marketing resources on people who are most likely to subscribe to these services.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors