Project Title: Supervised Machine Learning Fundamentals - Explanatory and Predictive Modeling of Travel Insurance Data
This project analyzes travel insurance data to understand the factors influencing the purchase of a Travel Insurance Package. It involves data preprocessing, exploratory data analysis (EDA), statistical modeling, and machine learning techniques to develop predictive models for this classification task.
The primary objective is to identify business opportunities for the travel insurance package, specifically targeting customers who are likely to purchase the product.
Moreover, the project aims to:
- Explore the relationships between customer features.
- Perform statistical inference on the customer dataset.
- Identify the most significant predictors of Travel Insurance Package purchase.
- Build multiple machine learning models to provide the best prediction for Travel Insurance Package purchase.
- The most important features to predict Travel Insurance Package purchase are especially Income and if a customer ever travelled abroad before.
- But also Age, number of family members, and if a customer is a frequent flyer are important features, as discovered in correlation analysis and logistic regression analysis.
- The Random Forest with hypertuning performed the best among the tested and validated machine learning models
- This model could reach on a hold-out test data set an accuracy score of 0.844 and a very impressive precision score of 0.988.
To set up this project locally:
- Clone the repository:
git clone https://github.com/razzf/insurance-data-machine-learning
- Navigate to the project directory:
cd insurance-data-machine-learning - Install required packages:
Ensure Python is installed and use the following command:
pip install -r requirements.txt
Open the notebook in Jupyter or JupyterLab to explore the analysis. Execute the cells sequentially to understand the workflow, from data exploration to model building and evaluation.
The dataset is located in the /data directory. It is originally derived from Kaggle. The data set reflects a collection of customer data from a Tour & Travels Company which is offering a Travel Insurance Package. The insurance was offered to some of the customers in 2019. The given data has been extracted from the performance/sales of the package during that period. It contains data of 1.987 customers for 8 features (e.g. Age, Employment, Type, Income, if Customer has ever travelled/ is a frequently flyer, etc.) and one variable containing information if Travel Insurance Package was bought. The data origins are unknown.
project-root/
├── custom_modules/
│ ├── plotting.py # Module for plotting
│ └── stat_calculations.py # Module for calculations
├── data/
│ └── TravelInsurancePrediction.csv # Dataset for analysis
├── notebook/
│ └── data_analysis_and_modeling.ipynb # Jupyter notebook for analysis
├── requirements.txt # Package dependencies
└── README.md # Project documentation
The requirements.txt file lists all Python dependencies. Install them using the command provided above.
The notebook includes the following sections:
- Introduction
- Project Discovery
- Objectives
- Problem Definition
- Data acquisition and preparation
- Data cleaning
- Exploratory Data Analysis
- Statistical Inference, modeling and evaluation
- Model Learning modeling and evaluation
- Suggestions for improvement