This project aims to help Assur'Aimant, a French insurer, estimate insurance premiums for its expansion in the United States. Currently, manual premium estimation is costly and time-consuming. This project uses machine learning to predict premiums based on customer demographics.
Assur'Aimant wants to modernize its insurance premium estimation process for the US market. We were commissioned to develop an AI solution capable of accurately predicting premiums based on customer characteristics. This project includes exploratory data analysis (EDA) and the construction of a predictive model.
The data collected from Assur'Aimant in Houston includes the following information:
BMI: Body Mass Index (18.5 - 24.9 ideally).Sex: Gender of the subscriber (male or female).Age: Age of the primary beneficiary.Children: Number of dependent children covered by insurance.Smoker: Smoking status (smoker or non-smoker).Region: Region of residence in the United States (Northeast, Southeast, Southwest, Northwest).Charges: Insurance premium billed (target variable).
-
Exploratory Data Analysis (EDA): Understanding data, identifying trends, outliers and relationships between variables. This includes:
- Missing and duplicate values check (with
missingno). - Outlier detection.
- Univariate and bivariate analysis.
- Correlation analysis.
- Hypothesis validation with statistical tests.
- Visualizations with
seaborn(box plots, violin plots, etc.).
- Missing and duplicate values check (with
-
Predictive Modeling: Building a machine learning model to predict insurance premiums. This includes:
- Creation of a base model (Dummy Model).
- Data separation (80% training, 20% test).
- Data preparation (logarithmic transformation if necessary, management of
random_stateandseed). - Model selection (
sklearn: Linear Regression, Lasso, Ridge, ElasticNet or any model that performs best). - Model evaluation (R², RMSE).
- Pre-processing (Standardization, encoding of categorical variables with
sklearn.pipeline.Pipeline). - Optimization (
PolynomialFeatures,GridSearchCV,RandomSearchCV). - Analysis and interpretation of results (importance of variables).
-
Streamlit Application: Develop an interactive application allowing:
- User data entry.
- Real-time insuranc charge prediction.
- Use of a pre-trained model exported in
.pkl. - Integration of pre-processing pipelines.
- Python
pandas,numpyscikit-learnseaborn,missingnostreamlit
app.py- Streamlit Applicationnotebooks/data_cleaning.ipynb- Data cleaningnotebooks/data_analysis.ipynb- Exploratory Data Analysis (EDA)notebooks/data_model.ipynb- Model development and testing (model used for the steamlit app)notebooks/analysis_vk.ipynb- Cleaning / EDA / model building and testingmodel/model.pkl- Exported trained modelREADME.md- This filerequirements.txt- Dependencies & packagesasset- Folder contains some figure results from the analysis
Follow these steps to execute the project:
- Ensure Python is installed on your system.
- Clone this repository to your local machine:
git clone https://github.com/MichAdebayo/simplon_insurance_price_prediction.git- Navigate to the project directory:
cd simplon_insurance_price_prediction
- Install the required dependencies:
pip install -r requirements.txt
- Running the Streamlit application
streamlit run app.pyIf you wish to only test the app without cloning the repo, you can do so using this link. This is possible because the the application has been deployed on streamlit cloud.
graph TD
A[Load data] --> B(Exploratory Data Analysis);
B --> C{Train Model};
C -- Linear Regression --> D[Evaluation];
C -- Lasso --> D;
C -- Linear SVR --> D;
C -- ElasticNet --> D;
D --> E[Optimisation];
E --> F[Streamlit Application]
F --> G[Deployment on Cloud];