🫀 CardioAI – Cardiovascular Disease Predictor

Advanced Machine Learning Pipeline for CVD Prediction

An end-to-end ML pipeline combining predictive analytics with a modern web interface for cardiovascular disease risk assessment.

🌍 Live Demo • 📂 GitHub Repository • 💬 Contact

📋 Quick Navigation

Section	Link
Overview	🌟 Jump to Overview
Features	✨ Jump to Features
Tech Stack	🛠️ Jump to Tech Stack
Getting Started	🚀 Jump to Getting Started
Installation	📦 Jump to Installation

🌟 Overview

CardioAI is a production-grade cardiovascular disease prediction system that combines:

🧬 Rigorous ML Pipeline – Data preprocessing, feature engineering, model training, and optimization
🤖 Ensemble Learning – 10+ classification algorithms with hyperparameter tuning
📊 High Performance – ~73.6% accuracy with balanced precision-recall
🌐 Modern Web Interface – Clean, responsive Next.js UI for instant predictions
☁️ Cloud Deployment – Hosted on Vercel for global accessibility

⚠️ Disclaimer: This project is for educational and research purposes only. It is not a medical device and must not be used for clinical decision-making. Always consult qualified healthcare professionals.

🎯 Features at a Glance

🧮 Machine Learning

✅ End-to-end preprocessing pipeline
✅ 10+ baseline classification models
✅ XGBoost ensemble optimization
✅ 5-fold cross-validation tuning
✅ Confusion matrix & F1 analysis
✅ Model serialization (.pkl)

🌐 Web Application

✅ Interactive patient data input form
✅ Real-time risk predictions
✅ Probability scoring system
✅ Responsive design (mobile-friendly)
✅ Vercel cloud deployment
✅ Fast inference (< 100ms)

🧠 How It Works

┌─────────────────────────────────────────────────────────────┐
│  1. DATA PREPARATION                                        │
│  • Load: 68,641 patient records                             │
│  • Features: 13 clinical & lifestyle metrics                │
│  • Train-Test Split: 80% / 20%                              │
└─────────────────┬───────────────────────────────────────────┘
                  ↓
┌─────────────────────────────────────────────────────────────┐
│  2. PREPROCESSING & SCALING                                 │
│  • StandardScaler normalization                             │
│  • Feature consistency across train & test                  │
└─────────────────┬───────────────────────────────────────────┘
                  ↓
┌─────────────────────────────────────────────────────────────┐
│  3. MODEL EXPERIMENTATION                                   │
│  • 10 baseline classifiers evaluated                        │
│  • XGBoost emerges as top performer (73.25%)               │
└─────────────────┬───────────────────────────────────────────┘
                  ↓
┌─────────────────────────────────────────────────────────────┐
│  4. HYPERPARAMETER TUNING                                   │
│  • GridSearchCV with 5-fold CV                              │
│  • Final XGBoost: 73.63% CV accuracy                        │
└─────────────────┬───────────────────────────────────────────┘
                  ↓
┌─────────────────────────────────────────────────────────────┐
│  5. DEPLOYMENT & INFERENCE                                  │
│  • Model exported as .pkl                                   │
│  • Integrated into Next.js web app                          │
│  • Live at https://cardioai.vercel.app/                     │
└─────────────────────────────────────────────────────────────┘

🧪 Dataset & Features

📊 Dataset Overview

Property	Value
Dataset Name	`cardio_cleaned_week2.csv`
Total Records	68,641 patients
Features	13 clinical & lifestyle metrics
Target	Binary (0 = No CVD, 1 = CVD Present)
Train Set	54,912 samples (80%)
Test Set	13,729 samples (20%)

🔍 Features Used

Feature	Type	Description
`gender`	Categorical	Biological sex (encoded)
`age_years`	Numerical	Age in years
`height`	Numerical	Height in cm
`weight`	Numerical	Weight in kg
`bmi`	Numerical	Body Mass Index
`ap_hi`	Numerical	Systolic blood pressure (mmHg)
`ap_lo`	Numerical	Diastolic blood pressure (mmHg)
`MAP`	Numerical	Mean Arterial Pressure
`cholesterol`	Categorical	Cholesterol level (1-3)
`gluc`	Categorical	Glucose level (1-3)
`smoke`	Binary	Smoking status (0/1)
`alco`	Binary	Alcohol consumption (0/1)
`active`	Binary	Physical activity (0/1)

📊 Model Performance & Results

🏆 Baseline Model Comparison

Rank	Algorithm	Test Accuracy	Macro F1	Status
🥇	XGBoost	73.25%	0.73	Selected
🥈	Random Forest	73.19%	0.73	Alternative
🥉	Stacking	73.18%	0.73	Backup
4	Gradient Boosting	73.14%	0.73	—
5	Decision Tree	72.60%	0.72	—
6	Calibrated Classifier	72.44%	0.72	—
7	Logistic Regression	72.34%	0.72	—
8	Linear SVM	72.30%	0.72	—
9	Gaussian Naïve Bayes	71.48%	0.71	—
10	KNN	70.70%	0.71	—

🎯 Final XGBoost Performance

CONFUSION MATRIX (Test Set)
─────────────────────────────
          Predicted
         No CVD | CVD
       ─────────────
Actual | 5,352 | 1,633  → 67.7% True Negative Rate
No CVD |       |
       ─────────────
       | 2,039 | 4,705  → 69.8% True Positive Rate
CVD    |       |
       ─────────────
       
📈 KEY METRICS
├─ Accuracy: 73.12%
├─ Macro F1: 0.73
├─ Precision: 0.74
└─ Recall: 0.70

🔬 Hyperparameter Tuning Results

XGBoost (Final Champion)

Best Parameters:
  • learning_rate: 0.05
  • max_depth: 3
  • n_estimators: 200
  • min_child_weight: 5
  • gamma: 0
  • reg_alpha: 0

Performance:
  • CV Accuracy: 73.63% ✓
  • Test Accuracy: 73.12%
  • Generalization Gap: 0.51% (Excellent!)

🛠️ Tech Stack

Machine Learning & Data Science

Frontend & Web

Deployment & DevOps

Development Tools

📦 Installation

✅ Prerequisites

Before you begin, ensure you have:

Python 3.9+ (Download)
pip or conda package manager
Node.js 16+ (for frontend modifications)
Git for version control

Step 1️⃣ – Clone Repository

git clone https://github.com/jaypatel342005/Cardiovascular-Disease-Predictor.git
cd Cardiovascular-Disease-Predictor

Step 2️⃣ – Setup Python Environment

# Create virtual environment
python -m venv venv

# Activate virtual environment
# Windows:
venv\Scripts\activate

# macOS / Linux:
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

Step 3️⃣ – Setup Frontend (Optional)

If you want to run or modify the Next.js frontend locally:

cd frontend
npm install

Step 4️⃣ – Verify Installation

# Check Python packages
pip list | grep -E "scikit-learn|xgboost|pandas"

# Check Node (if frontend setup)
node --version
npm --version

🚀 Getting Started

🧪 Option A: Run ML Experiments Locally

Perfect for understanding the ML pipeline!

# Navigate to notebooks
cd notebooks

# Start Jupyter
jupyter notebook

# Open these notebooks in order:
# 1. 01_data_preprocessing.ipynb
# 2. 02_model_baselines.ipynb
# 3. 03_hyperparameter_tuning.ipynb
# 4. 04_evaluation_and_export.ipynb

Each notebook includes:

📝 Detailed comments explaining each step
📊 Data visualizations
🧮 Model training & evaluation
💾 Model export to .pkl files

🔮 Option B: Local Inference Example

Use pre-trained models for predictions:

import pickle
import numpy as np
from sklearn.preprocessing import StandardScaler

# Load trained model + scaler
with open("models/best_xgboost_cvd_model.pkl", "rb") as f:
    model, scaler = pickle.load(f)

# Sample patient data (13 features)
patient_data = np.array([[
    1.0,      # gender: 1 (female)
    156.0,    # height: 156 cm
    85.0,     # weight: 85 kg
    140.0,    # ap_hi: 140 mmHg (systolic)
    90.0,     # ap_lo: 90 mmHg (diastolic)
    3.0,      # cholesterol: 3
    1.0,      # gluc: 1
    0.0,      # smoke: 0
    0.0,      # alco: 0
    1.0,      # active: 1
    55.38,    # age_years: 55.38 years
    34.93,    # bmi: 34.93
    106.67    # MAP: 106.67
]])

# Preprocess & predict
patient_scaled = scaler.transform(patient_data)
prediction = model.predict(patient_scaled)[0]
probability = model.predict_proba(patient_scaled)[0]

print(f"CVD Risk: {'HIGH RISK ⚠️' if prediction == 1 else 'LOW RISK ✓'}")
print(f"Probability of CVD: {probability[1]:.2%}")

🌐 Option C: Run Web App Locally

cd frontend
npm run dev

# Open browser: http://localhost:3000

Features:

📋 Interactive form to enter patient metrics
⚡ Real-time predictions via API
📊 Risk score visualization
📱 Fully responsive design

☁️ Option D: Use Live Demo

No setup needed!

👉 Open CardioAI Web App

📂 Project Structure

Cardiovascular-Disease-Predictor/
│
├── 📊 data/
│   └── cardio_cleaned_week2.csv              # 68,641 patient records
│
├── 📓 notebooks/                             # Jupyter notebooks (ML pipeline)
│   ├── 01_data_preprocessing.ipynb
│   ├── 02_model_baselines.ipynb
│   ├── 03_hyperparameter_tuning.ipynb
│   └── 04_evaluation_and_export.ipynb
│
├── 🤖 models/                                # Trained model artifacts
│   ├── best_xgboost_cvd_model.pkl           # ⭐ Final XGBoost + scaler
│   └── cardio_model_week3.pkl               # Alternative RandomForest + scaler
│
├── 🐍 src/                                   # Python utilities
│   ├── preprocessing.py
│   ├── train.py
│   ├── evaluate.py
│   └── __init__.py
│
├── 🌐 frontend/                              # Next.js React application
│   ├── pages/
│   │   ├── index.js                         # Main prediction page
│   │   └── api/
│   │       └── predict.js                   # ML inference endpoint
│   ├── components/
│   │   ├── PredictionForm.jsx
│   │   ├── ResultCard.jsx
│   │   └── RiskGauge.jsx
│   ├── styles/
│   ├── public/
│   ├── package.json
│   └── next.config.js
│
├── 📋 requirements.txt                       # Python dependencies
├── ✨ README.md                              # This file
├── 📄 LICENSE                                # MIT License
└── 🔗 .gitignore                             # Git ignore rules

🧠 Technical Deep Dive

ML Workflow Pipeline

DATA INGESTION
    ↓
    ├─ Load: cardio_cleaned_week2.csv
    ├─ Shape: (68,641, 13)
    └─ Target distribution checked
    ↓
EXPLORATORY DATA ANALYSIS (EDA)
    ↓
    ├─ Missing value detection
    ├─ Feature statistics (mean, std, range)
    ├─ Class imbalance analysis
    └─ Correlation heatmap
    ↓
PREPROCESSING & FEATURE ENGINEERING
    ↓
    ├─ Drop: id, age, bmi_cat columns
    ├─ Standardize: StandardScaler()
    │   └─ Fitted on training data only
    ├─ Handle outliers if necessary
    └─ Feature consistency validation
    ↓
DATA SPLITTING
    ↓
    ├─ Train: 80% (54,912 samples)
    ├─ Test: 20% (13,729 samples)
    └─ Random seed: 42 (reproducibility)
    ↓
MODEL EXPERIMENTATION (Baseline Phase)
    ↓
    ├─ LogisticRegression ........................ 72.34%
    ├─ SVM (LinearSVC) ........................... 72.30%
    ├─ K-Nearest Neighbors ....................... 70.70%
    ├─ Decision Tree ............................. 72.60%
    ├─ Gaussian Naïve Bayes ...................... 71.48%
    ├─ Random Forest ............................. 73.19% ← Good
    ├─ Gradient Boosting ......................... 73.14% ← Good
    ├─ Stacking ................................. 73.18% ← Good
    └─ XGBoost .................................. 73.25% ← BEST ⭐
    ↓
HYPERPARAMETER TUNING (Optimization Phase)
    ↓
    ├─ Selected: XGBoost & RandomForest
    ├─ Method: GridSearchCV
    ├─ CV Strategy: 5-Fold cross-validation
    ├─ Metric: Accuracy + Macro F1
    ├─ Search Space:
    │   ├─ learning_rate: [0.01, 0.05, 0.1]
    │   ├─ max_depth: [3, 5, 7]
    │   ├─ n_estimators: [100, 200, 300]
    │   └─ ... (other params)
    └─ Results:
        ├─ XGBoost CV Accuracy: 73.63% ✓ SELECTED
        └─ RandomForest CV Accuracy: 73.50%
    ↓
FINAL EVALUATION
    ↓
    ├─ Confusion Matrix
    ├─ Classification Report
    ├─ ROC-AUC Analysis
    ├─ Feature Importance Plot
    └─ Cross-validation curves
    ↓
MODEL EXPORT & DEPLOYMENT
    ↓
    ├─ Serialize: best_xgboost_cvd_model.pkl
    ├─ Include: Model + fitted StandardScaler
    ├─ Integrate: Next.js API route
    └─ Deploy: Vercel cloud
    ↓
PRODUCTION INFERENCE
    ↓
    └─ User submits patient data → 
       Predictions served in <100ms

Why XGBoost Won

Criterion	XGBoost	RandomForest	Winner
Baseline Accuracy	73.25%	73.19%	🎯 XGBoost
CV Accuracy (Tuned)	73.63%	73.50%	🎯 XGBoost
Generalization	+0.51%	+0.31%	🎯 XGBoost
Training Speed	⚡ Fast	⚡⚡ Faster	RandomForest
Interpretability	📊 Good	📊📊 Better	RandomForest

Decision: XGBoost selected for better CV performance & generalization.

🔄 Model Persistence & Inference

How Models Are Saved

import pickle

# During training:
model_package = (trained_model, fitted_scaler)
with open("best_xgboost_cvd_model.pkl", "wb") as f:
    pickle.dump(model_package, f)

How Models Are Loaded

import pickle

# During inference:
with open("best_xgboost_cvd_model.pkl", "rb") as f:
    model, scaler = pickle.load(f)

# Predict
features_scaled = scaler.transform(new_data)
prediction = model.predict(features_scaled)
probability = model.predict_proba(features_scaled)

📈 Project Insights & Learnings

✅ What Worked Well

Ensemble Methods Dominate
Top-performing models (XGBoost, RandomForest, Stacking) all used ensemble learning, suggesting this dataset benefits from model averaging.
Broad Exploration Pays Off
Testing 10 baseline models quickly established that ensembles were best. Focused tuning on top candidates (XGBoost, RF) was efficient.
CV-Driven Tuning
GridSearchCV with 5-fold CV identified truly generalizable parameters, not just test-set optimizations.
Minimal Overfitting
XGBoost's train accuracy (75.19%) vs test accuracy (73.25%) gap was healthy—indicating good generalization.

🎯 Performance Plateau (~73%)

All top models converged around 73% accuracy, suggesting:

✅ Current feature set captures key CVD patterns
⚠️ Performance ceiling may be limited by available features
🔬 Next step: Feature engineering, not algorithm switching

🚀 Future Optimization Opportunities

Advanced Feature Engineering
- Polynomial features (age², weight/height ratio)
- Interaction terms (age × blood pressure)
- Domain-driven scores (Framingham Risk Score)
Data Enrichment
- Additional lab metrics (LDL, HDL, triglycerides)
- Family history, medication data
- Time-series / longitudinal records
Explainability Integration
- SHAP values for feature importance
- LIME for local explanations
- Risk factor breakdown in UI
Cost-Sensitive Learning
- Misclassifying CVD patients (FN) more costly than false alarms (FP)
- Adjust class weights in model training

🤝 Contributing

We welcome contributions! Here's how:

1. Fork & Clone

git clone https://github.com/YOUR_USERNAME/Cardiovascular-Disease-Predictor.git
cd Cardiovascular-Disease-Predictor

2. Create Feature Branch

git checkout -b feature/your-feature-name

3. Make Changes & Commit

git add .
git commit -m "Add: Your clear commit message"

4. Push & Create Pull Request

git push origin feature/your-feature-name

Then open a PR on GitHub with:

Clear description of changes
Why the change is needed
Any related issues

Contribution Ideas

🧠 ML: Better feature engineering, new algorithms, ensemble strategies
🌐 Frontend: UI improvements, accessibility, new visualizations
📊 Data: Additional datasets, preprocessing improvements
📚 Docs: Tutorial videos, architecture guides, deployment guides

📜 License

This project is licensed under the MIT License – see the LICENSE file for details.

MIT License means:

✅ Use commercially
✅ Modify & distribute
✅ Use privately
⚠️ Include license & copyright notice

🤖 API Documentation

Prediction Endpoint

POST /api/predict

Request Body:

{
  "gender": 1,
  "age_years": 55.38,
  "height": 156,
  "weight": 85,
  "ap_hi": 140,
  "ap_lo": 90,
  "cholesterol": 3,
  "gluc": 1,
  "smoke": 0,
  "alco": 0,
  "active": 1,
  "bmi": 34.93,
  "map": 106.67
}

Response:

{
  "prediction": 1,
  "probability": [0.42, 0.58],
  "risk_level": "HIGH",
  "confidence": 0.58
}

📞 Contact & Support

Get in Touch

Jay Patel
Machine Learning Engineer | Full-Stack Developer

📧 Email: pateljay97378@gmail.com
💼 GitHub: @jaypatel342005
🔗 LinkedIn: Connect

Have questions? Open an Issue on GitHub!

⭐ Show Your Support

If this project helped you, please:

⭐ Star this repository on GitHub
🔗 Share with your network
📢 Contribute with PRs or ideas
💬 Give feedback via issues

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
.ipynb_checkpoints		.ipynb_checkpoints
backend_final		backend_final
web		web
Master_Cardiovascular_Disease_Prediction.ipynb		Master_Cardiovascular_Disease_Prediction.ipynb
README.md		README.md
Task 1 - Cardiovascular Disease Analysis.ipynb		Task 1 - Cardiovascular Disease Analysis.ipynb
Task 2 - Data Cleaning, Pre-processing & EDA.ipynb		Task 2 - Data Cleaning, Pre-processing & EDA.ipynb
Task 3 - Model Creation, Evaluation & Tuning.ipynb		Task 3 - Model Creation, Evaluation & Tuning.ipynb
best_randomforest_cvd_model.pkl		best_randomforest_cvd_model.pkl
best_xgboost_cvd_model.pkl		best_xgboost_cvd_model.pkl
cardio_cleaned_week2(1).csv		cardio_cleaned_week2(1).csv
cardio_cleaned_week2.csv		cardio_cleaned_week2.csv
cardio_model_week3.pkl		cardio_model_week3.pkl
cardio_train.csv		cardio_train.csv

Folders and files

Latest commit

History

Repository files navigation

🫀 CardioAI – Cardiovascular Disease Predictor

Advanced Machine Learning Pipeline for CVD Prediction

📋 Quick Navigation

🌟 Overview

🎯 Features at a Glance

🧮 Machine Learning

🌐 Web Application

🧠 How It Works

🧪 Dataset & Features

📊 Dataset Overview

🔍 Features Used

📊 Model Performance & Results

🏆 Baseline Model Comparison

🎯 Final XGBoost Performance

🔬 Hyperparameter Tuning Results

🛠️ Tech Stack

Machine Learning & Data Science

Frontend & Web

Deployment & DevOps

Development Tools

📦 Installation

✅ Prerequisites

Step 1️⃣ – Clone Repository

Step 2️⃣ – Setup Python Environment

Step 3️⃣ – Setup Frontend (Optional)

Step 4️⃣ – Verify Installation

🚀 Getting Started

🧪 Option A: Run ML Experiments Locally

🔮 Option B: Local Inference Example

🌐 Option C: Run Web App Locally

☁️ Option D: Use Live Demo

📂 Project Structure

🧠 Technical Deep Dive

ML Workflow Pipeline

Why XGBoost Won

🔄 Model Persistence & Inference

How Models Are Saved

How Models Are Loaded

📈 Project Insights & Learnings

✅ What Worked Well

🎯 Performance Plateau (~73%)

🚀 Future Optimization Opportunities

🤝 Contributing

1. Fork & Clone

2. Create Feature Branch

3. Make Changes & Commit

4. Push & Create Pull Request

Contribution Ideas

📜 License

🤖 API Documentation

Prediction Endpoint

📞 Contact & Support

Get in Touch

⭐ Show Your Support

Made with ❤️ to Advance Healthcare AI

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages