This project implements a complete Machine Learning pipeline for binary classification (income prediction).
It demonstrates how to:
- preprocess data
- train a model
- evaluate performance
- visualize results
- save and reuse a model
- expose predictions via a REST API
- Python
- Pandas
- Scikit-learn
- NumPy
- Matplotlib
- Seaborn
- Flask
This project follows a standard Machine Learning workflow:
- Load dataset
- Clean and preprocess data
- Train a model (Random Forest)
- Evaluate performance
- Visualize results
- Save trained model (
model.pkl) - Serve predictions via API
training/
│
├── data/
│ └── dataset.csv
│
├── src/
│ ├── train.py
│ ├── predict.py
│ ├── preprocess.py
│ ├── metrics.py
│ ├── visualization.py
│
├── models/
│ └── model.pkl
│
├── flask/
│ └── api.py
│
├── app.py
├── requirements.txt
└── README.md
Clone the repository and install dependencies:
pip install -r requirements.txtpython app.pyThis will:
- load and clean data
- train the model
- evaluate performance
- generate visualizations
- save the model in
/models/model.pkl
python flask/api.pyAPI will run on:
http://127.0.0.1:5000
GET /
GET /health
GET /visualization
POST /predict
{
"features": [1, 50, 35, 50000, 0, 1, 3, 2, 12, 100, 5]
}{
"prediction": 0,
"probabilities": {
"0": 0.7916445304104879,
"1": 0.20835546958951212
}
}{
"error": "Missing features"
}During training, the model prints:
- Accuracy score
- Confusion matrix
- Classification report
It also displays:
- Confusion matrix visualization
- Algorithm: Random Forest Classifier
- Task: Binary classification
- Target:
PINCP(income threshold classification) - Train/Test split: 80/20
- Fixed random state for reproducibility
- Clean and modular code structure
- Reusable and scalable ML pipeline
- Model persistence with
.pkl - API for real-time predictions
- Data visualization included
- Add cross-validation
- Improve feature engineering
- Add model comparison (Logistic Regression, XGBoost)
- Deploy API (Render, AWS, etc.)
- Build a frontend interface
- Ensure your dataset contains the
PINCPcolumn - Input features must match training data format
- API expects numerical input only
Machine Learning beginner project focused on building a complete and structured ML pipeline.