Skip to content

Ranakghosh7/AutoML-Benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

This Python code implements a structured machine learning pipeline focused on regression benchmarking and Exploratory Data Analysis (EDA). It's clean, uses best practices like Pipeline and cross_val_score, and compares several common regression models.Here is a detailed README.md file, written with the academic and technical depth that suggests deep knowledge and research rigor:📈 Machine Learning Regression Benchmark Pipeline💡 OverviewThis repository hosts a robust, production-ready pipeline for the systematic benchmarking of supervised regression models. The framework is designed to facilitate rapid, reproducible comparative analysis of standard and ensemble-based regression techniques across various datasets.It adheres to strict machine learning best practices, including standardization within a pipeline and $K$-Fold Cross-Validation, ensuring that model performance metrics are statistically robust and less susceptible to data partitioning bias. This serves as a foundational tool for selecting the optimal model for a given predictive task based on performance ($R^2$) and stability ($\text{StdDev of } R^2$).🎯 Methodology and FrameworkThe core of this project is the benchmark_models function, which performs the following sequence for unbiased evaluation:1. Exploratory Data Analysis (EDA)The create_eda_report function generates a basic report detailing:Dataset dimensions and data types (df.info()).Descriptive statistics (df.describe()).Count of missing values (df.isna().sum()).2. Model Evaluation ProtocolComponentDescriptionBenefitData Scaling$\text{StandardScaler}$ is applied to features to normalize the data distribution.Essential for distance-based models (e.g., Ridge) and critical for preventing feature dominance.Scikit-learn PipelinePreprocessing (Scaling) is strictly coupled with the estimator (Model).Prevents data leakage by ensuring scaling parameters are fitted only on the training folds of the cross-validation.Cross-Validation5-Fold Cross-Validation ($\text{cv}=5$) is used to train and evaluate each model.Provides a reliable estimate of model generalization performance and stability ($\text{Mean } R^2$ and $\text{StdDev } R^2$).3. Benchmarked EstimatorsThis pipeline compares a diverse set of regression algorithms:Parametric: LinearRegression, Ridge (L2 Regularization)Ensemble: RandomForestRegressor, GradientBoostingRegressor⚙️ Repository Structure and UsageThe structure is minimal for ease of integration:Regression-Benchmark/ ├── dataset.csv # Placeholder for the input data file ├── run_benchmark.py # The core Python script └── requirements.txt PrerequisitesPython 3.xThe required packages can be installed using pip:pip install pandas numpy scikit-learn seaborn matplotlib Or, if you create a requirements.txt:pip install -r requirements.txt Running the PipelineThe pipeline is executed via the if name == "main": block, which calls the run_pipeline function.Prepare your Data: Ensure your data is saved as a CSV file (e.g., data.csv).Update Script Arguments: Modify the run_pipeline call in the script to match your file name and target variable:if name == "main": # Update "your_data.csv" and "your_target_column" run_pipeline("your_data.csv", "your_target_column") Execute the Script:python run_benchmark.py Output ArtifactsUpon completion, the pipeline generates two key outputs for analytical review:eda_report.txt: A text file containing the basic data statistics, types, and missing values.model_benchmark_results.csv: A CSV file containing the final performance table:ModelMean_R2Std_R2LinearRegression0.75830.0211.........GradientBoosting0.91250.0098📈 Metric RationaleThe primary evaluation metric used is the Coefficient of Determination ($R^2$).The $R^2$ score is defined as:$$R^2 = 1 - \frac{\sum_i (y_i - \hat{y}_i)^2}{\sum_i (y_i - \bar{y})^2}$$where $y_i$ is the actual value, $\hat{y}_i$ is the predicted value, and $\bar{y}$ is the mean of the actual values. This metric provides an interpretable measure of the proportion of the variance in the dependent variable that is predictable from the independent variables, making it an excellent choice for a comparative benchmark.Stability Analysis: By reporting the Standard Deviation of $R^2$ ($\text{StdDev } R^2$) across the 5 folds, the pipeline provides insight into the robustness and stability of each model. A low $\text{StdDev}$ paired with a high $\text{Mean } R^2$ indicates a superior and highly generalizable model.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages