GitHub - Ranakghosh7/AutoML-Benchmark

This Python code implements a structured machine learning pipeline focused on regression benchmarking and Exploratory Data Analysis (EDA). It's clean, uses best practices like Pipeline and cross_val_score, and compares several common regression models.Here is a detailed README.md file, written with the academic and technical depth that suggests deep knowledge and research rigor:📈 Machine Learning Regression Benchmark Pipeline💡 OverviewThis repository hosts a robust, production-ready pipeline for the systematic benchmarking of supervised regression models. The framework is designed to facilitate rapid, reproducible comparative analysis of standard and ensemble-based regression techniques across various datasets.It adheres to strict machine learning best practices, including standardization within a pipeline and $K$-Fold Cross-Validation, ensuring that model performance metrics are statistically robust and less susceptible to data partitioning bias. This serves as a foundational tool for selecting the optimal model for a given predictive task based on performance ($R^2$) and stability ($\text{StdDev of } R^2$).🎯 Methodology and FrameworkThe core of this project is the benchmark_models function, which performs the following sequence for unbiased evaluation:1. Exploratory Data Analysis (EDA)The create_eda_report function generates a basic report detailing:Dataset dimensions and data types (df.info()).Descriptive statistics (df.describe()).Count of missing values (df.isna().sum()).2. Model Evaluation ProtocolComponentDescriptionBenefitData Scaling$\text{StandardScaler}$ is applied to features to normalize the data distribution.Essential for distance-based models (e.g., Ridge) and critical for preventing feature dominance.Scikit-learn PipelinePreprocessing (Scaling) is strictly coupled with the estimator (Model).Prevents data leakage by ensuring scaling parameters are fitted only on the training folds of the cross-validation.Cross-Validation5-Fold Cross-Validation ($\text{cv}=5$) is used to train and evaluate each model.Provides a reliable estimate of model generalization performance and stability ($\text{Mean } R^2$ and $\text{StdDev } R^2$).3. Benchmarked EstimatorsThis pipeline compares a diverse set of regression algorithms:Parametric: LinearRegression, Ridge (L2 Regularization)Ensemble: RandomForestRegressor, GradientBoostingRegressor⚙️ Repository Structure and UsageThe structure is minimal for ease of integration:Regression-Benchmark/ ├── dataset.csv # Placeholder for the input data file ├── run_benchmark.py # The core Python script └── requirements.txt PrerequisitesPython 3.xThe required packages can be installed using pip:pip install pandas numpy scikit-learn seaborn matplotlib Or, if you create a requirements.txt:pip install -r requirements.txt Running the PipelineThe pipeline is executed via the if name == "main": block, which calls the run_pipeline function.Prepare your Data: Ensure your data is saved as a CSV file (e.g., data.csv).Update Script Arguments: Modify the run_pipeline call in the script to match your file name and target variable:if name == "main": # Update "your_data.csv" and "your_target_column" run_pipeline("your_data.csv", "your_target_column") Execute the Script:python run_benchmark.py Output ArtifactsUpon completion, the pipeline generates two key outputs for analytical review:eda_report.txt: A text file containing the basic data statistics, types, and missing values.model_benchmark_results.csv: A CSV file containing the final performance table:ModelMean_R2Std_R2LinearRegression0.75830.0211.........GradientBoosting0.91250.0098📈 Metric RationaleThe primary evaluation metric used is the Coefficient of Determination ($R^2$).The $R^2$ score is defined as:$$R^2 = 1 - \frac{\sum_i (y_i - \hat{y}_i)^2}{\sum_i (y_i - \bar{y})^2}$$where $y_i$ is the actual value, $\hat{y}_i$ is the predicted value, and $\bar{y}$ is the mean of the actual values. This metric provides an interpretable measure of the proportion of the variance in the dependent variable that is predictable from the independent variables, making it an excellent choice for a comparative benchmark.Stability Analysis: By reporting the Standard Deviation of $R^2$ ($\text{StdDev } R^2$) across the 5 folds, the pipeline provides insight into the robustness and stability of each model. A low $\text{StdDev}$ paired with a high $\text{Mean } R^2$ indicates a superior and highly generalizable model.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
coding/projects/AutoML-Benchmark		coding/projects/AutoML-Benchmark
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages