Skip to content

jasonjiang8866/tabularML

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

14 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿง€ TabularML - Advanced ML Pipeline with Streamlit UI

A comprehensive machine learning pipeline for tabular data with a beautiful Streamlit web interface and automated UV environment setup.

โœจ Features

๐Ÿš€ Machine Learning Pipeline

  • Automated Data Processing: Handles numeric and categorical features automatically
  • Smart Feature Selection: Uses Random Forest for intelligent feature selection
  • LightGBM Integration: Fast and efficient gradient boosting algorithm
  • Hyperparameter Tuning: Automated model optimization with GridSearchCV
  • Comprehensive Evaluation: Detailed performance metrics and visualizations
  • Model Persistence: Saves trained models for deployment

๐ŸŽจ Streamlit Web Interface

  • Interactive Dashboard: Beautiful, responsive web interface
  • Data Exploration: Comprehensive data analysis and visualizations
  • Real-time Training: Live progress tracking during model training
  • Model Evaluation: Detailed performance metrics with interactive charts
  • Prediction Interface: Make predictions on new data with confidence intervals
  • Batch Processing: Upload CSV files for batch predictions

โšก UV Package Management

  • Fast Environment Setup: Automated dependency management with UV
  • Cross-platform Scripts: Works on Windows, macOS, and Linux
  • Reproducible Builds: Locked dependencies for consistent environments

๐Ÿ› ๏ธ Quick Start

Option 1: Automatic Setup (Recommended)

Linux/macOS:

# Make setup script executable and run
chmod +x setup.sh
./setup.sh

Windows:

# Run the setup batch file
setup.bat

Option 2: Manual Setup

  1. Install UV (if not already installed):
pip install uv
  1. Initialize the environment:
uv sync
  1. Run the application:
# Run the Streamlit UI
uv run streamlit run ui.py

# Or run the pipeline directly
uv run python pipeline.py

๐ŸŽฎ Using the Application

1. Launch the Web Interface

uv run streamlit run ui.py

Then open your browser to http://localhost:8501

2. User Interface Overview

The TabularML web interface provides an intuitive, step-by-step workflow for machine learning:

TabularML UI - Home Page

The interface features:

  • ๐Ÿ“Š Interactive Dashboard: Clean, modern design with real-time status updates
  • ๐ŸŽ›๏ธ Navigation Panel: Easy access to all pipeline stages
  • ๐Ÿ“ˆ Data Visualization: Rich charts and graphs for data exploration
  • โšก Quick Actions: One-click initialization and data loading

TabularML UI - Data Exploration

3. Navigate Through the Pipeline

๐Ÿ  Home Page

  • Initialize the pipeline
  • Load sample data
  • View system status

๐Ÿ“Š Data Exploration

  • View dataset statistics and metrics
  • Explore data distributions and correlations
  • Analyze feature relationships with interactive plots

๐Ÿ”ง Model Training

  • Configure training parameters
  • Start model training with live progress tracking
  • View training logs and results

๐Ÿ“ˆ Model Evaluation

  • Detailed performance metrics (Rยฒ, RMSE, MAE, MSE)
  • Predictions vs Actual scatter plots
  • Residuals distribution analysis
  • Feature importance charts
  • Model parameter inspection

๐Ÿ”ฎ Predictions

  • Single Predictions: Enter feature values for individual predictions
  • Batch Predictions: Upload CSV files for bulk processing
  • Confidence Intervals: Get prediction uncertainty estimates

โš™๏ธ Settings

  • Configure model parameters
  • Adjust preprocessing options
  • System information and controls

๐Ÿ“Š Pipeline Architecture

The ML pipeline follows these steps:

  1. Data Loading: Loads dataset (with fallback to synthetic data)
  2. Data Preprocessing: Handles missing values, scaling, and encoding
  3. Train-Test Split: Divides data into training and testing sets
  4. Feature Selection: Identifies top features using Random Forest
  5. Model Building: Trains LightGBM with hyperparameter tuning
  6. Model Evaluation: Comprehensive performance assessment
  7. Deployment: Saves model for production use

๐Ÿ”ง Configuration

Dependencies (pyproject.toml)

  • Core ML: pandas, scikit-learn, lightgbm, numpy
  • Visualization: matplotlib, plotly, seaborn
  • Web Interface: streamlit
  • Utilities: joblib for model persistence

UV Scripts

The setup includes pre-configured UV scripts for common tasks:

  • Environment initialization
  • Dependency installation
  • Application launching

๐ŸŽฏ Sample Dataset

The application includes a synthetic housing dataset with:

  • 1000 samples with 12 features
  • Numeric features: Income, house age, rooms, location, etc.
  • Categorical features: Property type, year built
  • Target: House price prediction

๐Ÿ“ˆ Performance

The pipeline achieves excellent performance on the sample dataset:

  • Rยฒ Score: ~0.97 (97% variance explained)
  • RMSE: ~0.99 (low prediction error)
  • Training Time: ~15 seconds for full pipeline

๐Ÿ” Advanced Features

Interactive Visualizations

  • Target distribution histograms
  • Feature correlation heatmaps
  • Scatter plot matrices
  • Predictions vs actual charts
  • Residuals analysis

Model Insights

  • Feature importance rankings
  • Model parameter inspection
  • Training progress tracking
  • Comprehensive evaluation metrics

Production Ready

  • Model serialization with joblib
  • Batch prediction capabilities
  • Error handling and validation
  • Scalable architecture

๐Ÿš€ Extending the Pipeline

Adding New Datasets

  1. Modify the fetch_data() method in pipeline.py
  2. Ensure your data has a 'label' column for the target
  3. The pipeline automatically handles numeric/categorical features

Customizing Models

  1. Update the model_building() method
  2. Modify hyperparameter grids in the training configuration
  3. Add new evaluation metrics as needed

UI Customization

  1. Modify ui.py to add new pages or features
  2. Update the navigation and styling
  3. Add new visualization types

๐Ÿ“š Dependencies

Core Libraries

  • pandas: Data manipulation and analysis
  • scikit-learn: Machine learning toolkit
  • lightgbm: Gradient boosting framework
  • streamlit: Web application framework
  • plotly: Interactive visualizations

Development Tools

  • uv: Fast Python package manager
  • pytest: Testing framework (optional)
  • black: Code formatting (optional)

๐Ÿค Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Test thoroughly
  5. Submit a pull request

๐Ÿ“„ License

This project is open source and available under the MIT License.

๐ŸŽ‰ Acknowledgments

  • Built with modern Python ML stack
  • Inspired by best practices in MLOps
  • Designed for both beginners and experts
  • Emphasis on user experience and visualization

Ready to explore your data? Start with ./setup.sh and launch the Streamlit interface! ๐Ÿš€

About

๐Ÿง€ TabularML - Advanced ML Pipeline with Streamlit UI for tabular data

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors