A comprehensive machine learning pipeline for tabular data with a beautiful Streamlit web interface and automated UV environment setup.
- Automated Data Processing: Handles numeric and categorical features automatically
- Smart Feature Selection: Uses Random Forest for intelligent feature selection
- LightGBM Integration: Fast and efficient gradient boosting algorithm
- Hyperparameter Tuning: Automated model optimization with GridSearchCV
- Comprehensive Evaluation: Detailed performance metrics and visualizations
- Model Persistence: Saves trained models for deployment
- Interactive Dashboard: Beautiful, responsive web interface
- Data Exploration: Comprehensive data analysis and visualizations
- Real-time Training: Live progress tracking during model training
- Model Evaluation: Detailed performance metrics with interactive charts
- Prediction Interface: Make predictions on new data with confidence intervals
- Batch Processing: Upload CSV files for batch predictions
- Fast Environment Setup: Automated dependency management with UV
- Cross-platform Scripts: Works on Windows, macOS, and Linux
- Reproducible Builds: Locked dependencies for consistent environments
# Make setup script executable and run
chmod +x setup.sh
./setup.sh# Run the setup batch file
setup.bat- Install UV (if not already installed):
pip install uv- Initialize the environment:
uv sync- Run the application:
# Run the Streamlit UI
uv run streamlit run ui.py
# Or run the pipeline directly
uv run python pipeline.pyuv run streamlit run ui.pyThen open your browser to http://localhost:8501
The TabularML web interface provides an intuitive, step-by-step workflow for machine learning:
The interface features:
- ๐ Interactive Dashboard: Clean, modern design with real-time status updates
- ๐๏ธ Navigation Panel: Easy access to all pipeline stages
- ๐ Data Visualization: Rich charts and graphs for data exploration
- โก Quick Actions: One-click initialization and data loading
- Initialize the pipeline
- Load sample data
- View system status
- View dataset statistics and metrics
- Explore data distributions and correlations
- Analyze feature relationships with interactive plots
- Configure training parameters
- Start model training with live progress tracking
- View training logs and results
- Detailed performance metrics (Rยฒ, RMSE, MAE, MSE)
- Predictions vs Actual scatter plots
- Residuals distribution analysis
- Feature importance charts
- Model parameter inspection
- Single Predictions: Enter feature values for individual predictions
- Batch Predictions: Upload CSV files for bulk processing
- Confidence Intervals: Get prediction uncertainty estimates
- Configure model parameters
- Adjust preprocessing options
- System information and controls
The ML pipeline follows these steps:
- Data Loading: Loads dataset (with fallback to synthetic data)
- Data Preprocessing: Handles missing values, scaling, and encoding
- Train-Test Split: Divides data into training and testing sets
- Feature Selection: Identifies top features using Random Forest
- Model Building: Trains LightGBM with hyperparameter tuning
- Model Evaluation: Comprehensive performance assessment
- Deployment: Saves model for production use
- Core ML: pandas, scikit-learn, lightgbm, numpy
- Visualization: matplotlib, plotly, seaborn
- Web Interface: streamlit
- Utilities: joblib for model persistence
The setup includes pre-configured UV scripts for common tasks:
- Environment initialization
- Dependency installation
- Application launching
The application includes a synthetic housing dataset with:
- 1000 samples with 12 features
- Numeric features: Income, house age, rooms, location, etc.
- Categorical features: Property type, year built
- Target: House price prediction
The pipeline achieves excellent performance on the sample dataset:
- Rยฒ Score: ~0.97 (97% variance explained)
- RMSE: ~0.99 (low prediction error)
- Training Time: ~15 seconds for full pipeline
- Target distribution histograms
- Feature correlation heatmaps
- Scatter plot matrices
- Predictions vs actual charts
- Residuals analysis
- Feature importance rankings
- Model parameter inspection
- Training progress tracking
- Comprehensive evaluation metrics
- Model serialization with joblib
- Batch prediction capabilities
- Error handling and validation
- Scalable architecture
- Modify the
fetch_data()method inpipeline.py - Ensure your data has a 'label' column for the target
- The pipeline automatically handles numeric/categorical features
- Update the
model_building()method - Modify hyperparameter grids in the training configuration
- Add new evaluation metrics as needed
- Modify
ui.pyto add new pages or features - Update the navigation and styling
- Add new visualization types
- pandas: Data manipulation and analysis
- scikit-learn: Machine learning toolkit
- lightgbm: Gradient boosting framework
- streamlit: Web application framework
- plotly: Interactive visualizations
- uv: Fast Python package manager
- pytest: Testing framework (optional)
- black: Code formatting (optional)
- Fork the repository
- Create a feature branch
- Make your changes
- Test thoroughly
- Submit a pull request
This project is open source and available under the MIT License.
- Built with modern Python ML stack
- Inspired by best practices in MLOps
- Designed for both beginners and experts
- Emphasis on user experience and visualization
Ready to explore your data? Start with ./setup.sh and launch the Streamlit interface! ๐

