This project focuses on predicting sales for Rossmann, one of the largest drugstore chains in Europe. With over 4,000 stores across multiple countries, accurate sales forecasting is crucial for inventory planning, promotions, and resource allocation.
Through data exploration, feature engineering, and machine learning modeling, this project aims to build a robust sales prediction model using historical store data.
- Analyze historical sales and customer trends across stores.
- Engineer meaningful time-based and promotional features.
- Handle outliers to improve model performance.
- Build and compare machine learning models for accurate forecasting.
- Identify key drivers of sales using feature importance.
-
Data Cleaning & Preprocessing
- Treated missing values and corrected data types.
- Removed extreme outliers using the IQR method.
-
Feature Engineering
- Created Recency feature (days since last record).
- Extracted Day of Week, Promo, and Holiday indicators.
- Encoded categorical features.
-
Exploratory Data Analysis (EDA)
- Sales & customer distribution analysis.
- Promotion and holiday impact assessments.
- Correlation and feature significance study.
-
Machine Learning Models
- Decision Tree Regressor
- Random Forest Regressor
- AdaBoost Regressor
- Stacking Regressor
-
Sales Distribution
- Daily sales mostly range from ₹2,000 to ₹10,000.
- High-value outliers (>₹15,000) can skew model predictions.
-
Customer Behavior
- Most stores serve <1,000 customers per day.
- Sudden spikes (>1,500) act as noise and were treated as outliers.
-
Promotion Impact
- Promo = 1 days show a clear increase in median sales.
- Promotional campaigns are strong revenue boosters.
-
Day of the Week Trends
- Sales drop significantly on Sundays due to partial/complete store closures.
- Weekdays show more stable and higher sales.
-
Holiday Effects
- State Holidays result in zero or near-zero sales, indicating closed stores.
- These dates are essential for forecasting accuracy.
-
Feature Importance
- Customers is the top predictor of sales.
- Promo and DayOfWeek also strongly influence revenue.
- Recency has a minor negative correlation.
-
Model Performance
- Random Forest and Stacking Regressor performed best.
- Achieved ~85%–86% R² accuracy on test data.
- Python
- pandas, numpy
- matplotlib, seaborn
- scikit-learn
- Jupyter Notebook
1. Clone the Repository:
git clone https://github.com/indu-explores-data/Rossmann-Sales-Prediction.git
2. Navigate to the Project Directory:
cd Rossmann-Sales-Prediction
3. Create and Activate a Virtual Environment (Recommended):
python -m venv venv
Windows:
venv\Scripts\activate
Mac/Linux:
source venv/bin/activate
4. Install Required Libraries:
pip install -r requirements.txt
5. Launch Jupyter Notebook:
jupyter notebook
6. Open Rossmann Sales Prediction.ipynb and run all cells to reproduce the analysis.
- Open Rossmann_Sales_Prediction.ipynb in Jupyter Notebook
- Run all cells sequentially
- Explore visualizations and model comparisons
- Final forecasts available in model output cells
Let’s connect on LinkedIn for project discussions or data-driven collaborations:
If you found this project helpful, please ⭐ star the repository and share your thoughts. Suggestions and contributions are always welcome!









