StockSmart is an intelligent inventory management system that combines transformer-based demand forecasting with deep reinforcement learning to optimize replenishment decisions. The system learns optimal ordering policies by processing historical sales data, stockout events, and exogenous factors, dynamically balancing holding costs against stockout penalties.
The primary approach trains per-category DQN agents — one shared Deep Q-Network per product category that learns a generalized ordering policy across all products in that category. An optional Genetic Algorithm (GA) pretraining step can seed the DQN replay buffer with evolved (s, S) policy trajectories for faster convergence.
FreshRetailNet-50K Dataset
│
▼
┌─────────────────────┐
│ Data Processing │ Demand reconstruction, feature engineering,
│ (data_processing) │ temporal/lag/rolling/interaction features
└────────┬────────────┘
│
▼
┌─────────────────────┐
│ Demand Forecasting │ LSTM baseline + Temporal Fusion Transformer
│ (data_processing) │ Quantile forecasts → state features for RL
└────────┬────────────┘
│
▼
┌─────────────────────┐
│ RL Optimization │ Per-category DQN agents (Stable-Baselines3)
│ (rl_optimization) │ Optional GA pretraining (DEAP) / Gymnasium
└─────────────────────┘
FreshRetailNet-50K from Hugging Face — hourly sales, stockout indicators, and rich contextual features for ~50K product-store combinations.
| Split | Rows | Description |
|---|---|---|
| Train | 4.5M | Historical sales with covariates |
| Eval | 350K | Held-out evaluation partition |
Key fields: hours_sale, hours_stock_status, discount, holiday_flag, activity_flag, weather variables, and full category hierarchy.
| Component | Technology |
|---|---|
| Language | Python 3.9+ |
| RL Framework | Stable-Baselines3 (DQN) |
| Environment API | Gymnasium |
| Deep Learning | PyTorch |
| Transformer Models | PyTorch Lightning / Hugging Face Transformers |
| Genetic Algorithm | DEAP (optional) |
| Data Processing | Pandas, NumPy |
| Forecasting | NeuralForecast (LSTM, TFT) |
| MILP Baseline | Pyomo + GLPK |
| Visualization | Matplotlib, Plotly |
| Experiment Tracking | Weights & Biases |
RL/
├── README.md
├── requirements.txt
├── artifacts/ # Generated models and features
│ ├── hourly_features.parquet
│ ├── rl_forecast_features.parquet
│ ├── feature_scaler.pkl
│ ├── nf_lstm/
│ ├── nf_tft/
│ └── dqn_category_<id>/ # Per-category DQN models
├── data/
│ └── data_processing.ipynb # Data processing + demand forecasting
└── rl_optimization.ipynb # Per-category RL training + evaluation
# Clone the repo
git clone <repo-url> && cd RL
# Create a virtual environment (recommended)
python -m venv venv && source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Launch Jupyter
jupyter notebook- Load FreshRetailNet-50K from Hugging Face
- Reconstruct latent demand during stockout periods
- Reshape daily rows with nested hourly sequences into long-format hourly table
- Engineer temporal, lag, rolling, interaction, and hierarchical features
- Standardize features and perform chronological train/val/test split
- LSTM baseline: 168-hour lookback, multi-step 24-hour forecast
- Temporal Fusion Transformer (TFT): probabilistic quantile forecasts with attention-based interpretability
- Evaluate on MAE, RMSE, Quantile Loss, and bias metrics
- Export point forecasts + prediction intervals as RL state features
Two environment types:
InventoryEnv: single product-store environment for per-product evaluationCategoryInventoryEnv: multi-product environment that randomly samples a product-store from the category each episode, training the agent on diverse demand patterns
Per-category DQN training loop:
- Select the top 5 product categories by number of product-store combinations
- For each category, build a
CategoryInventoryEnvwith all its product-stores - Train a DQN agent (MlpPolicy, [256, 256], epsilon-greedy) on the category environment
- Optionally seed the replay buffer with GA-evolved (s, S) policy trajectories (
USE_GA_PRETRAINING = True) - Evaluate against (s, S), EOQ, and random baselines on the test split
- Visualize inventory trajectories and cumulative cost per category
State vector for CategoryInventoryEnv:
[product_index_normalized, on_hand_inventory, incoming_shipments...,
demand_forecast..., price, discount, stockout_history...]
Reward: negative total cost = -(holding + stockout penalty + ordering costs)
Key flags in rl_optimization.ipynb:
| Parameter | Default | Description |
|---|---|---|
USE_GA_PRETRAINING |
False |
Enable GA (s,S) evolution + replay seeding |
TOTAL_TIMESTEPS |
20,000 |
DQN training steps per category |
N_CATEGORIES |
5 |
Number of product categories to train |
EPISODE_LENGTH |
365 |
Days per episode |
All experiments can be logged with Weights & Biases:
- Per-category cost comparison (DQN vs baselines)
- Service level (% periods without stockout)
- Inventory trajectory comparison plots
- Hyperparameter sweeps and ablation studies (GA vs no-GA)
MIT