RetailSynth is a sophisticated simulation environment designed to generate synthetic retail datasets. Its purpose is the evaluation of AI systems in retail, focusing on personalized pricing promotions, product recommendation algorithms, and more. RetailSynth leverages a multi-stage model to simulate customer shopping behavior, incorporating price sensitivity and customer history. The environment is calibrated with publicly available grocery data, allowing for the creation of realistic shopping transactions.
DISCLAIMER: RetailSynth is a research grade library and not production ready. The goal of open sourcing this library is to reproduce the analysis and results shown in the paper. Please cite the following paper when you use (parts of) this code:
Xia Y, Arian A, Narayanamoorthy S and Mabry J (2023). RetailSynth: Synthetic Data Generation for Retail AI Systems Evaluation
- Multi-Stage Simulation Model: Incorporates econometric models covering the full customer lifecycle, including store visits, category choices, product choices, and purchase quantities.
- Interpretability: Based on utility theory and can incorporate customer-specific features and treatments.
- Price Policy Setting: Employs a Markov model for generating realistic pricing strategies.
- Python 3.10.10
- Poetry for dependency management
Clone the repository and set up the environment:
git clone git@github.com:RetailMarketingAI/retailsynth.git
cd retailsynth
poetry install # install deps
poetry shell # switch to the project environmentThe code in this repository is organized into three primary directories:
src: CoreRetailSynthlibrary, featuring modules for data processing, feature engineering, and the synthesizer itself.analysis_workflow: Contains workflows used in our research, including data analysis, synthesizer calibration, and scenario analysis. These scripts and notebooks can be used to reproduce the results in the paper.tests: Unit tests ensuring code quality and reliability.
A directory tree with details of the submodules is provided below.
RetailSynth
├── LICENSE
├── poetry.lock
├── pyproject.toml
├── analysis_workflow # Research workflows
│ ├── 1_complete_journey_eda # Exploratory data analysis
│ ├── 2_synthesizer_calibration # Synthesizer Calibration
│ └── 3_scenario_analysis # Scenario Analysis
├── src
│ └── retailsynth # Core RetailSynth library
│ ├── base_config.py # Base configuration with defaults
│ ├── datasets # Module for processing Complete Journey Data
│ ├── feature_eng # Feature engineering module
│ ├── synthesizer # Core synthesizer module
│ └── utils # Misc utility functions
└── tests # Unit tests
In RetailSynth, we utilize Hydra for managing our project's configuration settings. Hydra's dynamic configuration capabilities make it simple to extend and further customize RetailSynth to different business settings.
- Base Configuration: Global default configuration values are defined in
src/retailsynth/base_config.py. This base configuration serves as the foundation for all workflow-specific settings. - Workflow Configurations: Each analysis workflow contains its own
cfgsubdirectory, holding configuration files tailored for individual tasks within that workflow.
Below is an example of a Hydra configuration in YAML format:
# Context parameters for script execution
hydra:
run:
dir: analysis_workflow/1_complete_journey_eda/outputs # Output directory
job:
chdir: True # Change working directory to the job's directory
# Default settings are loaded from `src/retailsynth/base_config.py`
# The '_self_' directive ensures these defaults are overridden by this file's settings
defaults:
- base_config
- _self_
# Custom parameters (overrides the defaults)
n_customers_sampled: 100 # Number of customers to sample
n_workers: 4 # Number of workers to useTo customize configurations for your specific needs:
- Modify the YAML files in the
cfgdirectory of the relevant workflow. - Adjust the parameters as needed, ensuring they align with the structure defined in
base_config.py.
For more detailed information on using and customizing Hydra configurations, refer to the Hydra Documentation.
Exploratory Data Analysis of The Complete Journey data and an overview of the pre-processing logic is contained
in the 1_preprocess_analysis.ipynb notebook. This notebook is a good entrypoint to become familiar with the
contents and structure of The Complete Journey data. The notebook utilizes a single configuration file,
located at ./cfg/real_dataset.yaml. To speed up the analysis iterations, we recommend you modify
the n_customers_sampled parameter within the configuration file to fit the compute resources available.
We followed an iterative workflow to calibrate the synthesizer distributions to match the distributions from the Complete journey data. Tactically, this translated to:
- Step 0: [Pre-requisite] Process Complete Journey data and generate baseline distributions to compare against
- Step 1: Choose a set of parameters for the Bayesian priors for the generative model that we have described in the paper
- Step 2: Generate a batch of synthetic data
- Step 3: Inspect the key distributions of interest both quantitatively and qualitatively
- Step 4: Update the parameters and repeat till the distributions match well
Initially, our approach to calibration was manual, leveraging insights from academic literature, intuitive understanding, and business judgment for parameter setting and adjustments. However, we encountered challenges in accurately replicating a few key target distributions. To help address these fitting issues, we shifted our approach to leverage Bayesian optimization for parameter sweeping using Optuna. We document both these approaches below.
We now switch to the POV of a user attempting to replicate our workflow.
To update the model parameters (Step 1), edit
./cfg/synthetic_data/synthetic_data_params.yaml. This configuration specifies the input to the
synthesizer and our current estimate of the parameter values for the prior distributions. This is a key file that you
will frequently revise during calibration. Detailed documentation for each setting is provided below:
sample_time_steps: 53 # Duration of simulation in weeks.
synthetic_data_setup:
n_customer: 100 # Number of customers to be simulated.
n_category: 3 # Number of product categories.
n_product: 30 # Total number of products across all categories.
category_product_count: [5, 10, 15] # Distribution of products across categories (sum of this should be equal n_product).
store_util_marketing_feature_mode: "random" # Choose from ["random", "discount"]; set to "random" for calibration
random_seed: 0 # Seed for random number generation, ensuring reproducibility.
random_seed_range: 100 # Range of seed values for random processes.
# Parameters for the prior distributions
# details provided in Appendix A of the paper
# coefficients to generate discount
...
# coefficients to compute product price
...
# coefficients to compute product utility
...
# coefficients to compute category utility
...
# coefficients to compute store visit probability
...
# coefficients to compute product demand
...Run the synthesizer_validation.ipynb notebook to perform Step 0, Step 2 and Step 3. This runs the
synthesizer using the above specified config and verifies how closely we match the target distributions. We recommend
looking at the overlap of the key probability distributions (store visit, category, and product choice probabilities)
as well as outcome distributions (time between visits, basket size, quantity purchased etc.)
In addition to the configs we have already introduced (i.e. synthetic_data_params.yaml and
./cfg/real_dataset.yaml), this notebook uses synthetic_dataset.yaml. This additional config
lets you override the default paths for output.
Once you have analyzed the notebook, you can nudge the parameters in the right direction by updating
the relevant values in synthetic_data_params.yaml.
While the calibration steps outlined above remains the same, we automate Step 1, Step 2, and Step 3 (partially) by defining an objective function and letting Optuna figure out the best set of parameters. As discussed in the paper, due to structural reasons it is difficult to achieve a perfect match between the distributions. We provide a range of different objective functions for the fine-tuning that can optimize different target distributions.
We use the KS-complement metric as the objective unless otherwise noted.
- Store Visit Choice Optimization: Optimize the fit of the store visit distribution.
- Overall Demand Optimization: Optimize the fit of the quantity purchased distribution.
- Category Choice Optimization: Optimize the fit of the category choice distribution according to a blended metric, summing the KS-complement metrics for the category choice and basket size distributions.
- Product Choice Optimization: Optimize the fit of the product choice distribution.
- Combined Optimization: Optimize the fit of the category and product choice distributions, summing their respective KS-complement metrics.
Begin the parameter sweeping process by processing The Complete Journey data (i.e. re-running Step 0). The
configuration for this script is driven by ./cfg/real_dataset.yaml. We suggest re-running this analysis rather than using data prepared previously, in case any parameters have been updated. The processed data will be saved in
the outputs/data directory by default. Run the script with the following command:
python ./analysis_workflow/2_synthesizer_calibration/1_prepare_real_data.pyNext, update ./cfg/parameter_sweeping.yaml to specify:
- the optimization strategy
hydra:
sweeper:
study_name: best_category_fit # choose from one of ["best_category_fit", "best_product_fit", " best_overall_fit", "best_store_fit", and "best_demand_fit"]- the specific parameters that requires sweeping
hydra:
sweeper:
params:
# Specify the parameters and the range of values to sweep here.
# For example, the following two parameters are key ones to compute the category utility
synthetic_data.synthetic_data_setup.category_choice_gamma_0j_cate.loc: range(-5, -4.4, step=0.2)
synthetic_data.synthetic_data_setup.category_choice_gamma_0j_cate.scale: 0.1,0.5There are additional settings that control the execution of the parameter sweep. Documentation for the key settings is provided below:
hydra:
run:
dir: analysis_workflow/2_synthesizer_calibration/outputs
job:
chdir: True
mode: MULTIRUN
sweeper:
sampler:
_target_: optuna.samplers.TPESampler # see https://optuna.readthedocs.io/en/stable/reference/samplers/index.html for details
seed: 0
direction: maximize
# specifies the optimization target
study_name: best_category_fit # choose from one of ["best_category_fit", "best_product_fit", " best_overall_fit", "best_store_fit", and "best_demand_fit"]
storage: null
n_trials: 1 # specifies the maximum number of runs to perform in a sweep
n_jobs: 1 # specifies how many trials to run in parallel
params:
# Specify the parameters and the range of values to sweep here.
# For example, the following two parameters are key ones to compute the category utility
synthetic_data.synthetic_data_setup.category_choice_gamma_0j_cate.loc: range(-5, -4.4, step=0.2)
synthetic_data.synthetic_data_setup.category_choice_gamma_0j_cate.scale: 0.1,0.5
sweep:
# sweeping report will be stored locally in the following directory
dir: analysis_workflow/2_synthesizer_calibration/multirun/${hydra.sweeper.study_name}/${now:%Y-%m-%d_%H-%M-%S}/
subdir: ${hydra.job.num}
paths: # use the following to override the default
processed_data: data/processed/synthetic_data_calib/
txns_array_path: data/processed/synthetic_data_calib/txns_array/
store_feature_path: data/processed/synthetic_data_calib/store_features/
category_feature_path: data/processed/synthetic_data_calib/category_features/
product_feature_path: data/processed/synthetic_data_calib/product_features/Execute the parameter sweeping script with the following command:
python ./analysis_workflow/2_synthesizer_calibration/2_parameter_sweeping.pyRunning this script effectively does Step 1, 2 and 4 multiple times. The results of the sweeping process,
including detailed reports, are stored in the
analysis_workflow/2_synthesizer_calibration/multirun/${hydra.sweeper.study_name}/${now:%Y-%m-%d_%H-%M-%S}/
directory. For example, if you set study_name: best_category_fit in the parameter_sweeping.yaml the output
would be stored in analysis_workflow/2_synthesizer_calibration/multirun/best_category_fit/${now:%Y-%m-%d_%H-%M-%S}/.
The optimal parameter values are outputted to the console and also saved in the optimization_results.yaml
file in the results directory, as shown below:
name: optuna
best_params:
synthetic_data.synthetic_data_setup.category_choice_gamma_0j_cate.loc: -4.6
synthetic_data.synthetic_data_setup.category_choice_gamma_0j_cate.scale: 0.1
best_value: 1.4419610352479562 # KS-Complement metric (max=2 in this case as we are matching 2 distributions)After performing the sweep, we can update the corresponding parameter values in synthetic_dataset.yaml to become our new baseline:
# coefficients to compute category utility
category_choice_gamma_0j_cate:
_target_: numpyro.distributions.Normal
loc: -4.6
scale: 0.1Now, perform Step 3 by running synthesizer_validation.ipynb notebook to verify the quantitative and
qualitative fit of the chosen set of parameters. The process here is the same as the one described in the manual
calibration approach.
Note: We've included the optimal parameter values obtained from our experiments in the optimal directory
for each strategy. The provided synthetic_data_params.yaml file is pre-configured with these values. Simply
running synthesizer_validation.ipynb notebook will reproduce the calibration results presented in our paper.
To replicate the scenario analysis results featured in our paper, you can run the pricing_strategy_visualization.ipynb notebook. This notebook executes the run_scenarios.py
script to generate the synthetic data for each scenario and then generates the scenario analysis results.
The policies we investigated in our paper are detailed in the table below.
| Policy | Effective Discount | Discount State Probability | Expected Discount Depth | |||
|---|---|---|---|---|---|---|
| I | 3% | 60% | 5% | (60,40) | (60,40) | (5,95) |
| II | 3% | 30% | 10% | (30,70) | (30,70) | (10,90) |
| III | 15% | 60% | 25% | (60,40) | (60,40) | (25,75) |
| IV | 15% | 30% | 50% | (30,70) | (30,70) | (50,50) |
| V | 24% | 60% | 40% | (60,40) | (60,40) | (40,60) |
Configuration and Customization:
-
The setup for the synthesizer in the scenario analysis builds upon the calibration step, utilizing the optimal values from
synthetic_data_params.yaml. However, additional overrides are specified insynthetic_dataset.yamlto reflect our updated specification of store utility equation. -
Policy Specification: The policies outlined in the table above are defined in
./cfg/scenario_analysis.yaml. For example, Policy I gets expressed in the config as:scenarios: I: # high-frequency, low-discount policy synthetic_data: synthetic_data_setup: discount_depth_distribution: _target_: numpyro.distributions.Beta concentration1: 5 # alpha, as in the beta distribution concentration0: 95 # beta, as in the beta distribution discount_state_a_01: _target_: numpyro.distributions.Beta concentration1: 60 concentration0: 40 discount_state_a_11: _target_: numpyro.distributions.Beta concentration1: 60 concentration0: 40
For support, questions, or feedback, please file an issue on our GitHub repository.