Paper: DiffOP: Reinforcement Learning of Optimization-Based Control Policies via Implicit Policy Gradients is accepted by AAAI 2026.
Experimental results are available at WandB DiffOP Results Link.
- Clone this repository:
cd DiffOP- Create a conda environment (recommended):
conda create -n diffop python=3.9
conda activate diffop- Install the package:
pip install -e .- Install additional dependencies (if needed):
pip install gymnasium wandb scipy pandas matplotlibTo run experiments on nonlinear control environments (Cartpole, Robotarm, Quadrotor):
./run_nonlinear_experiments.sh [max_parallel_jobs]This script will run experiments on all three environments with multiple seeds. You can specify the maximum number of parallel jobs (default: 4).
Example:
./run_nonlinear_experiments.sh 8 # Run with up to 8 parallel jobsTo run experiments on the Voltage-v0 environment:
./run_voltage_experiments.shYou can also run individual training scripts directly:
cd diffop/experiments
python train_diffop.py --env Voltage-v0 --seed 0 --lr 0.5 --std 0.01 --horizon 6 --apply_horizon 1 --wandb_logAvailable environments:
Cartpole-v0Robotarm-v0Quadrotor-v0Voltage-v0
Key parameters:
--env: Environment name--seed: Random seed--lr: Learning rate--std: Noise standard deviation--horizon: Planning horizon--apply_horizon: Horizon for applying actions--wandb_log: Enable WandB logging
To visualize results, open the Jupyter notebook:
cd diffop/experiments
jupyter notebook visualize_results.ipynbMake sure the results/ folder contains the necessary data files for visualization.
