Reinforcement learning (RL) promises to solve long-horizon tasks even when training data contains only short fragments of behaviors. This capability, called stitching, is a crucial prerequisite for more general, foundational RL models. Conventional wisdom holds that only temporal difference (TD) methods can stitch fragments of experiences gathered during training. We show that while TD methods can stitch experiences in simple, low-dimensional settings, this behavior does not transfer to more complex, high-dimensional tasks. We also show that Monte Carlo (MC) methods, although still behind TD methods, can exhibit some stitching behavior. Furthermore, we find that increasing network capacity plays a critical role in closing the generalization gap, which is an encouraging direction as models grow larger in RL.
TD methods can stitch in low-dimensional tasks (left), but fail in higher-dimensional settings (right).
This repo uses uv. To install uv, follow the instructions here. To install all dependencies and create the virtual environment, run:
uv syncNote
We use wandb for experiment tracking by default and you may be prompted to log in when running your first experiment. If you do not want to use wandb, pass the flag --exp.mode disabled to skip wandb logging.
To run a simple training with CRL and the default environment configuration:
Warning
This repository is optimized for GPU. Running the command below without a capable GPU may be very slow.
uv run src/train.py env:box-moving --exp.name testThe current version supports only the box-moving environment; specify env:box-moving for each experiment.
When wandb logging is enabled, experiment results (including environment and algorithm data) are logged to wandb. A short GIF of the agent's behavior is also recorded and stored in wandb.
List of hyperparameters and options is available via:
uv run src/train.py --helpOptions are grouped by prefixes:
exp.- General experimental settings (logging, seeds, experiment names, etc.). See./src/config.pyfor details.env.- Environment settings (difficulty, goal/start distributions, number of boxes, grid size, etc.). SeeBoxMovingConfigin./src/envs/block_moving/env_types.py.actor.- Algorithm settings (learning rates, batch sizes, network architectures, and algorithm choices). See./src/impls/agents/__init__.py.
You can generate expert offline datasets with:
uv run scripts/gather_expert_dataset.py --helpuv run scripts/gather_expert_dataset.py \
--output-path data/expert_default_6x6_6boxes_100traj.npy \
--num-trajectories 100 \
--level-generator default \
--grid-size 6 \
--number-of-boxes-min 6 \
--number-of-boxes-max 6 \
--number-of-moving-boxes-max 6scripts/gather_expert_dataset.py supports parallel environment rollout via vmapped planning:
--parallel-envs: number of environments collected in parallel.--fixed-length: stored trajectory length in transitions. Every trajectory in the saved dataset is padded to this exact length.
Example:
uv run scripts/gather_expert_dataset.py \
--output-path data/expert_default_6x6_6boxes_100traj_parallel10_fixed300.npy \
--num-trajectories 100 \
--level-generator default \
--grid-size 6 \
--number-of-boxes-min 6 \
--number-of-boxes-max 6 \
--number-of-moving-boxes-max 6 \
--parallel-envs 10 \
--fixed-length 300The collector now prints detailed runtime progress by default:
- startup configuration
- JIT warmup time
- per-batch progress (accepted/skipped/success/elapsed)
- final summary
Useful controls:
--log-every Nto print progress everyNbatches (default:1)--quietto suppress progress logs
The Box Moving environment is a grid-world where an agent moves boxes to target locations. It supports different grid sizes, numbers of boxes, and difficulty levels. While simple conceptually, complexity grows rapidly with grid size and box count, making it well suited for testing stitching capabilities.
The environment supports two modes for sampling box and goal positions (set via --env.level_generator):
default- Boxes and targets are spawned randomly on the grid.variable- Boxes and targets are spawned in grid corners. Under normal evaluation the box and goal corners are adjacent. If--exp.eval_specialis passed, the algorithm is additionally evaluated with box and goal corners diagonally opposite; results from this mode are logged in wandb under theeval_specialtab.
We focus on goal-conditioned RL algorithms and, to isolate stitching behaviors, remove policy networks from tested algorithms. Actions are sampled directly from the Q-function via softmax sampling. The main algorithms include:
- Contrastive RL (CRL) — a Monte Carlo (MC) style algorithm running without rewards.
- C-Learning — a TD algorithm running without rewards.
- GCDQN — TD and MC variants, with rewards.
- GCIQL — TD and MC variants, with rewards.
The paper includes clearn_search, crl_search, gciql_search, and gcdqn. These do NOT include an agent network. We also include some other algorithms that were not thoroughly tested and may not be fully compatible with the latest code.
- OGBench — benchmark for offline goal-conditioned RL algorithms, which inspired parts of our code structure.
- JaxGCRL — online goal-conditioned RL benchmark with various algorithms implemented in JAX.
- Jumanji — collection of RL environments in JAX; Sokoban inspired aspects of our box-moving environment.
If you use this work, please cite:
@inproceedings{anonymous2026temporal,
title={Is Temporal Difference Learning the Gold Standard for Stitching in RL?},
author={Michał Bortkiewicz and Władysław Pałucki and Benjamin Eysenbach and Mateusz Ostaszewski},
year={2026},
url={https://michalbortkiewicz.github.io/golden-standard/}
}Open an issue on GitHub or contact:
- Michał Bortkiewicz (michalbortkiewicz8@gmail.com)
- Władysław Pałucki (w.palucki@uw.edu.pl)

