Overview 📖

Reinforcement learning (RL) promises to solve long-horizon tasks even when training data contains only short fragments of behaviors. This capability, called stitching, is a crucial prerequisite for more general, foundational RL models. Conventional wisdom holds that only temporal difference (TD) methods can stitch fragments of experiences gathered during training. We show that while TD methods can stitch experiences in simple, low-dimensional settings, this behavior does not transfer to more complex, high-dimensional tasks. We also show that Monte Carlo (MC) methods, although still behind TD methods, can exhibit some stitching behavior. Furthermore, we find that increasing network capacity plays a critical role in closing the generalization gap, which is an encouraging direction as models grow larger in RL.

TD methods can stitch in low-dimensional tasks (left), but fail in higher-dimensional settings (right).

Installation & Setup 🔧

This repo uses uv. To install uv, follow the instructions here. To install all dependencies and create the virtual environment, run:

uv sync

Note

We use wandb for experiment tracking by default and you may be prompted to log in when running your first experiment. If you do not want to use wandb, pass the flag --exp.mode disabled to skip wandb logging.

Running experiments 🔬

To run a simple training with CRL and the default environment configuration:

Warning

This repository is optimized for GPU. Running the command below without a capable GPU may be very slow.

uv run src/train.py env:box-moving --exp.name test

The current version supports only the box-moving environment; specify env:box-moving for each experiment.

Wandb logging 📈

When wandb logging is enabled, experiment results (including environment and algorithm data) are logged to wandb. A short GIF of the agent's behavior is also recorded and stored in wandb.

Hyperparameters ⚙️

List of hyperparameters and options is available via:

uv run src/train.py --help

Options are grouped by prefixes:

exp. - General experimental settings (logging, seeds, experiment names, etc.). See ./src/config.py for details.
env. - Environment settings (difficulty, goal/start distributions, number of boxes, grid size, etc.). See BoxMovingConfig in ./src/envs/block_moving/env_types.py.
actor. - Algorithm settings (learning rates, batch sizes, network architectures, and algorithm choices). See ./src/impls/agents/__init__.py.

Expert Dataset Collection 📦

You can generate expert offline datasets with:

uv run scripts/gather_expert_dataset.py --help

Basic example

uv run scripts/gather_expert_dataset.py \
  --output-path data/expert_default_6x6_6boxes_100traj.npy \
  --num-trajectories 100 \
  --level-generator default \
  --grid-size 6 \
  --number-of-boxes-min 6 \
  --number-of-boxes-max 6 \
  --number-of-moving-boxes-max 6

Parallel collection with equal trajectory length

scripts/gather_expert_dataset.py supports parallel environment rollout via vmapped planning:

--parallel-envs: number of environments collected in parallel.
--fixed-length: stored trajectory length in transitions. Every trajectory in the saved dataset is padded to this exact length.

Example:

uv run scripts/gather_expert_dataset.py \
  --output-path data/expert_default_6x6_6boxes_100traj_parallel10_fixed300.npy \
  --num-trajectories 100 \
  --level-generator default \
  --grid-size 6 \
  --number-of-boxes-min 6 \
  --number-of-boxes-max 6 \
  --number-of-moving-boxes-max 6 \
  --parallel-envs 10 \
  --fixed-length 300

Runtime verbosity

The collector now prints detailed runtime progress by default:

startup configuration
JIT warmup time
per-batch progress (accepted/skipped/success/elapsed)
final summary

Useful controls:

--log-every N to print progress every N batches (default: 1)
--quiet to suppress progress logs

Environment 🕹️

The Box Moving environment is a grid-world where an agent moves boxes to target locations. It supports different grid sizes, numbers of boxes, and difficulty levels. While simple conceptually, complexity grows rapidly with grid size and box count, making it well suited for testing stitching capabilities.

The environment supports two modes for sampling box and goal positions (set via --env.level_generator):

default - Boxes and targets are spawned randomly on the grid.
variable - Boxes and targets are spawned in grid corners. Under normal evaluation the box and goal corners are adjacent. If --exp.eval_special is passed, the algorithm is additionally evaluated with box and goal corners diagonally opposite; results from this mode are logged in wandb under the eval_special tab.

Supported algorithms 🧠

We focus on goal-conditioned RL algorithms and, to isolate stitching behaviors, remove policy networks from tested algorithms. Actions are sampled directly from the Q-function via softmax sampling. The main algorithms include:

Contrastive RL (CRL) — a Monte Carlo (MC) style algorithm running without rewards.
C-Learning — a TD algorithm running without rewards.
GCDQN — TD and MC variants, with rewards.
GCIQL — TD and MC variants, with rewards.

The paper includes clearn_search, crl_search, gciql_search, and gcdqn. These do NOT include an agent network. We also include some other algorithms that were not thoroughly tested and may not be fully compatible with the latest code.

Also see 👀

OGBench — benchmark for offline goal-conditioned RL algorithms, which inspired parts of our code structure.
JaxGCRL — online goal-conditioned RL benchmark with various algorithms implemented in JAX.
Jumanji — collection of RL environments in JAX; Sokoban inspired aspects of our box-moving environment.

Citing 📄

If you use this work, please cite:

@inproceedings{anonymous2026temporal,
  title={Is Temporal Difference Learning the Gold Standard for Stitching in RL?},
  author={Michał Bortkiewicz and Władysław Pałucki and Benjamin Eysenbach and Mateusz Ostaszewski},
  year={2026},
  url={https://michalbortkiewicz.github.io/golden-standard/}
}

Questions or issues ❓

Open an issue on GitHub or contact:

Michał Bortkiewicz (michalbortkiewicz8@gmail.com)
Władysław Pałucki (w.palucki@uw.edu.pl)

Name		Name	Last commit message	Last commit date
Latest commit History 228 Commits
assets		assets
notebooks		notebooks
scripts		scripts
src		src
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview 📖

Installation & Setup 🔧

Running experiments 🔬

Wandb logging 📈

Hyperparameters ⚙️

Expert Dataset Collection 📦

Basic example

Parallel collection with equal trajectory length

Runtime verbosity

Environment 🕹️

Supported algorithms 🧠

Also see 👀

Citing 📄

Questions or issues ❓

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Overview 📖

Installation & Setup 🔧

Running experiments 🔬

Wandb logging 📈

Hyperparameters ⚙️

Expert Dataset Collection 📦

Basic example

Parallel collection with equal trajectory length

Runtime verbosity

Environment 🕹️

Supported algorithms 🧠

Also see 👀

Citing 📄

Questions or issues ❓

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages