Remove legacy benchmark scripts (benchmark refactor, Part 5/5) by AntoineRichard · Pull Request #6206 · isaac-sim/IsaacLab

AntoineRichard · 2026-06-17T08:57:17Z

Description

Part 5 of 5 of the benchmark refactor series — removes the legacy benchmark scripts now superseded by the unified suite added in Parts 1–4.

Stacked on Part 4 (#6201). The diff against develop below also includes Parts 1–4 until they merge. For the incremental Part 5 changes only, view:
AntoineRichard/IsaacLab@antoiner/benchmark-play...antoiner/benchmark-cleanup

Series: Part 1/5 core (#6197) → Part 2/5 runtime + startup (#6198) → Part 3/5 training (#6199) → Part 4/5 play (#6201) → Part 5/5 cleanup (this PR).

Parts 1–4 add the unified runtime.py / startup.py / training.py / play.py suite alongside the legacy scripts, so nothing breaks while they merge. This final PR removes the legacy scripts once downstream consumers (OmniPerf ingestion, job runners) have migrated — kept as a draft; merge it last, on the consumers' signal.

Removes:

benchmark_non_rl.py → runtime.py
benchmark_startup.py → startup.py
benchmark_rsl_rl.py → training.py --rl_library rsl_rl
benchmark_rlgames.py → training.py --rl_library rl_games
run_non_rl_benchmarks.sh, run_physx_benchmarks.sh, run_training_benchmarks.sh — express their behavior with script args + presets=; run the PhysX micro-benchmarks under source/isaaclab_physx/benchmark/ directly.
scripts/benchmarks/utils.py (obsolete helper) and the orphaned test/test_training_metrics.py.

Also repoints benchmark_hydra_resolve.py at _common for get_backend_type, and adds the "Benchmark Scripts" section to the 3.0 migration guide documenting the full old → new command mapping (including play.py).

Backward-compatibility note (outputs): the unified scripts emit the same OmniPerf / JSON / Osmo / Summary KPI files as the legacy scripts — every pre-existing KPI row is byte-identical in name, value, and unit. The only difference is four additive peak rows contributed by the shared recorders in Part 1 (GPU [i] Memory Used peak, System Memory RSS/VMS/USS peak). Nothing is renamed, removed, or recomputed. Direct callers of the removed scripts must switch to the unified entry points per the migration guide.

Fixes # (n/a)

Type of change

Breaking change (removes the legacy benchmark entry-point scripts; a documented migration path to the unified scripts is provided)
This change requires a documentation update

Checklist

I have read and understood the contribution guidelines
I have run the pre-commit checks with ./isaaclab.sh --format
I have made corresponding changes to the documentation (3.0 migration guide "Benchmark Scripts")
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works (covered by Parts 1–4; this PR only removes superseded scripts)
I have added a changelog fragment under source/<pkg>/changelog.d/ for every touched package
I have added my name to the CONTRIBUTORS.md or my name already exists there

Introduce the capture, metrics, builders, stepping, profiling, and backend_descriptor submodules for assembling the schema-v1 benchmark bundles, add a schema output backend, and let BaseIsaacLabBenchmark emit several backends in one run via a new attach_bundle hook. Unit tests cover each submodule plus the schema backend and multi-backend finalize. Part 1 of a series splitting the oversized benchmark refactor (core -> runtime/startup -> training -> play).

Add backend-agnostic runtime.py (random-action stepping, emits a RuntimeBundle) and startup.py (cProfile startup-phase profiling, emits a StartupBundle), wired to develop's launch API (launch_simulation and add_launcher_args from isaaclab.app; preset tokens forwarded to Hydra without folding). Remove the legacy benchmark_non_rl.py and benchmark_startup.py scripts plus the run_non_rl_benchmarks.sh and run_physx_benchmarks.sh runner shells; repoint benchmark_hydra_resolve at _common.get_backend_type. Part 2 of the benchmark refactor series (core -> runtime/startup -> training -> play); stacked on Part 1 (isaac-sim#6197).

Add training.py dispatching over --rl_library {rsl_rl, rl_games, skrl, sb3}; each adapter runs real training under BenchmarkMonitor and emits a TrainingBundle via the shared core, with an optional success-metric early stop. Scripts use develop's launch API (launch_simulation from isaaclab.app; preset tokens forwarded without folding). Remove the legacy benchmark_rsl_rl.py / benchmark_rlgames.py scripts, the run_training_benchmarks.sh runner shell, and the obsolete utils.py helper. Part 3 of the benchmark refactor series (core -> runtime/startup -> training -> play); stacked on Parts 1-2 (isaac-sim#6197, isaac-sim#6198).

Introduce scripts/benchmarks/play.py, a --rl_library dispatcher mirroring training.py, plus the rsl_rl inference adapter scripts/benchmarks/rsl_rl/bench_play_rsl_rl.py. The adapter resolves a checkpoint via resolve_play_checkpoint, loads the policy the way the rsl_rl play script does, rolls it out under a BenchmarkMonitor using run_play_loop, and emits a PlayBundle.

Roll out a checkpointed skrl policy under a BenchmarkMonitor and emit a PlayBundle. The skrl env wrapper returns reward and done tensors shaped (num_envs, 1); reshape them to (num_envs,) in run_play_loop so the per-environment return accumulator broadcasts correctly across backends.

Roll out a checkpointed Stable-Baselines3 policy under a BenchmarkMonitor and emit a PlayBundle. The sb3 vec env returns NumPy reward/done arrays and a per-environment list of info dicts; coerce reward and dones onto the env device in run_play_loop so CPU NumPy returns do not clash with the on-device accumulators, and skip success extraction when the info value is not a dict.

- rl_games adapter: read obs_groups/concate_obs_groups from the agent cfg and pass them to RlGamesVecEnvWrapper, so tasks with asymmetric/non-default observation layouts feed the policy the same observation it was trained on. - _common: key the published-checkpoint lookup on the bare training-task name (drop any namespace prefix and the -Play suffix), matching the reinforcement_learning play scripts; add a unit-testable _published_task_name helper. - rl_games adapter: drop the inaccurate RNN-state-reset claim from the policy docstring.

Add PlayBundle to the attach_bundle and SchemaBundleFile.finalize bundle-type unions, and trim the over-verbose reshape comment in run_play_loop.

Remove the per-backend benchmark entry points (benchmark_non_rl.py, benchmark_startup.py, benchmark_rsl_rl.py, benchmark_rlgames.py), the run_*.sh runner shells, and the obsolete utils.py helper, all of which are replaced by the unified runtime.py / startup.py / training.py scripts added earlier in this series. Repoint benchmark_hydra_resolve.py at _common for get_backend_type and document the old->new command mapping in the 3.0 migration guide. This removal is split into its own PR so the unified scripts (Parts 1-4) can merge while the legacy scripts remain in place, letting downstream consumers (OmniPerf ingestion, job runners) migrate at their own pace before the old entry points disappear.

greptile-apps · 2026-06-17T09:50:07Z

Greptile Summary

This PR is Part 5/5 of the benchmark refactor series, removing legacy per-backend benchmark entry scripts (benchmark_non_rl.py, benchmark_rsl_rl.py, benchmark_rlgames.py, benchmark_startup.py), their shell wrappers, and the obsolete utils.py / test_training_metrics.py. It also delivers the new unified benchmark library (capture.py, metrics.py, builders.py, stepping.py, backend_descriptor.py, profiling.py, schema.PlayBundle) and the per-framework adapter scripts alongside a comprehensive 3.0 migration guide section.

Multi-backend support was added to BaseIsaacLabBenchmark: backend_type now accepts a comma-separated list, and _finalize_impl iterates over all selected backends, suffixing filenames by backend key when more than one is active.
New schema types (PlayBundle, SchemaBundleFile) and pure-stdlib helpers (capture, metrics, builders, stepping, profiling) complete the typed-bundle pipeline introduced in Parts 1–4.
Legacy removal: four benchmark entry scripts, three shell wrappers, utils.py, and test_training_metrics.py are deleted; benchmark_hydra_resolve.py is repointed to _common.get_backend_type.

Confidence Score: 4/5

The PR is safe to merge. All changes are benchmark-tooling only with no impact on the simulation runtime or training logic.

The refactor is thorough, well-tested, and the migration guide is complete. The _published_task_name issue with .replace is real but low-risk given Isaac Lab task naming conventions. The at_iteration_boundary false-positive at step_count=0 is not triggered by any existing caller. The output_file_path property silently returns a nonexistent path in multi-backend mode, which could surprise a future caller.

scripts/benchmarks/_common.py (removesuffix fix), source/isaaclab/isaaclab/test/benchmark/metrics.py (at_iteration_boundary guard), source/isaaclab/isaaclab/test/benchmark/benchmark_core.py (output_file_path multi-backend correction)

Important Files Changed

Filename	Overview
scripts/benchmarks/_common.py	New shared helpers for all unified entry scripts; _published_task_name uses str.replace instead of removesuffix, which strips ALL occurrences of -Play rather than only the suffix.
source/isaaclab/isaaclab/test/benchmark/benchmark_core.py	Multi-backend refactor is clean; output_file_path property now returns a stale single-file path when more than one backend is active, though none of the new scripts use it.
source/isaaclab/isaaclab/test/benchmark/metrics.py	New Isaac-Sim-free metrics helpers; SuccessRateTracker.at_iteration_boundary returns True before any steps have been recorded (step_count=0).
source/isaaclab/isaaclab/test/benchmark/backends.py	New SchemaBundleFile backend correctly no-ops add_metrics and serializes the typed bundle on finalize; lazy import of serialize module avoids circular imports.
source/isaaclab/isaaclab/test/benchmark/capture.py	Pure-stdlib capture helpers correctly map recorder data to schema dataclasses; graceful defaults when recorders are absent.
source/isaaclab/isaaclab/test/benchmark/stepping.py	Backend-agnostic play loop handles 4-tuple and 5-tuple Gym step APIs, tensor/numpy coercion, and per-env episode accumulation; lazy torch import is clean.
scripts/benchmarks/rl_games/bench_rl_games.py	RL-Games training adapter; correctly flushes and closes the TensorBoard writer before parsing logs, handles early-stop via RlGamesEarlyStopObserver.
scripts/benchmarks/rsl_rl/bench_rsl_rl.py	RSL-RL training adapter; derives per-iteration timing from collection_time + learning_time TensorBoard scalars, attaches bundle before _finalize_impl.
source/isaaclab/isaaclab/test/benchmark/profiling.py	Accesses the internal CPython pstats.Stats.stats dict for caller-graph traversal; comment acknowledges the risk and specifies the migration path if the internal API changes.
scripts/benchmarks/early_stop.py	Refactored to depend on SuccessRateTracker from the new metrics module; rl_games and rsl_rl wrappers cleanly separated and well-documented.
docs/source/migration/migrating_to_isaaclab_3-0.rst	New Benchmark Scripts migration section is accurate and comprehensive with a complete command-mapping table and five-step migration checklist.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[training.py / play.py / runtime.py / startup.py] -->|--rl_library dispatch| B[bench_rsl_rl / bench_rl_games / bench_sb3 / bench_skrl]
    A -->|no --rl_library| C[runtime.py direct loop]

    B --> D[BaseIsaacLabBenchmark]
    C --> D

    D -->|backend_type list| E{Multi-backend?}
    E -->|single| F[One backend: filename = prefix.json]
    E -->|multi| G[N backends: filename = prefix_key.json each]

    F --> H[SchemaBundleFile / OmniPerfKPIFile / JSONFileMetrics / OsmoKPIFile / SummaryMetrics]
    G --> H

    H -->|schema backend| I[write_bundle_file → typed JSON]
    H -->|other backends| J[flat KPI JSON / stdout summary]

    D --> K[BenchmarkMonitor background thread]
    K -->|interval| L[update_manual_recorders]
    L --> M[GPUInfoRecorder / CPUInfoRecorder / MemoryInfoRecorder / VersionInfoRecorder]
    M --> D

    D -->|attach_bundle| N[RuntimeBundle / TrainingBundle / PlayBundle / StartupBundle]
    N -->|schema finalize| I

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[training.py / play.py / runtime.py / startup.py] -->|--rl_library dispatch| B[bench_rsl_rl / bench_rl_games / bench_sb3 / bench_skrl]
    A -->|no --rl_library| C[runtime.py direct loop]

    B --> D[BaseIsaacLabBenchmark]
    C --> D

    D -->|backend_type list| E{Multi-backend?}
    E -->|single| F[One backend: filename = prefix.json]
    E -->|multi| G[N backends: filename = prefix_key.json each]

    F --> H[SchemaBundleFile / OmniPerfKPIFile / JSONFileMetrics / OsmoKPIFile / SummaryMetrics]
    G --> H

    H -->|schema backend| I[write_bundle_file → typed JSON]
    H -->|other backends| J[flat KPI JSON / stdout summary]

    D --> K[BenchmarkMonitor background thread]
    K -->|interval| L[update_manual_recorders]
    L --> M[GPUInfoRecorder / CPUInfoRecorder / MemoryInfoRecorder / VersionInfoRecorder]
    M --> D

    D -->|attach_bundle| N[RuntimeBundle / TrainingBundle / PlayBundle / StartupBundle]
    N -->|schema finalize| I

Comments Outside Diff (1)

source/isaaclab/isaaclab/test/benchmark/benchmark_core.py, line 205-208 (link)

output_file_path returns {prefix}.json regardless of the number of active backends. When more than one backend is selected _finalize_impl writes {prefix}_{key}.json files (e.g. benchmark_training_..._schema.json and benchmark_training_..._omniperf.json), so this property now points to a file that does not exist. Any caller that uses output_file_path to locate the schema output after a multi-backend run will silently reference the wrong path.

_{Reviews (1): Last reviewed commit: "Remove legacy benchmark scripts supersed..." | Re-trigger Greptile}

greptile-apps · 2026-06-17T09:50:10Z

+    Returns:
+        The bare training-task name to query for a published checkpoint.
+    """
+    return task.split(":")[-1].replace("-Play", "")


str.replace("-Play", "") removes every non-overlapping occurrence of the substring, not just the trailing suffix. A task whose name contains "-Play" as an infix — e.g. "Isaac-Play-Cube-Direct" — would be incorrectly transformed to "Isaac-Cube-Direct". Use str.removesuffix (Python 3.9+, already in scope throughout this codebase) to strip only the trailing occurrence.

Suggested change

return task.split(":")[-1].replace("-Play", "")

return task.split(":")[-1].removesuffix("-Play")

greptile-apps · 2026-06-17T09:50:11Z

+        Integrations that call :meth:`record_step` more or fewer times per env step will
+        break iteration accounting.
+        """
+        return self.num_steps_per_env > 0 and self._step_count % self.num_steps_per_env == 0


_step_count starts at 0, so 0 % num_steps_per_env == 0 is always True for any positive num_steps_per_env. A freshly constructed SuccessRateTracker incorrectly reports at_iteration_boundary = True before a single step has been recorded. The wrapper guards against this in practice (it only checks after record_step), but the property is misleading and would cause an off-by-one if ever checked eagerly. Excluding _step_count == 0 fixes the edge case.

Suggested change

return self.num_steps_per_env > 0 and self._step_count % self.num_steps_per_env == 0

return self.num_steps_per_env > 0 and self._step_count > 0 and self._step_count % self.num_steps_per_env == 0

greptile-apps · 2026-06-17T09:50:13Z

+    out: list[str] = []
+    for tok in cli_backend.split(","):
+        tok = tok.strip()
+        if not tok:
+            continue
+        canon = get_backend_type(tok)
+        if canon not in out:
+            out.append(canon)
+    return out or ["omniperf"]


Silent fallback for unknown backend tokens

get_backend_types maps any unrecognised token through get_backend_type, which silently falls back to "omniperf". If a user passes --benchmark_backend schma (typo) they receive the omniperf backend with no diagnostic. Adding a logger.warning for unrecognised tokens would make misconfigurations detectable without changing existing behaviour for callers that rely on the fallback.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

AntoineRichard added 17 commits June 17, 2026 09:33

Add PlayBundle schema type for the play benchmark

9a468cf

Add build_play_bundle builder

43c0855

Add run_play_loop policy-rollout stepping helper

727c23c

Add resolve_play_checkpoint helper to _common

0df0733

Annotate run_play_loop return type

8229034

Add rl_games play benchmark adapter

455b4b7

Add gated generate-then-play smoke tests for the play benchmark

ee33769

Document the play benchmark and add changelog

e9da0ce

Polish play-benchmark docs and types from review

f6262af

Add PlayBundle to the attach_bundle and SchemaBundleFile.finalize bundle-type unions, and trim the over-verbose reshape comment in run_play_loop.

github-actions Bot added documentation Improvements or additions to documentation isaac-lab Related to Isaac Lab team labels Jun 17, 2026

AntoineRichard marked this pull request as ready for review June 17, 2026 09:41

AntoineRichard requested review from Mayankm96, jtigue-bdai, kellyguo11 and ooctipus as code owners June 17, 2026 09:41

greptile-apps Bot reviewed Jun 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove legacy benchmark scripts (benchmark refactor, Part 5/5)#6206

Remove legacy benchmark scripts (benchmark refactor, Part 5/5)#6206
AntoineRichard wants to merge 17 commits into
isaac-sim:developfrom
AntoineRichard:antoiner/benchmark-cleanup

AntoineRichard commented Jun 17, 2026

Uh oh!

greptile-apps Bot commented Jun 17, 2026 •

edited

Loading

Comments Outside Diff (1)

Uh oh!

greptile-apps Bot Jun 17, 2026

Uh oh!

greptile-apps Bot Jun 17, 2026

Uh oh!

greptile-apps Bot Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	return task.split(":")[-1].replace("-Play", "")
	return task.split(":")[-1].removesuffix("-Play")

	return self.num_steps_per_env > 0 and self._step_count % self.num_steps_per_env == 0
	return self.num_steps_per_env > 0 and self._step_count > 0 and self._step_count % self.num_steps_per_env == 0

Conversation

AntoineRichard commented Jun 17, 2026

Description

Type of change

Checklist

Uh oh!

greptile-apps Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Comments Outside Diff (1)

Uh oh!

greptile-apps Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps Bot commented Jun 17, 2026 •

edited

Loading