Skip to content

Remove legacy benchmark scripts (benchmark refactor, Part 5/5)#6206

Open
AntoineRichard wants to merge 17 commits into
isaac-sim:developfrom
AntoineRichard:antoiner/benchmark-cleanup
Open

Remove legacy benchmark scripts (benchmark refactor, Part 5/5)#6206
AntoineRichard wants to merge 17 commits into
isaac-sim:developfrom
AntoineRichard:antoiner/benchmark-cleanup

Conversation

@AntoineRichard

Copy link
Copy Markdown
Collaborator

Description

Part 5 of 5 of the benchmark refactor series — removes the legacy benchmark scripts now superseded by the unified suite added in Parts 1–4.

Stacked on Part 4 (#6201). The diff against develop below also includes Parts 1–4 until they merge. For the incremental Part 5 changes only, view:
AntoineRichard/IsaacLab@antoiner/benchmark-play...antoiner/benchmark-cleanup

Series: Part 1/5 core (#6197) → Part 2/5 runtime + startup (#6198) → Part 3/5 training (#6199) → Part 4/5 play (#6201) → Part 5/5 cleanup (this PR).

Parts 1–4 add the unified runtime.py / startup.py / training.py / play.py suite alongside the legacy scripts, so nothing breaks while they merge. This final PR removes the legacy scripts once downstream consumers (OmniPerf ingestion, job runners) have migrated — kept as a draft; merge it last, on the consumers' signal.

Removes:

  • benchmark_non_rl.pyruntime.py
  • benchmark_startup.pystartup.py
  • benchmark_rsl_rl.pytraining.py --rl_library rsl_rl
  • benchmark_rlgames.pytraining.py --rl_library rl_games
  • run_non_rl_benchmarks.sh, run_physx_benchmarks.sh, run_training_benchmarks.sh — express their behavior with script args + presets=; run the PhysX micro-benchmarks under source/isaaclab_physx/benchmark/ directly.
  • scripts/benchmarks/utils.py (obsolete helper) and the orphaned test/test_training_metrics.py.

Also repoints benchmark_hydra_resolve.py at _common for get_backend_type, and adds the "Benchmark Scripts" section to the 3.0 migration guide documenting the full old → new command mapping (including play.py).

Backward-compatibility note (outputs): the unified scripts emit the same OmniPerf / JSON / Osmo / Summary KPI files as the legacy scripts — every pre-existing KPI row is byte-identical in name, value, and unit. The only difference is four additive peak rows contributed by the shared recorders in Part 1 (GPU [i] Memory Used peak, System Memory RSS/VMS/USS peak). Nothing is renamed, removed, or recomputed. Direct callers of the removed scripts must switch to the unified entry points per the migration guide.

Fixes # (n/a)

Type of change

  • Breaking change (removes the legacy benchmark entry-point scripts; a documented migration path to the unified scripts is provided)
  • This change requires a documentation update

Checklist

  • I have read and understood the contribution guidelines
  • I have run the pre-commit checks with ./isaaclab.sh --format
  • I have made corresponding changes to the documentation (3.0 migration guide "Benchmark Scripts")
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works (covered by Parts 1–4; this PR only removes superseded scripts)
  • I have added a changelog fragment under source/<pkg>/changelog.d/ for every touched package
  • I have added my name to the CONTRIBUTORS.md or my name already exists there

Introduce the capture, metrics, builders, stepping, profiling, and
backend_descriptor submodules for assembling the schema-v1 benchmark
bundles, add a schema output backend, and let BaseIsaacLabBenchmark emit
several backends in one run via a new attach_bundle hook. Unit tests
cover each submodule plus the schema backend and multi-backend finalize.

Part 1 of a series splitting the oversized benchmark refactor
(core -> runtime/startup -> training -> play).
Add backend-agnostic runtime.py (random-action stepping, emits a
RuntimeBundle) and startup.py (cProfile startup-phase profiling, emits a
StartupBundle), wired to develop's launch API (launch_simulation and
add_launcher_args from isaaclab.app; preset tokens forwarded to Hydra
without folding). Remove the legacy benchmark_non_rl.py and
benchmark_startup.py scripts plus the run_non_rl_benchmarks.sh and
run_physx_benchmarks.sh runner shells; repoint benchmark_hydra_resolve
at _common.get_backend_type.

Part 2 of the benchmark refactor series (core -> runtime/startup ->
training -> play); stacked on Part 1 (isaac-sim#6197).
Add training.py dispatching over --rl_library {rsl_rl, rl_games, skrl,
sb3}; each adapter runs real training under BenchmarkMonitor and emits a
TrainingBundle via the shared core, with an optional success-metric early
stop. Scripts use develop's launch API (launch_simulation from
isaaclab.app; preset tokens forwarded without folding). Remove the legacy
benchmark_rsl_rl.py / benchmark_rlgames.py scripts, the
run_training_benchmarks.sh runner shell, and the obsolete utils.py helper.

Part 3 of the benchmark refactor series (core -> runtime/startup ->
training -> play); stacked on Parts 1-2 (isaac-sim#6197, isaac-sim#6198).
Introduce scripts/benchmarks/play.py, a --rl_library dispatcher mirroring
training.py, plus the rsl_rl inference adapter
scripts/benchmarks/rsl_rl/bench_play_rsl_rl.py. The adapter resolves a
checkpoint via resolve_play_checkpoint, loads the policy the way the
rsl_rl play script does, rolls it out under a BenchmarkMonitor using
run_play_loop, and emits a PlayBundle.
Roll out a checkpointed skrl policy under a BenchmarkMonitor and emit a
PlayBundle. The skrl env wrapper returns reward and done tensors shaped
(num_envs, 1); reshape them to (num_envs,) in run_play_loop so the
per-environment return accumulator broadcasts correctly across backends.
Roll out a checkpointed Stable-Baselines3 policy under a BenchmarkMonitor
and emit a PlayBundle. The sb3 vec env returns NumPy reward/done arrays and
a per-environment list of info dicts; coerce reward and dones onto the env
device in run_play_loop so CPU NumPy returns do not clash with the on-device
accumulators, and skip success extraction when the info value is not a dict.
- rl_games adapter: read obs_groups/concate_obs_groups from the agent cfg and pass
  them to RlGamesVecEnvWrapper, so tasks with asymmetric/non-default observation
  layouts feed the policy the same observation it was trained on.
- _common: key the published-checkpoint lookup on the bare training-task name (drop
  any namespace prefix and the -Play suffix), matching the reinforcement_learning
  play scripts; add a unit-testable _published_task_name helper.
- rl_games adapter: drop the inaccurate RNN-state-reset claim from the policy docstring.
Add PlayBundle to the attach_bundle and SchemaBundleFile.finalize
bundle-type unions, and trim the over-verbose reshape comment in
run_play_loop.
Remove the per-backend benchmark entry points (benchmark_non_rl.py,
benchmark_startup.py, benchmark_rsl_rl.py, benchmark_rlgames.py), the
run_*.sh runner shells, and the obsolete utils.py helper, all of which
are replaced by the unified runtime.py / startup.py / training.py scripts
added earlier in this series. Repoint benchmark_hydra_resolve.py at
_common for get_backend_type and document the old->new command mapping in
the 3.0 migration guide.

This removal is split into its own PR so the unified scripts (Parts 1-4)
can merge while the legacy scripts remain in place, letting downstream
consumers (OmniPerf ingestion, job runners) migrate at their own pace
before the old entry points disappear.
@github-actions github-actions Bot added documentation Improvements or additions to documentation isaac-lab Related to Isaac Lab team labels Jun 17, 2026
@AntoineRichard AntoineRichard marked this pull request as ready for review June 17, 2026 09:41
@greptile-apps

greptile-apps Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR is Part 5/5 of the benchmark refactor series, removing legacy per-backend benchmark entry scripts (benchmark_non_rl.py, benchmark_rsl_rl.py, benchmark_rlgames.py, benchmark_startup.py), their shell wrappers, and the obsolete utils.py / test_training_metrics.py. It also delivers the new unified benchmark library (capture.py, metrics.py, builders.py, stepping.py, backend_descriptor.py, profiling.py, schema.PlayBundle) and the per-framework adapter scripts alongside a comprehensive 3.0 migration guide section.

  • Multi-backend support was added to BaseIsaacLabBenchmark: backend_type now accepts a comma-separated list, and _finalize_impl iterates over all selected backends, suffixing filenames by backend key when more than one is active.
  • New schema types (PlayBundle, SchemaBundleFile) and pure-stdlib helpers (capture, metrics, builders, stepping, profiling) complete the typed-bundle pipeline introduced in Parts 1–4.
  • Legacy removal: four benchmark entry scripts, three shell wrappers, utils.py, and test_training_metrics.py are deleted; benchmark_hydra_resolve.py is repointed to _common.get_backend_type.

Confidence Score: 4/5

The PR is safe to merge. All changes are benchmark-tooling only with no impact on the simulation runtime or training logic.

The refactor is thorough, well-tested, and the migration guide is complete. The _published_task_name issue with .replace is real but low-risk given Isaac Lab task naming conventions. The at_iteration_boundary false-positive at step_count=0 is not triggered by any existing caller. The output_file_path property silently returns a nonexistent path in multi-backend mode, which could surprise a future caller.

scripts/benchmarks/_common.py (removesuffix fix), source/isaaclab/isaaclab/test/benchmark/metrics.py (at_iteration_boundary guard), source/isaaclab/isaaclab/test/benchmark/benchmark_core.py (output_file_path multi-backend correction)

Important Files Changed

Filename Overview
scripts/benchmarks/_common.py New shared helpers for all unified entry scripts; _published_task_name uses str.replace instead of removesuffix, which strips ALL occurrences of -Play rather than only the suffix.
source/isaaclab/isaaclab/test/benchmark/benchmark_core.py Multi-backend refactor is clean; output_file_path property now returns a stale single-file path when more than one backend is active, though none of the new scripts use it.
source/isaaclab/isaaclab/test/benchmark/metrics.py New Isaac-Sim-free metrics helpers; SuccessRateTracker.at_iteration_boundary returns True before any steps have been recorded (step_count=0).
source/isaaclab/isaaclab/test/benchmark/backends.py New SchemaBundleFile backend correctly no-ops add_metrics and serializes the typed bundle on finalize; lazy import of serialize module avoids circular imports.
source/isaaclab/isaaclab/test/benchmark/capture.py Pure-stdlib capture helpers correctly map recorder data to schema dataclasses; graceful defaults when recorders are absent.
source/isaaclab/isaaclab/test/benchmark/stepping.py Backend-agnostic play loop handles 4-tuple and 5-tuple Gym step APIs, tensor/numpy coercion, and per-env episode accumulation; lazy torch import is clean.
scripts/benchmarks/rl_games/bench_rl_games.py RL-Games training adapter; correctly flushes and closes the TensorBoard writer before parsing logs, handles early-stop via RlGamesEarlyStopObserver.
scripts/benchmarks/rsl_rl/bench_rsl_rl.py RSL-RL training adapter; derives per-iteration timing from collection_time + learning_time TensorBoard scalars, attaches bundle before _finalize_impl.
source/isaaclab/isaaclab/test/benchmark/profiling.py Accesses the internal CPython pstats.Stats.stats dict for caller-graph traversal; comment acknowledges the risk and specifies the migration path if the internal API changes.
scripts/benchmarks/early_stop.py Refactored to depend on SuccessRateTracker from the new metrics module; rl_games and rsl_rl wrappers cleanly separated and well-documented.
docs/source/migration/migrating_to_isaaclab_3-0.rst New Benchmark Scripts migration section is accurate and comprehensive with a complete command-mapping table and five-step migration checklist.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[training.py / play.py / runtime.py / startup.py] -->|--rl_library dispatch| B[bench_rsl_rl / bench_rl_games / bench_sb3 / bench_skrl]
    A -->|no --rl_library| C[runtime.py direct loop]

    B --> D[BaseIsaacLabBenchmark]
    C --> D

    D -->|backend_type list| E{Multi-backend?}
    E -->|single| F[One backend: filename = prefix.json]
    E -->|multi| G[N backends: filename = prefix_key.json each]

    F --> H[SchemaBundleFile / OmniPerfKPIFile / JSONFileMetrics / OsmoKPIFile / SummaryMetrics]
    G --> H

    H -->|schema backend| I[write_bundle_file → typed JSON]
    H -->|other backends| J[flat KPI JSON / stdout summary]

    D --> K[BenchmarkMonitor background thread]
    K -->|interval| L[update_manual_recorders]
    L --> M[GPUInfoRecorder / CPUInfoRecorder / MemoryInfoRecorder / VersionInfoRecorder]
    M --> D

    D -->|attach_bundle| N[RuntimeBundle / TrainingBundle / PlayBundle / StartupBundle]
    N -->|schema finalize| I
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[training.py / play.py / runtime.py / startup.py] -->|--rl_library dispatch| B[bench_rsl_rl / bench_rl_games / bench_sb3 / bench_skrl]
    A -->|no --rl_library| C[runtime.py direct loop]

    B --> D[BaseIsaacLabBenchmark]
    C --> D

    D -->|backend_type list| E{Multi-backend?}
    E -->|single| F[One backend: filename = prefix.json]
    E -->|multi| G[N backends: filename = prefix_key.json each]

    F --> H[SchemaBundleFile / OmniPerfKPIFile / JSONFileMetrics / OsmoKPIFile / SummaryMetrics]
    G --> H

    H -->|schema backend| I[write_bundle_file → typed JSON]
    H -->|other backends| J[flat KPI JSON / stdout summary]

    D --> K[BenchmarkMonitor background thread]
    K -->|interval| L[update_manual_recorders]
    L --> M[GPUInfoRecorder / CPUInfoRecorder / MemoryInfoRecorder / VersionInfoRecorder]
    M --> D

    D -->|attach_bundle| N[RuntimeBundle / TrainingBundle / PlayBundle / StartupBundle]
    N -->|schema finalize| I
Loading

Comments Outside Diff (1)

  1. source/isaaclab/isaaclab/test/benchmark/benchmark_core.py, line 205-208 (link)

    P2 output_file_path returns {prefix}.json regardless of the number of active backends. When more than one backend is selected _finalize_impl writes {prefix}_{key}.json files (e.g. benchmark_training_..._schema.json and benchmark_training_..._omniperf.json), so this property now points to a file that does not exist. Any caller that uses output_file_path to locate the schema output after a multi-backend run will silently reference the wrong path.

Reviews (1): Last reviewed commit: "Remove legacy benchmark scripts supersed..." | Re-trigger Greptile

Returns:
The bare training-task name to query for a published checkpoint.
"""
return task.split(":")[-1].replace("-Play", "")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 str.replace("-Play", "") removes every non-overlapping occurrence of the substring, not just the trailing suffix. A task whose name contains "-Play" as an infix — e.g. "Isaac-Play-Cube-Direct" — would be incorrectly transformed to "Isaac-Cube-Direct". Use str.removesuffix (Python 3.9+, already in scope throughout this codebase) to strip only the trailing occurrence.

Suggested change
return task.split(":")[-1].replace("-Play", "")
return task.split(":")[-1].removesuffix("-Play")

Integrations that call :meth:`record_step` more or fewer times per env step will
break iteration accounting.
"""
return self.num_steps_per_env > 0 and self._step_count % self.num_steps_per_env == 0

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 _step_count starts at 0, so 0 % num_steps_per_env == 0 is always True for any positive num_steps_per_env. A freshly constructed SuccessRateTracker incorrectly reports at_iteration_boundary = True before a single step has been recorded. The wrapper guards against this in practice (it only checks after record_step), but the property is misleading and would cause an off-by-one if ever checked eagerly. Excluding _step_count == 0 fixes the edge case.

Suggested change
return self.num_steps_per_env > 0 and self._step_count % self.num_steps_per_env == 0
return self.num_steps_per_env > 0 and self._step_count > 0 and self._step_count % self.num_steps_per_env == 0

Comment on lines +56 to +64
out: list[str] = []
for tok in cli_backend.split(","):
tok = tok.strip()
if not tok:
continue
canon = get_backend_type(tok)
if canon not in out:
out.append(canon)
return out or ["omniperf"]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Silent fallback for unknown backend tokens

get_backend_types maps any unrecognised token through get_backend_type, which silently falls back to "omniperf". If a user passes --benchmark_backend schma (typo) they receive the omniperf backend with no diagnostic. Adding a logger.warning for unrecognised tokens would make misconfigurations detectable without changing existing behaviour for callers that rely on the fallback.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation isaac-lab Related to Isaac Lab team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant