Skip to content

single_sbatch routing: rework agent-driven (DSE/live-RL) handling and mixed-job guard ordering #937

Description

@rutayan-nv

Context

handle_dry_run_and_run (src/cloudai/cli/handlers.py) routes scenarios as:

agent_driven = [_is_dse_or_live_rl(tr) for tr in test_scenario.test_runs]
if args.single_sbatch or not any(agent_driven):  # in this mode cases are unrolled using grid search
    handle_non_dse_job(runner, args)
    return 0
if all(agent_driven):
    return handle_dse_job(runner, args)
logging.error("Mixing agent-driven (DSE / live-RL) and plain jobs is not allowed.")
return 1

The args.single_sbatch short-circuit is evaluated before the mixed-job guard. This ordering predates the live-RL work (it keyed on tr.is_dse_job originally) and is benign for DSE, because SingleSbatchRunner deliberately grid-unrolls DSE param spaces (unroll_dse / all_trs / handle_dse). But it has two rough edges:

  1. Mixed agent-driven + plain under --single-sbatch bypasses the "mixing not allowed" error and silently routes everything through handle_non_dse_job.
  2. SingleSbatchRunner has no live-RL path — it only branches on tr.is_dse_job. A live_rl_mode job under --single-sbatch would silently run once as a static job, because live-RL needs the in-process GymServer loop, not a batch grid.

A narrow stopgap guard was added for case (2) (error on live_rl_mode + single_sbatch) with a TODO pointing at this issue. This issue tracks the proper fix.

Proposed fix (dedicated PR)

  • Reject mixed agent-driven + plain before the single_sbatch short-circuit.
  • Make single_sbatch semantics explicit per agent type: DSE -> grid-unroll (current behavior), live-RL -> unsupported/error (or a live-RL-aware single-sbatch path if ever needed).
  • Add regression tests covering single_sbatch=True for DSE, live-RL, plain, and mixed scenarios.

Files

  • src/cloudai/cli/handlers.py (handle_dry_run_and_run, _is_dse_or_live_rl)
  • src/cloudai/systems/slurm/single_sbatch_runner.py (all_trs, handle_dse)
  • tests/test_handlers.py (_run_routing + routing tests)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions