Context
handle_dry_run_and_run (src/cloudai/cli/handlers.py) routes scenarios as:
agent_driven = [_is_dse_or_live_rl(tr) for tr in test_scenario.test_runs]
if args.single_sbatch or not any(agent_driven): # in this mode cases are unrolled using grid search
handle_non_dse_job(runner, args)
return 0
if all(agent_driven):
return handle_dse_job(runner, args)
logging.error("Mixing agent-driven (DSE / live-RL) and plain jobs is not allowed.")
return 1
The args.single_sbatch short-circuit is evaluated before the mixed-job guard. This ordering predates the live-RL work (it keyed on tr.is_dse_job originally) and is benign for DSE, because SingleSbatchRunner deliberately grid-unrolls DSE param spaces (unroll_dse / all_trs / handle_dse). But it has two rough edges:
- Mixed agent-driven + plain under
--single-sbatch bypasses the "mixing not allowed" error and silently routes everything through handle_non_dse_job.
SingleSbatchRunner has no live-RL path — it only branches on tr.is_dse_job. A live_rl_mode job under --single-sbatch would silently run once as a static job, because live-RL needs the in-process GymServer loop, not a batch grid.
A narrow stopgap guard was added for case (2) (error on live_rl_mode + single_sbatch) with a TODO pointing at this issue. This issue tracks the proper fix.
Proposed fix (dedicated PR)
- Reject mixed agent-driven + plain before the
single_sbatch short-circuit.
- Make single_sbatch semantics explicit per agent type: DSE -> grid-unroll (current behavior), live-RL -> unsupported/error (or a live-RL-aware single-sbatch path if ever needed).
- Add regression tests covering
single_sbatch=True for DSE, live-RL, plain, and mixed scenarios.
Files
src/cloudai/cli/handlers.py (handle_dry_run_and_run, _is_dse_or_live_rl)
src/cloudai/systems/slurm/single_sbatch_runner.py (all_trs, handle_dse)
tests/test_handlers.py (_run_routing + routing tests)
Context
handle_dry_run_and_run(src/cloudai/cli/handlers.py) routes scenarios as:The
args.single_sbatchshort-circuit is evaluated before the mixed-job guard. This ordering predates the live-RL work (it keyed ontr.is_dse_joboriginally) and is benign for DSE, becauseSingleSbatchRunnerdeliberately grid-unrolls DSE param spaces (unroll_dse/all_trs/handle_dse). But it has two rough edges:--single-sbatchbypasses the "mixing not allowed" error and silently routes everything throughhandle_non_dse_job.SingleSbatchRunnerhas no live-RL path — it only branches ontr.is_dse_job. Alive_rl_modejob under--single-sbatchwould silently run once as a static job, because live-RL needs the in-processGymServerloop, not a batch grid.A narrow stopgap guard was added for case (2) (error on
live_rl_mode+single_sbatch) with aTODOpointing at this issue. This issue tracks the proper fix.Proposed fix (dedicated PR)
single_sbatchshort-circuit.single_sbatch=Truefor DSE, live-RL, plain, and mixed scenarios.Files
src/cloudai/cli/handlers.py(handle_dry_run_and_run,_is_dse_or_live_rl)src/cloudai/systems/slurm/single_sbatch_runner.py(all_trs,handle_dse)tests/test_handlers.py(_run_routing+ routing tests)