Skip to content

DSE: single-element list dims ([1]) aren't scalarized before command-gen (no central invariant) #946

Description

@rutayan-nv

Summary

A DSE dim authored as a single-element list (e.g. p_pp = [1]) can reach command generation still as a list, producing an invalid CLI arg (--p-pp [1]). Nothing in core guarantees cmd_args is fully resolved to scalars before a workload stringifies it. #938 patches this inside the aiconfigurator command-gen, but the gap is generic.

Mechanism

  • Sweepable dims are typed Union[int, List[int]] (list = explore, scalar = fixed).
  • param_space / is_dse_job classify by isinstance(value, list)length-agnostic — so a single-element [1] is treated as an action dim even though it carries no real choice.
  • apply_params_set overlays only the keys present in the agent's action; it does not iterate param_space. So a degenerate dim is collapsed to a scalar only if the active agent happens to emit it:
    • grid_search: emits every dim (cartesian product) → always collapses [1]1.
    • GymnasiumAdapter (RL): explicitly splits tunable (len > 1) vs fixed (len == 1) and injects the fixed scalars on every step → also collapses.
    • Any other agent/path is under no obligation to, so cmd_args.p_pp stays [1].
  • Command-gen then does str(value)--p-pp [1], which simple_predictor.py (int("[1]")) rejects.

Why it surfaced now (not RL-specific)

  • Long latent: grid was the only DSE agent and it always collapses, and fixed dims used to be authored as scalars (p_pp = 1).
  • The auto-generated agent configs (bo, ga, nvopt, rl) now emit fixed dims uniformly as [1], and non-grid agents don't guarantee overlay — so the latent gap became reachable. It is config/agent-convention driven, not specific to RL.

Impact

  • Any workload whose cmd_args field is typed Union[scalar, list] and stringified into a CLI flag is exposed; aiconfigurator is just the first.
  • Failure is a hard crash at predictor parse time, not a silent wrong result.

Root cause

No single normalization point owns "resolve every dim to a concrete scalar before command-gen." The responsibility is implicitly spread across each agent, and apply_params_set is action-driven rather than param_space-driven. A single-element list is semantically identical to its scalar to a human, but not at the CLI boundary (str(1) vs str([1])).

Suggested fix

Lift the fixed-dim collapse into core so every agent/path benefits, instead of per-workload band-aids:

  • Option A: in param_space, treat len == 1 lists as fixed (exclude from the search space) and materialize them onto cmd_args; or
  • Option B: have apply_params_set (or a dedicated post-resolve step) always scalarize degenerate dims regardless of which keys the agent's action carries.
  • GymnasiumAdapter._fixed_params already implements exactly this tunable/fixed split — it is the reference behavior to hoist into core.
  • Keep fix(aiconfig): unwrap single-value list dims in standalone command-gen #938's command-gen guard as defense-in-depth (fail loud on a genuine multi-element leak, which means an unresolved sweep escaped the agent).

Severity

Medium — hard crash, workload-reachable, currently mitigated only for aiconfigurator (#938).

Code: src/cloudai/_core/test_scenario.py (param_space, apply_params_set); src/cloudai/configurator/gymnasium_adapter.py (reference impl); src/cloudai/workloads/aiconfig/standalone_command_gen_strategy.py (current local fix, #938).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions