Skip to content

Truncate oversized NODELIST/SCHEDNODES/EXC_NODES in squeue serialization#164

Merged
mitthu merged 1 commit into
facebookresearch:mainfrom
mitthu:export-D109782894
Jun 26, 2026
Merged

Truncate oversized NODELIST/SCHEDNODES/EXC_NODES in squeue serialization#164
mitthu merged 1 commit into
facebookresearch:mainfrom
mitthu:export-D109782894

Conversation

@mitthu

@mitthu mitthu commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Summary:

Context/Motivation

The GCM gcm-slurm-job-monitor on fair-sc is in a crash loop (~20x/hour for 24+ hours)
because a single job's squeue record serializes to 1,246,781 bytes, exceeding the 1MB
Scribe chunk_size limit. This causes chunk_by_json_size() to raise a ValueError which
kills the entire collection cycle, resulting in ~51% data loss in fair_job_data.

The root cause is that the NODELIST parser expands Slurm range notation into individual
node names without any cap. For very large jobs, this expansion produces entries that
exceed the Scribe message size limit when JSON-serialized.

This diff

  • Adds a _truncated_nodelist() wrapper around the nodelist parser that caps the expanded
    list at 1000 entries (≈30KB, well within the 1MB limit)
  • Appends "..." as a sentinel marker when truncation occurs
  • Logs a warning with the original count and first/last node names
  • Applies the truncation to NODELIST, EXC_NODES, and SCHEDNODES fields in JobData

Differential Revision: D109782894

Summary:
## Context/Motivation
The GCM `gcm-slurm-job-monitor` on `fair-sc` is in a crash loop (~20x/hour for 24+ hours)
because a single job's squeue record serializes to 1,246,781 bytes, exceeding the 1MB
Scribe chunk_size limit. This causes `chunk_by_json_size()` to raise a `ValueError` which
kills the entire collection cycle, resulting in ~51% data loss in `fair_job_data`.

The root cause is that the NODELIST parser expands Slurm range notation into individual
node names without any cap. For very large jobs, this expansion produces entries that
exceed the Scribe message size limit when JSON-serialized.

## This diff
- Adds a `_truncated_nodelist()` wrapper around the nodelist parser that caps the expanded
  list at 1000 entries (≈30KB, well within the 1MB limit)
- Appends "..." as a sentinel marker when truncation occurs
- Logs a warning with the original count and first/last node names
- Applies the truncation to NODELIST, EXC_NODES, and SCHEDNODES fields in JobData

Differential Revision: D109782894
@mitthu mitthu requested review from calebho and giongto35 as code owners June 26, 2026 02:04
@meta-cla meta-cla Bot added the cla signed label Jun 26, 2026
@github-actions

Copy link
Copy Markdown

CI Commands

The following CI workflows run automatically on every push and pull request:

Workflow What it runs
GPU Cluster Monitoring Python CI lint, tests, typecheck, format, deb build, pyoxidizer builds
Go packages CI shelper tests, format, lint

The following commands can be used by maintainers to trigger additional tests that require access to secrets:

Command Description Requires approval?
/metaci tests Runs Meta internal integration tests (pytest) Yes — a maintainer must trigger the command and approve the deployment request
/metaci integration tests Same as above (alias) Yes

Note: Only repository maintainers (OWNER association) can trigger /metaci commands. After commenting the command, a maintainer must also navigate to the Actions tab and approve the deployment to the graph-api-access environment before the jobs will run. See the approval guidelines for what to approve or reject.

@meta-codesync

meta-codesync Bot commented Jun 26, 2026

Copy link
Copy Markdown

@mitthu has exported this pull request. If you are a Meta employee, you can view the originating Diff in D109782894.

@mitthu mitthu merged commit dc3af8e into facebookresearch:main Jun 26, 2026
40 of 41 checks passed
@mitthu mitthu deleted the export-D109782894 branch June 26, 2026 03:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants