Skip to content

Drop non-finite (NaN/Inf) values before ODS write#165

Merged
mitthu merged 1 commit into
facebookresearch:mainfrom
mitthu:export-D109949701
Jun 27, 2026
Merged

Drop non-finite (NaN/Inf) values before ODS write#165
mitthu merged 1 commit into
facebookresearch:mainfrom
mitthu:export-D109949701

Conversation

@mitthu

@mitthu mitthu commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Summary:
The gcm slurm_monitor ODS exporter has been failing to publish on all updated FAIR/shared clusters since the June 18 GCM rollout (S678528). Every collection cycle the write to https://graph.facebook.com/v21.0/ods_metrics is rejected with 400 (#100) param datapoints must be an array, so no fcm.* metrics land in ODS. fcm.total_gpus_up and the rest of the cluster/GPU-availability metrics have been flat since 2026-06-22, breaking the v3_gpu_availability detector and the FAIR cluster-health pages.

Root cause: a metric value computed by slurm_monitor can be non-finite (e.g. a mean/variance over zero samples => 0/0 = NaN). get_payload filtered values by isinstance(value, (int, float)), but NaN/Inf are floats and pass the check. json.dumps then serializes them as the bare tokens NaN/Infinity, which are invalid JSON, so the Graph API rejects the entire batch -- dropping every datapoint in the request, not just the offending one. That is why all fcm.* keys vanish at once.

Reproduced live against the deployed token: a minimal datapoint and a 10k-datapoint (~1.3MB) batch both return 200, a None value returns 200, but a single NaN value returns exactly the observed 400 (#100) param datapoints must be an array.

Fix: in get_payload, skip non-finite values (logging how many were dropped) and serialize with allow_nan=False as a belt-and-suspenders so we can never emit invalid JSON again. Valid metrics now publish even when one metric goes NaN. This restores fcm.* ODS publishing.

The separate scribe_category argument is missing assertion (the sdiag scribe write on clusters without sdiag_scribe_category) is tracked independently and does not affect ODS metrics.

Reviewed By: xman1979

Differential Revision: D109949701

Summary:
The `gcm slurm_monitor` ODS exporter has been failing to publish on all updated FAIR/shared clusters since the June 18 GCM rollout (S678528). Every collection cycle the write to https://graph.facebook.com/v21.0/ods_metrics is rejected with `400 (facebookresearch#100) param datapoints must be an array`, so no `fcm.*` metrics land in ODS. `fcm.total_gpus_up` and the rest of the cluster/GPU-availability metrics have been flat since 2026-06-22, breaking the v3_gpu_availability detector and the FAIR cluster-health pages.

Root cause: a metric value computed by slurm_monitor can be non-finite (e.g. a mean/variance over zero samples => 0/0 = NaN). `get_payload` filtered values by `isinstance(value, (int, float))`, but NaN/Inf are floats and pass the check. `json.dumps` then serializes them as the bare tokens `NaN`/`Infinity`, which are invalid JSON, so the Graph API rejects the entire batch -- dropping every datapoint in the request, not just the offending one. That is why all `fcm.*` keys vanish at once.

Reproduced live against the deployed token: a minimal datapoint and a 10k-datapoint (~1.3MB) batch both return 200, a `None` value returns 200, but a single NaN value returns exactly the observed `400 (facebookresearch#100) param datapoints must be an array`.

Fix: in `get_payload`, skip non-finite values (logging how many were dropped) and serialize with `allow_nan=False` as a belt-and-suspenders so we can never emit invalid JSON again. Valid metrics now publish even when one metric goes NaN. This restores `fcm.*` ODS publishing.

The separate `scribe_category argument is missing` assertion (the sdiag scribe write on clusters without `sdiag_scribe_category`) is tracked independently and does not affect ODS metrics.

Reviewed By: xman1979

Differential Revision: D109949701
@mitthu mitthu requested review from calebho and giongto35 as code owners June 27, 2026 21:44
@meta-cla meta-cla Bot added the cla signed label Jun 27, 2026
@github-actions

Copy link
Copy Markdown

CI Commands

The following CI workflows run automatically on every push and pull request:

Workflow What it runs
GPU Cluster Monitoring Python CI lint, tests, typecheck, format, deb build, pyoxidizer builds
Go packages CI shelper tests, format, lint

The following commands can be used by maintainers to trigger additional tests that require access to secrets:

Command Description Requires approval?
/metaci tests Runs Meta internal integration tests (pytest) Yes — a maintainer must trigger the command and approve the deployment request
/metaci integration tests Same as above (alias) Yes

Note: Only repository maintainers (OWNER association) can trigger /metaci commands. After commenting the command, a maintainer must also navigate to the Actions tab and approve the deployment to the graph-api-access environment before the jobs will run. See the approval guidelines for what to approve or reject.

@meta-codesync

meta-codesync Bot commented Jun 27, 2026

Copy link
Copy Markdown

@mitthu has exported this pull request. If you are a Meta employee, you can view the originating Diff in D109949701.

@meta-codesync

meta-codesync Bot commented Jun 27, 2026

Copy link
Copy Markdown

@mitthu has imported this pull request. If you are a Meta employee, you can view this in D109949701.

@Yash0270 Yash0270 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@xman1979 xman1979 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just fyi, I see Roman made a similar fix here: D109949701

@mitthu

mitthu commented Jun 27, 2026

Copy link
Copy Markdown
Contributor Author

just fyi, I see Roman made a similar fix here: D109949701

Same diff that I just published via Github.

@mitthu mitthu merged commit 1f35526 into facebookresearch:main Jun 27, 2026
38 of 43 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants