Drop non-finite (NaN/Inf) values before ODS write#165
Conversation
Summary: The `gcm slurm_monitor` ODS exporter has been failing to publish on all updated FAIR/shared clusters since the June 18 GCM rollout (S678528). Every collection cycle the write to https://graph.facebook.com/v21.0/ods_metrics is rejected with `400 (facebookresearch#100) param datapoints must be an array`, so no `fcm.*` metrics land in ODS. `fcm.total_gpus_up` and the rest of the cluster/GPU-availability metrics have been flat since 2026-06-22, breaking the v3_gpu_availability detector and the FAIR cluster-health pages. Root cause: a metric value computed by slurm_monitor can be non-finite (e.g. a mean/variance over zero samples => 0/0 = NaN). `get_payload` filtered values by `isinstance(value, (int, float))`, but NaN/Inf are floats and pass the check. `json.dumps` then serializes them as the bare tokens `NaN`/`Infinity`, which are invalid JSON, so the Graph API rejects the entire batch -- dropping every datapoint in the request, not just the offending one. That is why all `fcm.*` keys vanish at once. Reproduced live against the deployed token: a minimal datapoint and a 10k-datapoint (~1.3MB) batch both return 200, a `None` value returns 200, but a single NaN value returns exactly the observed `400 (facebookresearch#100) param datapoints must be an array`. Fix: in `get_payload`, skip non-finite values (logging how many were dropped) and serialize with `allow_nan=False` as a belt-and-suspenders so we can never emit invalid JSON again. Valid metrics now publish even when one metric goes NaN. This restores `fcm.*` ODS publishing. The separate `scribe_category argument is missing` assertion (the sdiag scribe write on clusters without `sdiag_scribe_category`) is tracked independently and does not affect ODS metrics. Reviewed By: xman1979 Differential Revision: D109949701
CI CommandsThe following CI workflows run automatically on every push and pull request:
The following commands can be used by maintainers to trigger additional tests that require access to secrets:
|
|
@mitthu has exported this pull request. If you are a Meta employee, you can view the originating Diff in D109949701. |
|
@mitthu has imported this pull request. If you are a Meta employee, you can view this in D109949701. |
xman1979
left a comment
There was a problem hiding this comment.
just fyi, I see Roman made a similar fix here: D109949701
Same diff that I just published via Github. |
Summary:
The
gcm slurm_monitorODS exporter has been failing to publish on all updated FAIR/shared clusters since the June 18 GCM rollout (S678528). Every collection cycle the write to https://graph.facebook.com/v21.0/ods_metrics is rejected with400 (#100) param datapoints must be an array, so nofcm.*metrics land in ODS.fcm.total_gpus_upand the rest of the cluster/GPU-availability metrics have been flat since 2026-06-22, breaking the v3_gpu_availability detector and the FAIR cluster-health pages.Root cause: a metric value computed by slurm_monitor can be non-finite (e.g. a mean/variance over zero samples => 0/0 = NaN).
get_payloadfiltered values byisinstance(value, (int, float)), but NaN/Inf are floats and pass the check.json.dumpsthen serializes them as the bare tokensNaN/Infinity, which are invalid JSON, so the Graph API rejects the entire batch -- dropping every datapoint in the request, not just the offending one. That is why allfcm.*keys vanish at once.Reproduced live against the deployed token: a minimal datapoint and a 10k-datapoint (~1.3MB) batch both return 200, a
Nonevalue returns 200, but a single NaN value returns exactly the observed400 (#100) param datapoints must be an array.Fix: in
get_payload, skip non-finite values (logging how many were dropped) and serialize withallow_nan=Falseas a belt-and-suspenders so we can never emit invalid JSON again. Valid metrics now publish even when one metric goes NaN. This restoresfcm.*ODS publishing.The separate
scribe_category argument is missingassertion (the sdiag scribe write on clusters withoutsdiag_scribe_category) is tracked independently and does not affect ODS metrics.Reviewed By: xman1979
Differential Revision: D109949701