Skip to content

Silent-failure ODS counters (#5850)#5850

Open
adityas-meta wants to merge 1 commit into
pytorch:mainfrom
adityas-meta:export-D107811590
Open

Silent-failure ODS counters (#5850)#5850
adityas-meta wants to merge 1 commit into
pytorch:mainfrom
adityas-meta:export-D107811590

Conversation

@adityas-meta

@adityas-meta adityas-meta commented Jun 8, 2026

Copy link
Copy Markdown

Summary:

X-link: https://github.com/facebookresearch/FBGEMM/pull/2768

Add OBC counters for 3 silent failure modes in the Raw Embedding Streaming (RES) path, closing the must-have bar of T273802725 (subtask of T269497764, RES code-side observability this half). Today these failure sites only emit XLOG(ERR) / LOG(ERROR) — nothing aggregates, so on-call cannot trigger off them or query them in ODS without tail -f on individual trainer hosts.

Counters added (all via TrainingPsOdsLogger::bumpKey, OBC category raw_embedding_streaming):

Counter Site
res.fail.shard_size_mismatch raw_embedding_streamer.cpp — both the outer indices.size(0) != weights.size(0) guard and the per-shard weights_masked.size(0) != rows_in_shard guard in tensor_stream()
res.fail.set_embeddings_rpc raw_embedding_streamer.cpp — wraps co_await res_client->co_setEmbeddings(req) in try/catch; bumps then re-throws to preserve trainer-thread termination
res.fail.write_sparse_throw TrainingPsHandler.cpp catch (const std::exception& e) block at the writeSparseListOfTensors site in co_stream_tensors

Backend (addresses review: use OBC, not fb303): all three route through the existing TrainingPsOdsLogger OBC client. The handler already owned an ods_logger_; the fbgemm-side RawEmbeddingStreamer now holds a TrainingPsOdsLogger member (forward-declared in the OSS-mirrored header, constructed only when streaming is enabled). OBC reaches ODS through the host-level agent with no per-process fb303 export/scrape config, so this revision deletes the fb303 ResFailCounters.h helper (and its call_once / addStatExportType / FOLLY_EXPORT machinery) entirely.

Counters bump AT the failure site (inside catch / right before continue) so they reflect what actually happened — applying the correctness lesson from D104076551 V4 (don't preemptively count attempts at the top of the function). Per-TBE-per-rank granularity is fine: OBC aggregates per (entity, key), so per-TBE bumps land on the same per-host key rather than separate rows.

Existing log lines preserved (additive). Existing throw-on-failure semantics preserved — the co_setEmbeddings catch re-throws after bumping so trainer-thread termination is unchanged.

Out of scope (intentional, per T269497764 boundary):

  • MPSC weights_to_stream_queue_ near-full counter — queue is unbounded; deferred to the queue-depth gauge work in T273802756.
  • Detectors / alerts on these counters — downstream consumer work.

Differential Revision: D107811590

@meta-cla meta-cla Bot added the cla signed label Jun 8, 2026
@adityas-meta adityas-meta force-pushed the export-D107811590 branch 3 times, most recently from 8900338 to 316d62e Compare June 15, 2026 14:09
Summary:

X-link: facebookresearch/FBGEMM#2768

Add OBC counters for 3 silent failure modes in the Raw Embedding Streaming (RES) path, closing the must-have bar of T273802725 (subtask of T269497764, RES code-side observability this half). Today these failure sites only emit `XLOG(ERR)` / `LOG(ERROR)` — nothing aggregates, so on-call cannot trigger off them or query them in ODS without `tail -f` on individual trainer hosts.

Counters added (all via `TrainingPsOdsLogger::bumpKey`, OBC category `raw_embedding_streaming`):

| Counter | Site |
|---|---|
| `res.fail.shard_size_mismatch` | `raw_embedding_streamer.cpp` — both the outer `indices.size(0) != weights.size(0)` guard and the per-shard `weights_masked.size(0) != rows_in_shard` guard in `tensor_stream()` |
| `res.fail.set_embeddings_rpc` | `raw_embedding_streamer.cpp` — wraps `co_await res_client->co_setEmbeddings(req)` in try/catch; bumps then re-throws to preserve trainer-thread termination |
| `res.fail.write_sparse_throw` | `TrainingPsHandler.cpp` `catch (const std::exception& e)` block at the `writeSparseListOfTensors` site in `co_stream_tensors` |

Backend (addresses review: use OBC, not fb303): all three route through the existing `TrainingPsOdsLogger` OBC client. The handler already owned an `ods_logger_`; the fbgemm-side `RawEmbeddingStreamer` now holds a `TrainingPsOdsLogger` member (forward-declared in the OSS-mirrored header, constructed only when streaming is enabled). OBC reaches ODS through the host-level agent with no per-process fb303 export/scrape config, so this revision deletes the fb303 `ResFailCounters.h` helper (and its `call_once` / `addStatExportType` / `FOLLY_EXPORT` machinery) entirely.

Counters bump AT the failure site (inside catch / right before `continue`) so they reflect what actually happened — applying the correctness lesson from D104076551 V4 (don't preemptively count attempts at the top of the function). Per-TBE-per-rank granularity is fine: OBC aggregates per (entity, key), so per-TBE bumps land on the same per-host key rather than separate rows.

Existing log lines preserved (additive). Existing throw-on-failure semantics preserved — the `co_setEmbeddings` catch re-throws after bumping so trainer-thread termination is unchanged.

Out of scope (intentional, per T269497764 boundary):
- MPSC `weights_to_stream_queue_` near-full counter — queue is unbounded; deferred to the queue-depth gauge work in T273802756.
- Detectors / alerts on these counters — downstream consumer work.

Differential Revision: D107811590
@meta-codesync meta-codesync Bot changed the title Silent-failure ODS counters Silent-failure ODS counters (#5850) Jun 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant