Silent-failure ODS counters (#5850)#5850
Open
adityas-meta wants to merge 1 commit into
Open
Conversation
8900338 to
316d62e
Compare
Summary: X-link: facebookresearch/FBGEMM#2768 Add OBC counters for 3 silent failure modes in the Raw Embedding Streaming (RES) path, closing the must-have bar of T273802725 (subtask of T269497764, RES code-side observability this half). Today these failure sites only emit `XLOG(ERR)` / `LOG(ERROR)` — nothing aggregates, so on-call cannot trigger off them or query them in ODS without `tail -f` on individual trainer hosts. Counters added (all via `TrainingPsOdsLogger::bumpKey`, OBC category `raw_embedding_streaming`): | Counter | Site | |---|---| | `res.fail.shard_size_mismatch` | `raw_embedding_streamer.cpp` — both the outer `indices.size(0) != weights.size(0)` guard and the per-shard `weights_masked.size(0) != rows_in_shard` guard in `tensor_stream()` | | `res.fail.set_embeddings_rpc` | `raw_embedding_streamer.cpp` — wraps `co_await res_client->co_setEmbeddings(req)` in try/catch; bumps then re-throws to preserve trainer-thread termination | | `res.fail.write_sparse_throw` | `TrainingPsHandler.cpp` `catch (const std::exception& e)` block at the `writeSparseListOfTensors` site in `co_stream_tensors` | Backend (addresses review: use OBC, not fb303): all three route through the existing `TrainingPsOdsLogger` OBC client. The handler already owned an `ods_logger_`; the fbgemm-side `RawEmbeddingStreamer` now holds a `TrainingPsOdsLogger` member (forward-declared in the OSS-mirrored header, constructed only when streaming is enabled). OBC reaches ODS through the host-level agent with no per-process fb303 export/scrape config, so this revision deletes the fb303 `ResFailCounters.h` helper (and its `call_once` / `addStatExportType` / `FOLLY_EXPORT` machinery) entirely. Counters bump AT the failure site (inside catch / right before `continue`) so they reflect what actually happened — applying the correctness lesson from D104076551 V4 (don't preemptively count attempts at the top of the function). Per-TBE-per-rank granularity is fine: OBC aggregates per (entity, key), so per-TBE bumps land on the same per-host key rather than separate rows. Existing log lines preserved (additive). Existing throw-on-failure semantics preserved — the `co_setEmbeddings` catch re-throws after bumping so trainer-thread termination is unchanged. Out of scope (intentional, per T269497764 boundary): - MPSC `weights_to_stream_queue_` near-full counter — queue is unbounded; deferred to the queue-depth gauge work in T273802756. - Detectors / alerts on these counters — downstream consumer work. Differential Revision: D107811590
316d62e to
96c09f7
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2768
Add OBC counters for 3 silent failure modes in the Raw Embedding Streaming (RES) path, closing the must-have bar of T273802725 (subtask of T269497764, RES code-side observability this half). Today these failure sites only emit
XLOG(ERR)/LOG(ERROR)— nothing aggregates, so on-call cannot trigger off them or query them in ODS withouttail -fon individual trainer hosts.Counters added (all via
TrainingPsOdsLogger::bumpKey, OBC categoryraw_embedding_streaming):res.fail.shard_size_mismatchraw_embedding_streamer.cpp— both the outerindices.size(0) != weights.size(0)guard and the per-shardweights_masked.size(0) != rows_in_shardguard intensor_stream()res.fail.set_embeddings_rpcraw_embedding_streamer.cpp— wrapsco_await res_client->co_setEmbeddings(req)in try/catch; bumps then re-throws to preserve trainer-thread terminationres.fail.write_sparse_throwTrainingPsHandler.cppcatch (const std::exception& e)block at thewriteSparseListOfTensorssite inco_stream_tensorsBackend (addresses review: use OBC, not fb303): all three route through the existing
TrainingPsOdsLoggerOBC client. The handler already owned anods_logger_; the fbgemm-sideRawEmbeddingStreamernow holds aTrainingPsOdsLoggermember (forward-declared in the OSS-mirrored header, constructed only when streaming is enabled). OBC reaches ODS through the host-level agent with no per-process fb303 export/scrape config, so this revision deletes the fb303ResFailCounters.hhelper (and itscall_once/addStatExportType/FOLLY_EXPORTmachinery) entirely.Counters bump AT the failure site (inside catch / right before
continue) so they reflect what actually happened — applying the correctness lesson from D104076551 V4 (don't preemptively count attempts at the top of the function). Per-TBE-per-rank granularity is fine: OBC aggregates per (entity, key), so per-TBE bumps land on the same per-host key rather than separate rows.Existing log lines preserved (additive). Existing throw-on-failure semantics preserved — the
co_setEmbeddingscatch re-throws after bumping so trainer-thread termination is unchanged.Out of scope (intentional, per T269497764 boundary):
weights_to_stream_queue_near-full counter — queue is unbounded; deferred to the queue-depth gauge work in T273802756.Differential Revision: D107811590