add job sdk observability #1103#1146
Open
FangwenDave wants to merge 9 commits into
Open
Conversation
Wrap label extraction, soft-fail emission, and hard-fail recording in defensive try/except (_safe_record_hard / _safe_emit_soft) so a bug in instrumentation can never turn a successful phase into a failure. The original wrapped exception is always re-raised unchanged. Adds a regression test proving a reporter that always raises breaks neither the soft nor the hard path. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The prior commit registered atexit.register(self.shutdown) and stashed the reader/provider handles, but never defined shutdown(). With an OTLP endpoint configured, evaluating self.shutdown raised AttributeError inside _init_otel, silently degrading the reporter to log-only mode. Add an idempotent, exception-isolated shutdown() that force-flushes the MeterProvider on clean exit, so a short-lived job's last buffered exception metrics aren't lost. Covered by TestShutdown. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
None of the existing job demos exercised the exception observability layer
(rock.sdk.job.observability). Add examples/job/observability/ with:
- observability_demo.py — two modes:
--mode self-test : infra-free; drives JobMetricsReporter + the
monitor_job_phase decorator with stubs so you can see the exact
structured ERROR log line and rock_job.exception.total counter
increment for both a soft fail and a hard fail.
--mode run : real Job(config).run() over a BashJobConfig; a
--scenario chooses success / soft-fail (exit 7) / timeout, then it
summarizes JobResult.status + per-trial exception_info and calls
get_reporter().shutdown() to force a final metric flush.
- README.md documenting the two env knobs (ROCK_JOB_METRICS_OTLP_ENDPOINT,
ROCK_JOB_METRICS_HIGH_CARDINALITY_LABELS), the soft/hard semantics, and
how to wire the OTLP endpoint to a collector.
Also list the new subdir in examples/job/README.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Restructure the observability example to match the harbor example's shape: load a YAML config via JobConfig.from_yaml() and run Job(config).run(). - drop the white-box --mode self-test path (stub executor/trial/counter and private-attr poking) — that capability check is already covered by tests/unit/sdk/job/test_observability.py - add observability_job_config.yaml.template: a BashJobConfig that exits 7, producing exactly one soft-fail event to observe - keep the two observability-specific touches: echo the two env knobs at startup, and force a final metrics flush via reporter.shutdown() on exit - README rewritten harbor-style (template + quick run + expected output)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
resolve issue refs #1103