Skip to content

add job sdk observability #1103#1146

Open
FangwenDave wants to merge 9 commits into
alibaba:masterfrom
FangwenDave:feat/job-sdk-observability
Open

add job sdk observability #1103#1146
FangwenDave wants to merge 9 commits into
alibaba:masterfrom
FangwenDave:feat/job-sdk-observability

Conversation

@FangwenDave

Copy link
Copy Markdown
Collaborator

resolve issue refs #1103

FangwenDave and others added 9 commits June 15, 2026 07:14
Wrap label extraction, soft-fail emission, and hard-fail recording in
defensive try/except (_safe_record_hard / _safe_emit_soft) so a bug in
instrumentation can never turn a successful phase into a failure. The
original wrapped exception is always re-raised unchanged. Adds a
regression test proving a reporter that always raises breaks neither the
soft nor the hard path.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The prior commit registered atexit.register(self.shutdown) and stashed
the reader/provider handles, but never defined shutdown(). With an OTLP
endpoint configured, evaluating self.shutdown raised AttributeError
inside _init_otel, silently degrading the reporter to log-only mode.

Add an idempotent, exception-isolated shutdown() that force-flushes the
MeterProvider on clean exit, so a short-lived job's last buffered
exception metrics aren't lost. Covered by TestShutdown.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
None of the existing job demos exercised the exception observability layer
(rock.sdk.job.observability). Add examples/job/observability/ with:

- observability_demo.py — two modes:
    --mode self-test : infra-free; drives JobMetricsReporter + the
      monitor_job_phase decorator with stubs so you can see the exact
      structured ERROR log line and rock_job.exception.total counter
      increment for both a soft fail and a hard fail.
    --mode run       : real Job(config).run() over a BashJobConfig; a
      --scenario chooses success / soft-fail (exit 7) / timeout, then it
      summarizes JobResult.status + per-trial exception_info and calls
      get_reporter().shutdown() to force a final metric flush.
- README.md documenting the two env knobs (ROCK_JOB_METRICS_OTLP_ENDPOINT,
  ROCK_JOB_METRICS_HIGH_CARDINALITY_LABELS), the soft/hard semantics, and
  how to wire the OTLP endpoint to a collector.

Also list the new subdir in examples/job/README.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Restructure the observability example to match the harbor example's shape:
load a YAML config via JobConfig.from_yaml() and run Job(config).run().

- drop the white-box --mode self-test path (stub executor/trial/counter and
  private-attr poking) — that capability check is already covered by
  tests/unit/sdk/job/test_observability.py
- add observability_job_config.yaml.template: a BashJobConfig that exits 7,
  producing exactly one soft-fail event to observe
- keep the two observability-specific touches: echo the two env knobs at
  startup, and force a final metrics flush via reporter.shutdown() on exit
- README rewritten harbor-style (template + quick run + expected output)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant