Skip to content

feat: add Micrometer observability to GlueSync listener and Kafka receiver #137

Merged
jamespfaulkner merged 29 commits into
mainfrom
feat/EGDL-8244/improved-observability
Jun 25, 2026
Merged

feat: add Micrometer observability to GlueSync listener and Kafka receiver #137
jamespfaulkner merged 29 commits into
mainfrom
feat/EGDL-8244/improved-observability

Conversation

@jamespfaulkner

@jamespfaulkner jamespfaulkner commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds structured JMX metrics to ApiaryGlueSync so every HMS event outcome is observable without log-scraping.

  • Structured event countersrecordEvent(operation, result, outcome) emits a listener.event counter tagged with operation (e.g. create_table), result
    (success/failure/ignored), and outcome (e.g. OperationTimeoutException, created, other). All failure paths, success paths, and ignored events are covered.
  • JMX tag promotionTaggedObjectNameFactory + taggedNameMapper() encode Micrometer tags as proper JMX ObjectName key properties
    (metrics:name=listener.event,operation=create_table,result=failure,outcome=OperationTimeoutException), making them filterable in JConsole/JMX tooling without string-parsing.
  • Bounded cardinality — ~8 operations × 3 results × 3–5 outcomes gives ≤72 unique counters.

Bug Fixes

  • Shade-safe outcome classificationtoOutcome() was silently returning "other" for all AWS errors. The Maven shade plugin relocates com.amazonaws.* to
    com.expediagroup.apiary.shaded.com.amazonaws.* at runtime, so Class.isInstance() against the original class literals never matched. Fixed by walking the exception class
    hierarchy using getSimpleName(), which is package-agnostic and survives relocation.
  • Duplicate JmxMeterRegistry preventionconfiguredRegistry() is now synchronized and guards against double-registration under concurrent HMS init.
  • consumer.close() guarantee in KafkaMessageReaderconsumer.close() was not called if kafkaClientMetrics.close() threw; fixed with a try-finally.

Performance

Registration cost is one-time per unique (operation, result, outcome) combination. After the first call, computeIfAbsent returns the cached Counter and the hot path is
counter.increment() — a lock-free LongAdder.add(1). TaggedObjectNameFactory never touches the request thread again.

recordEvent() is called inside the per-partition loop on onAddPartition/onDropPartition, where a single HMS event can carry hundreds of partitions. The string concat
forming the cache key allocates a small object per call, but computeIfAbsent hits the already-populated ConcurrentHashMap after the first call (hash lookup only). This is
negligible compared to the synchronous Glue API call that precedes each partition operation. All ≤72 counters are registered within the first minutes of steady-state traffic.

Test coverage

Baseline test coverage was established in #136. This PR adds recordEvent verifications to all existing HMS event handler tests, plus:

  • Failure metric tests covering OperationTimeoutException, InvalidInputException
  • toOutcome hierarchy-walk tests: AWS subclass not in known list → "AmazonServiceException"; unknown exception → "other"
  • TaggedObjectNameFactory unit tests
  • Rename failure metering, partition-not-found path

🤖 Generated with Claude Code

jamespfaulkner and others added 23 commits June 19, 2026 10:12
Binds the Kafka consumer's native metrics (~30 out of the box: fetch
latency, consumer lag, heartbeat rate, etc.) to a MeterRegistry via
KafkaClientMetrics. The registry is injected explicitly via the builder
withMeterRegistry(...) and defaults to null (opt-in). Bumps Micrometer
from 1.9.9 to 1.14.14 to pick up KafkaClientMetrics support.

Also adds .sdkmanrc pinning Java 11 for local builds (required by
Spotless/google-java-format compatibility).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds per-operation counters with an `outcome` tag alongside the existing
binary success/failure counters. Each of the 8 event handler methods now
emits a tagged counter (e.g. glue_listener_create_database{outcome="created"})
capturing ignored events, specific outcomes (created/updated/deleted/renamed/
not_found), and exception class names for failures — without changing existing
metric behaviour.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… varargs

Replaces the per-operation metric names with a single glue_listener_event
counter, tagged with operation, result (success/failure/ignored), and outcome
(created/updated/deleted/renamed/not_found, or exception class name).
MetricService.incrementCounter now accepts Tag... for caller-controlled
tagging. Tag keys and result values are centralised as constants in
MetricConstants.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Introduces recordEvent(operation, result, outcome) on MetricService,
encapsulating the LISTENER_EVENT metric name and tag key construction.
Removes incrementCounter(String, Tag...) and the Tag import from
ApiaryGlueSync. Each outcome path is now a single readable line instead
of a 200-char Tag.of chain.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Record rename duration as a Micrometer Timer
  (glue_listener_table_rename_duration) so latency is visible in
  metrics, not just logged
- Handle EntityNotFoundException in onDropPartition as a success
  (not_found outcome) matching the existing onDropTable pattern, so
  idempotent deletes no longer increment the failure counter
- Store the MeterRegistry as a field in MetricService so recordEvent
  and recordDuration both use the injected registry consistently

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…attern

KafkaMessageReaderBuilder.build() now calls configuredRegistry() instead of
defaulting to null (no metrics). In plain JVM deployments a JmxMeterRegistry
is added to Metrics.globalRegistry; in Spring Boot the existing registry is
reused unchanged — matching the pattern established in MetricService.

Adds micrometer-registry-jmx to kafka-metastore-receiver (no shading needed;
this module does not run inside HMS). Both configuredRegistry() copies carry a
comment explaining why a shared utility would break the GlueSync shading.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Helps diagnose why metrics are flowing to JMX rather than an existing
registry in plain JVM deployments.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace unbounded e.getClass().getSimpleName() tag values with a fixed
vocabulary derived from a KNOWN_EXCEPTIONS allowlist, preventing
indefinite registry growth under degraded Glue conditions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… init

Synchronize configuredRegistry() in both MetricService and KafkaMessageReader
to close the TOCTOU race where two HMS worker threads could each observe an
empty registry and add a second JmxMeterRegistry, causing MBean registration
failures on startup.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…throws

Wrap metric deregistration in try-finally so a JMX InstanceNotFoundException
during shutdown cannot leave the Kafka consumer's threads and broker
connections alive.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…partition loops

Replace Counter.builder().register() on every recordEvent() call with a
lazy ConcurrentHashMap cache, matching the pre-registration pattern used
by incrementCounter(). Eliminates Tags array allocation on hot partition
event paths.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ate recovery in ApiaryGlueSync

Replace inline nested catch blocks for AlreadyExistsException/EntityNotFoundException
recovery with a shared GlueOutcome enum (CREATED/UPDATED) and extracted private methods:
createOrUpdateDatabase, createOrUpdateTable, updateOrCreateTable, updateOrCreatePartition.
Each event handler is now a flat single-catch structure with outcome.name() as the metric tag.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Extract createOrUpdatePartition (create-first, AlreadyExistsException -> update)
to flatten onAddPartition's nested catch into the same single-catch structure
used by the other event handlers.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…me tag case

Add tests for the five previously uncovered recovery branches:
onCreateHiveTableThatAlreadyExists, onAlterHiveTableThatDoesntExistInGlue,
onAlterPartition (happy path), onAlterPartitionThatDoesntExistInGlue,
onAddPartitionThatAlreadyExists.

Update existing recordEvent assertions from lowercase "created"/"updated" to
"CREATED"/"UPDATED" to match the GlueOutcome enum values emitted via outcome.name().

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…all handlers

Add onDropTable (happy path) and onDropTableThatDoesntExistInGlue (EntityNotFoundException
-> not_found) to cover the previously untested handler. Add
allHandlers_ignoredWhenEventStatusFalse which fires all 8 event handlers with
getStatus()=false and asserts the RESULT_IGNORED metric is recorded and glueClient
is never called.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tions=true tests

Add onDropPartition to cover the missing success path for partition deletion.
Add onCreateTable_failureMetricsRecordedOnException to verify RESULT_FAILURE metrics
and toOutcome() tag (OperationTimeoutException) are recorded when a non-recovery
exception is thrown. Add onCreateTable_exceptionRethrownWhenThrowExceptionsEnabled
to verify exceptions are wrapped as MetaException and rethrown when throwExceptions=true.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…re tests

Add onAddPartition to cover the simple create-succeeds path independently of
the service-layer retry test. Add onAlterIcebergTable_RenameTableSkipsRenameOperation
to verify Iceberg renames bypass doRenameOperation and go through updateOrCreateTable
instead. Add onAlterHiveTable_RenameOperationFailureIsMetered to verify that a
mid-rename exception is caught and recorded as RESULT_FAILURE — this was unmetered
on main before the try-block refactor.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…geReader comment

Move static imports to top of MetricService per Java convention. Update comment
in KafkaMessageReader from "Codahale classpath conflict" to "classpath conflicts"
to avoid referencing a specific library name that may not always be the cause.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ument KNOWN_EXCEPTIONS ordering

GlueOutcome.name() was emitting uppercase tag values ("CREATED", "UPDATED") while all
other outcome strings were lowercase, breaking dashboard filtering on the outcome tag.
Replaced with OUTCOME_CREATED/OUTCOME_UPDATED constants in MetricConstants and updated
all call sites and test assertions. Added a comment above KNOWN_EXCEPTIONS explaining
the subclass-before-superclass ordering constraint.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@jamespfaulkner jamespfaulkner force-pushed the feat/EGDL-8244/improved-observability branch from 7b7efd8 to 84e8fdb Compare June 19, 2026 13:01
jamespfaulkner and others added 6 commits June 19, 2026 15:10
…ade relocation

KNOWN_EXCEPTIONS used Class.isInstance() against class literals that don't match the
shaded runtime types (com.expediagroup.apiary.shaded.com.amazonaws.*). Walk the class
hierarchy using getSimpleName() instead so the outcome tag resolves correctly whether
or not the shade plugin has relocated the AWS SDK classes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…gedObjectNameFactory

Replaces the default JmxMeterRegistry with one backed by a Dropwizard MetricRegistry and
a custom TaggedObjectNameFactory. taggedNameMapper() encodes tags into the Dropwizard
metric name as "name[key=value,...]"; TaggedObjectNameFactory then promotes those pairs
to real JMX ObjectName key properties so monitoring tools can filter by operation/result/outcome.

Performance characteristics:
- Registration cost is one-time per unique (operation, result, outcome) combination.
  After the first call, computeIfAbsent returns the cached Counter and the hot path is
  counter.increment() — a lock-free LongAdder.add(1). TaggedObjectNameFactory never
  touches the request thread again after initial registration.
- Cardinality is bounded: ~8 operations × 3 results × 3-5 outcomes ≤ 72 unique counters,
  all registered within the first minutes of steady-state traffic.
- recordEvent() is called inside the per-partition loop on onAddPartition/onDropPartition,
  where a single HMS event can carry hundreds of partitions. The string concat for the
  cache key allocates a small object per call, but computeIfAbsent hits the already-
  populated ConcurrentHashMap after the first call (hash lookup only). This is negligible
  compared to the synchronous Glue API call that precedes each partition operation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… KafkaMessageReader

The two copies are no longer equivalent — MetricService now uses a Dropwizard-backed
JmxMeterRegistry with TaggedObjectNameFactory for tagged JMX ObjectName properties,
while KafkaMessageReader retains the plain JmxMeterRegistry. Updated comments to
reflect the intentional difference and drop the stale "keep in sync" instruction.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…essageReader

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@jamespfaulkner jamespfaulkner marked this pull request as ready for review June 25, 2026 09:33
@jamespfaulkner jamespfaulkner requested a review from a team as a code owner June 25, 2026 09:33
@jamespfaulkner jamespfaulkner merged commit f40e2b9 into main Jun 25, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants