feat: add Micrometer observability to GlueSync listener and Kafka receiver #137
Merged
Conversation
Binds the Kafka consumer's native metrics (~30 out of the box: fetch latency, consumer lag, heartbeat rate, etc.) to a MeterRegistry via KafkaClientMetrics. The registry is injected explicitly via the builder withMeterRegistry(...) and defaults to null (opt-in). Bumps Micrometer from 1.9.9 to 1.14.14 to pick up KafkaClientMetrics support. Also adds .sdkmanrc pinning Java 11 for local builds (required by Spotless/google-java-format compatibility). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds per-operation counters with an `outcome` tag alongside the existing
binary success/failure counters. Each of the 8 event handler methods now
emits a tagged counter (e.g. glue_listener_create_database{outcome="created"})
capturing ignored events, specific outcomes (created/updated/deleted/renamed/
not_found), and exception class names for failures — without changing existing
metric behaviour.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… varargs Replaces the per-operation metric names with a single glue_listener_event counter, tagged with operation, result (success/failure/ignored), and outcome (created/updated/deleted/renamed/not_found, or exception class name). MetricService.incrementCounter now accepts Tag... for caller-controlled tagging. Tag keys and result values are centralised as constants in MetricConstants. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Introduces recordEvent(operation, result, outcome) on MetricService, encapsulating the LISTENER_EVENT metric name and tag key construction. Removes incrementCounter(String, Tag...) and the Tag import from ApiaryGlueSync. Each outcome path is now a single readable line instead of a 200-char Tag.of chain. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Record rename duration as a Micrometer Timer (glue_listener_table_rename_duration) so latency is visible in metrics, not just logged - Handle EntityNotFoundException in onDropPartition as a success (not_found outcome) matching the existing onDropTable pattern, so idempotent deletes no longer increment the failure counter - Store the MeterRegistry as a field in MetricService so recordEvent and recordDuration both use the injected registry consistently Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…attern KafkaMessageReaderBuilder.build() now calls configuredRegistry() instead of defaulting to null (no metrics). In plain JVM deployments a JmxMeterRegistry is added to Metrics.globalRegistry; in Spring Boot the existing registry is reused unchanged — matching the pattern established in MetricService. Adds micrometer-registry-jmx to kafka-metastore-receiver (no shading needed; this module does not run inside HMS). Both configuredRegistry() copies carry a comment explaining why a shared utility would break the GlueSync shading. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Helps diagnose why metrics are flowing to JMX rather than an existing registry in plain JVM deployments. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace unbounded e.getClass().getSimpleName() tag values with a fixed vocabulary derived from a KNOWN_EXCEPTIONS allowlist, preventing indefinite registry growth under degraded Glue conditions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… init Synchronize configuredRegistry() in both MetricService and KafkaMessageReader to close the TOCTOU race where two HMS worker threads could each observe an empty registry and add a second JmxMeterRegistry, causing MBean registration failures on startup. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…throws Wrap metric deregistration in try-finally so a JMX InstanceNotFoundException during shutdown cannot leave the Kafka consumer's threads and broker connections alive. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…partition loops Replace Counter.builder().register() on every recordEvent() call with a lazy ConcurrentHashMap cache, matching the pre-registration pattern used by incrementCounter(). Eliminates Tags array allocation on hot partition event paths. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ate recovery in ApiaryGlueSync Replace inline nested catch blocks for AlreadyExistsException/EntityNotFoundException recovery with a shared GlueOutcome enum (CREATED/UPDATED) and extracted private methods: createOrUpdateDatabase, createOrUpdateTable, updateOrCreateTable, updateOrCreatePartition. Each event handler is now a flat single-catch structure with outcome.name() as the metric tag. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Extract createOrUpdatePartition (create-first, AlreadyExistsException -> update) to flatten onAddPartition's nested catch into the same single-catch structure used by the other event handlers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…me tag case Add tests for the five previously uncovered recovery branches: onCreateHiveTableThatAlreadyExists, onAlterHiveTableThatDoesntExistInGlue, onAlterPartition (happy path), onAlterPartitionThatDoesntExistInGlue, onAddPartitionThatAlreadyExists. Update existing recordEvent assertions from lowercase "created"/"updated" to "CREATED"/"UPDATED" to match the GlueOutcome enum values emitted via outcome.name(). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…all handlers Add onDropTable (happy path) and onDropTableThatDoesntExistInGlue (EntityNotFoundException -> not_found) to cover the previously untested handler. Add allHandlers_ignoredWhenEventStatusFalse which fires all 8 event handlers with getStatus()=false and asserts the RESULT_IGNORED metric is recorded and glueClient is never called. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tions=true tests Add onDropPartition to cover the missing success path for partition deletion. Add onCreateTable_failureMetricsRecordedOnException to verify RESULT_FAILURE metrics and toOutcome() tag (OperationTimeoutException) are recorded when a non-recovery exception is thrown. Add onCreateTable_exceptionRethrownWhenThrowExceptionsEnabled to verify exceptions are wrapped as MetaException and rethrown when throwExceptions=true. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…re tests Add onAddPartition to cover the simple create-succeeds path independently of the service-layer retry test. Add onAlterIcebergTable_RenameTableSkipsRenameOperation to verify Iceberg renames bypass doRenameOperation and go through updateOrCreateTable instead. Add onAlterHiveTable_RenameOperationFailureIsMetered to verify that a mid-rename exception is caught and recorded as RESULT_FAILURE — this was unmetered on main before the try-block refactor. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…geReader comment Move static imports to top of MetricService per Java convention. Update comment in KafkaMessageReader from "Codahale classpath conflict" to "classpath conflicts" to avoid referencing a specific library name that may not always be the cause. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ument KNOWN_EXCEPTIONS ordering
GlueOutcome.name() was emitting uppercase tag values ("CREATED", "UPDATED") while all
other outcome strings were lowercase, breaking dashboard filtering on the outcome tag.
Replaced with OUTCOME_CREATED/OUTCOME_UPDATED constants in MetricConstants and updated
all call sites and test assertions. Added a comment above KNOWN_EXCEPTIONS explaining
the subclass-before-superclass ordering constraint.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
7b7efd8 to
84e8fdb
Compare
…ade relocation KNOWN_EXCEPTIONS used Class.isInstance() against class literals that don't match the shaded runtime types (com.expediagroup.apiary.shaded.com.amazonaws.*). Walk the class hierarchy using getSimpleName() instead so the outcome tag resolves correctly whether or not the shade plugin has relocated the AWS SDK classes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…gedObjectNameFactory Replaces the default JmxMeterRegistry with one backed by a Dropwizard MetricRegistry and a custom TaggedObjectNameFactory. taggedNameMapper() encodes tags into the Dropwizard metric name as "name[key=value,...]"; TaggedObjectNameFactory then promotes those pairs to real JMX ObjectName key properties so monitoring tools can filter by operation/result/outcome. Performance characteristics: - Registration cost is one-time per unique (operation, result, outcome) combination. After the first call, computeIfAbsent returns the cached Counter and the hot path is counter.increment() — a lock-free LongAdder.add(1). TaggedObjectNameFactory never touches the request thread again after initial registration. - Cardinality is bounded: ~8 operations × 3 results × 3-5 outcomes ≤ 72 unique counters, all registered within the first minutes of steady-state traffic. - recordEvent() is called inside the per-partition loop on onAddPartition/onDropPartition, where a single HMS event can carry hundreds of partitions. The string concat for the cache key allocates a small object per call, but computeIfAbsent hits the already- populated ConcurrentHashMap after the first call (hash lookup only). This is negligible compared to the synchronous Glue API call that precedes each partition operation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… KafkaMessageReader The two copies are no longer equivalent — MetricService now uses a Dropwizard-backed JmxMeterRegistry with TaggedObjectNameFactory for tagged JMX ObjectName properties, while KafkaMessageReader retains the plain JmxMeterRegistry. Updated comments to reflect the intentional difference and drop the stale "keep in sync" instruction. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…essageReader Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
javsanbel2
approved these changes
Jun 25, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds structured JMX metrics to
ApiaryGlueSyncso every HMS event outcome is observable without log-scraping.recordEvent(operation, result, outcome)emits alistener.eventcounter tagged withoperation(e.g.create_table),result(
success/failure/ignored), andoutcome(e.g.OperationTimeoutException,created,other). All failure paths, success paths, and ignored events are covered.TaggedObjectNameFactory+taggedNameMapper()encode Micrometer tags as proper JMX ObjectName key properties(
metrics:name=listener.event,operation=create_table,result=failure,outcome=OperationTimeoutException), making them filterable in JConsole/JMX tooling without string-parsing.Bug Fixes
toOutcome()was silently returning"other"for all AWS errors. The Maven shade plugin relocatescom.amazonaws.*tocom.expediagroup.apiary.shaded.com.amazonaws.*at runtime, soClass.isInstance()against the original class literals never matched. Fixed by walking the exception classhierarchy using
getSimpleName(), which is package-agnostic and survives relocation.configuredRegistry()is now synchronized and guards against double-registration under concurrent HMS init.consumer.close()guarantee in KafkaMessageReader —consumer.close()was not called ifkafkaClientMetrics.close()threw; fixed with a try-finally.Performance
Registration cost is one-time per unique
(operation, result, outcome)combination. After the first call,computeIfAbsentreturns the cachedCounterand the hot path iscounter.increment()— a lock-freeLongAdder.add(1).TaggedObjectNameFactorynever touches the request thread again.recordEvent()is called inside the per-partition loop ononAddPartition/onDropPartition, where a single HMS event can carry hundreds of partitions. The string concatforming the cache key allocates a small object per call, but
computeIfAbsenthits the already-populatedConcurrentHashMapafter the first call (hash lookup only). This isnegligible compared to the synchronous Glue API call that precedes each partition operation. All ≤72 counters are registered within the first minutes of steady-state traffic.
Test coverage
Baseline test coverage was established in #136. This PR adds
recordEventverifications to all existing HMS event handler tests, plus:OperationTimeoutException,InvalidInputExceptiontoOutcomehierarchy-walk tests: AWS subclass not in known list →"AmazonServiceException"; unknown exception →"other"TaggedObjectNameFactoryunit tests🤖 Generated with Claude Code