Skip to content

PE-7693 | Add ElastiCache/Redis metrics via Alloy + Redis dashboard#60

Merged
r0ohafza merged 4 commits into
mainfrom
nashville-v3
Jun 5, 2026
Merged

PE-7693 | Add ElastiCache/Redis metrics via Alloy + Redis dashboard#60
r0ohafza merged 4 commits into
mainfrom
nashville-v3

Conversation

@r0ohafza

@r0ohafza r0ohafza commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Why

We had no visibility into our Redis (ElastiCache) clusters — there were zero aws_elasticache_* metrics in Grafana, so memory pressure, evictions, or a lagging replica were effectively invisible until something broke. This PR closes that gap.

What you get

  • Redis metrics in Grafana. Alloy now scrapes the key ElastiCache health signals from CloudWatch (memory usage, freeable memory, engine CPU, connections, cache hit rate, evictions, swap, replication lag), following the same pattern we already use for ECS and RDS.
  • A ready-made Redis dashboard (dashboards/server/redis.json) — 8 panels, broken out per node (primary vs. replica) so on-call can spot an unhealthy node at a glance.
  • A CI safety net. A new "Alloy Check" workflow runs alloy fmt and alloy validate on every PR. Previously formatting was manual and nothing validated the config before it reached a running container — now a malformed or invalid config can't merge.

Rollout (action required)

Config is baked into the Docker image, so metrics only start flowing after merge: cut a release vX.Y.Z-<sha>, then redeploy Alloy across all environments. Redis alerting intentionally stays in CloudWatch for now — there are no alerting changes here.

How it was verified

  • Confirmed the ElastiCache replication groups in us-west-2 (production / staging / testing) are tagged environment and expose every scraped metric at per-node granularity.
  • alloy fmt and alloy validate pass in CI against the prod-pinned Alloy v1.13.2.

🤖 Generated with Claude Code

r0ohafza and others added 4 commits June 4, 2026 12:29
Add a prometheus.exporter.cloudwatch "elasticache" block discovering
AWS/ElastiCache clusters via the environment tag, with Sum statistics for
the CacheHits/CacheMisses/Evictions counters and Average/Maximum for the
gauges. Wire a create_elasticache_labels relabel (service=server, node
identity preserved via dimension_CacheClusterId) and an elasticache scrape
job into the existing pipeline. Add the Redis dashboard (8 panels, per-node
breakout) and document the new exporter in the README.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The dashboard shipped with a hand-typed placeholder uid
(a1b2c3d4-redis-server-cache-0001), inconsistent with the UUIDs used by
the sibling database/service dashboards. Swap in a generated UUIDv4 to
lock in a stable, collision-free identity before the dashboard is
imported into Grafana.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was no automated gate ensuring the .alloy config files are
formatted and valid; formatting relied on contributors running the Alloy
VS Code extension locally, and a malformed or invalid config could only
fail at container boot in a cloud environment.

Add an "Alloy Check" workflow that runs on pull requests (and pushes to
main) and:

  - runs `alloy fmt -t` on every config/*.alloy file, failing if any file
    is not formatted correctly
  - runs `alloy validate` over the whole config directory so cross-file
    pipeline references are checked together

Both checks use the exact Alloy version read from the Dockerfile, so CI
validates against the same binary that runs in production. The validate
step passes dummy values for the sys.env() references; it inspects config
structure and does not connect to any endpoint.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The Alloy team documents `alloy fmt --test` and `alloy validate` as the
CI contract (exit codes signal pass/fail) but ships no dedicated
setup/fmt/validate GitHub Action, so we invoke the CLI ourselves. Running
it through `docker run` required overriding the image entrypoint and a
mounted-volume find loop; installing the released binary is simpler and
faster.

Download the alloy-linux-amd64 release matching the version pinned in the
Dockerfile (keeping CI in parity with production), then run `alloy fmt -t`
per file and `alloy validate` over the config directory. The format loop
uses `find` rather than a bash globstar so it can never silently match
zero files.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@r0ohafza r0ohafza marked this pull request as ready for review June 4, 2026 23:12
@r0ohafza r0ohafza requested a review from mihoward21 June 4, 2026 23:12

@mihoward21 mihoward21 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

think this all looks good. have you run it yet in testing or anything? fine to just "test in prod" if that's easier

@r0ohafza

r0ohafza commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

think this all looks good. have you run it yet in testing or anything? fine to just "test in prod" if that's easier

want to test it on prod directly

@r0ohafza r0ohafza merged commit ce60d74 into main Jun 5, 2026
2 checks passed
@r0ohafza r0ohafza deleted the nashville-v3 branch June 5, 2026 17:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants