PE-7693 | Add ElastiCache/Redis metrics via Alloy + Redis dashboard by r0ohafza · Pull Request #60 · 0xSplits/alloy

r0ohafza · 2026-06-04T19:30:03Z

Why

We had no visibility into our Redis (ElastiCache) clusters — there were zero aws_elasticache_* metrics in Grafana, so memory pressure, evictions, or a lagging replica were effectively invisible until something broke. This PR closes that gap.

What you get

Redis metrics in Grafana. Alloy now scrapes the key ElastiCache health signals from CloudWatch (memory usage, freeable memory, engine CPU, connections, cache hit rate, evictions, swap, replication lag), following the same pattern we already use for ECS and RDS.
A ready-made Redis dashboard (dashboards/server/redis.json) — 8 panels, broken out per node (primary vs. replica) so on-call can spot an unhealthy node at a glance.
A CI safety net. A new "Alloy Check" workflow runs alloy fmt and alloy validate on every PR. Previously formatting was manual and nothing validated the config before it reached a running container — now a malformed or invalid config can't merge.

Rollout (action required)

Config is baked into the Docker image, so metrics only start flowing after merge: cut a release vX.Y.Z-<sha>, then redeploy Alloy across all environments. Redis alerting intentionally stays in CloudWatch for now — there are no alerting changes here.

How it was verified

Confirmed the ElastiCache replication groups in us-west-2 (production / staging / testing) are tagged environment and expose every scraped metric at per-node granularity.
alloy fmt and alloy validate pass in CI against the prod-pinned Alloy v1.13.2.

🤖 Generated with Claude Code

Add a prometheus.exporter.cloudwatch "elasticache" block discovering AWS/ElastiCache clusters via the environment tag, with Sum statistics for the CacheHits/CacheMisses/Evictions counters and Average/Maximum for the gauges. Wire a create_elasticache_labels relabel (service=server, node identity preserved via dimension_CacheClusterId) and an elasticache scrape job into the existing pipeline. Add the Redis dashboard (8 panels, per-node breakout) and document the new exporter in the README. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The dashboard shipped with a hand-typed placeholder uid (a1b2c3d4-redis-server-cache-0001), inconsistent with the UUIDs used by the sibling database/service dashboards. Swap in a generated UUIDv4 to lock in a stable, collision-free identity before the dashboard is imported into Grafana. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

There was no automated gate ensuring the .alloy config files are formatted and valid; formatting relied on contributors running the Alloy VS Code extension locally, and a malformed or invalid config could only fail at container boot in a cloud environment. Add an "Alloy Check" workflow that runs on pull requests (and pushes to main) and: - runs `alloy fmt -t` on every config/*.alloy file, failing if any file is not formatted correctly - runs `alloy validate` over the whole config directory so cross-file pipeline references are checked together Both checks use the exact Alloy version read from the Dockerfile, so CI validates against the same binary that runs in production. The validate step passes dummy values for the sys.env() references; it inspects config structure and does not connect to any endpoint. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The Alloy team documents `alloy fmt --test` and `alloy validate` as the CI contract (exit codes signal pass/fail) but ships no dedicated setup/fmt/validate GitHub Action, so we invoke the CLI ourselves. Running it through `docker run` required overriding the image entrypoint and a mounted-volume find loop; installing the released binary is simpler and faster. Download the alloy-linux-amd64 release matching the version pinned in the Dockerfile (keeping CI in parity with production), then run `alloy fmt -t` per file and `alloy validate` over the config directory. The format loop uses `find` rather than a bash globstar so it can never silently match zero files. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mihoward21

think this all looks good. have you run it yet in testing or anything? fine to just "test in prod" if that's easier

r0ohafza · 2026-06-05T16:59:51Z

think this all looks good. have you run it yet in testing or anything? fine to just "test in prod" if that's easier

want to test it on prod directly

r0ohafza and others added 4 commits June 4, 2026 12:29

r0ohafza marked this pull request as ready for review June 4, 2026 23:12

r0ohafza requested a review from mihoward21 June 4, 2026 23:12

mihoward21 approved these changes Jun 5, 2026

View reviewed changes

r0ohafza merged commit ce60d74 into main Jun 5, 2026
2 checks passed

r0ohafza deleted the nashville-v3 branch June 5, 2026 17:01

r0ohafza mentioned this pull request Jun 5, 2026

PE-7693: bump alloy to v1.12.2-ce60d74 0xSplits/releases#72

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PE-7693 | Add ElastiCache/Redis metrics via Alloy + Redis dashboard#60

PE-7693 | Add ElastiCache/Redis metrics via Alloy + Redis dashboard#60
r0ohafza merged 4 commits into
mainfrom
nashville-v3

r0ohafza commented Jun 4, 2026 •

edited

Loading

Uh oh!

mihoward21 left a comment

Uh oh!

r0ohafza commented Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

r0ohafza commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What you get

Rollout (action required)

How it was verified

Uh oh!

mihoward21 left a comment

Choose a reason for hiding this comment

Uh oh!

r0ohafza commented Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

r0ohafza commented Jun 4, 2026 •

edited

Loading