Ethereum validator node monitoring powered by Hermes Agent.
Core principle: Prometheus Detects. Hermes Reasons.
Deterministic alerting for known failures, LLM-driven contextual analysis for everything else.
| Plane | Responsibility |
|---|---|
| Telemetry | Prometheus metrics + Loki logs from Ethereum clients |
| Detection | Alertmanager rules — deterministic, low-latency |
| Reasoning | Hermes Agent — context gathering, root-cause correlation, runbook selection |
| Action | Notifications (Discord + ntfy.sh) + approval-gated remediation (Phase 4+) |
Design docs: docs/ (architecture, webhook spec, runbook spec, notification design, alert set, memory & feedback, slashing protocol).
agentic-node-ops/
├── docs/ # Design specs (architecture, webhook, runbooks, etc.)
├── src/agentic_node_ops/ # Python source (Hermes integration & notifications)
│ ├── __init__.py
│ ├── types.py # Alert schemas, data models
│ ├── dispatcher.py # Alert routing and processing
│ ├── discord.py # Discord notification adapter
│ ├── ntfy.py # ntfy.sh notification adapter
│ ├── database.py # SQLite WAL wrapper (sole writer for incidents)
│ ├── processor.py # Async jsonl drain, payload build, dispatch, offset update
│ ├── runbooks.py # YAML runbook loading and alert_type matching
│ ├── context.py # Hermes context assembly (incidents, corrections, baselines)
│ ├── baselines.py # Nightly host baseline learning (p50/p95 percentiles)
│ ├── approval.py # Approval state machine and fatigue prevention
│ └── executor.py # Runbook diagnostics and action execution
├── tests/ # Test suite (161 tests passing, 90% coverage)
│ ├── test_database.py
│ ├── test_notifications.py
│ ├── test_processor.py
│ ├── test_runbooks.py
│ ├── test_context.py
│ ├── test_baselines.py
│ ├── test_approval.py
│ └── test_executor.py
├── runbooks/ # YAML runbooks (e.g., consensus_desync.yaml)
├── webhook-receiver/ # Standalone HTTP receiver (Phase 1 complete)
│ ├── src/webhook_receiver/ # aiohttp server, schema validation, dedup, storm protection, context fetch
│ └── tests/ # Webhook receiver test suite
└── docker-compose.yml # Phase 4 deployment config with nginx socket-proxy
| Phase | Scope | Status |
|---|---|---|
| 0 | Project scaffolding, packaging, CI/CD | ✅ Complete |
| 1 | Webhook receiver + alert normalization + dedup + storm protection + context fetch | ✅ Complete |
| 2 | Hermes integration + runbook matching + operator notifications | ✅ Complete |
| 3 | Memory layer + feedback loop + host fingerprints + context assembly | ✅ Complete |
| 4 | Tier 2 suggested actions + approval state machine + runbook executor + nginx socket-proxy migration | ✅ Complete |
| 5 | Runbook synthesis from historical incidents | Design pending |
| 6 | Semi-autonomous remediation with confidence scoring | Future |
- Python 3.12+
- Docker + Docker Compose (for deployment)
- eth-docker running on the target host
# Create virtual environment and install dependencies
python3 -m venv .venv
.venv/bin/pip install -e ".[dev]"
# Run tests
.venv/bin/pytest tests/ -v
# Run a single test module
.venv/bin/pytest tests/test_notifications.py -v# Build a wheel distribution
python3 -m pip install build
python3 -m build
# Output: dist/agentic_node_ops-0.1.0-py3-none-any.whl# Bump version in pyproject.toml, then:
python3 -m build
twine upload dist/*See implementation phases in docs/hermes-implementation-plan.md.
- Review docs/architecture.md for system design and docs/ for all design documents
- Deploy monitoring stack alongside eth-docker (Prometheus, Loki, Alertmanager)
- Configure Alertmanager to POST to
http://webhook-receiver:8090/webhook - Deploy webhook receiver (Phase 1) + hermes-agent integration (Phase 2)
- Import initial runbooks into
runbooks/
- Hermes is the reasoning layer, not the monitoring system
- Detection stays deterministic (Prometheus rules, Loki ruler)
- Slashing risk is always priority 0, never deferred, never autonomously remediated
- All approvals have explicit timeouts — silence means no action
- Docker socket access uses the socket-proxy pattern
- Single-writer boundary: webhook receiver writes
alerts.jsonlonly; hermes-agent writes SQLite - Discord + ntfy.sh two-tier notification routing (Discord for all severities, ntfy for critical/slashing)
MIT