feat: --healthcheck for monitoring and systemd watchdogs by mehowz · Pull Request #13 · zenon-network/znn_controller_dart

mehowz · 2026-04-23T23:37:09Z

Summary

Adds a `Healthcheck` menu option + `--healthcheck` CLI flag that
runs diagnostic checks against the local node and exits with a
status code suitable for monitoring tools. Fills the gap between
`Status` (human-readable, interactive-only) and nothing.

Stacks on #11 and #12 — please review / merge those first. The
diff this PR adds over #12 is 134 insertions across one commit
(`fe9d699`).

Output format

Healthcheck
-----------
[PASS] service        go-zenon.service is active
[PASS] binary         znnd version v0.0.8-abcdef
[PASS] rpc            reachable at ws://127.0.0.1:35998
[PASS] sync           height=12345678 lag=3s (threshold: 15s healthy, 60s warn)
[PASS] pillar         mypillar1 (z1qxemdeddedxt0ken...), produced=42 this epoch
-----------
HEALTHY

Thresholds: sync lag ≤15s is `[PASS]`, ≤60s is `[WARN]` (warns but
doesn't fail), >60s is `[FAIL]`. These match the 10s target block time
with headroom for short-term jitter. Exit `0` if no `FAIL`, `1`
otherwise. `WARN` conditions don't fail the overall check — the
operator reads the output and decides.

Checks

Service — `systemctl is-active go-zenon.service`. `FAIL` if
inactive.
Binary — `/usr/local/bin/znnd` exists and `znnd version`
returns. `FAIL` if missing.
RPC — `ws://127.0.0.1:$defaultWsPort` connects within 10s
timeout. `FAIL` if unreachable. Skipped if service is inactive
(would double-report).
Sync — `getFrontierMomentum()` with 10s timeout, compute
wall-clock lag vs `momentum.timestamp`. `FAIL` if timeout,
`PASS`/`WARN`/`FAIL` by lag threshold.
Pillar — only runs if producer config exists. Enumerates
`embedded.pillar.getAll` until the producer address is found or
pagination exhausts. `PASS` if registered and produced >0 this
epoch; `WARN` if registered but produced=0; absent from `PASS`
registry → `WARN`.

All RPC calls carry explicit timeouts so a stuck node doesn't cause
the healthcheck itself to hang — which is exactly the failure mode
the check is meant to detect.

Use cases

Designed to drop into existing monitoring plumbing:

systemd watchdog — `WatchdogSec=60` plus a timer unit running
`znn-controller --healthcheck`
cron + mail-on-nonzero — for passive monitoring
Nagios / Icinga — `check_exit` wrapper consumes the exit code
Ansible / Terraform — post-deploy verification
Container liveness probe — in a containerised pillar deployment

Test plan

Docker (Ubuntu 24.04, no running znnd):

`--healthcheck` exits 1, prints `UNHEALTHY` with
`[FAIL] service` + `[FAIL] binary`
Menu item `Healthcheck` selectable in interactive mode
Output works with ANSI stripped (test script verified clean
grep-ability)

A running-znnd integration test would need a real node spun up; not
in scope for the PR itself but straightforward to add to follow-up CI.

Scope

No changes to existing flows (`Status`, `Deploy`, services) — this
PR is strictly additive.
Uses `znn_sdk_dart` types already imported for `Status`
(`Zenon`, `PillarInfo`, `PillarInfoList`, `Momentum`). No new
dependency.

The Deploy flow calls apt, wget, tar, git, and go build via Process.runSync with no timeout, no DEBIAN_FRONTEND=noninteractive, no streamed stdout. On Ubuntu 24.04 the first `apt -y install linux-kernel-headers` call can block indefinitely on a dpkg lock (unattended-upgrades) or an interactive prompt, and the operator sees no output between "Git installation detected" and the hang. Reproduced on a fresh Hetzner 24.04 host; only workaround was to pre-install `golang-go build-essential` manually and rebuild znnd from source. This change: - Adds `_runStreaming` that uses `Process.start`, streams stdout/ stderr live, accepts a timeout, and SIGTERM→SIGKILL if exceeded - Adds `_isDpkgLocked` precheck via `fuser /var/lib/dpkg/lock-frontend` so the deploy aborts with an actionable message instead of blocking - Adds `_hasCommand` / `_hasDebPackage` helpers so already-installed tools (git, build-essential, linux-libc-dev, wget, Go) are skipped cleanly — a host with manual `apt install golang-go build-essential` now sails through prereqs instead of re-running every apt step - Routes all apt invocations through `_aptInstall` with DEBIAN_FRONTEND=noninteractive + confold/confdef options so dpkg never waits on config-file prompts - Replaces `linux-kernel-headers` (transitional name, may not exist on Ubuntu 22.04+) with `linux-libc-dev`, accepting either package in the skip check for legacy compatibility - Makes `_buildFromSource` pick `go` from PATH if /usr/local/go is absent, matching the detection in the prereq step - 15-minute timeout on `go build` (slow hosts), 10-min on apt, 5-min on git clone / wget, 2-min on tar extract Caller in main() awaits both async functions — both were bool, now Future<bool>. No other call sites.

…r bound - goLinuxDlUrl bumped to go1.22.12 with verified go.dev/dl SHA256. go-zenon's go.mod still says `go 1.20`, so 1.20.x would technically compile it, but Go only maintains security updates for the two most recent minor versions. 1.22 is the current LTS floor. - pubspec.yaml sdk constraint relaxed from '>=2.14.0 <3.0.0' to '>=2.14.0 <4.0.0'. The existing constraint prevents `dart pub get` on any Dart 3.x release, which is what `dart-lang/setup-dart@v1.5.0` (used by the release workflow) installs by default. dcli 3.0.2 targets Dart 3.0 exactly; keep that implicit via pubspec.lock. Verified in Docker (Ubuntu 24.04 target): - target-prepped (golang-go + build-essential pre-installed): controller detects each prereq, skips all apt calls, advances to `git clone` + `go build` with streamed output. Reproduces the zenonorg5 operator's working scenario. - target-locked (flock holding /var/lib/dpkg/lock-frontend): controller aborts in ~2s with the actionable lock-contention error instead of hanging. No apt call issued.

Today the Deploy path assumes apt. On a CentOS / RHEL / Arch / NixOS host the first apt call fails with a cryptic error and the operator has to guess what went wrong. Parse /etc/os-release up front and abort with a one-shot actionable message: which distros are supported, what prereqs to install manually, and that Deploy will transparently skip the install step on a pre-prepped host thanks to the _hasCommand / _hasDebPackage detection already landed. Recognizes derivatives via ID_LIKE (e.g., Linux Mint, Raspbian) not just the narrow ID==debian/ubuntu check.

Two related reproducibility fixes: - Clone pins to `--branch znnRefTag --depth 1`. Previously Deploy pulled go-zenon master at whatever commit happened to be there at deploy time, so two operators following the same runbook a day apart could land different znnd binaries. Tag pin makes the deploy output deterministic across operators and across time. Bumped in lockstep with controller releases when a new go-zenon tag is the recommended mainnet version (v0.0.8 for 0.0.5 of this tool). - Existing clone detection: if /root/go-zenon exists and its origin remote matches znnGithubUrl, fetch + hard-reset to the pinned tag instead of deleting and re-downloading ~200MB of modules. Foreign directories (different remote, scratch files) still get nuked and fresh-cloned — only a legit prior clone is reused. Accelerates re-deploys from several minutes to seconds when the tag hasn't changed.

Adds a small argparse block at the top of main() so the controller can be driven from Ansible / Terraform / bash scripts without TTY. Flags: --deploy / --status / --start-service / --stop-service / --resync Jump straight to the matching menu action. -y / --yes Auto-confirm every prompt (echoes what was answered so the run log shows the choice). -h / --help, -v / --version Info commands, no side effects. Semantics: - All confirm() calls in Deploy / Resync now go through _confirmOrDefault, returning the default in --yes mode. - Existing-keystore password prompt reads ZNN_KEYSTORE_PASSWORD when --yes is set; aborts with an actionable error if missing rather than blocking on ask(). Retries are disabled in --yes mode to fail fast. - Fresh-keystore generation (the common first-deploy path) is already unattended-friendly — the existing RandomStringGenerator fallback produces a password without any prompt. Conflict detection: specifying two action flags (e.g., --deploy --status) exits 2 with a clear error instead of silently picking one.

Adds a Healthcheck menu option + --healthcheck CLI flag that reports one line per check with [PASS] / [WARN] / [FAIL] tags and exits 0 if every check passed (or only WARN'd), 1 if any FAIL occurred. Shape suitable for: - systemd watchdog (WatchdogSec= + ExecReload=... --healthcheck) - cron + mail-on-nonzero for passive monitoring - Nagios / Icinga check_exit wrapper - Ansible / Terraform post-deploy verification Checks: 1. go-zenon.service active via systemctl is-active 2. /usr/local/bin/znnd binary present + reports a version 3. Local RPC reachable at ws://127.0.0.1:35998 (10s timeout) 4. Frontier momentum fetched; wall-clock lag ≤ 15s = PASS, ≤ 60s = WARN, > 60s = FAIL. Thresholds match the 10s target block time with headroom for short-term jitter. 5. (Only if producer config exists) pillar registered with the configured producer address; producedMomentums > 0 this epoch PASSes, 0 is WARN (new pillar or one that just missed). All RPC calls carry explicit timeouts so a stuck node doesn't cause the healthcheck itself to hang — exactly the failure mode the check is meant to detect.

mehowz added 6 commits April 23, 2026 19:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: --healthcheck for monitoring and systemd watchdogs#13

feat: --healthcheck for monitoring and systemd watchdogs#13
mehowz wants to merge 6 commits into
zenon-network:masterfrom
ZenonOrg:feat/deploy-healthcheck

mehowz commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mehowz commented Apr 23, 2026

Summary

Output format

Checks

Use cases

Test plan

Scope

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant