Skip to content

feat: --healthcheck for monitoring and systemd watchdogs#13

Open
mehowz wants to merge 6 commits into
zenon-network:masterfrom
ZenonOrg:feat/deploy-healthcheck
Open

feat: --healthcheck for monitoring and systemd watchdogs#13
mehowz wants to merge 6 commits into
zenon-network:masterfrom
ZenonOrg:feat/deploy-healthcheck

Conversation

@mehowz

@mehowz mehowz commented Apr 23, 2026

Copy link
Copy Markdown

Summary

Adds a `Healthcheck` menu option + `--healthcheck` CLI flag that
runs diagnostic checks against the local node and exits with a
status code suitable for monitoring tools. Fills the gap between
`Status` (human-readable, interactive-only) and nothing.

Stacks on #11 and #12 — please review / merge those first. The
diff this PR adds over #12 is 134 insertions across one commit
(`fe9d699`).

Output format

Healthcheck
-----------
[PASS] service        go-zenon.service is active
[PASS] binary         znnd version v0.0.8-abcdef
[PASS] rpc            reachable at ws://127.0.0.1:35998
[PASS] sync           height=12345678 lag=3s (threshold: 15s healthy, 60s warn)
[PASS] pillar         mypillar1 (z1qxemdeddedxt0ken...), produced=42 this epoch
-----------
HEALTHY

Thresholds: sync lag ≤15s is `[PASS]`, ≤60s is `[WARN]` (warns but
doesn't fail), >60s is `[FAIL]`. These match the 10s target block time
with headroom for short-term jitter. Exit `0` if no `FAIL`, `1`
otherwise. `WARN` conditions don't fail the overall check — the
operator reads the output and decides.

Checks

  1. Service — `systemctl is-active go-zenon.service`. `FAIL` if
    inactive.
  2. Binary — `/usr/local/bin/znnd` exists and `znnd version`
    returns. `FAIL` if missing.
  3. RPC — `ws://127.0.0.1:$defaultWsPort` connects within 10s
    timeout. `FAIL` if unreachable. Skipped if service is inactive
    (would double-report).
  4. Sync — `getFrontierMomentum()` with 10s timeout, compute
    wall-clock lag vs `momentum.timestamp`. `FAIL` if timeout,
    `PASS`/`WARN`/`FAIL` by lag threshold.
  5. Pillar — only runs if producer config exists. Enumerates
    `embedded.pillar.getAll` until the producer address is found or
    pagination exhausts. `PASS` if registered and produced >0 this
    epoch; `WARN` if registered but produced=0; absent from `PASS`
    registry → `WARN`.

All RPC calls carry explicit timeouts so a stuck node doesn't cause
the healthcheck itself to hang — which is exactly the failure mode
the check is meant to detect.

Use cases

Designed to drop into existing monitoring plumbing:

  • systemd watchdog — `WatchdogSec=60` plus a timer unit running
    `znn-controller --healthcheck`
  • cron + mail-on-nonzero — for passive monitoring
  • Nagios / Icinga — `check_exit` wrapper consumes the exit code
  • Ansible / Terraform — post-deploy verification
  • Container liveness probe — in a containerised pillar deployment

Test plan

Docker (Ubuntu 24.04, no running znnd):

  • `--healthcheck` exits 1, prints `UNHEALTHY` with
    `[FAIL] service` + `[FAIL] binary`
  • Menu item `Healthcheck` selectable in interactive mode
  • Output works with ANSI stripped (test script verified clean
    grep-ability)

A running-znnd integration test would need a real node spun up; not
in scope for the PR itself but straightforward to add to follow-up CI.

Scope

  • No changes to existing flows (`Status`, `Deploy`, services) — this
    PR is strictly additive.
  • Uses `znn_sdk_dart` types already imported for `Status`
    (`Zenon`, `PillarInfo`, `PillarInfoList`, `Momentum`). No new
    dependency.

mehowz added 6 commits April 23, 2026 19:03
The Deploy flow calls apt, wget, tar, git, and go build via
Process.runSync with no timeout, no DEBIAN_FRONTEND=noninteractive,
no streamed stdout. On Ubuntu 24.04 the first `apt -y install
linux-kernel-headers` call can block indefinitely on a dpkg lock
(unattended-upgrades) or an interactive prompt, and the operator
sees no output between "Git installation detected" and the hang.
Reproduced on a fresh Hetzner 24.04 host; only workaround was to
pre-install `golang-go build-essential` manually and rebuild znnd
from source.

This change:

- Adds `_runStreaming` that uses `Process.start`, streams stdout/
  stderr live, accepts a timeout, and SIGTERM→SIGKILL if exceeded
- Adds `_isDpkgLocked` precheck via `fuser /var/lib/dpkg/lock-frontend`
  so the deploy aborts with an actionable message instead of blocking
- Adds `_hasCommand` / `_hasDebPackage` helpers so already-installed
  tools (git, build-essential, linux-libc-dev, wget, Go) are skipped
  cleanly — a host with manual `apt install golang-go build-essential`
  now sails through prereqs instead of re-running every apt step
- Routes all apt invocations through `_aptInstall` with
  DEBIAN_FRONTEND=noninteractive + confold/confdef options so dpkg
  never waits on config-file prompts
- Replaces `linux-kernel-headers` (transitional name, may not exist
  on Ubuntu 22.04+) with `linux-libc-dev`, accepting either package
  in the skip check for legacy compatibility
- Makes `_buildFromSource` pick `go` from PATH if /usr/local/go is
  absent, matching the detection in the prereq step
- 15-minute timeout on `go build` (slow hosts), 10-min on apt,
  5-min on git clone / wget, 2-min on tar extract

Caller in main() awaits both async functions — both were bool, now
Future<bool>. No other call sites.
…r bound

- goLinuxDlUrl bumped to go1.22.12 with verified go.dev/dl SHA256.
  go-zenon's go.mod still says `go 1.20`, so 1.20.x would technically
  compile it, but Go only maintains security updates for the two most
  recent minor versions. 1.22 is the current LTS floor.
- pubspec.yaml sdk constraint relaxed from '>=2.14.0 <3.0.0' to
  '>=2.14.0 <4.0.0'. The existing constraint prevents `dart pub get`
  on any Dart 3.x release, which is what `dart-lang/setup-dart@v1.5.0`
  (used by the release workflow) installs by default. dcli 3.0.2
  targets Dart 3.0 exactly; keep that implicit via pubspec.lock.

Verified in Docker (Ubuntu 24.04 target):
- target-prepped (golang-go + build-essential pre-installed): controller
  detects each prereq, skips all apt calls, advances to `git clone` +
  `go build` with streamed output. Reproduces the zenonorg5 operator's
  working scenario.
- target-locked (flock holding /var/lib/dpkg/lock-frontend): controller
  aborts in ~2s with the actionable lock-contention error instead of
  hanging. No apt call issued.
Today the Deploy path assumes apt. On a CentOS / RHEL / Arch / NixOS
host the first apt call fails with a cryptic error and the operator
has to guess what went wrong. Parse /etc/os-release up front and
abort with a one-shot actionable message: which distros are
supported, what prereqs to install manually, and that Deploy will
transparently skip the install step on a pre-prepped host thanks to
the _hasCommand / _hasDebPackage detection already landed.

Recognizes derivatives via ID_LIKE (e.g., Linux Mint, Raspbian) not
just the narrow ID==debian/ubuntu check.
Two related reproducibility fixes:

- Clone pins to `--branch znnRefTag --depth 1`. Previously Deploy
  pulled go-zenon master at whatever commit happened to be there at
  deploy time, so two operators following the same runbook a day
  apart could land different znnd binaries. Tag pin makes the deploy
  output deterministic across operators and across time. Bumped in
  lockstep with controller releases when a new go-zenon tag is the
  recommended mainnet version (v0.0.8 for 0.0.5 of this tool).
- Existing clone detection: if /root/go-zenon exists and its origin
  remote matches znnGithubUrl, fetch + hard-reset to the pinned tag
  instead of deleting and re-downloading ~200MB of modules. Foreign
  directories (different remote, scratch files) still get nuked and
  fresh-cloned — only a legit prior clone is reused. Accelerates
  re-deploys from several minutes to seconds when the tag hasn't
  changed.
Adds a small argparse block at the top of main() so the controller
can be driven from Ansible / Terraform / bash scripts without TTY.

Flags:
  --deploy / --status / --start-service / --stop-service / --resync
      Jump straight to the matching menu action.
  -y / --yes
      Auto-confirm every prompt (echoes what was answered so the run
      log shows the choice).
  -h / --help, -v / --version
      Info commands, no side effects.

Semantics:
  - All confirm() calls in Deploy / Resync now go through
    _confirmOrDefault, returning the default in --yes mode.
  - Existing-keystore password prompt reads ZNN_KEYSTORE_PASSWORD
    when --yes is set; aborts with an actionable error if missing
    rather than blocking on ask(). Retries are disabled in --yes mode
    to fail fast.
  - Fresh-keystore generation (the common first-deploy path) is
    already unattended-friendly — the existing RandomStringGenerator
    fallback produces a password without any prompt.

Conflict detection: specifying two action flags (e.g., --deploy
--status) exits 2 with a clear error instead of silently picking one.
Adds a Healthcheck menu option + --healthcheck CLI flag that
reports one line per check with [PASS] / [WARN] / [FAIL] tags and
exits 0 if every check passed (or only WARN'd), 1 if any FAIL
occurred. Shape suitable for:

  - systemd watchdog (WatchdogSec= + ExecReload=... --healthcheck)
  - cron + mail-on-nonzero for passive monitoring
  - Nagios / Icinga check_exit wrapper
  - Ansible / Terraform post-deploy verification

Checks:
  1. go-zenon.service active via systemctl is-active
  2. /usr/local/bin/znnd binary present + reports a version
  3. Local RPC reachable at ws://127.0.0.1:35998 (10s timeout)
  4. Frontier momentum fetched; wall-clock lag ≤ 15s = PASS,
     ≤ 60s = WARN, > 60s = FAIL. Thresholds match the 10s
     target block time with headroom for short-term jitter.
  5. (Only if producer config exists) pillar registered with the
     configured producer address; producedMomentums > 0 this epoch
     PASSes, 0 is WARN (new pillar or one that just missed).

All RPC calls carry explicit timeouts so a stuck node doesn't
cause the healthcheck itself to hang — exactly the failure mode
the check is meant to detect.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant