feat: --healthcheck for monitoring and systemd watchdogs#13
Open
mehowz wants to merge 6 commits into
Open
Conversation
The Deploy flow calls apt, wget, tar, git, and go build via Process.runSync with no timeout, no DEBIAN_FRONTEND=noninteractive, no streamed stdout. On Ubuntu 24.04 the first `apt -y install linux-kernel-headers` call can block indefinitely on a dpkg lock (unattended-upgrades) or an interactive prompt, and the operator sees no output between "Git installation detected" and the hang. Reproduced on a fresh Hetzner 24.04 host; only workaround was to pre-install `golang-go build-essential` manually and rebuild znnd from source. This change: - Adds `_runStreaming` that uses `Process.start`, streams stdout/ stderr live, accepts a timeout, and SIGTERM→SIGKILL if exceeded - Adds `_isDpkgLocked` precheck via `fuser /var/lib/dpkg/lock-frontend` so the deploy aborts with an actionable message instead of blocking - Adds `_hasCommand` / `_hasDebPackage` helpers so already-installed tools (git, build-essential, linux-libc-dev, wget, Go) are skipped cleanly — a host with manual `apt install golang-go build-essential` now sails through prereqs instead of re-running every apt step - Routes all apt invocations through `_aptInstall` with DEBIAN_FRONTEND=noninteractive + confold/confdef options so dpkg never waits on config-file prompts - Replaces `linux-kernel-headers` (transitional name, may not exist on Ubuntu 22.04+) with `linux-libc-dev`, accepting either package in the skip check for legacy compatibility - Makes `_buildFromSource` pick `go` from PATH if /usr/local/go is absent, matching the detection in the prereq step - 15-minute timeout on `go build` (slow hosts), 10-min on apt, 5-min on git clone / wget, 2-min on tar extract Caller in main() awaits both async functions — both were bool, now Future<bool>. No other call sites.
…r bound - goLinuxDlUrl bumped to go1.22.12 with verified go.dev/dl SHA256. go-zenon's go.mod still says `go 1.20`, so 1.20.x would technically compile it, but Go only maintains security updates for the two most recent minor versions. 1.22 is the current LTS floor. - pubspec.yaml sdk constraint relaxed from '>=2.14.0 <3.0.0' to '>=2.14.0 <4.0.0'. The existing constraint prevents `dart pub get` on any Dart 3.x release, which is what `dart-lang/setup-dart@v1.5.0` (used by the release workflow) installs by default. dcli 3.0.2 targets Dart 3.0 exactly; keep that implicit via pubspec.lock. Verified in Docker (Ubuntu 24.04 target): - target-prepped (golang-go + build-essential pre-installed): controller detects each prereq, skips all apt calls, advances to `git clone` + `go build` with streamed output. Reproduces the zenonorg5 operator's working scenario. - target-locked (flock holding /var/lib/dpkg/lock-frontend): controller aborts in ~2s with the actionable lock-contention error instead of hanging. No apt call issued.
Today the Deploy path assumes apt. On a CentOS / RHEL / Arch / NixOS host the first apt call fails with a cryptic error and the operator has to guess what went wrong. Parse /etc/os-release up front and abort with a one-shot actionable message: which distros are supported, what prereqs to install manually, and that Deploy will transparently skip the install step on a pre-prepped host thanks to the _hasCommand / _hasDebPackage detection already landed. Recognizes derivatives via ID_LIKE (e.g., Linux Mint, Raspbian) not just the narrow ID==debian/ubuntu check.
Two related reproducibility fixes: - Clone pins to `--branch znnRefTag --depth 1`. Previously Deploy pulled go-zenon master at whatever commit happened to be there at deploy time, so two operators following the same runbook a day apart could land different znnd binaries. Tag pin makes the deploy output deterministic across operators and across time. Bumped in lockstep with controller releases when a new go-zenon tag is the recommended mainnet version (v0.0.8 for 0.0.5 of this tool). - Existing clone detection: if /root/go-zenon exists and its origin remote matches znnGithubUrl, fetch + hard-reset to the pinned tag instead of deleting and re-downloading ~200MB of modules. Foreign directories (different remote, scratch files) still get nuked and fresh-cloned — only a legit prior clone is reused. Accelerates re-deploys from several minutes to seconds when the tag hasn't changed.
Adds a small argparse block at the top of main() so the controller
can be driven from Ansible / Terraform / bash scripts without TTY.
Flags:
--deploy / --status / --start-service / --stop-service / --resync
Jump straight to the matching menu action.
-y / --yes
Auto-confirm every prompt (echoes what was answered so the run
log shows the choice).
-h / --help, -v / --version
Info commands, no side effects.
Semantics:
- All confirm() calls in Deploy / Resync now go through
_confirmOrDefault, returning the default in --yes mode.
- Existing-keystore password prompt reads ZNN_KEYSTORE_PASSWORD
when --yes is set; aborts with an actionable error if missing
rather than blocking on ask(). Retries are disabled in --yes mode
to fail fast.
- Fresh-keystore generation (the common first-deploy path) is
already unattended-friendly — the existing RandomStringGenerator
fallback produces a password without any prompt.
Conflict detection: specifying two action flags (e.g., --deploy
--status) exits 2 with a clear error instead of silently picking one.
Adds a Healthcheck menu option + --healthcheck CLI flag that
reports one line per check with [PASS] / [WARN] / [FAIL] tags and
exits 0 if every check passed (or only WARN'd), 1 if any FAIL
occurred. Shape suitable for:
- systemd watchdog (WatchdogSec= + ExecReload=... --healthcheck)
- cron + mail-on-nonzero for passive monitoring
- Nagios / Icinga check_exit wrapper
- Ansible / Terraform post-deploy verification
Checks:
1. go-zenon.service active via systemctl is-active
2. /usr/local/bin/znnd binary present + reports a version
3. Local RPC reachable at ws://127.0.0.1:35998 (10s timeout)
4. Frontier momentum fetched; wall-clock lag ≤ 15s = PASS,
≤ 60s = WARN, > 60s = FAIL. Thresholds match the 10s
target block time with headroom for short-term jitter.
5. (Only if producer config exists) pillar registered with the
configured producer address; producedMomentums > 0 this epoch
PASSes, 0 is WARN (new pillar or one that just missed).
All RPC calls carry explicit timeouts so a stuck node doesn't
cause the healthcheck itself to hang — exactly the failure mode
the check is meant to detect.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a `Healthcheck` menu option + `--healthcheck` CLI flag that
runs diagnostic checks against the local node and exits with a
status code suitable for monitoring tools. Fills the gap between
`Status` (human-readable, interactive-only) and nothing.
Stacks on #11 and #12 — please review / merge those first. The
diff this PR adds over #12 is 134 insertions across one commit
(`fe9d699`).
Output format
Thresholds: sync lag ≤15s is `[PASS]`, ≤60s is `[WARN]` (warns but
doesn't fail), >60s is `[FAIL]`. These match the 10s target block time
with headroom for short-term jitter. Exit `0` if no `FAIL`, `1`
otherwise. `WARN` conditions don't fail the overall check — the
operator reads the output and decides.
Checks
inactive.
returns. `FAIL` if missing.
timeout. `FAIL` if unreachable. Skipped if service is inactive
(would double-report).
wall-clock lag vs `momentum.timestamp`. `FAIL` if timeout,
`PASS`/`WARN`/`FAIL` by lag threshold.
`embedded.pillar.getAll` until the producer address is found or
pagination exhausts. `PASS` if registered and produced >0 this
epoch; `WARN` if registered but produced=0; absent from `PASS`
registry → `WARN`.
All RPC calls carry explicit timeouts so a stuck node doesn't cause
the healthcheck itself to hang — which is exactly the failure mode
the check is meant to detect.
Use cases
Designed to drop into existing monitoring plumbing:
`znn-controller --healthcheck`
Test plan
Docker (Ubuntu 24.04, no running znnd):
`[FAIL] service` + `[FAIL] binary`
grep-ability)
A running-znnd integration test would need a real node spun up; not
in scope for the PR itself but straightforward to add to follow-up CI.
Scope
PR is strictly additive.
(`Zenon`, `PillarInfo`, `PillarInfoList`, `Momentum`). No new
dependency.