Skip to content

fix: Deploy hangs indefinitely on Ubuntu 24.04 during apt install#11

Open
mehowz wants to merge 2 commits into
zenon-network:masterfrom
ZenonOrg:fix/deploy-hang-ubuntu-24
Open

fix: Deploy hangs indefinitely on Ubuntu 24.04 during apt install#11
mehowz wants to merge 2 commits into
zenon-network:masterfrom
ZenonOrg:fix/deploy-hang-ubuntu-24

Conversation

@mehowz

@mehowz mehowz commented Apr 23, 2026

Copy link
Copy Markdown

Summary

The Deploy menu option hangs forever on fresh Ubuntu 24.04 hosts.
After the line Git installation detected: git version 2.43.0 the tool
sits with no output for hours; the operator has no indication of what
failed or how to proceed. Reproducible on any stock 24.04 host.

Reproduction

  1. Fresh Ubuntu 24.04 LTS host (Hetzner, EC2, Hyper-V, anything).
  2. Download the published v0.0.4-alpha binary from Releases.
  3. Run as root; select Deploy.
  4. Tool prints Installing Linux prerequisites ... / Git installation detected: git version 2.43.0, then hangs with zero further output.

Root cause

`_installLinuxPrerequisites()` chains three `Process.runSync('apt', ['-y',
'install', ...], runInShell: true)` calls with:

  • no timeout,
  • no `DEBIAN_FRONTEND=noninteractive`,
  • stdout/stderr captured but never displayed.

On Ubuntu 24.04 the first call (`apt -y install linux-kernel-headers`)
hits at least one of three failure modes:

  • `linux-kernel-headers` is a transitional package name; modern Ubuntu
    ships `linux-libc-dev` instead. The transitional shim can prompt.
  • `dpkg` waits for an interactive config-file prompt (`confold` /
    `confdef` not set) when a package replaces an existing config.
  • `/var/lib/dpkg/lock-frontend` is held by `unattended-upgrades`
    which is enabled by default on a fresh 24.04 image.

`Process.runSync` blocks forever with no telemetry, so the operator
sees nothing.

Fix

  • New `_runStreaming` helper: `Process.start` with live stdout/stderr
    tee, timeout + `SIGTERM`→`SIGKILL` fallback, configurable env merge.
  • New `_isDpkgLocked` precheck via `fuser /var/lib/dpkg/lock-frontend`
    so Deploy aborts in <2s with an actionable message when another
    apt/dpkg job is running, instead of blocking.
  • New `_hasCommand` / `_hasDebPackage` helpers: if a prereq is already
    present (`git`, `build-essential`, `linux-libc-dev`,
    `linux-kernel-headers`, `wget`, Go via `/usr/local/go` or system
    `go`), the apt call is skipped entirely. Makes the tool work on
    operator-prepped hosts without fighting them.
  • New `_aptInstall` wrapper: every apt call routed through it, with
    `DEBIAN_FRONTEND=noninteractive` and
    `-o DPkg::Options::=--force-confold -o DPkg::Options::=--force-confdef`
    so dpkg never waits on config prompts.
  • `linux-kernel-headers` replaced by `linux-libc-dev` (the real modern
    name); the detection check accepts either so legacy hosts still work.
  • `_buildFromSource` picks `go` from `PATH` if `/usr/local/go` is
    absent, matching the detection in the prereq step.
  • Timeouts: apt 10m, wget 5m, tar 2m, git clone 5m, go build 15m.
    Every one is long enough to never falsely trip in normal operation
    and short enough to bound operator frustration.

Bonus changes (same PR for atomicity)

  • `goLinuxDlUrl` bumped from `go1.20.3` → `go1.22.12`. go 1.20 is EOL
    (Go maintains only the two most recent minor versions); 1.22 is the
    current security-supported floor. Verified SHA256 against go.dev.
  • `pubspec.yaml` SDK constraint relaxed from `<3.0.0` to `<4.0.0` so
    `dart pub get` works on Dart 3 (which is what
    `dart-lang/setup-dart@v1.5.0` installs in the existing CI workflow
    by default — the constraint was already out of sync with the CI
    reality).

Test plan

Verified in Docker (Ubuntu 24.04 base, Dart 3.0 builder):

  • Prepped host (zenonorg5 reproduction): pre-install `git`,
    `build-essential`, `linux-libc-dev`, `wget`, `golang-go`.
    Controller detects each, skips all apt, advances to `git clone` +
    `go build` with streamed output. Zero `apt-get install` calls.
  • Lock held: `flock -n /var/lib/dpkg/lock-frontend -c 'sleep 60' &`.
    Controller aborts in <2s with the "held by another process" message.
  • Argument parsing: `--help`, `--version`, unknown-arg rejection,
    conflicting-action rejection all behave correctly.

A full 10-test matrix passes; the harness (bash + Docker) is standalone
and can be resurrected as real CI in a follow-up PR if maintainers want.

Scope

  • Not fixed here: the build chain itself is fragile — `dcli 3.0.2`
    only compiles on Dart 3.0.x exactly (later 3.x removed `cli.waitFor`,
    3.5+ removed `UnmodifiableUint8ListView`). A dcli bump deserves its
    own PR with tests first.
  • Not fixed here: `Process.runSync` is still used for `systemctl`,
    `pgrep`, `kill`, etc. Those could also hang in theory; migrating
    them is orthogonal and not on the hot path that produced the reported
    bug.

Follow-ups

Two additional PRs are prepared, stacked on this one:

  • `feat/deploy-robustness` — distro detection, pin go-zenon to a tagged
    release, reuse existing clone on re-deploy, `--yes --deploy` for
    unattended runs.
  • `feat/deploy-healthcheck` — `--healthcheck` with `[PASS]`/`[WARN]`/
    `[FAIL]` markers and exit-code semantics suitable for monitoring /
    systemd watchdogs.

Happy to split further if the maintainer prefers smaller PRs.

mehowz added 2 commits April 23, 2026 19:03
The Deploy flow calls apt, wget, tar, git, and go build via
Process.runSync with no timeout, no DEBIAN_FRONTEND=noninteractive,
no streamed stdout. On Ubuntu 24.04 the first `apt -y install
linux-kernel-headers` call can block indefinitely on a dpkg lock
(unattended-upgrades) or an interactive prompt, and the operator
sees no output between "Git installation detected" and the hang.
Reproduced on a fresh Hetzner 24.04 host; only workaround was to
pre-install `golang-go build-essential` manually and rebuild znnd
from source.

This change:

- Adds `_runStreaming` that uses `Process.start`, streams stdout/
  stderr live, accepts a timeout, and SIGTERM→SIGKILL if exceeded
- Adds `_isDpkgLocked` precheck via `fuser /var/lib/dpkg/lock-frontend`
  so the deploy aborts with an actionable message instead of blocking
- Adds `_hasCommand` / `_hasDebPackage` helpers so already-installed
  tools (git, build-essential, linux-libc-dev, wget, Go) are skipped
  cleanly — a host with manual `apt install golang-go build-essential`
  now sails through prereqs instead of re-running every apt step
- Routes all apt invocations through `_aptInstall` with
  DEBIAN_FRONTEND=noninteractive + confold/confdef options so dpkg
  never waits on config-file prompts
- Replaces `linux-kernel-headers` (transitional name, may not exist
  on Ubuntu 22.04+) with `linux-libc-dev`, accepting either package
  in the skip check for legacy compatibility
- Makes `_buildFromSource` pick `go` from PATH if /usr/local/go is
  absent, matching the detection in the prereq step
- 15-minute timeout on `go build` (slow hosts), 10-min on apt,
  5-min on git clone / wget, 2-min on tar extract

Caller in main() awaits both async functions — both were bool, now
Future<bool>. No other call sites.
…r bound

- goLinuxDlUrl bumped to go1.22.12 with verified go.dev/dl SHA256.
  go-zenon's go.mod still says `go 1.20`, so 1.20.x would technically
  compile it, but Go only maintains security updates for the two most
  recent minor versions. 1.22 is the current LTS floor.
- pubspec.yaml sdk constraint relaxed from '>=2.14.0 <3.0.0' to
  '>=2.14.0 <4.0.0'. The existing constraint prevents `dart pub get`
  on any Dart 3.x release, which is what `dart-lang/setup-dart@v1.5.0`
  (used by the release workflow) installs by default. dcli 3.0.2
  targets Dart 3.0 exactly; keep that implicit via pubspec.lock.

Verified in Docker (Ubuntu 24.04 target):
- target-prepped (golang-go + build-essential pre-installed): controller
  detects each prereq, skips all apt calls, advances to `git clone` +
  `go build` with streamed output. Reproduces the zenonorg5 operator's
  working scenario.
- target-locked (flock holding /var/lib/dpkg/lock-frontend): controller
  aborts in ~2s with the actionable lock-contention error instead of
  hanging. No apt call issued.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant