Skip to content

feat(certs): serve leaf + device CA chain from Traefik for client CA pinning#197

Merged
mairas merged 2 commits into
mainfrom
feat/traefik-serve-ca-chain
Jun 22, 2026
Merged

feat(certs): serve leaf + device CA chain from Traefik for client CA pinning#197
mairas merged 2 commits into
mainfrom
feat/traefik-serve-ca-chain

Conversation

@mairas

@mairas mairas commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

Summary

Makes Traefik serve leaf + device CA so TOFU clients (notably SensESP's Signal K CA-pinning) can capture and pin the stable CA instead of the rotating leaf. Closes #196.

Traefik served the leaf only — correct per RFC (a self-signed root may be omitted, and the HaLOS device CA is a self-signed root signing the leaf directly) but it leaves TOFU clients nothing to pin but the leaf.

What changed

halos-manage-certs appends the device CA (${CA_CRT}) to the served cert file (halos.crt) after signing, idempotently — only when the file still holds the leaf alone, so re-runs and already-migrated devices are untouched. Runs before the Cockpit install so :9090 serves the same chain. A comment marks the deliberate root inclusion so it isn't "cleaned up".

Verification

  • bash -n clean; all tests/test-halos-manage-certs.sh pass (40/40), including a new test_traefik_serves_leaf_plus_ca_chain that asserts the served file is leaf + CA, the 2nd cert is the device CA, and a rerun stays at 2 certs.
  • ⚠️ Not verified on a live device. Confirm on a HALPI with openssl s_client -connect <host>:4430 -showcerts (expect 2 certs) and that Cockpit/browser trust is unaffected.
  • ⚠️ Reload timing: on an existing device that is not re-signing, the CA is appended but Traefik's explicit dynamic-config reload is gated on a re-sign; it relies on Traefik's file-provider watcher noticing the cert file changed. Worth confirming on-device that the new chain is served without a restart.

Rollout

SignalK/SensESP#1028 can land first; until this ships, devices degrade to leaf-fingerprint pinning (no worse than today). Once deployed and verified, SensESP's verified-upgrade path upgrades pinned devices to CA pinning automatically on a handshake authenticated by their existing trusted leaf.

🤖 Generated with Claude Code

@mairas

mairas commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

Review: serve leaf + device CA chain from Traefik

Reviewed at b566e37 with a multi-persona pass (correctness, security/PKI, reliability, testing, project-standards, adversarial) plus per-finding verification against the code, units, and packaging. Tests pass 40/40 locally.

Verdict: the core idea is sound and the security design is clean — chain order is leaf-first, the appended CA is already a public artifact (no new exposure), and the Cockpit combined PEM still validates. But it can't merge as-is: CI is red, and one load-bearing comment is factually wrong about why the change works.

The key correction from review: the feature does take effect, but via the package-upgrade stack restart, not the file-provider watcher the comment claims. postinst restarts halos-core-containers.service; halos-manage-certs.service is a non-RemainAfterExit oneshot pulled in via Requires=/Before=, so on upgrade the append runs, then docker compose down/up recreates Traefik, which re-reads /certs/halos.crt = leaf+CA. Traefik watches /etc/traefik/dynamic (a separate bind mount), never /certs — so a bare append is not served live. That directly answers the PR body's open ⚠️ ("served without a restart?"): no, it's served with the restart the upgrade already performs.


Blocker

1. version-bump-check is failing — VERSION must be bumped.
VERSION is 0.5.0; the latest stable release is v0.5.0+1 (upstream 0.5.0) — equal. This PR changes a package-affecting file (assets/halos-manage-certs), so it opens a new release cycle and must bump VERSION once. The live check is red, and the repo gate is absolute ("all checks must pass before merge").
./run bumpversion minor (→ 0.6.0; this is a feat). Level is author judgment — CI only enforces that a bump happened.

Major (fix before merge)

2. The new comment (lines 298-299) is false and contradicts the file's own cited reasoning 60 lines below.

"Direct modification of CERT_FILE is picked up by Traefik's file-provider watcher; a re-sign additionally touches the dynamic config below."

Lines 363-364 (citing traefik/traefik#5495) state the opposite: the watcher fires on the dynamic-config file, not the referenced cert files. The mount topology confirms it (/certs:ro vs /etc/traefik/dynamic:ro). This isn't just inaccurate prose — it's the stated justification for why the append needs no reload, and it misattributes the actual mechanism (the upgrade restart). A future maintainer who trusts it could "optimize" the restart away and silently break live pickup.
→ Replace with the truth: a bare CERT_FILE edit is not seen by Traefik; the appended chain is served after the next halos-core-containers.service restart (which the package upgrade triggers).

3. Non-atomic cat CA >> CERT_FILE can freeze a permanently-corrupt served chain.
Every other writer of these artifacts uses stage+mv (AGENTS.md calls it "the established pattern") precisely to avoid torn reads/writes. A crash or power loss mid-append (these run on boats) leaves leaf + a truncated CA block. The next run sees grep -c == 2, concludes "already migrated," and never re-appends — the corruption is frozen. Worse, that file feeds the Cockpit :9090 override (halos_cockpit_install_leaf validates only the first cert), so the bad chain silently lands on both surfaces with no self-heal.
→ Build the chain in a .new.$$ sibling, validate that both certs parse, then mv — mirroring halos_cockpit_install_leaf. This also subsumes finding 7.

4. The new test can't catch the behavior it's named for.
test_traefik_serves_leaf_plus_ca_chain only exercises fresh first-boot, asserts on-disk bytes (a proxy, not what Traefik serves), and never seeds TRAEFIK_TLS_CONFIG — so it skips the reload path entirely. Uncovered and worth adding:

  • the migration path (pre-existing leaf-only file + valid sentinel, NEED_LEAF=false) — the scenario the PR exists for;
  • the re-sign-then-re-append path (halos_ca_sign_leaf overwrites to leaf-only → guard re-appends → still 2) — the lifetime-correctness invariant;
  • CA rotation → the re-appended cert is the new CA (the exact invariant a CA-pinning client depends on);
  • custom-CA mode (append runs there too);
  • assert the 1st cert is the leaf (CA:FALSE), not only that the 2nd is the CA.

Minor

5. Reload-on-append (optional hardening). The upgrade restart covers the real deployment path, so this isn't a functional blocker. But triggering the dynamic-config touch (and the cockpit reload) when the append actually changes the file — set a flag in the if-block and OR it into the gates at lines 347/386 — removes the dependency on "restart recreates Traefik" and makes a timer-only append serve live. Cheap belt-and-suspenders.

6. docs/CERTS.md drift. Line 86 now mis-states the Cockpit override as "leaf cert + leaf key" — it's leaf + device CA + leaf key after this change. And the deliberate served-chain behavior is undocumented in CERTS.md (the operator doc the units reference via Documentation=); add a short note + the openssl s_client -showcerts 2-cert check that issue #196's Verify section calls for.

Nit

7. grep -c 'BEGIN CERTIFICATE' counts matching lines, not PEM blocks — a hand-edited cert with a comment containing that literal would misfire and silently skip the append. Anchor it (^-----BEGIN CERTIFICATE-----) and log when the guard skips so a misfire is diagnosable. (Folds into the finding-3 rewrite.)

8. The commit carries a non-standard Claude-Session: trailer (the only one in repo history) baking an access-gated URL into permanent history. Drop it on amend; keep Co-Authored-By.


What's good

  • Chain order is correct and RFC-valid; the "Do NOT remove this as redundant" intent comment (with RFC citations) is exactly the kind of WHY the standards want — it just needs the watcher sentence corrected.
  • No trust weakening / no key exposure: the CA is already published world-readable and installed in the host trust store.
  • Idempotency self-heals across leaf rotation (the grep -c == 1 guard re-appends after each re-sign); the adversarial "stale old CA stays forever after rotation" hypothesis does not occur — verified.

Net: bump VERSION (blocker), fix the false comment + make the append atomic (major), broaden the test to the migration/rotation paths (major). The optional reload-on-append and the doc updates round it out.

🤖 Synthesized from a multi-agent review (Claude Code).

@mairas mairas force-pushed the feat/traefik-serve-ca-chain branch from b566e37 to 97c9898 Compare June 22, 2026 13:12
@mairas

mairas commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

On-device finding (testing SensESP CA-pinning end-to-end against a HALSER + a halosdev server): the file-provider watcher does not reload Traefik on an in-place CA append. After the script appended the CA to halos.crt, the served chain stayed at the leaf alone until tls-default.yml was touched — so on an existing device that isn't re-signing, the chain would not actually be served until the next restart/re-sign. Fixed: the append now explicitly touches the dynamic config (same reload mechanism the leaf-rotation path uses). Verified on halosdev that the touch makes Traefik serve the 2-cert chain, and that a SensESP device then captures the CA and upgrades its pin.

mairas and others added 2 commits June 22, 2026 17:38
SensESP's Signal K CA-pinning captures the issuing CA from the TLS
handshake to pin it instead of the rotating leaf. Traefik served the leaf
only -- correct per RFC (a self-signed root may be omitted), but it leaves
TOFU clients nothing to pin but the leaf.

Append the device CA to the served certificate file (idempotently, only
when it still holds the leaf alone) so Traefik presents leaf + CA. Clients
that already trust the CA ignore the extra cert. Runs before the Cockpit
install so :9090 serves the same chain. The chain is assembled via
stage-and-mv so an interrupted write can never freeze a half-written
second cert into the served file.

Traefik's file-provider watcher does NOT reload on a cert-file change
(verified on-device: the served chain stayed at the leaf alone until the
dynamic config was touched), so a successful append explicitly pokes the
dynamic config to force a reload -- the same mechanism the leaf-rotation
path uses.

Covered by tests for the migration, leaf-resign, CA-rotation, and
custom-CA paths; documented in docs/CERTS.md.

Refs #196

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mairas mairas force-pushed the feat/traefik-serve-ca-chain branch from 97c9898 to 3c1badd Compare June 22, 2026 14:38
@mairas

mairas commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

Review findings addressed (pushed 3c1badd)

Picked up from a second session — thanks for the on-device watcher verification, it confirmed the reload gap directly. Force-pushed (amended the feature commit + a separate version-bump commit). All checks now green.

Blocker — fixed

  • VERSION bumped 0.5.0 → 0.6.0 (./run bumpversion minor). version-bump-check is green; this opens the new release cycle (minor for a feat — CI only enforces that a bump happened).

Major — fixed

  • Atomic chain assembly. Replaced the in-place cat >> CERT_FILE with stage-and-mv (*.new.$$ → validate the leaf parses + ≥2 cert blocks → chmodmv), mirroring halos_ca_sign_leaf / halos_cockpit_install_leaf. A torn append previously left a leaf + truncated CA block that the count guard would read as "already migrated" and freeze forever (and propagate into the :9090 override). Now the served file is always the complete chain or unchanged.
  • Reload-on-append now logged + gated on success. The touch (added in the prior push) only fires after a successful publish and emits a log line, matching the leaf-rotation path.
  • Test coverage for the paths that matter. Added four tests + extended the existing one:
    • …migration_appends_ca_and_reloads_existing_leaf — the in-place-upgrade scenario (NEED_LEAF=false, pre-existing leaf-only): asserts the CA is appended, the leaf is not re-signed, the key is untouched, and the dynamic config is touched so a running Traefik reloads.
    • …reappends_ca_after_leaf_resign — re-sign overwrites to leaf-only → CA re-appended.
    • …reappends_new_ca_after_rotation — after CA rotation the chain carries the new CA (the pinning invariant).
    • …includes_custom_ca — custom-CA mode chains the operator CA.
    • existing chain test now also asserts cert fix(homarr): add custom prestart.sh for secret generation #1 is the leaf (CA:FALSE), pinning order.
    • 44/44 pass.

Minor / nit — fixed

  • Anchored the count guard to ^-----BEGIN CERTIFICATE----- (was matching any line containing that text) and added a WARNING on assembly failure.
  • docs/CERTS.md: new "Served chain: leaf + device CA" section; corrected the Cockpit override description (now leaf + device CA + leaf key); added an openssl s_client -showcerts check (expect 2 certs) — the verification issue Traefik: serve the device CA alongside the leaf to enable client CA pinning #196 asks for.
  • Dropped the non-standard Claude-Session: commit trailer.

🤖 Generated with Claude Code

@mairas mairas marked this pull request as ready for review June 22, 2026 14:46
@mairas mairas merged commit de8f4eb into main Jun 22, 2026
4 checks passed
@mairas mairas deleted the feat/traefik-serve-ca-chain branch June 22, 2026 14:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Traefik: serve the device CA alongside the leaf to enable client CA pinning

1 participant