…t reach the gateway
Enrollment proves only the HTTPS/TCP path; a firewall blocking the QUIC/UDP
tunnel (UDP 4433) could let enrollment succeed yet leave the tunnel dead, while
the installer still reported success.
`agent up` now performs a one-shot QUIC + mTLS connectivity probe to the gateway
right after enrolling, and exits non-zero on failure — which the enrollment
custom action already turns into a failed (and rolled-back) install.
- agent: `probe_connectivity` reuses the live connect path (one handshake +
bounded drain); no standalone subcommand or heartbeat round-trip.
- agent-installer: on a failed `up`, roll back a freshly-persisted enrollment
only when `up` actually wrote new certs (guarded, fails safe), never the
prior install's.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Split out of #1831.
Problem: the installer reports success on enrollment, not on tunnel connectivity
Symptom: the MSI install shows the Agent Tunnel step as success, but the agent never appears online in the Gateway / DVLS agent list.
Root cause: enrollment and the tunnel use two different network paths.
EnrollAgentTunnelrunsdevolutions-agent up→enroll_agent(), whose success criteria were only:POST https://<gw>:7171/jet/tunnel/enroll(HTTPS management port, TCP) returns 2xx and issues the client cert, andTunnelsection are persisted toagent.json.The actual data path — the QUIC tunnel over UDP (4433) — is established later by the agent service and was never probed at install time. Enrollment is TCP/7171; the tunnel is UDP/4433. So when UDP 4433 is blocked (firewall/NAT), the install is green while the agent silently fails to connect (
Tunnel connection lost error=QUIC handshake: timed out).Fix
After enrolling,
agent upnow performs a one-shot QUIC + mTLS connectivity probe to the gateway tunnel endpoint and exits non-zero on failure.EnrollAgentTunnelalready checksup's exit code, so a blocked UDP path now fails the install and rolls back — giving the operator actionable feedback (verify UDP 4433 / firewall) while they're still at the machine.tunnel.rs,main.rs):probe_connectivityreuses the live connect path (the sameconnect_to_gatewaythe running service uses) for a single mTLS + QUIC handshake, bounded by a timeout, then drains the connection (close+ boundedwait_idle) so the gateway unregisters the probe promptly. A completed handshake is sufficient proof the UDP path is open, so the standaloneprobe-tunnelsubcommand and the heavier heartbeat round-trip probe were removed.CustomActions.cs): removed the in-CA subprocess probe (it lives inupnow). On a failedup, both the timeout and non-zero-exit paths roll back a freshly-persisted enrollment via a guarded helper that cleans up only whenupactually wrote new certs (uuid-named client cert path changed from the pre-upsnapshot) — never the prior install's certs, and never when the pre-upsnapshot couldn't be captured. Fails safe.Testing
Validated end-to-end on a lab agent VM against the live gateway:
upexits non-zero → install fails and rolls back, no orphaned artifacts left behind.Unit tests:
probe_fails_fast_when_tunnel_disabled,probe_times_out_when_gateway_unreachable.Reviewed via an iterative Claude + Codex review loop (converged clean).
Known follow-ups (out of scope here)
upis hard-killed (the installer's 60s timeout) after writing the cert files but beforeagent.jsonis updated, the new files can orphan / a fixed-namegateway-ca.pemmay be left overwritten. Pre-existing; the clean fix is a transactionalupor an MSI cert-directory snapshot.agent_idwithout a connection-identity check; the probe's bounded drain compensates on this path (it runs before the service starts), but adding an identity/generation check gateway-side would close the race for good.