Add check subcommand for endpoint connectivity diagnostics#3
Conversation
DelineaLaari
left a comment
There was a problem hiding this comment.
Overall take
Strong PR. The scope is exactly right (replace ad-hoc PowerShell/bash with one supported tool), the design choices are sensible (SNI on hostnames, skip TLS on IPs, exit non-zero), and the test layer probes real listeners rather than mocks. The probe-layer code is readable and the data model (ProbeResult → TCPProbe → TLSProbe) is clean. Approve with comments — nothing here blocks merge, but two items are worth fixing before it goes out.
Worth fixing before merge
1. os.Exit(1) bypasses cobra error handling — internal/cli/check.go:138-140
if connchk.HasFailures(summary) {
os.Exit(1)
}
return nilCalling os.Exit inside RunE skips any deferred cleanup, and depending on whether SilenceUsage/SilenceErrors is set on the root command, you can also dump the cobra usage screen on top of an already-rendered failure report — exactly the wrong UX for a CI gate.
Cleaner: return a typed error and let main set the exit code. Even simpler interim fix — return a sentinel error and silence usage on the command:
checkCmd.SilenceUsage = true
checkCmd.SilenceErrors = true // your own rendering is already complete
// ...
if connchk.HasFailures(summary) {
return errProbesFailed
}Then in main, type-assert and os.Exit(1) on that.
2. The TLS handshake doesn't inherit the parent context — internal/connchk/connchk.go:305
ctx, cancel := context.WithDeadline(context.Background(), deadline)This re-roots the TLS handshake on context.Background() instead of the ctx passed all the way down from Run → probeOne → probePort. Right now nothing cancels the parent (see #3 below), so the practical impact is zero — but the moment you add Ctrl-C handling, the TCP dial and DNS lookup will cancel cleanly while the TLS handshake keeps running until its 5s deadline. Pass ctx through handshakeTLS and derive from it.
Non-blocking but worth tracking
3. No signal-driven cancellation — internal/cli/check.go:126
ctx := context.Background() — Ctrl-C does the right thing today only because every probe has its own timeout. With hundreds of targets at concurrency 10, that can be tens of seconds of "unresponsive after Ctrl-C." signal.NotifyContext(ctx, os.Interrupt, syscall.SIGTERM) is one line and worth the polish.
4. classifyTLSErr uses string matching — internal/connchk/connchk.go:325-336
case strings.Contains(msg, "certificate signed by unknown authority"),
strings.Contains(msg, "x509: certificate"):
return "...possible SSL inspection / proxy"The second branch matches anything containing x509: certificate, including x509: certificate is valid for X, not Y — that's a hostname mismatch, not an SSL-inspection signal. Labeling it as "possible SSL inspection / proxy" misleads operators in exactly the scenario this command is supposed to clarify.
Replace with errors.As against the typed errors:
var ua x509.UnknownAuthorityError
var ci x509.CertificateInvalidError
var hn x509.HostnameError
switch {
case errors.As(err, &ua):
return fmt.Sprintf("certificate signed by unknown CA (possible SSL inspection / proxy): %s", err)
case errors.As(err, &hn):
return fmt.Sprintf("certificate hostname mismatch: %s", err)
case errors.As(err, &ci):
return fmt.Sprintf("certificate invalid: %s", err)
case errors.Is(err, context.DeadlineExceeded):
return "TLS handshake timed out"
default:
return err.Error()
}5. Cert reporting only looks at CN — internal/connchk/connchk.go:317-320
Many modern certs have empty CommonName and identify hosts purely via SAN. You'd report cert: (issuer: …) with no subject. Fall back to certs[0].DNSNames[0] if CN is empty.
6. tlsPorts is a mutable package-level map — internal/connchk/connchk.go:31-41
The test mutates it (tlsPorts[port] = true; defer delete(tlsPorts, port) in connchk_test.go:113-114). Works because the test doesn't call t.Parallel(), but it's a footgun for whoever adds the next test in this package. Either:
- Make it a
varslice and pass intoRunviaCheckOptions.TLSPorts(defaulting to the canonical set) — also opens the door to--tls-ports=…if customers ever need it, or - Use
sync.RWMutexto gate it.
7. --insecure doesn't warn
Skipping cert verification is exactly the thing operators run by mistake and then trust the green checkmarks. One fmt.Fprintln(os.Stderr, "WARNING: --insecure disables TLS certificate validation") at the top of runCheck (and consider repeating it in renderSummary when set) is cheap and saves real grief.
8. --concurrency 0 silently falls back to default
Fine, but unexpected. Either error out (return fmt.Errorf("--concurrency must be > 0")) or log the substitution. Same for --timeout 0.
Nits
check.go:218— DNS line for IP literals says(IP literal).probeDNS(line 254) also stashes the IP intoAddressesso consumers can use it, butrenderTargetonly prints(IP literal)and never the address. Minor inconsistency — either don't populateAddresses, or print it.connchk.go:288-292— the TLS branch is twoelse ifs that sharetlsPorts[port]. Collapse to a singleif tlsPorts[port] { ... }with a nested host check for readability.check.go:193-198—groupKeyincludesdirectionbut the sort comparator doesn't. Stable but order is data-input-dependent. Sort by direction too if you want fully deterministic output.connchk.go:317-323— success path callstlsConn.Close()explicitly;probePortthen defer-closes the underlyingconn. Closing the same socket twice is harmless on POSIX but explicit-then-deferred is a code smell. Just rely on the deferred close.- README/CHANGELOG — well written, accurate, links into TOC. No changes.
- Test coverage — probe-layer integration tests are solid. Missing: a test for the
runCheckflow itself (parsing → filtering → output), and a race-detector run (go test -race) — easy to add to the CI matrix if it isn't already there.
Positives worth calling out
- The TCP-first / TLS-second flow on the same connection (
probePortlines 270-294) is correct and saves a round trip — many naive implementations dial twice. - The "skip TLS for IP" branch with an explicit
Skippedstatus (rather than silently dropping it) gives operators the right mental model. buildJobsdeduplicating ports across entries that share(service, region, target)is exactly the kind of thing you'd miss on a first pass and have to backfill later. Good that it's there from day one.- Treating CIDRs and
<tenant>placeholders ascontinuerather than errors keeps the tool aligned with the source-of-truth JSON without forcing operators to pre-filter. - The test at
connchk_test.go:60-94spins up a real listener instead of mockingnet.Dial— these tests will actually catch regressions.
DelineaLaari
left a comment
There was a problem hiding this comment.
Follow-up commit bb1a55a addresses every item from my earlier review — pre-merge blockers, non-blockers, and nits. go test -race ./... passes clean locally. LGTM.
Introduce a `check` subcommand that probes the endpoints declared in
network-requirements.json from the local machine — the Go equivalent of
the PowerShell/Linux diagnostic scripts customers run when validating
firewall rules.
For each unique hostname or IP the command runs three probes:
- DNS resolution (hostnames only; IP literals are skipped)
- TCP dial against every port published for the service
- TLS handshake on TLS-typical ports (443, 5671, 8883, 636, 993, 995,
465, 5986, 8443) with SNI when the target is a hostname
CIDR ranges and values still containing the <tenant> placeholder are
skipped automatically. The process exits non-zero when any probe fails
so the command can gate firewall-readiness scripts and CI pipelines.
Flags: --region, --service (matches the JSON `id`), --tenant,
--timeout (5s default), --concurrency (10 default), --insecure.
Co-Authored-By: Claude <noreply@anthropic.com>
connchk: - Move tlsPorts from a mutable package-level map onto CheckOptions.TLSPorts, defaulting to the canonical set when unset. Removes the test-only mutation footgun. No CLI flag exposed yet — wiring stays library-internal. - Thread the caller's ctx through handshakeTLS so signal-driven cancellation reaches TLS handshakes, not just DNS and TCP. - Replace classifyTLSErr's substring matching with errors.As against typed x509 errors (UnknownAuthorityError, HostnameError, CertificateInvalidError). Fixes the regression where hostname-mismatch errors were mislabelled as "possible SSL inspection / proxy". - Fall back to certs[0].DNSNames[0] when the leaf CommonName is empty (Azure and Let's Encrypt certs frequently ship empty CN). - Drop the explicit tlsConn.Close() — the deferred net.Conn close already releases the FD. - Collapse the duplicated tlsPorts[port] branches in probePort. - Sort results by (direction, service, region, target) for fully deterministic output regardless of input ordering. cli/check + main: - Return a sentinel error (ErrProbesFailed) instead of calling os.Exit inside RunE. main.go translates any non-nil error into exit code 1 — cobra handles the message rendering, SilenceUsage on checkCmd keeps the help dump out of the failure path. - Wire signal.NotifyContext(os.Interrupt, SIGTERM) into Run so Ctrl-C aborts in-flight probes instead of waiting per-probe timeouts. - Validate --timeout and --concurrency are > 0 at the CLI boundary; the library API keeps its Go-idiomatic "zero ⇒ default" semantics. - Print a stderr warning when --insecure is set, and add a sticky note in the summary so a green report can't be silently misread as proof of certificate validity. - Render the IP literal on the DNS line for IP targets (was previously just "skipped"). - Switch render output to cmd.OutOrStdout() / cmd.ErrOrStderr() so tests can capture output cleanly. Tests: - Drop the package-level tlsPorts mutation; tests now pass TLSPorts via CheckOptions. - Add table-driven classifyTLSErr coverage including the hostname-mismatch regression case. - Add TestRunCheckFlowAgainstLocalListener covering the full load → parse → filter → probe → render pipeline against a real loopback listener with a generated JSON fixture. - Add flag-validation tests for the new --timeout/--concurrency guards and the unknown --service error. make vet and `go test ./... -race` both green. Co-Authored-By: Claude <noreply@anthropic.com>
bb1a55a to
be3baec
Compare
Summary
Adds a new
checksubcommand that probes the endpoints declared innetwork-requirements.jsonfrom the local machine — the Go equivalent of the PowerShell / Linux diagnostic scripts customers run today when troubleshooting why an on-prem engine cannot reach the Delinea Platform.For every unique hostname or IP it runs three probes:
CIDR ranges and values still containing
<tenant>are skipped automatically. The process exits non-zero on any failed probe so it can gate firewall-readiness scripts and CI.Why this matters
Today the failure mode is the same on every install we touch: customer files a ticket saying "the engine cannot connect", support asks them to run a hand-rolled PowerShell or bash script (the two snippets that motivated this work), and we burn cycles iterating over chat. Embedding the same diagnostics directly in our own CLI gives us:
network-requirements.jsonthe firewall rules come from, so DNS, TCP, and TLS coverage stays in sync automatically when new services are added.--insecureis available when the customer only wants to confirm the handshake completes.Use cases
--service platform_engine_messaging --region <region>to instantly isolate whether the problem is DNS, TCP, or TLS — no more guessing from engine logs.--tenant <name>the<tenant>placeholders are substituted before probing, so tenant-specific hostnames (<tenant>.delinea.app, etc.) are exercised too.--region <code>keeps the probe set focused.Examples
```bash
./delinea-netconfig check --service platform_engine_messaging --region us
./delinea-netconfig check --service storage --region us
```
Example output
```text
./delinea-netconfig check --service storage --region us
No file or URL specified, using default: https://setup.delinea.app/network-requirements
Parsed network requirements version 1.4.0 (updated: 2026-04-13T00:00:00Z)
Filtered to global + us: 1 of 1 entries
=== Connectivity Check ===
Source: https://setup.delinea.app/network-requirements
Targets: deriving from 1 entries...
Timeout: 5s Concurrency: 10 Insecure TLS: false
[outbound] storage / global
authstorprod8138094.blob.core.windows.net
✓ DNS: 20.150.76.4, 20.150.9.196, 20.150.9.228 (95ms)
✓ TCP 443 reachable (107ms)
✓ TLS 443 handshake OK (230ms) cert: *.blob.core.windows.net (issuer: Microsoft TLS G2 RSA CA OCSP 04)
downloads.engine-pool.services.delinea.app
✓ DNS: 150.171.109.68, 2603:1061:14:43::1 (47ms)
✓ TCP 443 reachable (48ms)
✓ TLS 443 handshake OK (109ms) cert: downloads.engine-pool.services.delinea.app (issuer: GeoTrust TLS RSA CA G1)
enginepool-downloads-prod.azureedge.net
✓ DNS: 150.171.109.68, 2603:1061:14:43::1 (47ms)
✓ TCP 443 reachable (40ms)
✓ TLS 443 handshake OK (114ms) cert: *.azureedge.net (issuer: Microsoft TLS G2 ECC CA OCSP 02)
enginepoolupdateprod.blob.core.windows.net
✓ DNS: 57.150.183.225, 57.150.75.33, 57.150.190.225 (111ms)
✓ TCP 443 reachable (44ms)
✓ TLS 443 handshake OK (116ms) cert: *.blob.core.windows.net (issuer: Microsoft Azure RSA TLS Issuing CA 03)
=== Summary ===
Targets probed: 4
DNS: 4 ok, 0 failed
TCP: 4 ok, 0 failed
TLS handshakes: 4 ok, 0 failed, 0 skipped
Result: ✓ all probes succeeded.
```
Flags
Test plan
Co-Authored-By: Claude noreply@anthropic.com