Skip to content

Add check subcommand for endpoint connectivity diagnostics#3

Merged
andrespires merged 2 commits into
mainfrom
feat/connectivity-check
May 14, 2026
Merged

Add check subcommand for endpoint connectivity diagnostics#3
andrespires merged 2 commits into
mainfrom
feat/connectivity-check

Conversation

@andrespires

Copy link
Copy Markdown
Collaborator

Summary

Adds a new check subcommand that probes the endpoints declared in network-requirements.json from the local machine — the Go equivalent of the PowerShell / Linux diagnostic scripts customers run today when troubleshooting why an on-prem engine cannot reach the Delinea Platform.

For every unique hostname or IP it runs three probes:

  1. DNS resolution (hostnames only; IP literals are skipped).
  2. TCP dial against every port published for the service.
  3. TLS handshake on TLS-typical ports (443, 5671, 8883, 636, 993, 995, 465, 5986, 8443) with SNI when the target is a hostname, so certificate validation is meaningful.

CIDR ranges and values still containing <tenant> are skipped automatically. The process exits non-zero on any failed probe so it can gate firewall-readiness scripts and CI.

Why this matters

Today the failure mode is the same on every install we touch: customer files a ticket saying "the engine cannot connect", support asks them to run a hand-rolled PowerShell or bash script (the two snippets that motivated this work), and we burn cycles iterating over chat. Embedding the same diagnostics directly in our own CLI gives us:

  • A single supported tool instead of ad-hoc scripts that drift from the published requirements. The probe set is driven by the same network-requirements.json the firewall rules come from, so DNS, TCP, and TLS coverage stays in sync automatically when new services are added.
  • Honest TLS verification. The handshake uses SNI on hostnames, so a corporate SSL-inspecting proxy injecting its own certificate is reported as a TLS failure rather than masquerading as success. --insecure is available when the customer only wants to confirm the handshake completes.
  • A non-zero exit code on failure, so it slots into firewall-readiness scripts and CI pipelines — build/upgrade fails before the engine is shipped to prod, not after.
  • Region and service scoping, matching the rest of the CLI. Operators can validate a single service after a firewall change without re-probing everything.

Use cases

  • Pre-install / pre-upgrade gate. Run on the box that will host the on-prem engine before installation to confirm every published egress works end-to-end. Catches missing firewall rules, broken DNS, and SSL-inspection problems before the engine binary is even copied over.
  • Incident triage. When an engine reports a connectivity error to a specific subsystem (e.g. messaging), run --service platform_engine_messaging --region <region> to instantly isolate whether the problem is DNS, TCP, or TLS — no more guessing from engine logs.
  • Firewall change validation. After the network team opens or rotates rules, re-run scoped by service or region and use the non-zero exit code to assert the change landed.
  • SSL-inspection / proxy detection. TLS handshakes with proper SNI surface CA-injection by middleboxes as a clear handshake failure, with the bad issuer printed in the report.
  • Smoke check after tenant cutover. With --tenant <name> the <tenant> placeholders are substituted before probing, so tenant-specific hostnames (<tenant>.delinea.app, etc.) are exercised too.
  • Regional rollouts. When bringing up a new region or tenant in a region, --region <code> keeps the probe set focused.

Examples

```bash
./delinea-netconfig check --service platform_engine_messaging --region us
./delinea-netconfig check --service storage --region us
```

Example output

```text
./delinea-netconfig check --service storage --region us
No file or URL specified, using default: https://setup.delinea.app/network-requirements
Parsed network requirements version 1.4.0 (updated: 2026-04-13T00:00:00Z)
Filtered to global + us: 1 of 1 entries
=== Connectivity Check ===
Source: https://setup.delinea.app/network-requirements
Targets: deriving from 1 entries...
Timeout: 5s Concurrency: 10 Insecure TLS: false

[outbound] storage / global
authstorprod8138094.blob.core.windows.net
✓ DNS: 20.150.76.4, 20.150.9.196, 20.150.9.228 (95ms)
✓ TCP 443 reachable (107ms)
✓ TLS 443 handshake OK (230ms) cert: *.blob.core.windows.net (issuer: Microsoft TLS G2 RSA CA OCSP 04)
downloads.engine-pool.services.delinea.app
✓ DNS: 150.171.109.68, 2603:1061:14:43::1 (47ms)
✓ TCP 443 reachable (48ms)
✓ TLS 443 handshake OK (109ms) cert: downloads.engine-pool.services.delinea.app (issuer: GeoTrust TLS RSA CA G1)
enginepool-downloads-prod.azureedge.net
✓ DNS: 150.171.109.68, 2603:1061:14:43::1 (47ms)
✓ TCP 443 reachable (40ms)
✓ TLS 443 handshake OK (114ms) cert: *.azureedge.net (issuer: Microsoft TLS G2 ECC CA OCSP 02)
enginepoolupdateprod.blob.core.windows.net
✓ DNS: 57.150.183.225, 57.150.75.33, 57.150.190.225 (111ms)
✓ TCP 443 reachable (44ms)
✓ TLS 443 handshake OK (116ms) cert: *.blob.core.windows.net (issuer: Microsoft Azure RSA TLS Issuing CA 03)

=== Summary ===
Targets probed: 4
DNS: 4 ok, 0 failed
TCP: 4 ok, 0 failed
TLS handshakes: 4 ok, 0 failed, 0 skipped

Result: ✓ all probes succeeded.
```

Flags

  • `--region` — limit probes to global + a single region.
  • `--service` — limit probes to a single service `id`.
  • `--tenant` — substitute `` placeholders before probing.
  • `--timeout` — per-probe timeout (default 5s).
  • `--concurrency` — parallel probes (default 10).
  • `--insecure` — skip TLS certificate validation (handshake-reachability only).

Test plan

  • `go vet ./...` clean
  • `go test ./... -race` green (new unit + integration tests for the probe engine using a real TCP listener and `httptest` TLS server)
  • Manual smoke run against `testdata/network-requirements.json --service platform_engine_messaging --region eu` — DNS, TCP, and TLS all OK with cert subject/issuer printed
  • Manual run against the live default URL (`./delinea-netconfig check --service storage --region us`) — output captured above
  • Reviewer: please validate from a region other than where you primarily work, to confirm regional filtering surfaces the expected endpoints

Co-Authored-By: Claude noreply@anthropic.com

@andrespires andrespires requested a review from DelineaLaari May 14, 2026 16:09

@DelineaLaari DelineaLaari left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall take

Strong PR. The scope is exactly right (replace ad-hoc PowerShell/bash with one supported tool), the design choices are sensible (SNI on hostnames, skip TLS on IPs, exit non-zero), and the test layer probes real listeners rather than mocks. The probe-layer code is readable and the data model (ProbeResultTCPProbeTLSProbe) is clean. Approve with comments — nothing here blocks merge, but two items are worth fixing before it goes out.

Worth fixing before merge

1. os.Exit(1) bypasses cobra error handling — internal/cli/check.go:138-140

if connchk.HasFailures(summary) {
    os.Exit(1)
}
return nil

Calling os.Exit inside RunE skips any deferred cleanup, and depending on whether SilenceUsage/SilenceErrors is set on the root command, you can also dump the cobra usage screen on top of an already-rendered failure report — exactly the wrong UX for a CI gate.

Cleaner: return a typed error and let main set the exit code. Even simpler interim fix — return a sentinel error and silence usage on the command:

checkCmd.SilenceUsage = true
checkCmd.SilenceErrors = true   // your own rendering is already complete
// ...
if connchk.HasFailures(summary) {
    return errProbesFailed
}

Then in main, type-assert and os.Exit(1) on that.

2. The TLS handshake doesn't inherit the parent context — internal/connchk/connchk.go:305

ctx, cancel := context.WithDeadline(context.Background(), deadline)

This re-roots the TLS handshake on context.Background() instead of the ctx passed all the way down from RunprobeOneprobePort. Right now nothing cancels the parent (see #3 below), so the practical impact is zero — but the moment you add Ctrl-C handling, the TCP dial and DNS lookup will cancel cleanly while the TLS handshake keeps running until its 5s deadline. Pass ctx through handshakeTLS and derive from it.

Non-blocking but worth tracking

3. No signal-driven cancellation — internal/cli/check.go:126

ctx := context.Background() — Ctrl-C does the right thing today only because every probe has its own timeout. With hundreds of targets at concurrency 10, that can be tens of seconds of "unresponsive after Ctrl-C." signal.NotifyContext(ctx, os.Interrupt, syscall.SIGTERM) is one line and worth the polish.

4. classifyTLSErr uses string matching — internal/connchk/connchk.go:325-336

case strings.Contains(msg, "certificate signed by unknown authority"),
    strings.Contains(msg, "x509: certificate"):
    return "...possible SSL inspection / proxy"

The second branch matches anything containing x509: certificate, including x509: certificate is valid for X, not Y — that's a hostname mismatch, not an SSL-inspection signal. Labeling it as "possible SSL inspection / proxy" misleads operators in exactly the scenario this command is supposed to clarify.

Replace with errors.As against the typed errors:

var ua x509.UnknownAuthorityError
var ci x509.CertificateInvalidError
var hn x509.HostnameError
switch {
case errors.As(err, &ua):
    return fmt.Sprintf("certificate signed by unknown CA (possible SSL inspection / proxy): %s", err)
case errors.As(err, &hn):
    return fmt.Sprintf("certificate hostname mismatch: %s", err)
case errors.As(err, &ci):
    return fmt.Sprintf("certificate invalid: %s", err)
case errors.Is(err, context.DeadlineExceeded):
    return "TLS handshake timed out"
default:
    return err.Error()
}

5. Cert reporting only looks at CN — internal/connchk/connchk.go:317-320

Many modern certs have empty CommonName and identify hosts purely via SAN. You'd report cert: (issuer: …) with no subject. Fall back to certs[0].DNSNames[0] if CN is empty.

6. tlsPorts is a mutable package-level map — internal/connchk/connchk.go:31-41

The test mutates it (tlsPorts[port] = true; defer delete(tlsPorts, port) in connchk_test.go:113-114). Works because the test doesn't call t.Parallel(), but it's a footgun for whoever adds the next test in this package. Either:

  • Make it a var slice and pass into Run via CheckOptions.TLSPorts (defaulting to the canonical set) — also opens the door to --tls-ports=… if customers ever need it, or
  • Use sync.RWMutex to gate it.

7. --insecure doesn't warn

Skipping cert verification is exactly the thing operators run by mistake and then trust the green checkmarks. One fmt.Fprintln(os.Stderr, "WARNING: --insecure disables TLS certificate validation") at the top of runCheck (and consider repeating it in renderSummary when set) is cheap and saves real grief.

8. --concurrency 0 silently falls back to default

Fine, but unexpected. Either error out (return fmt.Errorf("--concurrency must be > 0")) or log the substitution. Same for --timeout 0.

Nits

  • check.go:218 — DNS line for IP literals says (IP literal). probeDNS (line 254) also stashes the IP into Addresses so consumers can use it, but renderTarget only prints (IP literal) and never the address. Minor inconsistency — either don't populate Addresses, or print it.
  • connchk.go:288-292 — the TLS branch is two else ifs that share tlsPorts[port]. Collapse to a single if tlsPorts[port] { ... } with a nested host check for readability.
  • check.go:193-198groupKey includes direction but the sort comparator doesn't. Stable but order is data-input-dependent. Sort by direction too if you want fully deterministic output.
  • connchk.go:317-323 — success path calls tlsConn.Close() explicitly; probePort then defer-closes the underlying conn. Closing the same socket twice is harmless on POSIX but explicit-then-deferred is a code smell. Just rely on the deferred close.
  • README/CHANGELOG — well written, accurate, links into TOC. No changes.
  • Test coverage — probe-layer integration tests are solid. Missing: a test for the runCheck flow itself (parsing → filtering → output), and a race-detector run (go test -race) — easy to add to the CI matrix if it isn't already there.

Positives worth calling out

  • The TCP-first / TLS-second flow on the same connection (probePort lines 270-294) is correct and saves a round trip — many naive implementations dial twice.
  • The "skip TLS for IP" branch with an explicit Skipped status (rather than silently dropping it) gives operators the right mental model.
  • buildJobs deduplicating ports across entries that share (service, region, target) is exactly the kind of thing you'd miss on a first pass and have to backfill later. Good that it's there from day one.
  • Treating CIDRs and <tenant> placeholders as continue rather than errors keeps the tool aligned with the source-of-truth JSON without forcing operators to pre-filter.
  • The test at connchk_test.go:60-94 spins up a real listener instead of mocking net.Dial — these tests will actually catch regressions.

@DelineaLaari DelineaLaari left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up commit bb1a55a addresses every item from my earlier review — pre-merge blockers, non-blockers, and nits. go test -race ./... passes clean locally. LGTM.

andrespires and others added 2 commits May 14, 2026 13:46
Introduce a `check` subcommand that probes the endpoints declared in
network-requirements.json from the local machine — the Go equivalent of
the PowerShell/Linux diagnostic scripts customers run when validating
firewall rules.

For each unique hostname or IP the command runs three probes:

  - DNS resolution (hostnames only; IP literals are skipped)
  - TCP dial against every port published for the service
  - TLS handshake on TLS-typical ports (443, 5671, 8883, 636, 993, 995,
    465, 5986, 8443) with SNI when the target is a hostname

CIDR ranges and values still containing the <tenant> placeholder are
skipped automatically. The process exits non-zero when any probe fails
so the command can gate firewall-readiness scripts and CI pipelines.

Flags: --region, --service (matches the JSON `id`), --tenant,
--timeout (5s default), --concurrency (10 default), --insecure.

Co-Authored-By: Claude <noreply@anthropic.com>
connchk:
- Move tlsPorts from a mutable package-level map onto CheckOptions.TLSPorts,
  defaulting to the canonical set when unset. Removes the test-only mutation
  footgun. No CLI flag exposed yet — wiring stays library-internal.
- Thread the caller's ctx through handshakeTLS so signal-driven cancellation
  reaches TLS handshakes, not just DNS and TCP.
- Replace classifyTLSErr's substring matching with errors.As against typed
  x509 errors (UnknownAuthorityError, HostnameError, CertificateInvalidError).
  Fixes the regression where hostname-mismatch errors were mislabelled as
  "possible SSL inspection / proxy".
- Fall back to certs[0].DNSNames[0] when the leaf CommonName is empty (Azure
  and Let's Encrypt certs frequently ship empty CN).
- Drop the explicit tlsConn.Close() — the deferred net.Conn close already
  releases the FD.
- Collapse the duplicated tlsPorts[port] branches in probePort.
- Sort results by (direction, service, region, target) for fully
  deterministic output regardless of input ordering.

cli/check + main:
- Return a sentinel error (ErrProbesFailed) instead of calling os.Exit inside
  RunE. main.go translates any non-nil error into exit code 1 — cobra handles
  the message rendering, SilenceUsage on checkCmd keeps the help dump out of
  the failure path.
- Wire signal.NotifyContext(os.Interrupt, SIGTERM) into Run so Ctrl-C aborts
  in-flight probes instead of waiting per-probe timeouts.
- Validate --timeout and --concurrency are > 0 at the CLI boundary; the
  library API keeps its Go-idiomatic "zero ⇒ default" semantics.
- Print a stderr warning when --insecure is set, and add a sticky note in
  the summary so a green report can't be silently misread as proof of
  certificate validity.
- Render the IP literal on the DNS line for IP targets (was previously just
  "skipped").
- Switch render output to cmd.OutOrStdout() / cmd.ErrOrStderr() so tests can
  capture output cleanly.

Tests:
- Drop the package-level tlsPorts mutation; tests now pass TLSPorts via
  CheckOptions.
- Add table-driven classifyTLSErr coverage including the hostname-mismatch
  regression case.
- Add TestRunCheckFlowAgainstLocalListener covering the full
  load → parse → filter → probe → render pipeline against a real loopback
  listener with a generated JSON fixture.
- Add flag-validation tests for the new --timeout/--concurrency guards and
  the unknown --service error.

make vet and `go test ./... -race` both green.

Co-Authored-By: Claude <noreply@anthropic.com>
@andrespires andrespires force-pushed the feat/connectivity-check branch from bb1a55a to be3baec Compare May 14, 2026 17:46
@andrespires andrespires merged commit 9afc661 into main May 14, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants