Skip to content

Expand Agent flare to collect Agent Data Plane telemetry, config, and runtime state #1703

Description

@jszwedko

Expand Agent flare to collect Agent Data Plane telemetry, config, and runtime state

Background

Agent Data Plane (ADP) runs as a separate process alongside the Core Agent. Today, when a user runs agent flare, ADP's logs, status, and live telemetry land in the flare. We'd like to expand the data that is collected to make debugging easier.

The plumbing for ADP→Agent flare exchange already exists:

  • Agent side: comp/core/remoteagentregistry/ implements gRPC clients for three services — StatusProvider, TelemetryProvider, FlareProvider — and calls each on every registered remote agent.
  • ADP side: bin/agent-data-plane/src/internal/remote_agent.rs registers all three services. get_status_details and get_telemetry are implemented; get_flare_files is a stub returning an empty HashMap.

The flare-builder hook on the Agent side (comp/core/remoteagentregistry/impl/services.go:fillFlare) already absorbs whatever get_flare_files returns and writes it into the archive. Because ADP's stub returns nothing, no flare-time ADP artifacts beyond logs and live telemetry are captured.

Currently collected

  • logs/agent-data-plane.log (active), logs/agent-data-plane.log.1 (rotated) — picked up by the generic log-dir sweep in pkg/flare/common/common.go:90.
  • telemetry.log — already contains ADP's Prometheus telemetry via the registry's GetTelemetry fan-out and the registryCollector (comp/core/remoteagentregistry/impl/services.go:122).
  • status.log — already contains ADP's status section (version, git commit, architecture, start time, DSD metrics) populated by RemoteAgentImpl::get_status_details in bin/agent-data-plane/src/internal/remote_agent.rs:444.

Proposed additions

All artifacts below ship as entries in the HashMap<String, Vec<u8>> returned by ADP's get_flare_files. The Core Agent does not scrape ADP endpoints directly; everything flows through the existing Remote Agent Registry gRPC interface. Files land under a data-plane/ subdirectory (path encoded in the map key returned by ADP).

Configuration

  • data-plane/runtime_config_dump.yaml — resolved ADP config (equivalent of /config privileged-API endpoint output). Mirrors the existing core-agent runtime_config_dump.yaml.

Health and runtime state

  • data-plane/health.yaml — capture of /health, /ready, /live results.
  • data-plane/memory_status.json — capture of /memory/status. Useful for OOM diagnosis; no current equivalent in the core flare.

Workload / tagger

  • data-plane/workload-tags-dump.json — tagger state.
  • data-plane/workload-external-data-dump.json — external-data resolver state.

These complement the core agent's existing tagger-list.json and workload-list.log and are critical for debugging origin-detection / tag-cardinality discrepancies between ADP and core.

Process info

  • data-plane/runtime_debug_info.log — pid, uptime, RSS, args, fd count, thread count. Mirrors runtime_debug_info.log.
  • data-plane/process_tree.txt — Linux-only: pstree fragment rooted at ADP pid. Useful when ADP forks worker subprocesses.

Async runtime / stack equivalent

  • data-plane/runtime-dump.txt — Rust tokio task dump or stack snapshot. Mirrors go-routine-dump.log. (Open question — see below.)

Implementation notes

Two-side change:

ADP side (saluki repo):

  1. Implement RemoteAgentImpl::get_flare_files in bin/agent-data-plane/src/internal/remote_agent.rs:492. Build the HashMap<String, Vec<u8>> by gathering the artifacts listed above (config, health, memory, workload dumps, process info). Use map keys of the form data-plane/<artifact> so they land in a single subdirectory inside the flare archive.
  2. No scrubbing on the ADP side. Return raw bytes; the Core Agent scrubs.

Agent side (datadog-agent repo):

  1. No code change strictly required — comp/core/remoteagentregistry/impl/services.go:fillFlare already absorbs whatever ADP returns and writes it into the flare archive. Scrubbing is applied by the existing flarebuilder.FlareBuilder.AddFile path (pkg/util/scrubber).
  2. Gate the entire ADP fan-out on data_plane.enabled = true. If false, skip silently.
  3. When ADP is enabled but unreachable (crashed, gRPC connection fails, timeout), write data-plane/UNREACHABLE.txt containing the error message returned by the gRPC client. Do not block or fail core-agent flare collection.
  4. Document in docs/dev/agent-data-plane/ what each artifact contains.

Privacy and scrubbing

Scrubbing is the Core Agent's responsibility. ADP returns raw bytes via get_flare_files; the existing Go-side flarebuilder.FlareBuilder.AddFile path runs every artifact through pkg/util/scrubber before writing to the archive. This matches how every other flare-builder caller works today.

Expected exposure profile (all already mitigated by the existing scrubber):

  • data-plane/runtime_config_dump.yaml — API keys, app keys, proxy passwords, secret URLs. Scrubbed identically to the core agent's runtime_config_dump.yaml.
  • Workload tag dumps — container IDs and pod names. Same exposure as tagger-list.json today.

Open questions

  1. Does ADP expose a pprof / task-dump endpoint equivalent to Go's /debug/pprof/goroutine? If not, is it worth adding one specifically for flare diagnostics?
  2. Size budget: the core flare has rough soft limits. ADP artifacts can be sizable on high-cardinality hosts. Cap per-artifact at some bound (e.g. 5MB) with truncation marker?

Decisions

  • Scrubbing: Core Agent only. ADP returns raw bytes.
  • Collection mechanism: Remote Agent Registry FlareProvider gRPC. No direct endpoint scraping from Core Agent.
  • data_plane.enabled = false: skip ADP flare fan-out entirely.
  • ADP unreachable: write data-plane/UNREACHABLE.txt containing the gRPC error message; continue with the rest of the flare.

Acceptance criteria

  • Running agent flare on a host with data_plane.enabled: true produces a data-plane/ subdirectory containing at minimum: config dump, health, memory status, workload tag dump, and process info.
  • Secret scrubbing applied by the Core Agent. No raw API keys in any artifact.
  • Flare succeeds with data-plane/UNREACHABLE.txt containing the gRPC error when ADP is unreachable; does not block or hang core-agent flare collection.
  • Flare on data_plane.enabled: false hosts contains no data-plane/ subdirectory.
  • Documentation in docs/dev/ describes each artifact and what to look for during triage.

References

  • comp/core/remoteagentregistry/impl/services.go:fillFlare — Agent-side flare fan-out
  • comp/core/remoteagentregistry/impl/client.go:31-35 — service name constants
  • bin/agent-data-plane/src/internal/remote_agent.rs:444 (saluki) — already-implemented get_status_details
  • bin/agent-data-plane/src/internal/remote_agent.rs:469 (saluki) — already-implemented get_telemetry
  • bin/agent-data-plane/src/internal/remote_agent.rs:492 (saluki) — get_flare_files stub to fill in
  • pkg/flare/common/common.go:90 — existing log-dir sweep that picks up ADP logs today

Metadata

Metadata

Assignees

Labels

area/observabilityInternal observability of ADP and Saluki.effort/intermediateInvolves changes that can be worked on by non-experts but might require guidance.type/enhancementAn enhancement in functionality or support.

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions