Expand Agent flare to collect Agent Data Plane telemetry, config, and runtime state
Background
Agent Data Plane (ADP) runs as a separate process alongside the Core Agent. Today, when a user runs agent flare, ADP's logs, status, and live telemetry land in the flare. We'd like to expand the data that is collected to make debugging easier.
The plumbing for ADP→Agent flare exchange already exists:
- Agent side:
comp/core/remoteagentregistry/ implements gRPC clients for three services — StatusProvider, TelemetryProvider, FlareProvider — and calls each on every registered remote agent.
- ADP side:
bin/agent-data-plane/src/internal/remote_agent.rs registers all three services. get_status_details and get_telemetry are implemented; get_flare_files is a stub returning an empty HashMap.
The flare-builder hook on the Agent side (comp/core/remoteagentregistry/impl/services.go:fillFlare) already absorbs whatever get_flare_files returns and writes it into the archive. Because ADP's stub returns nothing, no flare-time ADP artifacts beyond logs and live telemetry are captured.
Currently collected
logs/agent-data-plane.log (active), logs/agent-data-plane.log.1 (rotated) — picked up by the generic log-dir sweep in pkg/flare/common/common.go:90.
telemetry.log — already contains ADP's Prometheus telemetry via the registry's GetTelemetry fan-out and the registryCollector (comp/core/remoteagentregistry/impl/services.go:122).
status.log — already contains ADP's status section (version, git commit, architecture, start time, DSD metrics) populated by RemoteAgentImpl::get_status_details in bin/agent-data-plane/src/internal/remote_agent.rs:444.
Proposed additions
All artifacts below ship as entries in the HashMap<String, Vec<u8>> returned by ADP's get_flare_files. The Core Agent does not scrape ADP endpoints directly; everything flows through the existing Remote Agent Registry gRPC interface. Files land under a data-plane/ subdirectory (path encoded in the map key returned by ADP).
Configuration
data-plane/runtime_config_dump.yaml — resolved ADP config (equivalent of /config privileged-API endpoint output). Mirrors the existing core-agent runtime_config_dump.yaml.
Health and runtime state
data-plane/health.yaml — capture of /health, /ready, /live results.
data-plane/memory_status.json — capture of /memory/status. Useful for OOM diagnosis; no current equivalent in the core flare.
Workload / tagger
data-plane/workload-tags-dump.json — tagger state.
data-plane/workload-external-data-dump.json — external-data resolver state.
These complement the core agent's existing tagger-list.json and workload-list.log and are critical for debugging origin-detection / tag-cardinality discrepancies between ADP and core.
Process info
data-plane/runtime_debug_info.log — pid, uptime, RSS, args, fd count, thread count. Mirrors runtime_debug_info.log.
data-plane/process_tree.txt — Linux-only: pstree fragment rooted at ADP pid. Useful when ADP forks worker subprocesses.
Async runtime / stack equivalent
data-plane/runtime-dump.txt — Rust tokio task dump or stack snapshot. Mirrors go-routine-dump.log. (Open question — see below.)
Implementation notes
Two-side change:
ADP side (saluki repo):
- Implement
RemoteAgentImpl::get_flare_files in bin/agent-data-plane/src/internal/remote_agent.rs:492. Build the HashMap<String, Vec<u8>> by gathering the artifacts listed above (config, health, memory, workload dumps, process info). Use map keys of the form data-plane/<artifact> so they land in a single subdirectory inside the flare archive.
- No scrubbing on the ADP side. Return raw bytes; the Core Agent scrubs.
Agent side (datadog-agent repo):
- No code change strictly required —
comp/core/remoteagentregistry/impl/services.go:fillFlare already absorbs whatever ADP returns and writes it into the flare archive. Scrubbing is applied by the existing flarebuilder.FlareBuilder.AddFile path (pkg/util/scrubber).
- Gate the entire ADP fan-out on
data_plane.enabled = true. If false, skip silently.
- When ADP is enabled but unreachable (crashed, gRPC connection fails, timeout), write
data-plane/UNREACHABLE.txt containing the error message returned by the gRPC client. Do not block or fail core-agent flare collection.
- Document in
docs/dev/agent-data-plane/ what each artifact contains.
Privacy and scrubbing
Scrubbing is the Core Agent's responsibility. ADP returns raw bytes via get_flare_files; the existing Go-side flarebuilder.FlareBuilder.AddFile path runs every artifact through pkg/util/scrubber before writing to the archive. This matches how every other flare-builder caller works today.
Expected exposure profile (all already mitigated by the existing scrubber):
data-plane/runtime_config_dump.yaml — API keys, app keys, proxy passwords, secret URLs. Scrubbed identically to the core agent's runtime_config_dump.yaml.
- Workload tag dumps — container IDs and pod names. Same exposure as
tagger-list.json today.
Open questions
- Does ADP expose a pprof / task-dump endpoint equivalent to Go's
/debug/pprof/goroutine? If not, is it worth adding one specifically for flare diagnostics?
- Size budget: the core flare has rough soft limits. ADP artifacts can be sizable on high-cardinality hosts. Cap per-artifact at some bound (e.g. 5MB) with truncation marker?
Decisions
- Scrubbing: Core Agent only. ADP returns raw bytes.
- Collection mechanism: Remote Agent Registry
FlareProvider gRPC. No direct endpoint scraping from Core Agent.
data_plane.enabled = false: skip ADP flare fan-out entirely.
- ADP unreachable: write
data-plane/UNREACHABLE.txt containing the gRPC error message; continue with the rest of the flare.
Acceptance criteria
- Running
agent flare on a host with data_plane.enabled: true produces a data-plane/ subdirectory containing at minimum: config dump, health, memory status, workload tag dump, and process info.
- Secret scrubbing applied by the Core Agent. No raw API keys in any artifact.
- Flare succeeds with
data-plane/UNREACHABLE.txt containing the gRPC error when ADP is unreachable; does not block or hang core-agent flare collection.
- Flare on
data_plane.enabled: false hosts contains no data-plane/ subdirectory.
- Documentation in
docs/dev/ describes each artifact and what to look for during triage.
References
comp/core/remoteagentregistry/impl/services.go:fillFlare — Agent-side flare fan-out
comp/core/remoteagentregistry/impl/client.go:31-35 — service name constants
bin/agent-data-plane/src/internal/remote_agent.rs:444 (saluki) — already-implemented get_status_details
bin/agent-data-plane/src/internal/remote_agent.rs:469 (saluki) — already-implemented get_telemetry
bin/agent-data-plane/src/internal/remote_agent.rs:492 (saluki) — get_flare_files stub to fill in
pkg/flare/common/common.go:90 — existing log-dir sweep that picks up ADP logs today
Expand Agent flare to collect Agent Data Plane telemetry, config, and runtime state
Background
Agent Data Plane (ADP) runs as a separate process alongside the Core Agent. Today, when a user runs
agent flare, ADP's logs, status, and live telemetry land in the flare. We'd like to expand the data that is collected to make debugging easier.The plumbing for ADP→Agent flare exchange already exists:
comp/core/remoteagentregistry/implements gRPC clients for three services —StatusProvider,TelemetryProvider,FlareProvider— and calls each on every registered remote agent.bin/agent-data-plane/src/internal/remote_agent.rsregisters all three services.get_status_detailsandget_telemetryare implemented;get_flare_filesis a stub returning an emptyHashMap.The flare-builder hook on the Agent side (
comp/core/remoteagentregistry/impl/services.go:fillFlare) already absorbs whateverget_flare_filesreturns and writes it into the archive. Because ADP's stub returns nothing, no flare-time ADP artifacts beyond logs and live telemetry are captured.Currently collected
logs/agent-data-plane.log(active),logs/agent-data-plane.log.1(rotated) — picked up by the generic log-dir sweep inpkg/flare/common/common.go:90.telemetry.log— already contains ADP's Prometheus telemetry via the registry'sGetTelemetryfan-out and theregistryCollector(comp/core/remoteagentregistry/impl/services.go:122).status.log— already contains ADP's status section (version, git commit, architecture, start time, DSD metrics) populated byRemoteAgentImpl::get_status_detailsinbin/agent-data-plane/src/internal/remote_agent.rs:444.Proposed additions
All artifacts below ship as entries in the
HashMap<String, Vec<u8>>returned by ADP'sget_flare_files. The Core Agent does not scrape ADP endpoints directly; everything flows through the existing Remote Agent Registry gRPC interface. Files land under adata-plane/subdirectory (path encoded in the map key returned by ADP).Configuration
data-plane/runtime_config_dump.yaml— resolved ADP config (equivalent of/configprivileged-API endpoint output). Mirrors the existing core-agentruntime_config_dump.yaml.Health and runtime state
data-plane/health.yaml— capture of/health,/ready,/liveresults.data-plane/memory_status.json— capture of/memory/status. Useful for OOM diagnosis; no current equivalent in the core flare.Workload / tagger
data-plane/workload-tags-dump.json— tagger state.data-plane/workload-external-data-dump.json— external-data resolver state.These complement the core agent's existing
tagger-list.jsonandworkload-list.logand are critical for debugging origin-detection / tag-cardinality discrepancies between ADP and core.Process info
data-plane/runtime_debug_info.log— pid, uptime, RSS, args, fd count, thread count. Mirrorsruntime_debug_info.log.data-plane/process_tree.txt— Linux-only: pstree fragment rooted at ADP pid. Useful when ADP forks worker subprocesses.Async runtime / stack equivalent
data-plane/runtime-dump.txt— Rust tokio task dump or stack snapshot. Mirrorsgo-routine-dump.log. (Open question — see below.)Implementation notes
Two-side change:
ADP side (saluki repo):
RemoteAgentImpl::get_flare_filesinbin/agent-data-plane/src/internal/remote_agent.rs:492. Build theHashMap<String, Vec<u8>>by gathering the artifacts listed above (config, health, memory, workload dumps, process info). Use map keys of the formdata-plane/<artifact>so they land in a single subdirectory inside the flare archive.Agent side (datadog-agent repo):
comp/core/remoteagentregistry/impl/services.go:fillFlarealready absorbs whatever ADP returns and writes it into the flare archive. Scrubbing is applied by the existingflarebuilder.FlareBuilder.AddFilepath (pkg/util/scrubber).data_plane.enabled = true. If false, skip silently.data-plane/UNREACHABLE.txtcontaining the error message returned by the gRPC client. Do not block or fail core-agent flare collection.docs/dev/agent-data-plane/what each artifact contains.Privacy and scrubbing
Scrubbing is the Core Agent's responsibility. ADP returns raw bytes via
get_flare_files; the existing Go-sideflarebuilder.FlareBuilder.AddFilepath runs every artifact throughpkg/util/scrubberbefore writing to the archive. This matches how every other flare-builder caller works today.Expected exposure profile (all already mitigated by the existing scrubber):
data-plane/runtime_config_dump.yaml— API keys, app keys, proxy passwords, secret URLs. Scrubbed identically to the core agent'sruntime_config_dump.yaml.tagger-list.jsontoday.Open questions
/debug/pprof/goroutine? If not, is it worth adding one specifically for flare diagnostics?Decisions
FlareProvidergRPC. No direct endpoint scraping from Core Agent.data_plane.enabled = false: skip ADP flare fan-out entirely.data-plane/UNREACHABLE.txtcontaining the gRPC error message; continue with the rest of the flare.Acceptance criteria
agent flareon a host withdata_plane.enabled: trueproduces adata-plane/subdirectory containing at minimum: config dump, health, memory status, workload tag dump, and process info.data-plane/UNREACHABLE.txtcontaining the gRPC error when ADP is unreachable; does not block or hang core-agent flare collection.data_plane.enabled: falsehosts contains nodata-plane/subdirectory.docs/dev/describes each artifact and what to look for during triage.References
comp/core/remoteagentregistry/impl/services.go:fillFlare— Agent-side flare fan-outcomp/core/remoteagentregistry/impl/client.go:31-35— service name constantsbin/agent-data-plane/src/internal/remote_agent.rs:444(saluki) — already-implementedget_status_detailsbin/agent-data-plane/src/internal/remote_agent.rs:469(saluki) — already-implementedget_telemetrybin/agent-data-plane/src/internal/remote_agent.rs:492(saluki) —get_flare_filesstub to fill inpkg/flare/common/common.go:90— existing log-dir sweep that picks up ADP logs today