chore(config): experiment - mapless config system implementation 3b#1886
chore(config): experiment - mapless config system implementation 3b#1886webern wants to merge 25 commits into
Conversation
Rename the forwarder DatadogConfiguration to DatadogForwarderConfiguration, then generate a public nested DatadogConfiguration in datadog-agent-config for the support: full / support: partial overlay keys. The overlay selects the keys (schema pruning); typify generates the nested struct tree from the pruned JSON Schema. Numerics mirror the schema (f64); refinement is deferred to the translator. Mostly unused until the translator PR.
The name saluki-config conflates pretty easy with the actual configuration structs that are used to configure saluki components and agent-data-plane. In reality, the crate is a utility crate for configuration mechanics, and tools will help disambiguate as we build out the config translation system.
Adds the following new crates in an empty state: - lib/agent-data-plane-config-system - lib/agent-data-plane-config - lib/saluki-component-config
A new form of testing is added at test/architecture which cuts across crate boundaries. It uses grep and crate dependencies etc to assert things like crate-a must not depend on crate-b. ThisType must not appear in crate-c. The purpose, initially, is to drive config system development. But ultimately this could be a powerful way to constrain coding agents from creating architectural entropy.
Source-agnostic component-native config structs mirroring the config-only fields of each saluki-components struct, plus ScopedConfig<T> fixed-or-live handle. Un-ignore the step-2 leaf dependency guards. Build Order step 2.
Add witness_gen.rs: generate DatadogConfigConsumer (one consume_<key> per supported key) and a fallible drive() that walks DatadogConfiguration with clobber-with-defaults semantics for absent optional sections. Add hand-written TranslateError. Relocate KEY_ALIASES and DatadogRemapper here as the authoritative Datadog source-normalization home. Fix: datadog_config_gen.rs now resolves schema $refs, so apm_config.* and multi_region_failover.* keys are no longer silently dropped from DatadogConfiguration. Witness now covers all 147 support: full/partial keys. Build Order step 4.
Define SalukiConfiguration { control, components } with per-domain group
wrappers embedding saluki-component-config leaf structs directly.
ControlConfiguration carries pipeline gates and topology-shaping.
SalukiOnlyConfiguration + seed(), BootstrapConfiguration { datadog, saluki },
authority enums, and ConfigViews. Add the binary-local dogstatsd filter and
workload leaf structs. Add Serialize/PartialEq/Eq to saluki-io ListenAddress.
Un-ignore step-3 dependency guards.
Build Order step 3.
RemoteAgentClientConfiguration (pure typed config), connect() -> DatadogAgentConnection (Arc-shared session + config stream), and a typed Attachments bundle (status/flare/telemetry/metrics/autodiscovery/host_tags). The raw map is confined inside connect(); no public signature exposes it. Build Order step 6.
Translator implements all 147 witness consume_<key> methods, delegating by ownership domain to translate/datadog/<subsystem>.rs. Endpoints assembled from scratch in finish(); translate() runs seed-then-drive as the single reusable path for startup and dynamic retranslation. Add ControlConfiguration.log_file as the destination for data_plane.log_file. Build Order step 7.
ConfigUpdateRouter ingests ConfigUpdate snapshots/partials into a retained source snapshot, retranslates via the same seed-then-drive path, and routes coarse per-slice diffs to typed ScopedConfig<T> handles (DynamicConfigHandles). Rejects malformed updates and keeps last-good. Live in stream mode, Fixed in local mode. Dynamic log level routed as a dedicated ScopedConfig<String>. Build Order step 8.
ConfigurationSystem::load loads local Datadog + Saluki sources once, parses typed BootstrapConfiguration, and decides LocalSnapshot vs AgentStream authority. start_runtime consumes the loaded object by value: in local mode the local snapshot is the runtime authority; in stream mode it connects, awaits the first config-stream snapshot, and spawns the dynamic router task. Started exposes saluki()/dynamic_handles()/attachments()/config_views(). Views regenerate live-on-request from a shared ViewSources cell, scrubbed via saluki-common. Un-ignore the step-7 config-system guard. Build Order step 5 (+ /config view producer).
Components no longer parse GenericConfiguration. Each Configuration consumes its source-agnostic leaf slice: trivial configs embed the leaf struct directly, behavior-carrying configs (forwarder, apm, obfuscation, otlp, retry/proxy) build from it via from_native adapters. Dynamic components (forwarder API-key, dsd debug-log, MRF) consume ScopedConfig<T> handles instead of string-key watches. Raw-map config validation removed from components (now config-system's job). All forbidden raw-map symbols eliminated from the crate. NOTE: bin/agent-data-plane does not yet compile -- the topology/startup cutover that feeds these typed slices lands in the next commit (atomic refactor). Build Order step 9 (components half).
Startup is now ConfigurationSystem::load -> start_runtime: bootstrap logging and early metrics from typed slices, topology consumes ControlConfiguration + ComponentConfiguration native slices and DynamicConfigHandles, internal services build on typed Attachments, /config serves scrubbed ConfigViews live-on-request, and dynamic log level rides ScopedConfig<String>. Pipeline-gate predicates moved onto ControlConfiguration. All raw-map symbols eliminated from the binary; the out-of-scope saluki-env provider layer is fed via a confined EnvConfig pass-through owned by config-system. Un-ignore all remaining step-9/10/11 guards. Build Order steps 9 (binary), 10, 11.
Fix clippy lints (char-pattern, push_str newline, field-reassign-with-default), remove dependencies orphaned by the cutover (facet/figment/serde_with/serde_yaml from saluki-components; prost-types/serde_with from agent-data-plane), sync the third-party license file, extend the Vale technical vocabulary, and repair intra-doc links. make check-all is green.
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fa750be44d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| }; | ||
|
|
||
| if !in_standalone_mode && !dp_config.enabled() { | ||
| if !control.standalone_mode && !control.enabled { |
There was a problem hiding this comment.
Translate data_plane gates before checking control.enabled
When data_plane.enabled: true is supplied by the Agent stream or local snapshot, the new typed translation never copies it into ControlConfiguration (the generated DatadogConfigurationDataPlane/witness omit enabled, standalone_mode, and the per-pipeline enabled gates). In that scenario control.enabled remains its default false, so this guard exits instead of starting ADP; the omitted sub-gates also leave DogStatsD/checks/OTLP at their defaults regardless of configuration.
Useful? React with 👍 / 👎.
| pid: std::process::id(), | ||
| display_name: app_details.full_name().to_string(), | ||
| flavor: app_details.full_name().replace(' ', "_").to_lowercase(), | ||
| api_endpoint: secure_api_listen_address, |
There was a problem hiding this comment.
Register a dialable secure API endpoint
For the default secure API address, this now registers tcp://0.0.0.0:5101 with the Core Agent. The previous path converted the listen address through GrpcTargetAddress::try_from_listen_addr, which strips the tcp:// listen URI and rewrites wildcard TCP binds to 127.0.0.1:5101; without that conversion, TCP registrations advertise a non-dialable callback endpoint, so the Agent cannot call the status/flare/telemetry services even after registration succeeds.
Useful? React with 👍 / 👎.
| let authority = if standalone_mode || !remote_agent_enabled { | ||
| RuntimeAuthority::LocalSnapshot | ||
| } else { | ||
| RuntimeAuthority::AgentStream |
There was a problem hiding this comment.
Honor use_new_config_stream_endpoint when choosing authority
When users set data_plane.use_new_config_stream_endpoint: false but leave remote-agent registration enabled, the old flow registered the remote agent and kept using the local snapshot; this authority decision ignores that flag and always selects AgentStream whenever remote_agent_enabled is true. In that compatibility mode ADP now connects to the new config stream and waits for an initial snapshot that the deployment explicitly disabled (or may not support), preventing startup from using the local configuration.
Useful? React with 👍 / 👎.
| let logging_bootstrap = agent_data_plane_config::LoggingBootstrap { | ||
| log_level: Some(runtime_log_level), | ||
| ..Default::default() |
There was a problem hiding this comment.
Preserve non-level logging settings on runtime reload
This reload builds a LoggingBootstrap containing only log_level, but LoggingGuard::reload swaps the entire output stack. In local mode, or after the initial Agent snapshot, any configured log_format_json, console/syslog settings, rotation limits, disable_file_logging, or ADP log-file destination are reset to translator defaults as soon as run starts; carry the full runtime logging slice (or update only the base filter) instead of defaulting every non-level field.
Useful? React with 👍 / 👎.
| _ => {} | ||
| } | ||
| } | ||
| aggregate.hist_config.statistics = statistics; |
There was a problem hiding this comment.
Keep histogram percentiles in translated histogram config
The witness always calls consume_histogram_aggregates with the Datadog default aggregate list, and this assignment replaces AggregateConfig::default().hist_config.statistics, which includes the default 0.95 percentile. Because histogram_percentiles is not witnessed or added back here, default DogStatsD histograms stop emitting .95percentile series (and configured percentile lists cannot take effect).
Useful? React with 👍 / 👎.
- Add AutoscalingFailoverConfig and ClusterAgentConfig leaf structs in saluki-component-config - Add autoscaling_failover and cluster_agent groups to ComponentConfiguration in agent-data-plane-config - Wire the 7 previously no-op DatadogConfigConsumer methods to their native config destinations (autoscaling, cluster_agent, dogstatsd_disable_verbose_logs) - Regenerate schema overlay: adds the 7 missing trait methods - Migrate ClusterAgentForwarderConfiguration and AutoscalingFailoverGatewayConfiguration to from_native constructors - Move ClusterAgentConfiguration endpoint resolution logic into cluster_agent/mod.rs; re-add autoscaling failover pipeline in run.rs - Fix io.rs: thread EndpointRequestMapperFactory through run_io_loop and run_endpoint_io_loop; use ServiceExt::map_request to avoid the Clone bound on Box<dyn FnMut> - Fix cluster_agent/mod.rs: convert from_configuration to from_native using leaf types; update tests accordingly - Fix autoscaling_failover_gateway/mod.rs: inline struct, drop dead saluki_config imports, convert tests to synchronous construction
.dockerignore excluded all of test/ except test/antithesis/, but test/architecture is a Cargo workspace member so Docker builds fail when cargo can't read its manifest.
build_collector is #[cfg(unix)] so std::future::Future is unused on Windows, which is an error under #![deny(warnings)].
Binary Size Analysis (Agent Data Plane)Baseline: 3ecd1e0 · Comparison: 8abc34e · diff ✅ Binary size difference within thresholdChanges by Module
Detailed Symbol Changes |
Regression Detector (Agent Data Plane)Run ID: Optimization Goals: ❌ 2 regressions detected
Fine details of change detection per experiment (18)Experiments configured
Bounds Checks: ❌ Failed (5)
ExplanationA change is flagged as a regression when |Δ mean %| > 5.00% in the regressing direction for its optimization goal AND SMP marks the experiment as a regression ( |
data_plane.enabled and data_plane.dogstatsd.enabled were in the Datadog core schema but sat in schema_overlay excluded:. Move them to support: full so the codegen produces fields in DatadogConfiguration and witness methods in DatadogConfigConsumer. Add patch_snapshot_pipeline_gates to bootstrap to normalize flat DD_ figment keys (e.g. data_plane_enabled) to nested JSON before serde_json::from_value, which fixes the "Agent Data Plane is not enabled" failure in integration tests. Thread standalone_mode (Saluki-internal, not in core schema) from LoadedSources through start_local so control.standalone_mode is set correctly at runtime.
- Restore "Waiting for initial configuration..." and "Initial configuration received." log messages in start_stream (dropped when the config-system crate was extracted from the binary). - Thread data_plane.otlp.enabled through LoadedSources so both start_local and start_stream apply it to control.otlp.enabled (excluded from schema overlay; flat-key fallback in try_get_typed handles DD_DATA_PLANE_OTLP_ENABLED). - Add data_plane.api_listen_address to LocalApiBootstrap; in start_stream apply both api and secure_api listen addresses from bootstrap after stream translation (stream snapshot from Core Agent does not include these ADP-local env vars).
e013180 to
f501ac4
Compare
- Bridge DD_DOGSTATSD_TCP_PORT from Datadog source into saluki_only.dogstatsd.tcp_port (bootstrap). - Send Ok(()) on registration failure so ADP does not exit when Agent is unreachable. - Add metric_tag_filterlist to overlay + translation. - Fan out apm_config.target_traces_per_second to both sampler and encoder. - Add apm_config.errors_per_second and error_tracking_standalone.enabled to overlay + translation (fan out to sampler and encoder).
start_local was unconditionally forcing standalone_mode=true, which killed the workload provider and broke origin detection. When standalone_mode is false (the remote_agent_enabled=false case), the Agent is still available for metadata services. Connect to it so the env-provider gets real gRPC-backed workload/host/autodiscovery providers.
Saluki-only keys set via DD_* env vars were silently dropped, causing SMP memory regressions in the DSD quality_gates benchmarks (~20% RSS increase). Adds bridge_dd_fallbacks() to read saluki-only values from the Datadog source when the Saluki source has not set them. Covers aggregate_context_limit, interner sizes, autoscale, and OTLP context limits.
Summary
This is a prototype of a system that eliminates untyped, stringly access to a generic configuration map and produces a strongly typed barrier between
agent-data-planeand Datadog Agent config. It encapsulates knowledge of Datadog Agent configuration into a single system, isolated fromagent-data-planeand provides the extension point where other configuration dialects may be inserted (such as OTeL, for example).This PR is not mergeable, but I need to keep it here and keep it healthy and rebased. The "real" PRs will essentially burn down this diff in more manageable chunks.
Change Type
How did you test this PR?
This PR is basically a test, in itself, of the design and the ability to pass CI with a prototype of the design.
References
#1788