Skip to content

feat(agent-data-plane): added workaround for tls handshake timeout config option #1819

Open
lucastemb wants to merge 4 commits into
mainfrom
lt/178
Open

feat(agent-data-plane): added workaround for tls handshake timeout config option #1819
lucastemb wants to merge 4 commits into
mainfrom
lt/178

Conversation

@lucastemb

Copy link
Copy Markdown
Contributor

Summary

Added workaround to allow for tls_handshake_timeout config option. tokio-rustls provides no built-in timeout, so handshakes could be potentially infinite. The idea here is to replace hyper-rustls's HttpsConnector with direct ownership of the TLS layer so that we can wrap it in a timeout with the time specified through the config option.

Change Type

  • Bug fix
  • New feature
  • Non-functional (chore, refactoring, docs)
  • Performance

How did you test this PR?

Unit tests

References

@lucastemb lucastemb marked this pull request as ready for review June 4, 2026 17:52
@lucastemb lucastemb requested a review from a team as a code owner June 4, 2026 17:52
@dd-octo-sts dd-octo-sts Bot added area/io General I/O and networking. area/components Sources, transforms, and destinations. labels Jun 4, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dc7d606ddd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

let endpoints = config.build_routable_endpoints(live_config.clone())?;
let mut client_builder = HttpClient::builder()
.with_request_timeout(config.request_timeout())
.with_tls_handshake_timeout(config.tls_handshake_timeout())

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor TLS handshake timeout for proxied HTTPS

When proxy_https is configured, this timeout only reaches the inner connector; HttpClientBuilder::build then wraps it in hyper_http_proxy::ProxyConnector, whose CONNECT path performs the destination TLS handshake itself after opening the tunnel. In that common forwarder-proxy setup, a stalled TLS handshake to the intake still ignores tls_handshake_timeout and waits for the broader request timeout, even though the config is now classified as fully supported.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth a follow up, since it would involve making a PR against the hyper-http-proxy fork.

@pr-commenter

pr-commenter Bot commented Jun 4, 2026

Copy link
Copy Markdown

Regression Detector (Agent Data Plane)

Run ID: fcbe5fe8-4a4e-4c7c-8b9a-3ede0a1337d9
Baseline: b8dd4227 · Comparison: 43275bd3 · diff

Optimization Goals: ✅ No significant changes detected

Fine details of change detection per experiment (35)

Experiments configured erratic: true are tagged (ignored) and skipped when determining which experiments regressed or improved. Experiments which are detected as erratic at runtime are tagged (erratic) to flag that the run's sample dispersion was high, but their regression / improvement signal still counts.

experiment goal Δ mean % links
otlp_ingest_traces_5mb_cpu (erratic) cpu ⚪ +3.13 metrics profiles logs
dsd_uds_500mb_3k_contexts_throughput throughput ⚪ -0.87 metrics profiles logs
otlp_ingest_traces_ottl_transform_5mb_cpu (erratic) cpu ⚪ +0.79 metrics profiles logs
otlp_ingest_traces_ottl_filtering_5mb_cpu (erratic) cpu ⚪ +0.46 metrics profiles logs
dsd_uds_10mb_3k_contexts_memory memory ⚪ +0.37 metrics profiles logs
otlp_ingest_metrics_5mb_memory memory ⚪ +0.33 metrics profiles logs
dsd_uds_100mb_3k_contexts_cpu (erratic) cpu ⚪ +0.25 metrics profiles logs
quality_gates_rss_idle memory ⚪ +0.23 metrics profiles logs
otlp_ingest_logs_5mb_cpu (ignored) cpu ⚪ +0.18 metrics profiles logs
quality_gates_rss_dsd_heavy memory ⚪ +0.12 metrics profiles logs
otlp_ingest_traces_ottl_filtering_5mb_throughput throughput ⚪ -0.04 metrics profiles logs
otlp_ingest_traces_5mb_memory memory ⚪ +0.03 metrics profiles logs
otlp_ingest_metrics_5mb_throughput throughput ⚪ -0.03 metrics profiles logs
otlp_ingest_traces_ottl_transform_5mb_memory memory ⚪ +0.01 metrics profiles logs
dsd_uds_10mb_3k_contexts_throughput throughput ⚪ -0.01 metrics profiles logs
dsd_uds_512kb_3k_contexts_throughput throughput ⚪ -0.00 metrics profiles logs
dsd_uds_100mb_3k_contexts_throughput throughput ⚪ -0.00 metrics profiles logs
dsd_uds_1mb_3k_contexts_throughput throughput ⚪ -0.00 metrics profiles logs
dsd_uds_500mb_3k_contexts_cpu (erratic) cpu ⚪ -0.00 metrics profiles logs
otlp_ingest_traces_ottl_filtering_5mb_memory memory ⚪ -0.00 metrics profiles logs
dsd_uds_512kb_3k_contexts_memory memory ⚪ -0.01 metrics profiles logs
otlp_ingest_logs_5mb_throughput (ignored) throughput ⚪ +0.02 metrics profiles logs
otlp_ingest_metrics_5mb_cpu (erratic) cpu ⚪ -0.14 metrics profiles logs
otlp_ingest_traces_5mb_throughput throughput ⚪ +0.15 metrics profiles logs
dsd_uds_1mb_3k_contexts_memory memory ⚪ -0.20 metrics profiles logs
dsd_uds_500mb_3k_contexts_memory memory ⚪ -0.20 metrics profiles logs
dsd_uds_100mb_3k_contexts_memory memory ⚪ -0.21 metrics profiles logs
quality_gates_rss_dsd_ultraheavy memory ⚪ -0.29 metrics profiles logs
quality_gates_rss_dsd_low memory ⚪ -0.39 metrics profiles logs
quality_gates_rss_dsd_medium memory ⚪ -0.40 metrics profiles logs
otlp_ingest_traces_ottl_transform_5mb_throughput throughput ⚪ +0.61 metrics profiles logs
dsd_uds_10mb_3k_contexts_cpu (erratic) cpu ⚪ -0.79 metrics profiles logs
dsd_uds_512kb_3k_contexts_cpu (erratic) cpu ⚪ -1.19 metrics profiles logs
dsd_uds_1mb_3k_contexts_cpu (erratic) cpu ⚪ -2.77 metrics profiles logs
otlp_ingest_logs_5mb_memory (ignored) memory ⚪ -5.16 metrics profiles logs
Bounds Checks: ✅ Passed (5)
experiment check replicates observed links
quality_gates_rss_dsd_heavy memory_usage 10/10 ✅ 115 MiB ≤ 140 MiB metrics profiles logs
quality_gates_rss_dsd_low memory_usage 10/10 ✅ 39.8 MiB ≤ 50 MiB metrics profiles logs
quality_gates_rss_dsd_medium memory_usage 10/10 ✅ 60.6 MiB ≤ 75 MiB metrics profiles logs
quality_gates_rss_dsd_ultraheavy memory_usage 10/10 ✅ 183 MiB ≤ 200 MiB metrics profiles logs
quality_gates_rss_idle memory_usage 10/10 ✅ 26.9 MiB ≤ 40 MiB metrics profiles logs
Explanation

A change is flagged as a regression when |Δ mean %| > 5.00% in the regressing direction for its optimization goal AND SMP marks the experiment as a regression (is_regression: true). Improvements use the matching criteria for the improving direction. Experiments configured erratic: true (tagged (ignored)) are skipped outright; experiments detected as erratic at runtime (tagged (erratic)) still count, since that flag describes sample dispersion rather than directional certainty. The Δ mean % cell is colored accordingly: 🟢 = improvement, 🔴 = regression, ⚪ = neutral. Reduction in CPU or memory is an improvement; reduction in ingress throughput is a regression.

Comment thread lib/saluki-components/src/config_registry/classifier.rs
@lucastemb lucastemb requested a review from jszwedko June 4, 2026 18:52
@pr-commenter

pr-commenter Bot commented Jun 4, 2026

Copy link
Copy Markdown

Binary Size Analysis (Agent Data Plane)

Baseline: b8dd422 · Comparison: 43275bd · diff
Analysis Configuration: stripped binaries · Pass/Fail Threshold: +5%
Sizes: 38.08 MiB (baseline) vs 38.06 MiB (comparison)
Size Change: -26.03 KiB (-0.07%)

✅ Binary size difference within threshold

Changes by Module
Module File Size Symbols
figment +45.02 KiB 94
core -24.88 KiB 1565
hyper_util +14.35 KiB 51
saluki_components::common::datadog -13.09 KiB 48
otlp_protos::otlp_include::opentelemetry +10.66 KiB 113
saluki_components::common::otlp -8.72 KiB 21
agent_data_plane::internal::env +7.13 KiB 9
hickory_net -6.86 KiB 16
hyper_rustls -6.82 KiB 8
saluki_components::sources::otlp -6.46 KiB 11
tower -6.28 KiB 50
agent_data_plane::cli::run -6.10 KiB 4
hashbrown -5.28 KiB 31
alloc -5.18 KiB 110
h2 -5.03 KiB 94
prost +4.73 KiB 26
[Unmapped] -4.42 KiB 1
saluki_components::encoders::datadog -4.16 KiB 15
anon.ae75cd3060d8b952a6fd466cba97c1c5.103.llvm.298110800033293764 +4.09 KiB 1
anon.4706bd4aeb8942c18877ac36e20eb5a0.57.llvm.4503685933437568069 -4.00 KiB 1
Detailed Symbol Changes
    FILE SIZE        VM SIZE    
 --------------  -------------- 
  [NEW] +47.6Ki  [NEW] +47.4Ki    saluki_components::common::datadog::io::run_endpoint_io_loop::_{{closure}}::haa53f0bdb9193162
  [NEW] +23.4Ki  [NEW] +23.3Ki    h2::proto::connection::DynConnection<B>::recv_frame::ha3f039a4a5d4b305
  [NEW] +22.6Ki  [NEW] +22.5Ki    hyper::proto::h1::dispatch::Dispatcher<D,Bs,I,T>::poll_loop::h01c1b65213c02a61
  [NEW] +20.2Ki  [NEW] +20.1Ki    hyper_util::client::legacy::client::Client<C,B>::send_request::_{{closure}}::hd8e6113d7c690de3
   +51% +19.9Ki   +51% +19.9Ki    agent_data_plane::internal::env::workload::build_collector::_{{closure}}::h8518305f145d7e76
   +50% +19.2Ki   +51% +19.2Ki    agent_data_plane::internal::env::ADPEnvironmentProvider::from_configuration::_{{closure}}::h01476d2b9c923f6d
  [NEW] +16.9Ki  [NEW] +16.8Ki    _<core::pin::Pin<P> as core::future::future::Future>::poll::hb00cb23c69d84358
  [NEW] +11.4Ki  [NEW] +11.2Ki    _<figment::value::magic::RelativePathBuf as figment::value::magic::Magic>::deserialize_from::h8d37ed269d34eed5
  [NEW] +10.8Ki  [NEW] +10.7Ki    _<figment::value::magic::Tagged<T> as figment::value::magic::Magic>::deserialize_from::h5dac5c6b4df7bde9
  [NEW] +9.54Ki  [NEW] +9.38Ki    _<figment::value::de::ConfiguredValueDe<I> as serde_core::de::Deserializer>::deserialize_struct::h2b57f80a64519235
  [NEW] +9.11Ki  [NEW] +8.97Ki    _<tracing::instrument::Instrumented<T> as core::future::future::Future>::poll::ha966712a914af3db
  [NEW] +8.18Ki  [NEW] +8.05Ki    saluki_components::common::datadog::config::ForwarderConfiguration::build_routable_endpoints::h978d2ec35c207a35
  [DEL] -8.37Ki  [DEL] -8.23Ki    saluki_components::common::datadog::config::ForwarderConfiguration::build_routable_endpoints::h2d044e6702d42f69
  [DEL] -8.67Ki  [DEL] -8.53Ki    _<tracing::instrument::Instrumented<T> as core::future::future::Future>::poll::h53386ddcc8a09bda
 -73.0% -9.65Ki -73.6% -9.65Ki    saluki_io::net::client::http::client::HttpClientBuilder::build::hc31e1965fdac5778
  [DEL] -15.0Ki  [DEL] -14.8Ki    saluki_components::common::datadog::apm::_::_<impl serde_core::de::Deserialize for saluki_components::common::datadog::apm::ApmConfiguration>::deserialize::h94b605d1c568ebac
  [DEL] -15.4Ki  [DEL] -15.3Ki    _<core::pin::Pin<P> as core::future::future::Future>::poll::h3502d14eabfb0d3f
  [DEL] -17.8Ki  [DEL] -17.7Ki    hyper_util::client::legacy::client::Client<C,B>::send_request::_{{closure}}::h72267e740e2e9efc
  [DEL] -31.9Ki  [DEL] -31.7Ki    agent_data_plane::internal::env::workload::RemoteAgentWorkloadProvider::from_configuration::_{{closure}}::h7d89a227ad85041c
  [DEL] -48.0Ki  [DEL] -47.9Ki    saluki_components::common::datadog::io::run_endpoint_io_loop::_{{closure}}::h6760b52a19241012
  -1.1% -90.0Ki  -1.3% -79.2Ki    [8051 Others]
  -0.1% -26.0Ki  -0.0% -15.6Ki    TOTAL

///
/// Defaults to 10 seconds. If the TLS handshake does not complete within this duration after the
/// TCP connection is established, the connection attempt fails with a timeout error.
#[serde(default = "default_tls_handshake_timeout_secs", rename = "tls_handshake_timeout")]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this parse durations correctly? The config coming from the Core Agent will be a String "Go duration" like 10s.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call out. I tried to base it off of some of the other config keys in ForwarderConfiguration, but I think the same logic does not apply since things like forwarder_timeout are based off just pure seconds. As a result, I also changed the name to tls_handshake_timeout to reflect this

}
}

#[cfg(test)]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to delete these tests?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did because configure_tls_alpn_for_http_protocol was no longer a standalone function after refactoring, but I should've probably modified the tests instead of deleting altogether.

Comment thread lib/saluki-io/src/net/client/http/conn.rs
@@ -378,27 +382,51 @@ impl Service<Uri> for HttpsCapableConnector {
}

fn call(&mut self, dst: Uri) -> Self::Future {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to get some other eyes from @tobz or @webern given the nuance involved here.

Comment thread lib/saluki-io/src/net/client/http/conn.rs Outdated
Comment thread lib/saluki-io/src/net/client/http/conn.rs Outdated
Comment thread lib/saluki-io/src/net/client/http/conn.rs Outdated
@datadog-prod-us1-4

This comment has been minimized.

@jszwedko jszwedko left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments, and I'd still like more review of the changes to conn.rs.

///
/// Defaults to 10 seconds. If the TLS handshake does not complete within this duration after the
/// TCP connection is established, the connection attempt fails with a timeout error.
#[serde(default = "default_tls_handshake_timeout", rename = "tls_handshake_timeout")]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#[serde(default = "default_tls_handshake_timeout", rename = "tls_handshake_timeout")]
#[serde(default = "default_tls_handshake_timeout"]

I don't think we need the rename now that the field name matches.

Comment on lines +176 to +177
/// Defaults to 10 seconds.
///

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Defaults to 10 seconds.
///

There is no default for this function that I can see 🤔

use std::sync::Arc;

use tower::Service as _;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to remove this? I think it is actually needed 🤔

Comment thread lib/saluki-io/Cargo.toml
http-body-util = { workspace = true }
proptest = { workspace = true }
rand_distr = { workspace = true }
saluki-metrics = { workspace = true, features = ["test"] }

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed? I'm not seeing saluki-metrics used in saluki-io.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/components Sources, transforms, and destinations. area/io General I/O and networking.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow configuring the TLS handshake timeout for HTTP clients.

2 participants