Skip to content

Add DTLS throughput benchmark tool and optimize send path#10551

Open
julek-wolfssl wants to merge 6 commits into
wolfSSL:masterfrom
julek-wolfssl:dtls-perf-benchmark
Open

Add DTLS throughput benchmark tool and optimize send path#10551
julek-wolfssl wants to merge 6 commits into
wolfSSL:masterfrom
julek-wolfssl:dtls-perf-benchmark

Conversation

@julek-wolfssl

Copy link
Copy Markdown
Member

Add examples/benchmark/dtls_bench, a DTLS throughput benchmark that completes a handshake and then measures bulk-send throughput. It supports DTLS 1.2 and 1.3, selectable cipher suites, an end-to-end mode, and a -z sink mode that discards records on the server after the handshake to isolate the sender's record-layer cost. The socket is set up with wolfSSL_set_dtls_fd_connected.

Optimize the send path exercised by the benchmark:

  • wolfio (EmbedSendTo): cache the per-descriptor socket-type probe (getsockopt SO_TYPE) in WOLFSSL_DTLS_CTX instead of running it on every send, removing a syscall from the record send path. The cache is invalidated whenever rfd/wfd is reassigned.

  • internal (BuildMessage): for AEAD suites whose explicit nonce is the 8-byte record sequence number, write the sequence number directly as nonce_explicit instead of drawing it from the RNG. This covers AES-GCM (RFC 5288 sec 3), AES-CCM (RFC 6655 sec 3), SM4-GCM/CCM (RFC 8998 sec 3), and Camellia-/ARIA-GCM which inherit the RFC 5288 construction; ChaCha20 uses an implicit nonce and is excluded. A new read-only PeekSEQ() helper reads the sequence number without advancing the per-direction counter, leaving the single mandated increment to writeAeadAuthData().

Also ignore the built dtls_bench binary in .gitignore.

Copilot AI review requested due to automatic review settings May 28, 2026 14:07

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new DTLS throughput benchmark under examples/benchmark/ and makes two optimizations in the DTLS send path to better measure (and reduce) per-record overhead in wolfSSL’s record layer and socket I/O glue.

Changes:

  • Add examples/benchmark/dtls_bench.c: a DTLS 1.2/1.3 throughput benchmark with cipher selection, plain-UDP baseline mode, and a client-side “sink send” mode.
  • Optimize DTLS send path by caching the SO_TYPE (datagram vs stream) probe in WOLFSSL_DTLS_CTX instead of calling getsockopt() on every send.
  • Optimize AEAD explicit-nonce construction by writing the record sequence number directly for suites where the explicit nonce is defined as the seq number, using a new read-only PeekSEQ() helper.

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
wolfssl/internal.h Adds DTLS context fields for caching socket type probe results.
tests/api.c Resets new DTLS context cache fields when copying SSL state in an API test helper.
src/wolfio.c Changes datagram-vs-stream detection to cache SO_TYPE results.
src/ssl.c Invalidates the DTLS socket-type cache when read/write fds are (re)assigned.
src/internal.c Adds PeekSEQ() and uses it to derive AEAD explicit nonce from sequence number for applicable suites.
examples/benchmark/include.am Adds dtls_bench to Automake build outputs.
examples/benchmark/dtls_bench.c New DTLS benchmark tool implementation.
.gitignore Ignores the built examples/benchmark/dtls_bench binary.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/wolfio.c Outdated
Comment thread examples/benchmark/include.am
Comment thread examples/benchmark/dtls_bench.c
Comment thread examples/benchmark/dtls_bench.c
@julek-wolfssl julek-wolfssl self-assigned this May 28, 2026
@julek-wolfssl julek-wolfssl force-pushed the dtls-perf-benchmark branch from 8d445f1 to 32c7f0b Compare May 28, 2026 17:02
@julek-wolfssl julek-wolfssl marked this pull request as ready for review May 28, 2026 17:19
@github-actions

Copy link
Copy Markdown

retest this please

@julek-wolfssl julek-wolfssl force-pushed the dtls-perf-benchmark branch 2 times, most recently from 7b5387d to 4068636 Compare May 29, 2026 16:27
@github-actions

github-actions Bot commented May 29, 2026

Copy link
Copy Markdown

MemBrowse Memory Report

gcc-arm-cortex-m3

  • FLASH: .text +100 B (+0.1%, 121,357 B / 262,144 B, total: 46% used)

gcc-arm-cortex-m4

  • FLASH: .text +256 B (+0.1%, 198,958 B / 262,144 B, total: 76% used)

gcc-arm-cortex-m4-crypto-only

  • FLASH: .text +64 B (+0.0%, 173,614 B / 262,144 B, total: 66% used)

gcc-arm-cortex-m4-dtls13

  • FLASH: .text +128 B (+0.1%, 179,672 B / 1,048,576 B, total: 17% used)

gcc-arm-cortex-m4-openssl-compat

  • FLASH: .text +704 B (+0.1%, 766,828 B / 1,048,576 B, total: 73% used)

gcc-arm-cortex-m4-pkcs7

  • FLASH: .text +320 B (+0.2%, 211,313 B / 262,144 B, total: 81% used)

gcc-arm-cortex-m4-pq

  • FLASH: .text +384 B (+0.1%, 277,528 B / 1,048,576 B, total: 26% used)

gcc-arm-cortex-m4-rsa-only

  • FLASH: .text +320 B (+0.1%, 323,000 B / 1,048,576 B, total: 31% used)

gcc-arm-cortex-m4-tls12

  • FLASH: .text +128 B (+0.1%, 122,125 B / 262,144 B, total: 47% used)

gcc-arm-cortex-m4-tls13

  • FLASH: .text +64 B (+0.0%, 234,464 B / 262,144 B, total: 89% used)

gcc-arm-cortex-m7

  • FLASH: .text +256 B (+0.1%, 198,958 B / 262,144 B, total: 76% used)

gcc-arm-cortex-m7-pq

  • FLASH: .text +384 B (+0.1%, 278,104 B / 1,048,576 B, total: 27% used)

gcc-arm-cortex-m7-tls13

  • FLASH: .text +64 B (+0.0%, 234,528 B / 262,144 B, total: 89% used)

linuxkm-pie

  • Data: __patchable_function_entries +56 B (+0.2%, 24,264 B)

linuxkm-standard

@julek-wolfssl julek-wolfssl force-pushed the dtls-perf-benchmark branch from 4068636 to 0d93481 Compare May 31, 2026 14:02
@julek-wolfssl

julek-wolfssl commented May 31, 2026

Copy link
Copy Markdown
Member Author

retest this please.

@julek-wolfssl julek-wolfssl force-pushed the dtls-perf-benchmark branch 2 times, most recently from 0c90533 to 1e7d632 Compare June 1, 2026 13:32
@julek-wolfssl

Copy link
Copy Markdown
Member Author

The jenkins failures are not related to this PR.

@douzzer douzzer added the Staged Staged for merge pending final test results and review label Jun 5, 2026
douzzer
douzzer previously requested changes Jun 5, 2026
Comment thread examples/benchmark/dtls_bench.c
@douzzer douzzer removed the Staged Staged for merge pending final test results and review label Jun 5, 2026
@julek-wolfssl julek-wolfssl force-pushed the dtls-perf-benchmark branch from 1e7d632 to cae9827 Compare June 5, 2026 22:19
@julek-wolfssl julek-wolfssl requested a review from douzzer June 5, 2026 22:20

@dgarske dgarske left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skoll Code Review

Scan type: reviewOverall recommendation: COMMENT
Findings: 5 total — 5 posted, 0 skipped
5 finding(s) posted as inline comments (see file-level comments below)

Posted findings

  • [Medium] isDGram cache write is an unsynchronized data race under WOLFSSL_RW_THREADEDsrc/wolfio.c:655-666
  • [Low] Single isDGram cache shared between rfd and wfd of potentially different socket typessrc/wolfio.c:649-667
  • [Low] Confirm discarded explicit-nonce value is intentional across FIPS and epoch-order pathssrc/internal.c:24809-24831
  • [Low] now_sec ignores clock_gettime failure, can return uninitialized timeexamples/benchmark/dtls_bench.c:96-101
  • [Low] -z (sink send) silently ignored when combined with -s (server)examples/benchmark/dtls_bench.c:parse_args

Review generated by Skoll

Comment thread src/wolfio.c Outdated
Comment thread src/wolfio.c Outdated
Comment thread src/internal.c
Comment thread examples/benchmark/dtls_bench.c
Comment thread examples/benchmark/dtls_bench.c
Add examples/benchmark/dtls_bench, a DTLS throughput benchmark that
completes a handshake and then measures bulk-send throughput. It
supports DTLS 1.2 and 1.3, selectable cipher suites, an end-to-end
mode, and a -z sink mode that discards records on the server after the
handshake to isolate the sender's record-layer cost. The socket is set
up with wolfSSL_set_dtls_fd_connected.

Optimize the send path exercised by the benchmark:

- wolfio (EmbedSendTo): cache the per-descriptor socket-type probe
  (getsockopt SO_TYPE) in WOLFSSL_DTLS_CTX instead of running it on
  every send, removing a syscall from the record send path. The cache
  is invalidated whenever rfd/wfd is reassigned.

- internal (BuildMessage): for AEAD suites whose explicit nonce is the
  8-byte record sequence number, write the sequence number directly as
  nonce_explicit instead of drawing it from the RNG. This covers
  AES-GCM (RFC 5288 sec 3), AES-CCM (RFC 6655 sec 3), SM4-GCM/CCM
  (RFC 8998 sec 3), and Camellia-/ARIA-GCM which inherit the RFC 5288
  construction; ChaCha20 uses an implicit nonce and is excluded. A new
  read-only PeekSEQ() helper reads the sequence number without advancing
  the per-direction counter, leaving the single mandated increment to
  writeAeadAuthData().

Also ignore the built dtls_bench binary in .gitignore.
dtls_bench.c is built whenever DTLS and the example servers are enabled,
including the cross-mingw-all-crypto multi-test scenario, which cross-
compiles for Windows. It directly includes POSIX-only headers
(<sys/socket.h>, <arpa/inet.h>, <netdb.h>, <net/if.h>) that mingw does
not ship, so the build failed there.

Gate the networking includes and the whole benchmark body behind a
DTLS_BENCH_ENABLED check (WOLFSSL_DTLS, not USE_WINDOWS_API, not
WOLFSSL_NO_SOCK). When the platform lacks POSIX BSD sockets, compile a
small stub main() that reports the tool is unsupported, so the source
tree still builds.
Under WOLFSSL_RW_THREADED the read and write threads could both perform
the lazy isDGramSock() first-time cache write concurrently; the cached
bit-fields share a storage unit with other dtlsCtx flags, making this a
data race.

Instead of caching from inside the I/O callbacks, run the
getsockopt(SO_TYPE) probe where dtlsCtx.rfd/wfd is assigned and store
the result per descriptor (rfd and wfd may be different sockets of
different types). fd assignment happens during single-threaded setup,
so no thread-specific handling is needed, and the I/O callbacks reduce
to reading a struct member, so isDGramSock() is dropped in favor of
reading the flags directly. The stateless-hash test no longer needs to
mask the fields: the I/O callbacks no longer write to the WOLFSSL
object.
…laceholders

The PeekSEQ-written explicit nonce is overwritten by the encrypt paths
before transmission (cipher-generated counter, or aead_exp_IV on legacy
FIPS/selftest builds) and the AAD sequence is written separately; the
optimization is the removal of the per-record RNG draw.
Fail loudly if clock_gettime() ever fails instead of computing
throughput from uninitialized stack, and warn when -z is combined with
-s since the sink-send only applies to the client.
@julek-wolfssl julek-wolfssl requested a review from dgarske June 10, 2026 12:05

@dgarske dgarske left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skoll Code Review

Scan type: reviewOverall recommendation: COMMENT
Findings: 5 total — 5 posted, 0 skipped
4 finding(s) posted as inline comments (see file-level comments below)

Posted findings

  • [Medium] Explicit-nonce 'placeholder' invariant not guaranteed on ATOMIC_USER AEAD pathsrc/internal.c:24870-24896
  • [Medium] DTLS benchmark client aborts on transient send errors that udp_client toleratesexamples/benchmark/dtls_bench.c:710-722
  • [Low] No regression test for seq-as-nonce path or cached socket-type fieldssrc/internal.c:24885-24888, src/wolfio.c:649-662
  • [Low] Help flag -? returns failure exit codeexamples/benchmark/dtls_bench.c:266-269, 786-789
  • [Low] Large stack buffer and unvalidated numeric option parsing in benchmarkexamples/benchmark/dtls_bench.c:140-156, 186-227

Review generated by Skoll

Comment thread src/internal.c
Comment thread examples/benchmark/dtls_bench.c
Comment thread examples/benchmark/dtls_bench.c
Comment thread examples/benchmark/dtls_bench.c Outdated
- Retry wolfSSL_write on the same recoverable send errors the plain-UDP
  baseline already retries on: EAGAIN/EWOULDBLOCK surface as WANT_WRITE
  and ENOBUFS as SOCKET_ERROR_E with errno preserved. The buffered
  record is flushed by the retried call without re-encrypting.
- Treat an explicit -? as a help request: print usage to stdout and
  exit 0, keeping stderr and a failure exit for genuine option errors.
- Enumerate ciphers with wolfSSL_get_cipher_list() instead of an 8 KiB
  stack buffer, and range-check -p and -b like the other numeric
  options.
- Document in BuildMessage that the FIPS<2 path overwrites the
  explicit-nonce placeholder inside BuildMessage itself, and that the
  one path transmitting the bytes as written (ATOMIC_USER MacEncryptCb)
  still emits the sequence number that RFC 5288 et al. prescribe.
@julek-wolfssl julek-wolfssl requested a review from dgarske June 10, 2026 21:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants