Add DTLS throughput benchmark tool and optimize send path#10551
Add DTLS throughput benchmark tool and optimize send path#10551julek-wolfssl wants to merge 6 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds a new DTLS throughput benchmark under examples/benchmark/ and makes two optimizations in the DTLS send path to better measure (and reduce) per-record overhead in wolfSSL’s record layer and socket I/O glue.
Changes:
- Add
examples/benchmark/dtls_bench.c: a DTLS 1.2/1.3 throughput benchmark with cipher selection, plain-UDP baseline mode, and a client-side “sink send” mode. - Optimize DTLS send path by caching the
SO_TYPE(datagram vs stream) probe inWOLFSSL_DTLS_CTXinstead of callinggetsockopt()on every send. - Optimize AEAD explicit-nonce construction by writing the record sequence number directly for suites where the explicit nonce is defined as the seq number, using a new read-only
PeekSEQ()helper.
Reviewed changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
wolfssl/internal.h |
Adds DTLS context fields for caching socket type probe results. |
tests/api.c |
Resets new DTLS context cache fields when copying SSL state in an API test helper. |
src/wolfio.c |
Changes datagram-vs-stream detection to cache SO_TYPE results. |
src/ssl.c |
Invalidates the DTLS socket-type cache when read/write fds are (re)assigned. |
src/internal.c |
Adds PeekSEQ() and uses it to derive AEAD explicit nonce from sequence number for applicable suites. |
examples/benchmark/include.am |
Adds dtls_bench to Automake build outputs. |
examples/benchmark/dtls_bench.c |
New DTLS benchmark tool implementation. |
.gitignore |
Ignores the built examples/benchmark/dtls_bench binary. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
8d445f1 to
32c7f0b
Compare
|
retest this please |
7b5387d to
4068636
Compare
|
4068636 to
0d93481
Compare
|
retest this please. |
0c90533 to
1e7d632
Compare
|
The jenkins failures are not related to this PR. |
1e7d632 to
cae9827
Compare
dgarske
left a comment
There was a problem hiding this comment.
Skoll Code Review
Scan type: reviewOverall recommendation: COMMENT
Findings: 5 total — 5 posted, 0 skipped
5 finding(s) posted as inline comments (see file-level comments below)
Posted findings
- [Medium] isDGram cache write is an unsynchronized data race under WOLFSSL_RW_THREADED —
src/wolfio.c:655-666 - [Low] Single isDGram cache shared between rfd and wfd of potentially different socket types —
src/wolfio.c:649-667 - [Low] Confirm discarded explicit-nonce value is intentional across FIPS and epoch-order paths —
src/internal.c:24809-24831 - [Low] now_sec ignores clock_gettime failure, can return uninitialized time —
examples/benchmark/dtls_bench.c:96-101 - [Low] -z (sink send) silently ignored when combined with -s (server) —
examples/benchmark/dtls_bench.c:parse_args
Review generated by Skoll
Add examples/benchmark/dtls_bench, a DTLS throughput benchmark that completes a handshake and then measures bulk-send throughput. It supports DTLS 1.2 and 1.3, selectable cipher suites, an end-to-end mode, and a -z sink mode that discards records on the server after the handshake to isolate the sender's record-layer cost. The socket is set up with wolfSSL_set_dtls_fd_connected. Optimize the send path exercised by the benchmark: - wolfio (EmbedSendTo): cache the per-descriptor socket-type probe (getsockopt SO_TYPE) in WOLFSSL_DTLS_CTX instead of running it on every send, removing a syscall from the record send path. The cache is invalidated whenever rfd/wfd is reassigned. - internal (BuildMessage): for AEAD suites whose explicit nonce is the 8-byte record sequence number, write the sequence number directly as nonce_explicit instead of drawing it from the RNG. This covers AES-GCM (RFC 5288 sec 3), AES-CCM (RFC 6655 sec 3), SM4-GCM/CCM (RFC 8998 sec 3), and Camellia-/ARIA-GCM which inherit the RFC 5288 construction; ChaCha20 uses an implicit nonce and is excluded. A new read-only PeekSEQ() helper reads the sequence number without advancing the per-direction counter, leaving the single mandated increment to writeAeadAuthData(). Also ignore the built dtls_bench binary in .gitignore.
dtls_bench.c is built whenever DTLS and the example servers are enabled, including the cross-mingw-all-crypto multi-test scenario, which cross- compiles for Windows. It directly includes POSIX-only headers (<sys/socket.h>, <arpa/inet.h>, <netdb.h>, <net/if.h>) that mingw does not ship, so the build failed there. Gate the networking includes and the whole benchmark body behind a DTLS_BENCH_ENABLED check (WOLFSSL_DTLS, not USE_WINDOWS_API, not WOLFSSL_NO_SOCK). When the platform lacks POSIX BSD sockets, compile a small stub main() that reports the tool is unsupported, so the source tree still builds.
Under WOLFSSL_RW_THREADED the read and write threads could both perform the lazy isDGramSock() first-time cache write concurrently; the cached bit-fields share a storage unit with other dtlsCtx flags, making this a data race. Instead of caching from inside the I/O callbacks, run the getsockopt(SO_TYPE) probe where dtlsCtx.rfd/wfd is assigned and store the result per descriptor (rfd and wfd may be different sockets of different types). fd assignment happens during single-threaded setup, so no thread-specific handling is needed, and the I/O callbacks reduce to reading a struct member, so isDGramSock() is dropped in favor of reading the flags directly. The stateless-hash test no longer needs to mask the fields: the I/O callbacks no longer write to the WOLFSSL object.
…laceholders The PeekSEQ-written explicit nonce is overwritten by the encrypt paths before transmission (cipher-generated counter, or aead_exp_IV on legacy FIPS/selftest builds) and the AAD sequence is written separately; the optimization is the removal of the per-record RNG draw.
Fail loudly if clock_gettime() ever fails instead of computing throughput from uninitialized stack, and warn when -z is combined with -s since the sink-send only applies to the client.
cae9827 to
94d0a49
Compare
dgarske
left a comment
There was a problem hiding this comment.
Skoll Code Review
Scan type: reviewOverall recommendation: COMMENT
Findings: 5 total — 5 posted, 0 skipped
4 finding(s) posted as inline comments (see file-level comments below)
Posted findings
- [Medium] Explicit-nonce 'placeholder' invariant not guaranteed on ATOMIC_USER AEAD path —
src/internal.c:24870-24896 - [Medium] DTLS benchmark client aborts on transient send errors that udp_client tolerates —
examples/benchmark/dtls_bench.c:710-722 - [Low] No regression test for seq-as-nonce path or cached socket-type fields —
src/internal.c:24885-24888, src/wolfio.c:649-662 - [Low] Help flag -? returns failure exit code —
examples/benchmark/dtls_bench.c:266-269, 786-789 - [Low] Large stack buffer and unvalidated numeric option parsing in benchmark —
examples/benchmark/dtls_bench.c:140-156, 186-227
Review generated by Skoll
- Retry wolfSSL_write on the same recoverable send errors the plain-UDP baseline already retries on: EAGAIN/EWOULDBLOCK surface as WANT_WRITE and ENOBUFS as SOCKET_ERROR_E with errno preserved. The buffered record is flushed by the retried call without re-encrypting. - Treat an explicit -? as a help request: print usage to stdout and exit 0, keeping stderr and a failure exit for genuine option errors. - Enumerate ciphers with wolfSSL_get_cipher_list() instead of an 8 KiB stack buffer, and range-check -p and -b like the other numeric options. - Document in BuildMessage that the FIPS<2 path overwrites the explicit-nonce placeholder inside BuildMessage itself, and that the one path transmitting the bytes as written (ATOMIC_USER MacEncryptCb) still emits the sequence number that RFC 5288 et al. prescribe.
Add examples/benchmark/dtls_bench, a DTLS throughput benchmark that completes a handshake and then measures bulk-send throughput. It supports DTLS 1.2 and 1.3, selectable cipher suites, an end-to-end mode, and a -z sink mode that discards records on the server after the handshake to isolate the sender's record-layer cost. The socket is set up with wolfSSL_set_dtls_fd_connected.
Optimize the send path exercised by the benchmark:
wolfio (EmbedSendTo): cache the per-descriptor socket-type probe (getsockopt SO_TYPE) in WOLFSSL_DTLS_CTX instead of running it on every send, removing a syscall from the record send path. The cache is invalidated whenever rfd/wfd is reassigned.
internal (BuildMessage): for AEAD suites whose explicit nonce is the 8-byte record sequence number, write the sequence number directly as nonce_explicit instead of drawing it from the RNG. This covers AES-GCM (RFC 5288 sec 3), AES-CCM (RFC 6655 sec 3), SM4-GCM/CCM (RFC 8998 sec 3), and Camellia-/ARIA-GCM which inherit the RFC 5288 construction; ChaCha20 uses an implicit nonce and is excluded. A new read-only PeekSEQ() helper reads the sequence number without advancing the per-direction counter, leaving the single mandated increment to writeAeadAuthData().
Also ignore the built dtls_bench binary in .gitignore.