Skip to content

Release 1.3.3#498

Merged
StefanBratanov merged 27 commits into
masterfrom
1.3.3
Jun 23, 2026
Merged

Release 1.3.3#498
StefanBratanov merged 27 commits into
masterfrom
1.3.3

Conversation

@github-actions

Copy link
Copy Markdown

Please review, approve, and Regular Merge (do not squash) this PR to trigger the Cloudsmith deployment.

lucassaldanha and others added 27 commits May 21, 2026 19:16
…emeral cert subject key

Before this change, `SecureChannel.Session.remotePubKey` returned by the TLS
and QUIC transports was the ephemeral X.509 cert subject key (created per
handshake), not the libp2p host public key embedded in the libp2p cert
extension (OID 1.3.6.1.4.1.53594.1.1). This broke the invariant
`PeerId.fromPubKey(remotePubKey) == remoteId` that SecIO, Noise, and
Plaintext uphold and that downstream consumers rely on (e.g. signed-record
verification).

Changes:

* TLSSecureChannel.kt: introduce `TlsPeerIdentity(peerId, pubKey)` and
  `verifyAndExtractIdentity(chain)` which derives both fields from the same
  libp2p host key unmarshalled out of the cert extension. The legacy
  `verifyAndExtractPeerId` now delegates so existing call sites
  (CertificatesTest, Libp2pTrustManager) continue to work unchanged. The
  session-construction site in `buildTlsHandler` is updated to use the new
  helper, so `remotePubKey` is now the libp2p key, not the ephemeral
  subject key.

* QuicTransport.kt (server inbound): replace the separate
  `verifyAndExtractPeerId` + `getPublicKeyFromCert` pair with a single
  `verifyAndExtractIdentity` call. Both `remoteId` and `remotePubKey`
  now come from the same libp2p host key.

* QuicTransport.kt (client dial): delete the `Multihash.Digest.Identity`
  special case entirely. The libp2p cert extension is the single source of
  truth for the remote libp2p public key regardless of how the peer id is
  encoded in the multiaddr (identity-digest vs sha-256). Add a sanity
  check that the libp2p key extracted from the cert hashes to the
  dial-target peer id — defence-in-depth on top of `Libp2pTrustManager`.
  Remove now-unused imports (`Multihash`, `unmarshalPublicKey`,
  `getPublicKeyFromCert`).

Tests:

* TlsSecureChannelTest now extends `CipherSecureChannelTest` (matching
  the SecIO/Noise pattern), which contributes `verify secure session`
  asserting `remotePubKey == pubKey<remote>` on both sides. The shared
  malformed-cipher-bytes test is marked `open` in the base and overridden
  with `@Disabled` for TLS because OpenSSL surfaces those errors through
  a different path that is out of scope here.

* New `QuicRemotePubKeyIdentityTest` exercises a real QUIC handshake
  between two hosts and asserts
  `PeerId.fromPubKey(secureSession.remotePubKey) == secureSession.remoteId`
  on both the client and server connection.
Teku's Goodbye flow disconnects the peer (closing the QUIC connection and
all of its streams) and then half-closes the RPC stream it was responding
on. Closing a QUIC stream already sends the FIN, so the subsequent
shutdownOutput() failed with "ChannelOutputShutdownException: Fin was sent
already", surfacing in Teku as "PLEASE FIX OR REPORT" errors. The
muxer-based transports tolerate this ordering silently (a MuxChannel
disconnect after close is a no-op).

Make QUIC behave the same: closeWrite() is now idempotent and completes
successfully when the stream is already closed or its output is already
shut down, including when that state is only reached concurrently.

Add regression tests for closeWrite after connection close (the Teku
Goodbye scenario) and double closeWrite.
When a remote peer resets a stream (common during disconnects and
cancelled RPCs), Netty fires QuicStreamResetException into the
application pipeline and leaves the stream channel open, so applications
saw an unhandled exception (Teku logged "Unhandled error while processes
req/response") and the stream leaked unless they closed it themselves.

The muxer-based transports close the child channel quietly on a remote
RST (AbstractMuxHandler.onRemoteClose). Match that behavior: intercept
QuicStreamResetException in QuicStreamReadCloseEventConverter and close
the stream, so applications get the regular stream-closed flow on all
transports. Other exceptions still propagate.
exceptionCaught logged OutOfMemoryError (and other unexpected
throwables) at ERROR but left the channel open and did not rethrow,
masking the failure while the event loop kept running. Close the
channel on any unexpected throwable, and additionally rethrow fatal
JVM Errors so they are surfaced rather than downgraded to a log line.
Each outbound QUIC dial binds its own ephemeral NioDatagramChannel to
carry the connection. Closing a QuicChannel does not close its parent
datagram channel in Netty, so every dial leaked a NioDatagramChannel
(plus its codec, native quiche config and queued direct buffers) for the
lifetime of the transport. Under connection churn this manifested as a
linear off-heap/direct-memory leak (~3 GiB/day observed on a soak node).

- Tie the datagram channel lifecycle to the QuicChannel: close it on
  QuicChannel close and on connect failure.
- Close the established QuicChannel when post-handshake setup fails, so
  a failed dial cannot strand the connection (and its datagram channel).

Adds regression tests asserting the datagram channel closes with the
connection and that a post-handshake failure does not leak a connection.
NoiseXXCodec.exceptionCaught detected a fatal JVM Error nested in the
causal chain via hasCauseOfType(Error::class) but then did throw cause,
rethrowing whatever Netty handed in. On the encode() path Netty wraps
throwables in EncoderException, so an OutOfMemoryError was rethrown as a
RuntimeException wrapper, downgrading the fatal JVM signal.

Add findCauseOfType to extract the actual Error from the causal chain and
rethrow that. Covered by a test feeding EncoderException(OutOfMemoryError).
NetworkImpl.connect() races dials across addresses and cancels the losing
dial futures. QUIC dial() returns the dependent thenApply future; cancelling
it does not propagate upstream to the QUIC connect, so when the handshake
later completes the thenApply body is skipped. The established QuicChannel
was therefore never registered, handed to the caller, nor closed, leaking it
(and its underlying datagram channel) until the idle timeout.

Mirror the TCP transport: register the channel as soon as the connection is
established (before thenApply) so it is tracked and closeable, and close it
explicitly when the returned future is cancelled, cascading to the ephemeral
datagram channel via the existing close listener.
# Conflicts:
#	libp2p/src/test/java/io/libp2p/transport/quic/QuicServerTestJava.java
fix(quic): close leaked dial datagram channels; harden NoiseXXCodec OOM handling
# Conflicts:
#	libp2p/src/main/kotlin/io/libp2p/transport/quic/QuicTransport.kt
…emeral cert subject key

Before this change, `SecureChannel.Session.remotePubKey` returned by the TLS
and QUIC transports was the ephemeral X.509 cert subject key (created per
handshake), not the libp2p host public key embedded in the libp2p cert
extension (OID 1.3.6.1.4.1.53594.1.1). This broke the invariant
`PeerId.fromPubKey(remotePubKey) == remoteId` that SecIO, Noise, and
Plaintext uphold and that downstream consumers rely on (e.g. signed-record
verification).

Changes:

* TLSSecureChannel.kt: introduce `TlsPeerIdentity(peerId, pubKey)` and
  `verifyAndExtractIdentity(chain)` which derives both fields from the same
  libp2p host key unmarshalled out of the cert extension. The legacy
  `verifyAndExtractPeerId` now delegates so existing call sites
  (CertificatesTest, Libp2pTrustManager) continue to work unchanged. The
  session-construction site in `buildTlsHandler` is updated to use the new
  helper, so `remotePubKey` is now the libp2p key, not the ephemeral
  subject key.

* QuicTransport.kt (server inbound): replace the separate
  `verifyAndExtractPeerId` + `getPublicKeyFromCert` pair with a single
  `verifyAndExtractIdentity` call. Both `remoteId` and `remotePubKey`
  now come from the same libp2p host key.

* QuicTransport.kt (client dial): delete the `Multihash.Digest.Identity`
  special case entirely. The libp2p cert extension is the single source of
  truth for the remote libp2p public key regardless of how the peer id is
  encoded in the multiaddr (identity-digest vs sha-256). Add a sanity
  check that the libp2p key extracted from the cert hashes to the
  dial-target peer id — defence-in-depth on top of `Libp2pTrustManager`.
  Remove now-unused imports (`Multihash`, `unmarshalPublicKey`,
  `getPublicKeyFromCert`).

Tests:

* TlsSecureChannelTest now extends `CipherSecureChannelTest` (matching
  the SecIO/Noise pattern), which contributes `verify secure session`
  asserting `remotePubKey == pubKey<remote>` on both sides. The shared
  malformed-cipher-bytes test is marked `open` in the base and overridden
  with `@Disabled` for TLS because OpenSSL surfaces those errors through
  a different path that is out of scope here.

* New `QuicRemotePubKeyIdentityTest` exercises a real QUIC handshake
  between two hosts and asserts
  `PeerId.fromPubKey(secureSession.remotePubKey) == secureSession.remoteId`
  on both the client and server connection.
ConnectionOverNetty resolved local/remote addresses from the underlying
Netty channel on every call. For QUIC the native channel state is freed
on close, after which remoteSocketAddress() returns null and the !! in
QuicTransport.remoteAddress() throws an NPE. This surfaced when a
disconnect handler read the remote address during idle-timeout teardown.

Cache each address on first successful read while the channel is live so
teardown-time callers get a stable value. Add regression tests covering
both addresses surviving a subsequently-freed channel.
Address two review findings on the QUIC transport:

- ConnectionOverNetty only cached addresses after a successful first read,
  so a teardown handler that was the first caller after the native QUIC
  channel was freed could still hit the remoteSocketAddress()!! NPE. Add
  cacheAddresses() and warm it while the channel is live at all three QUIC
  connection-establishment points (dial, server-inbound, hole-punch) before
  the connection is exposed.

- The hole-punch thenApply registered the QuicChannel but, unlike the normal
  dial path, had no close-on-failure cleanup. A peer-id mismatch or missing
  certs left the channel registered and open while the future failed. Wrap
  the setup in the same close-on-failure pattern as the dial path.
fix(tls,quic): expose libp2p host pubkey as remotePubKey, not the ephemeral cert subject key 2/2
…ection

On the hole-punch inbound path, channelActive set the secure/muxer sessions
and completed the pending future with the raw QuicChannel before the dial
target peer id was checked (that check only ran later in dialAsListener's
thenApply). Because the server codec installs an InboundStreamHandler, a peer
that reached the expected UDP tuple but presented a valid cert for the wrong
libp2p identity could have inbound streams dispatched to protocol handlers in
the window before the later check closed the channel.

Move the dial-target validation into channelActive, before the connection is
exposed:

- Carry the expected peer id with each pending hole punch (PendingHolePunch).
- Extract holePunchIdentityMatches(expected, actual) and unit-test it.
- On mismatch, close the channel and fail the future before setting the secure
  session or completing it; channelActive and stream-activation events run
  serially on the channel event loop, so closing here closes the window.

The existing dialAsListener thenApply check is retained as defence-in-depth.
…eams

When a remote peer sends a STOP_SENDING frame (it no longer wants to read from
the stream) while the local side has queued writes, Netty fails those writes
with ChannelOutputShutdownException("STOP_SENDING frame received") and surfaces
it into the stream channel's pipeline. QuicStreamReadCloseEventConverter only
special-cased QuicStreamResetException, so this exception was forwarded to
application handlers and logged as an unhandled error.

STOP_SENDING is a benign half-close: only our write side is affected, the read
side may still deliver data. Handle ChannelOutputShutdownException in the
converter's exceptionCaught by swallowing it quietly without closing the stream
(matching Netty's own "should not close the channel" handling of STOP_SENDING),
mirroring the existing remote-reset handling.

Adds deterministic EmbeddedChannel unit tests for the converter covering the
STOP_SENDING swallow, the remote-reset quiet close, and propagation of
unrelated exceptions.
* allocations-reduction

* additional improvements

* additional comments
Drives a real QUIC handshake where the server stops reading (shutdownInput,
which makes quiche send STOP_SENDING) while the client writes to the stream.
The client write uses a void promise so the resulting ChannelOutputShutdownException
is routed through the stream pipeline's exceptionCaught — the same path that
surfaced it to application handlers in the wild. A capturer inserted immediately
after QuicStreamReadCloseEventConverter asserts the converter swallows the
exception (nothing reaches application handlers) and the stream stays open.

Verified the test fails without the converter fix (the exception surfaces to
the capturer) and passes with it.
dialAsListener accepted a target multiaddr with no /p2p peer id and
holePunchIdentityMatches(null, ...) accepted any identity, so any peer
reaching the expected UDP tuple could complete the hole punch as the
dialed peer. Require a non-null target peer id (fail fast otherwise) and
make the expected id non-null end to end.

The exception converter swallowed every ChannelOutputShutdownException;
netty's quic codec also raises it as "Fin was sent already" for a local
write-after-FIN, which would be silently dropped. Swallow only the
"STOP_SENDING frame received" case and propagate other causes.
Extract the inbound handshake routing from the channelActive handler into
QuicTransport.routeInboundHandshake so the hole-punch identity guard is
testable without a live QUIC channel. A full network e2e is impractical:
dial() binds an ephemeral source port and dialAsListener only waits, so a
mismatched peer cannot be forced to connect from the registered target
tuple. Add tests asserting a wrong-identity hole punch closes the channel
and fails the caller future without ever preparing or exposing the
connection, plus the matching and normal-inbound paths.
fix(quic): validate hole-punch peer identity early and handle STOP_SENDING

@StefanBratanov StefanBratanov left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@StefanBratanov StefanBratanov merged commit d555449 into master Jun 23, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants