Release 1.3.3#498
Merged
Merged
Conversation
…emeral cert subject key Before this change, `SecureChannel.Session.remotePubKey` returned by the TLS and QUIC transports was the ephemeral X.509 cert subject key (created per handshake), not the libp2p host public key embedded in the libp2p cert extension (OID 1.3.6.1.4.1.53594.1.1). This broke the invariant `PeerId.fromPubKey(remotePubKey) == remoteId` that SecIO, Noise, and Plaintext uphold and that downstream consumers rely on (e.g. signed-record verification). Changes: * TLSSecureChannel.kt: introduce `TlsPeerIdentity(peerId, pubKey)` and `verifyAndExtractIdentity(chain)` which derives both fields from the same libp2p host key unmarshalled out of the cert extension. The legacy `verifyAndExtractPeerId` now delegates so existing call sites (CertificatesTest, Libp2pTrustManager) continue to work unchanged. The session-construction site in `buildTlsHandler` is updated to use the new helper, so `remotePubKey` is now the libp2p key, not the ephemeral subject key. * QuicTransport.kt (server inbound): replace the separate `verifyAndExtractPeerId` + `getPublicKeyFromCert` pair with a single `verifyAndExtractIdentity` call. Both `remoteId` and `remotePubKey` now come from the same libp2p host key. * QuicTransport.kt (client dial): delete the `Multihash.Digest.Identity` special case entirely. The libp2p cert extension is the single source of truth for the remote libp2p public key regardless of how the peer id is encoded in the multiaddr (identity-digest vs sha-256). Add a sanity check that the libp2p key extracted from the cert hashes to the dial-target peer id — defence-in-depth on top of `Libp2pTrustManager`. Remove now-unused imports (`Multihash`, `unmarshalPublicKey`, `getPublicKeyFromCert`). Tests: * TlsSecureChannelTest now extends `CipherSecureChannelTest` (matching the SecIO/Noise pattern), which contributes `verify secure session` asserting `remotePubKey == pubKey<remote>` on both sides. The shared malformed-cipher-bytes test is marked `open` in the base and overridden with `@Disabled` for TLS because OpenSSL surfaces those errors through a different path that is out of scope here. * New `QuicRemotePubKeyIdentityTest` exercises a real QUIC handshake between two hosts and asserts `PeerId.fromPubKey(secureSession.remotePubKey) == secureSession.remoteId` on both the client and server connection.
Teku's Goodbye flow disconnects the peer (closing the QUIC connection and all of its streams) and then half-closes the RPC stream it was responding on. Closing a QUIC stream already sends the FIN, so the subsequent shutdownOutput() failed with "ChannelOutputShutdownException: Fin was sent already", surfacing in Teku as "PLEASE FIX OR REPORT" errors. The muxer-based transports tolerate this ordering silently (a MuxChannel disconnect after close is a no-op). Make QUIC behave the same: closeWrite() is now idempotent and completes successfully when the stream is already closed or its output is already shut down, including when that state is only reached concurrently. Add regression tests for closeWrite after connection close (the Teku Goodbye scenario) and double closeWrite.
When a remote peer resets a stream (common during disconnects and cancelled RPCs), Netty fires QuicStreamResetException into the application pipeline and leaves the stream channel open, so applications saw an unhandled exception (Teku logged "Unhandled error while processes req/response") and the stream leaked unless they closed it themselves. The muxer-based transports close the child channel quietly on a remote RST (AbstractMuxHandler.onRemoteClose). Match that behavior: intercept QuicStreamResetException in QuicStreamReadCloseEventConverter and close the stream, so applications get the regular stream-closed flow on all transports. Other exceptions still propagate.
exceptionCaught logged OutOfMemoryError (and other unexpected throwables) at ERROR but left the channel open and did not rethrow, masking the failure while the event loop kept running. Close the channel on any unexpected throwable, and additionally rethrow fatal JVM Errors so they are surfaced rather than downgraded to a log line.
Each outbound QUIC dial binds its own ephemeral NioDatagramChannel to carry the connection. Closing a QuicChannel does not close its parent datagram channel in Netty, so every dial leaked a NioDatagramChannel (plus its codec, native quiche config and queued direct buffers) for the lifetime of the transport. Under connection churn this manifested as a linear off-heap/direct-memory leak (~3 GiB/day observed on a soak node). - Tie the datagram channel lifecycle to the QuicChannel: close it on QuicChannel close and on connect failure. - Close the established QuicChannel when post-handshake setup fails, so a failed dial cannot strand the connection (and its datagram channel). Adds regression tests asserting the datagram channel closes with the connection and that a post-handshake failure does not leak a connection.
NoiseXXCodec.exceptionCaught detected a fatal JVM Error nested in the causal chain via hasCauseOfType(Error::class) but then did throw cause, rethrowing whatever Netty handed in. On the encode() path Netty wraps throwables in EncoderException, so an OutOfMemoryError was rethrown as a RuntimeException wrapper, downgrading the fatal JVM signal. Add findCauseOfType to extract the actual Error from the causal chain and rethrow that. Covered by a test feeding EncoderException(OutOfMemoryError).
NetworkImpl.connect() races dials across addresses and cancels the losing dial futures. QUIC dial() returns the dependent thenApply future; cancelling it does not propagate upstream to the QUIC connect, so when the handshake later completes the thenApply body is skipped. The established QuicChannel was therefore never registered, handed to the caller, nor closed, leaking it (and its underlying datagram channel) until the idle timeout. Mirror the TCP transport: register the channel as soon as the connection is established (before thenApply) so it is tracked and closeable, and close it explicitly when the returned future is cancelled, cascading to the ephemeral datagram channel via the existing close listener.
Fix quic close write logic
# Conflicts: # libp2p/src/test/java/io/libp2p/transport/quic/QuicServerTestJava.java
fix(quic): close leaked dial datagram channels; harden NoiseXXCodec OOM handling
# Conflicts: # libp2p/src/main/kotlin/io/libp2p/transport/quic/QuicTransport.kt
…emeral cert subject key Before this change, `SecureChannel.Session.remotePubKey` returned by the TLS and QUIC transports was the ephemeral X.509 cert subject key (created per handshake), not the libp2p host public key embedded in the libp2p cert extension (OID 1.3.6.1.4.1.53594.1.1). This broke the invariant `PeerId.fromPubKey(remotePubKey) == remoteId` that SecIO, Noise, and Plaintext uphold and that downstream consumers rely on (e.g. signed-record verification). Changes: * TLSSecureChannel.kt: introduce `TlsPeerIdentity(peerId, pubKey)` and `verifyAndExtractIdentity(chain)` which derives both fields from the same libp2p host key unmarshalled out of the cert extension. The legacy `verifyAndExtractPeerId` now delegates so existing call sites (CertificatesTest, Libp2pTrustManager) continue to work unchanged. The session-construction site in `buildTlsHandler` is updated to use the new helper, so `remotePubKey` is now the libp2p key, not the ephemeral subject key. * QuicTransport.kt (server inbound): replace the separate `verifyAndExtractPeerId` + `getPublicKeyFromCert` pair with a single `verifyAndExtractIdentity` call. Both `remoteId` and `remotePubKey` now come from the same libp2p host key. * QuicTransport.kt (client dial): delete the `Multihash.Digest.Identity` special case entirely. The libp2p cert extension is the single source of truth for the remote libp2p public key regardless of how the peer id is encoded in the multiaddr (identity-digest vs sha-256). Add a sanity check that the libp2p key extracted from the cert hashes to the dial-target peer id — defence-in-depth on top of `Libp2pTrustManager`. Remove now-unused imports (`Multihash`, `unmarshalPublicKey`, `getPublicKeyFromCert`). Tests: * TlsSecureChannelTest now extends `CipherSecureChannelTest` (matching the SecIO/Noise pattern), which contributes `verify secure session` asserting `remotePubKey == pubKey<remote>` on both sides. The shared malformed-cipher-bytes test is marked `open` in the base and overridden with `@Disabled` for TLS because OpenSSL surfaces those errors through a different path that is out of scope here. * New `QuicRemotePubKeyIdentityTest` exercises a real QUIC handshake between two hosts and asserts `PeerId.fromPubKey(secureSession.remotePubKey) == secureSession.remoteId` on both the client and server connection.
ConnectionOverNetty resolved local/remote addresses from the underlying Netty channel on every call. For QUIC the native channel state is freed on close, after which remoteSocketAddress() returns null and the !! in QuicTransport.remoteAddress() throws an NPE. This surfaced when a disconnect handler read the remote address during idle-timeout teardown. Cache each address on first successful read while the channel is live so teardown-time callers get a stable value. Add regression tests covering both addresses surviving a subsequently-freed channel.
Address two review findings on the QUIC transport: - ConnectionOverNetty only cached addresses after a successful first read, so a teardown handler that was the first caller after the native QUIC channel was freed could still hit the remoteSocketAddress()!! NPE. Add cacheAddresses() and warm it while the channel is live at all three QUIC connection-establishment points (dial, server-inbound, hole-punch) before the connection is exposed. - The hole-punch thenApply registered the QuicChannel but, unlike the normal dial path, had no close-on-failure cleanup. A peer-id mismatch or missing certs left the channel registered and open while the future failed. Wrap the setup in the same close-on-failure pattern as the dial path.
fix(tls,quic): expose libp2p host pubkey as remotePubKey, not the ephemeral cert subject key 2/2
…ection On the hole-punch inbound path, channelActive set the secure/muxer sessions and completed the pending future with the raw QuicChannel before the dial target peer id was checked (that check only ran later in dialAsListener's thenApply). Because the server codec installs an InboundStreamHandler, a peer that reached the expected UDP tuple but presented a valid cert for the wrong libp2p identity could have inbound streams dispatched to protocol handlers in the window before the later check closed the channel. Move the dial-target validation into channelActive, before the connection is exposed: - Carry the expected peer id with each pending hole punch (PendingHolePunch). - Extract holePunchIdentityMatches(expected, actual) and unit-test it. - On mismatch, close the channel and fail the future before setting the secure session or completing it; channelActive and stream-activation events run serially on the channel event loop, so closing here closes the window. The existing dialAsListener thenApply check is retained as defence-in-depth.
…eams
When a remote peer sends a STOP_SENDING frame (it no longer wants to read from
the stream) while the local side has queued writes, Netty fails those writes
with ChannelOutputShutdownException("STOP_SENDING frame received") and surfaces
it into the stream channel's pipeline. QuicStreamReadCloseEventConverter only
special-cased QuicStreamResetException, so this exception was forwarded to
application handlers and logged as an unhandled error.
STOP_SENDING is a benign half-close: only our write side is affected, the read
side may still deliver data. Handle ChannelOutputShutdownException in the
converter's exceptionCaught by swallowing it quietly without closing the stream
(matching Netty's own "should not close the channel" handling of STOP_SENDING),
mirroring the existing remote-reset handling.
Adds deterministic EmbeddedChannel unit tests for the converter covering the
STOP_SENDING swallow, the remote-reset quiet close, and propagation of
unrelated exceptions.
* allocations-reduction * additional improvements * additional comments
Drives a real QUIC handshake where the server stops reading (shutdownInput, which makes quiche send STOP_SENDING) while the client writes to the stream. The client write uses a void promise so the resulting ChannelOutputShutdownException is routed through the stream pipeline's exceptionCaught — the same path that surfaced it to application handlers in the wild. A capturer inserted immediately after QuicStreamReadCloseEventConverter asserts the converter swallows the exception (nothing reaches application handlers) and the stream stays open. Verified the test fails without the converter fix (the exception surfaces to the capturer) and passes with it.
dialAsListener accepted a target multiaddr with no /p2p peer id and holePunchIdentityMatches(null, ...) accepted any identity, so any peer reaching the expected UDP tuple could complete the hole punch as the dialed peer. Require a non-null target peer id (fail fast otherwise) and make the expected id non-null end to end. The exception converter swallowed every ChannelOutputShutdownException; netty's quic codec also raises it as "Fin was sent already" for a local write-after-FIN, which would be silently dropped. Swallow only the "STOP_SENDING frame received" case and propagate other causes.
Extract the inbound handshake routing from the channelActive handler into QuicTransport.routeInboundHandshake so the hole-punch identity guard is testable without a live QUIC channel. A full network e2e is impractical: dial() binds an ephemeral source port and dialAsListener only waits, so a mismatched peer cannot be forced to connect from the registered target tuple. Add tests asserting a wrong-identity hole punch closes the channel and fails the caller future without ever preparing or exposing the connection, plus the matching and normal-inbound paths.
fix(quic): validate hole-punch peer identity early and handle STOP_SENDING
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Please review, approve, and Regular Merge (do not squash) this PR to trigger the Cloudsmith deployment.