Fix stale connection handling in BaseServer.getConnection() and handleIncomingConnection() by elguardian · Pull Request #1026 · belaban/JGroups

elguardian · 2026-06-15T07:18:50Z

getConnection(): when the locked path finds a connection in the conns map that is not connected (stale), remove and close it before creating a new one.

handleIncomingConnection(): when the "bigger address wins" logic would reject an incoming connection, first check if the existing connection is stale (last_access older than sock_conn_timeout). If stale, accept the incoming and replace the dead connection instead of rejecting. This prevents the infinite rejection loop where a node cannot rejoin the cluster because a peer holds a stale connection entry from a previous epoch.

…eIncomingConnection() getConnection(): when the locked path finds a connection in the conns map that is not connected (stale), remove and close it before creating a new one. handleIncomingConnection(): when the "bigger address wins" logic would reject an incoming connection, first check if the existing connection is stale (last_access older than sock_conn_timeout). If stale, accept the incoming and replace the dead connection instead of rejecting. This prevents the infinite rejection loop where a node cannot rejoin the cluster because a peer holds a stale connection entry from a previous epoch.

belaban · 2026-06-15T09:24:16Z

I'll look at this once we've resolved the conversation in #1024. But a quick glance shows that the fix won't work if connection reaping is disabled, because the timestamp in a connection will not be updated in this case.

Also, if you have a time service enabled, whose interval is greater than sock_conn_timeout, the definition of stable will fail.

elguardian · 2026-06-15T09:40:05Z

@belaban it does not try to solve the problem incoming / outoing conection logic as it was stated before that is a far too risky fix. This tries to get rid of some problems I am seeinng in the CI like this https://redhat.atlassian.net/browse/JGRP-3013 (which is the same problem over and over again- There are other test too)... it is more like a reacting thing. We can discuss how to reap the connection when certains things are not configured but I think this is safer approach.

This works (I did run during the weekend with no failures)... usually you reach a failure without touching the test in 3-4 hours more or less.

This issue is extremely difficutl to reproduce as it requires a connection rejected replacement in the node joining and an outoing conection from the coordinator.. so the joining node tries to reconnect indefenetly and the coordinator keeps rejecting because it has a good connection in its pool and the conflict connection resolution keeps working in favor of the coordiantor as its port win the race... so the joining node will never joing the cluster.

anyway let me know your thoughts.... I have a bunch of data and test regarding this problem.

belaban · 2026-06-15T14:22:31Z

Your second but last paragraph: can we write a reproducer? E.g. we can inject addresses which always makes the joiner's address lower than the coordinator's, so that the coordinator always wins.

If you provide a more concise description of how this can be reproduced, I'll take a look at creating a reproducer. E.g. how can a connection in the client result in rejection etc.

I'm extremely hesitant to change code which works well, only to support a feature which tries to derive a perfect failure detector from connection management (which can be unreliable)...

belaban · 2026-06-15T15:00:22Z

In your opinion, would the following reproduce the issue?

A has port 1000 (local port to connect to B is 1500), B 2000 (local port to connect to A is 2500)
B sends a message to A, therefore has a connection B -> A
A clears its connection table
A sends a message to B, connects at 1500; this should be rejected by B

I'm writing a reproducer in ServerTests, trying to see if this fails.

belaban · 2026-06-16T06:34:07Z

I can actually reproduce this but only if I remove a connection on the client side without closing it. I don't see how this could happen, as connections are always closed when removed...

elguardian · 2026-06-16T07:48:38Z

@belaban you can check the logs in the jira I did post. the close gracefully during tons of times I think is signaling that. and that is the reason why sometimes the cluster times out no matter what you do. I have actually have a PR that consistently reproduce the issue in the CI.
https://ci.wildfly.org/buildConfiguration/WF_PullRequest_BootableJarLinuxJdk25/565911
wildfly/wildfly#19428
If you retest this almost 95 % of the time fails. The test is simple restart ot the node.

belaban · 2026-06-16T09:28:00Z

I don't have a team city login

elguardian mentioned this pull request Jun 15, 2026

[TEST] EnableSuspectEventsDeterministicEventTest #1024

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix stale connection handling in BaseServer.getConnection() and handleIncomingConnection()#1026

Fix stale connection handling in BaseServer.getConnection() and handleIncomingConnection()#1026
elguardian wants to merge 1 commit into
belaban:masterfrom
elguardian:JGRP-R1

elguardian commented Jun 15, 2026

Uh oh!

belaban commented Jun 15, 2026

Uh oh!

elguardian commented Jun 15, 2026

Uh oh!

belaban commented Jun 15, 2026

Uh oh!

belaban commented Jun 15, 2026

Uh oh!

belaban commented Jun 16, 2026

Uh oh!

elguardian commented Jun 16, 2026

Uh oh!

belaban commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

elguardian commented Jun 15, 2026

Uh oh!

belaban commented Jun 15, 2026

Uh oh!

elguardian commented Jun 15, 2026

Uh oh!

belaban commented Jun 15, 2026

Uh oh!

belaban commented Jun 15, 2026

Uh oh!

belaban commented Jun 16, 2026

Uh oh!

elguardian commented Jun 16, 2026

Uh oh!

belaban commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants