Skip to content

Fix stale connection handling in BaseServer.getConnection() and handleIncomingConnection()#1026

Open
elguardian wants to merge 1 commit into
belaban:masterfrom
elguardian:JGRP-R1
Open

Fix stale connection handling in BaseServer.getConnection() and handleIncomingConnection()#1026
elguardian wants to merge 1 commit into
belaban:masterfrom
elguardian:JGRP-R1

Conversation

@elguardian

Copy link
Copy Markdown
Contributor

getConnection(): when the locked path finds a connection in the conns map that is not connected (stale), remove and close it before creating a new one.

handleIncomingConnection(): when the "bigger address wins" logic would reject an incoming connection, first check if the existing connection is stale (last_access older than sock_conn_timeout). If stale, accept the incoming and replace the dead connection instead of rejecting. This prevents the infinite rejection loop where a node cannot rejoin the cluster because a peer holds a stale connection entry from a previous epoch.

…eIncomingConnection()

getConnection(): when the locked path finds a connection in the conns map that is
not connected (stale), remove and close it before creating a new one.

handleIncomingConnection(): when the "bigger address wins" logic would reject an
incoming connection, first check if the existing connection is stale (last_access
older than sock_conn_timeout). If stale, accept the incoming and replace the dead
connection instead of rejecting. This prevents the infinite rejection loop where
a node cannot rejoin the cluster because a peer holds a stale connection entry
from a previous epoch.
@belaban

belaban commented Jun 15, 2026

Copy link
Copy Markdown
Owner

I'll look at this once we've resolved the conversation in #1024. But a quick glance shows that the fix won't work if connection reaping is disabled, because the timestamp in a connection will not be updated in this case.

Also, if you have a time service enabled, whose interval is greater than sock_conn_timeout, the definition of stable will fail.

@elguardian

Copy link
Copy Markdown
Contributor Author

@belaban it does not try to solve the problem incoming / outoing conection logic as it was stated before that is a far too risky fix. This tries to get rid of some problems I am seeinng in the CI like this https://redhat.atlassian.net/browse/JGRP-3013 (which is the same problem over and over again- There are other test too)... it is more like a reacting thing. We can discuss how to reap the connection when certains things are not configured but I think this is safer approach.

This works (I did run during the weekend with no failures)... usually you reach a failure without touching the test in 3-4 hours more or less.

This issue is extremely difficutl to reproduce as it requires a connection rejected replacement in the node joining and an outoing conection from the coordinator.. so the joining node tries to reconnect indefenetly and the coordinator keeps rejecting because it has a good connection in its pool and the conflict connection resolution keeps working in favor of the coordiantor as its port win the race... so the joining node will never joing the cluster.

anyway let me know your thoughts.... I have a bunch of data and test regarding this problem.

@belaban

belaban commented Jun 15, 2026

Copy link
Copy Markdown
Owner

Your second but last paragraph: can we write a reproducer? E.g. we can inject addresses which always makes the joiner's address lower than the coordinator's, so that the coordinator always wins.

If you provide a more concise description of how this can be reproduced, I'll take a look at creating a reproducer. E.g. how can a connection in the client result in rejection etc.

I'm extremely hesitant to change code which works well, only to support a feature which tries to derive a perfect failure detector from connection management (which can be unreliable)...

@belaban

belaban commented Jun 15, 2026

Copy link
Copy Markdown
Owner

In your opinion, would the following reproduce the issue?

  • A has port 1000 (local port to connect to B is 1500), B 2000 (local port to connect to A is 2500)
  • B sends a message to A, therefore has a connection B -> A
  • A clears its connection table
  • A sends a message to B, connects at 1500; this should be rejected by B

I'm writing a reproducer in ServerTests, trying to see if this fails.

@belaban

belaban commented Jun 16, 2026

Copy link
Copy Markdown
Owner

I can actually reproduce this but only if I remove a connection on the client side without closing it. I don't see how this could happen, as connections are always closed when removed...

@elguardian

Copy link
Copy Markdown
Contributor Author

@belaban you can check the logs in the jira I did post. the close gracefully during tons of times I think is signaling that. and that is the reason why sometimes the cluster times out no matter what you do. I have actually have a PR that consistently reproduce the issue in the CI.
https://ci.wildfly.org/buildConfiguration/WF_PullRequest_BootableJarLinuxJdk25/565911
wildfly/wildfly#19428
If you retest this almost 95 % of the time fails. The test is simple restart ot the node.

@belaban

belaban commented Jun 16, 2026

Copy link
Copy Markdown
Owner

I don't have a team city login

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants