feat(PubSub): Streaming pull keepalive#15649
Conversation
cf97acc to
8332d19
Compare
There was a problem hiding this comment.
Code Review
This pull request introduces a protocol-compliant keep-alive ping/pong mechanism for the streaming pull client, including a timeout-based stream restart when the server becomes unresponsive. It also refactors stream disposal and updates the test suite to validate these behaviors. The review feedback highlights critical issues with the implementation: first, using a 15-second timeout for MoveNext will cause constant stream restarts during idle periods, so the timeout should be increased to 45 seconds (s_streamPingPeriod + s_streamPongPeriod); second, disposing the CancellationTokenSource while MoveNext is still active can lead to an ObjectDisposedException or race conditions; and third, the StreamTimeoutRestarts test must be updated to wait longer than the new 45-second timeout to pass successfully.
amanda-tarafa
left a comment
There was a problem hiding this comment.
Yep, generally good. Thanks.
8332d19 to
042ee2a
Compare
042ee2a to
52bb6f5
Compare
|
The breaking change check is a false positive, already being tracked in b/520002639. I've added the |
amanda-tarafa
left a comment
There was a problem hiding this comment.
Looking good, but I have a few comments.
Also, I think it'd be good to encapsulate all the ping/pong checks and times etc. on the _retryState object, so that we can use it the same or similar to how we use it for exceptions.
52bb6f5 to
74d3477
Compare
|
I've added refactored things a bit differently than moving things onto the |
amanda-tarafa
left a comment
There was a problem hiding this comment.
Happy to chat about these.
74d3477 to
baa98a3
Compare
amanda-tarafa
left a comment
There was a problem hiding this comment.
A little bit of pseudocode, but see if that seems like a good balance between encapsulation and knowing what's happening just by looking at the handlers.
77d6c51 to
abf1ebc
Compare
amanda-tarafa
left a comment
There was a problem hiding this comment.
Looking great. Still a few changes but nothing major now.
| } | ||
| } | ||
|
|
||
| // Message-stream has messages (or not, depending on moveNextResult) | ||
| private void HandlePullMessageData(Task<bool> moveNextTask) | ||
| { | ||
| if (!moveNextTask.IsCompleted) |
There was a problem hiding this comment.
We need to record the failure in _retryState as that's what increments the backoff for restarting the pull. That's why I had the idea to include what you've put in KeepAliveMonitor in RetryState instead because the lack of pong is a effectively a retriable failure.
I think we can still merge those two, and maybe call it RestartMonitor or something. But we can do that in a follow up PR, where we first clean up RetryState and then merge it with KeepAliveMonitor into RestartMonitor.
There was a problem hiding this comment.
Sounds like a plan will resolve in a follow-up.
There was a problem hiding this comment.
We need to record the failure in _retryState as that's what increments the backoff for restarting the pull.
This part we need to solve on this PR, otherwise we'll be retrying sooner than with the backoff and that might make the problem worse.
What we can do in a follow up is all the refactoring.
| _pongReceived = false; | ||
| } | ||
|
|
||
| internal void RecordPong() => _pongReceived = true; |
There was a problem hiding this comment.
You still want to use Interlock here and on RecordPing. The pong is recorded when the MoveNext completes and that happens in the background.
There was a problem hiding this comment.
Good point - I'd confused myself after all the refactoring, done for both.
| public bool DisableKeepAlive { get; private set; } | ||
| public ServerAction WithDisableKeepAlive() { DisableKeepAlive = true; return this; } |
There was a problem hiding this comment.
It makes no sense to have the With to return the same instance. Either you make the type mutable, i.e. keep the DisableKeepAlive set public, or you don't, meaning you return a new instance on this method. But this is just a detour from using properties idioamatically.
There was a problem hiding this comment.
Thanks for the pointer, I've reverted this change entirely based on refactor.
7739ec0 to
d3110ee
Compare
amanda-tarafa
left a comment
There was a problem hiding this comment.
These are good tests, thanks.
A couple of things, but looks great.
| } | ||
| } | ||
|
|
||
| // Message-stream has messages (or not, depending on moveNextResult) | ||
| private void HandlePullMessageData(Task<bool> moveNextTask) | ||
| { | ||
| if (!moveNextTask.IsCompleted) |
There was a problem hiding this comment.
We need to record the failure in _retryState as that's what increments the backoff for restarting the pull.
This part we need to solve on this PR, otherwise we'll be retrying sooner than with the backoff and that might make the problem worse.
What we can do in a follow up is all the refactoring.
There was a problem hiding this comment.
Is it OK to change these properties based on the values in a pong? I'm guessing yes, because the next move next with real messages will have the right values, but just double checking, what do you think?
There was a problem hiding this comment.
Thanks for pointing this out. There is a case where we HandleAckResponse and for that we will have incorrect _exactlyonceDelieveryEnabled field set. Fixed.
| /// <summary> | ||
| /// For testing only. Disables stream timeout enforcement when no messages are being received. | ||
| /// </summary> | ||
| internal bool DisableKeepAliveTimeout { get; set; } |
There was a problem hiding this comment.
Let's make this one read-only and pass the value in the constructor. And if my recommendation of configuring the pong period seems good, you'll have to rename it and change the type, of course.
There was a problem hiding this comment.
Really good point, thanks Amanda. I've updated the code.
| internal void RecordPing() | ||
| { | ||
| long nowTicks = _clock.GetCurrentDateTimeUtc().Ticks; | ||
| Interlocked.Exchange(ref _lastPingTicks, nowTicks); |
There was a problem hiding this comment.
I think you only need interlock for the pong.
There was a problem hiding this comment.
Yes you are right, the ping is grabbed synchronously on the main processing thread. Removed the interlocked.
| /// <summary> | ||
| /// For testing only. Disables stream timeout enforcement when no messages are being received. | ||
| /// </summary> | ||
| internal bool DisableKeepAliveTimeout { get; set; } |
There was a problem hiding this comment.
Let's keep this one read only as well, and receive it on the internal constructor, we can use that one from the tests if we have to.
And can we just configure the pong timeout instead, for tests we set to inifnity (or whatever that translates to in TimeSpan) and we don't have to add ifs on production code. If that works, rename this as PongPeriod or something like that.
There was a problem hiding this comment.
That's a nice way to avoid ifs in production code 👍 . I've updated accordingly
| return Task.FromResult(SubscriberClient.Reply.Ack); | ||
| }); | ||
| // Wait for more than the timeout. | ||
| // With dynamic timeout, the first stream starts at T=0. |
There was a problem hiding this comment.
nit: What does the word dynamic mean here?
There was a problem hiding this comment.
I meant to convey that the timeout on the first was at 45 seconds instead of 30 seconds. I just removed it I think it's probably clearer without.
d3110ee to
7d4c98d
Compare
7d4c98d to
974ea8e
Compare
amanda-tarafa
left a comment
There was a problem hiding this comment.
LGTM!
Make sure to run all integration tests locally before merging.
b/427319804