Fix the force-kill disposition race and harden the elevation connect harness#612
Merged
jschick04 merged 1 commit intoJun 21, 2026
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Fixes two intermittent CI flakes (and one production correctness issue) in the elevation-helper flow by making cancellation/force-kill outcome reporting deterministic and by hardening the integration-test helper startup harness to surface early-exit failures with better diagnostics.
Changes:
- Ensure the force-kill timer task’s
KillDispositionis observed before translating the final outcome (avoids losing the “force-killed” recovery hint). - Improve the integration-test helper startup harness by racing pipe-connect vs. early process exit, increasing connect timeout, and capturing a bounded stderr tail for diagnosis.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
src/EventLogExpert.Runtime/DatabaseTools/Elevation/ElevatedDatabaseToolsRunner.cs |
Captures and awaits the cancellation force-kill timer task so disposition is set before TranslateOutcome reads it. |
tests/Integration/EventLogExpert.ElevationHelper.IntegrationTests/TestUtils/TestElevatedHelperProcessHost.cs |
Adds early-exit detection, longer connect timeout, and bounded stderr capture to make helper-start failures diagnosable. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
3f129cc to
f749fd4
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Fixes two intermittent CI test flakes in the elevation-helper area. Both are independent of the build/ARM64 work in #609 and are based on
main.1. Force-kill disposition race (#593) - also a production correctness bug
ElevatedDatabaseToolsRunnerrecords the force-killKillDispositionon a fire-and-forget kill-timer task: it callsprocess.Kill()and only thenMarkKillSucceeded().Kill()closes the helper pipe (the helper dies), which unblocks the runner's drain loop, so the main flow can reachTranslateOutcomeand readKillState.Dispositionbefore the timer task records it. When that read wins, the outcome is the generic"Cancelled."instead of the"Cancelled (... force-killed). If you ran an Upgrade, a .bak of the original target may remain ... rename it to recover."message.That is the unit-test flake
RunAsync_WhenHelperKillSucceedsAfterCancel_ReportsForceKilled(passes in isolation, fails under parallel contention), and the same ordering applies in production - a force-killed helper can lose the recovery hint.Fix: the kill-timer task is captured in
KillState; afterCancelGraceTimer()and before the exit-wait, the runner awaits it (bounded by_exitGrace), so the disposition write happens-before the disposition read. The bound preserves the existing "runner never hangs" guarantee.2. Elevation helper connect harness masks early exit
TestElevatedHelperProcessHost.StartAsyncwaited only for the pipe to connect (15s) and never observed early helper-process exit, so a crash / wrong deps / cold-Docker startup surfaced as an opaque"did not connect within 15s"after wasting the full window.Fix: race the connect against a separate exit-watch source (so a connect timeout cannot be misreported as an exit); on early exit, throw a diagnosable error with the exit code and a bounded, lock-guarded stderr tail; raise the cold-start timeout to 60s; the
finallycancels and observes both tasks so neither leaks.Why it matters
Flake 1 reds
build-and-testintermittently and, in production, can drop the data-recovery hint after a force-kill. Flake 2 turns a fast, diagnosable helper-startup failure into a slow blind timeout (and would mask a real RID/arch regression if the build graph ever changes).Verification
Runtime.TestsandElevationHelper.IntegrationTests).RunAsync_WhenHelperKillSucceedsAfterCancel...+ the unkillable sister test 15/15 green; fullRuntime.Tests1629/1629 x3.The local force-kill race is timing-dependent and did not reproduce locally before the fix; the diagnosis is grounded in source analysis and the existing #593 CI failures, and the fix makes the disposition read deterministic.
Scope
Two files:
src/EventLogExpert.Runtime/.../ElevatedDatabaseToolsRunner.cs(production) andtests/Integration/.../TestUtils/TestElevatedHelperProcessHost.cs(test harness). No public API change (the runner additions are on a private nested type).Closes #593.