Fix the force-kill disposition race and harden the elevation connect harness by jschick04 · Pull Request #612 · microsoft/EventLogExpert

jschick04 · 2026-06-21T00:51:40Z

What

Fixes two intermittent CI test flakes in the elevation-helper area. Both are independent of the build/ARM64 work in #609 and are based on main.

1. Force-kill disposition race (#593) - also a production correctness bug

ElevatedDatabaseToolsRunner records the force-kill KillDisposition on a fire-and-forget kill-timer task: it calls process.Kill() and only then MarkKillSucceeded(). Kill() closes the helper pipe (the helper dies), which unblocks the runner's drain loop, so the main flow can reach TranslateOutcome and read KillState.Disposition before the timer task records it. When that read wins, the outcome is the generic "Cancelled." instead of the "Cancelled (... force-killed). If you ran an Upgrade, a .bak of the original target may remain ... rename it to recover." message.

That is the unit-test flake RunAsync_WhenHelperKillSucceedsAfterCancel_ReportsForceKilled (passes in isolation, fails under parallel contention), and the same ordering applies in production - a force-killed helper can lose the recovery hint.

Fix: the kill-timer task is captured in KillState; after CancelGraceTimer() and before the exit-wait, the runner awaits it (bounded by _exitGrace), so the disposition write happens-before the disposition read. The bound preserves the existing "runner never hangs" guarantee.

2. Elevation helper connect harness masks early exit

TestElevatedHelperProcessHost.StartAsync waited only for the pipe to connect (15s) and never observed early helper-process exit, so a crash / wrong deps / cold-Docker startup surfaced as an opaque "did not connect within 15s" after wasting the full window.

Fix: race the connect against a separate exit-watch source (so a connect timeout cannot be misreported as an exit); on early exit, throw a diagnosable error with the exit code and a bounded, lock-guarded stderr tail; raise the cold-start timeout to 60s; the finally cancels and observes both tasks so neither leaks.

Why it matters

Flake 1 reds build-and-test intermittently and, in production, can drop the data-recovery hint after a force-kill. Flake 2 turns a fast, diagnosable helper-startup failure into a slow blind timeout (and would mask a real RID/arch regression if the build graph ever changes).

Verification

Build: 0 warnings / 0 errors (Runtime.Tests and ElevationHelper.IntegrationTests).
Flake 1: RunAsync_WhenHelperKillSucceedsAfterCancel... + the unkillable sister test 15/15 green; full Runtime.Tests 1629/1629 x3.
Flake 2: harness compiles; the integration suite itself is container-gated and runs in CI.

The local force-kill race is timing-dependent and did not reproduce locally before the fix; the diagnosis is grounded in source analysis and the existing #593 CI failures, and the fix makes the disposition read deterministic.

Scope

Two files: src/EventLogExpert.Runtime/.../ElevatedDatabaseToolsRunner.cs (production) and tests/Integration/.../TestUtils/TestElevatedHelperProcessHost.cs (test harness). No public API change (the runner additions are on a private nested type).

Closes #593.

Copilot

Pull request overview

Fixes two intermittent CI flakes (and one production correctness issue) in the elevation-helper flow by making cancellation/force-kill outcome reporting deterministic and by hardening the integration-test helper startup harness to surface early-exit failures with better diagnostics.

Changes:

Ensure the force-kill timer task’s KillDisposition is observed before translating the final outcome (avoids losing the “force-killed” recovery hint).
Improve the integration-test helper startup harness by racing pipe-connect vs. early process exit, increasing connect timeout, and capturing a bounded stderr tail for diagnosis.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
`src/EventLogExpert.Runtime/DatabaseTools/Elevation/ElevatedDatabaseToolsRunner.cs`	Captures and awaits the cancellation force-kill timer task so disposition is set before `TranslateOutcome` reads it.
`tests/Integration/EventLogExpert.ElevationHelper.IntegrationTests/TestUtils/TestElevatedHelperProcessHost.cs`	Adds early-exit detection, longer connect timeout, and bounded stderr capture to make helper-start failures diagnosable.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

Copilot AI review requested due to automatic review settings June 21, 2026 00:51

Copilot started reviewing on behalf of jschick04 June 21, 2026 00:52 View session

Copilot AI reviewed Jun 21, 2026

View reviewed changes

Comment thread ...n/EventLogExpert.ElevationHelper.IntegrationTests/TestUtils/TestElevatedHelperProcessHost.cs

Comment thread ...n/EventLogExpert.ElevationHelper.IntegrationTests/TestUtils/TestElevatedHelperProcessHost.cs Outdated

Fix the force-kill disposition race and the helper connect harness

f749fd4

jschick04 force-pushed the jschick/elevation-forcekill-race branch from 3f129cc to f749fd4 Compare June 21, 2026 01:09

jschick04 changed the base branch from main to jschick/embed-symbols-drop-m1 June 21, 2026 01:09

jschick04 requested a review from Copilot June 21, 2026 01:13

Copilot started reviewing on behalf of jschick04 June 21, 2026 01:14 View session

Copilot AI reviewed Jun 21, 2026

View reviewed changes

jschick04 marked this pull request as ready for review June 21, 2026 01:16

jschick04 requested a review from a team as a code owner June 21, 2026 01:16

jschick04 merged commit 2a02c9b into jschick/embed-symbols-drop-m1 Jun 21, 2026
8 checks passed

jschick04 deleted the jschick/elevation-forcekill-race branch June 21, 2026 01:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix the force-kill disposition race and harden the elevation connect harness#612

Fix the force-kill disposition race and harden the elevation connect harness#612
jschick04 merged 1 commit into
jschick/embed-symbols-drop-m1from
jschick/elevation-forcekill-race

jschick04 commented Jun 21, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

jschick04 commented Jun 21, 2026

What

1. Force-kill disposition race (#593) - also a production correctness bug

2. Elevation helper connect harness masks early exit

Why it matters

Verification

Scope

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants