Skip to content

Fix the force-kill disposition race and harden the elevation connect harness#612

Merged
jschick04 merged 1 commit into
jschick/embed-symbols-drop-m1from
jschick/elevation-forcekill-race
Jun 21, 2026
Merged

Fix the force-kill disposition race and harden the elevation connect harness#612
jschick04 merged 1 commit into
jschick/embed-symbols-drop-m1from
jschick/elevation-forcekill-race

Conversation

@jschick04

Copy link
Copy Markdown
Collaborator

What

Fixes two intermittent CI test flakes in the elevation-helper area. Both are independent of the build/ARM64 work in #609 and are based on main.

1. Force-kill disposition race (#593) - also a production correctness bug

ElevatedDatabaseToolsRunner records the force-kill KillDisposition on a fire-and-forget kill-timer task: it calls process.Kill() and only then MarkKillSucceeded(). Kill() closes the helper pipe (the helper dies), which unblocks the runner's drain loop, so the main flow can reach TranslateOutcome and read KillState.Disposition before the timer task records it. When that read wins, the outcome is the generic "Cancelled." instead of the "Cancelled (... force-killed). If you ran an Upgrade, a .bak of the original target may remain ... rename it to recover." message.

That is the unit-test flake RunAsync_WhenHelperKillSucceedsAfterCancel_ReportsForceKilled (passes in isolation, fails under parallel contention), and the same ordering applies in production - a force-killed helper can lose the recovery hint.

Fix: the kill-timer task is captured in KillState; after CancelGraceTimer() and before the exit-wait, the runner awaits it (bounded by _exitGrace), so the disposition write happens-before the disposition read. The bound preserves the existing "runner never hangs" guarantee.

2. Elevation helper connect harness masks early exit

TestElevatedHelperProcessHost.StartAsync waited only for the pipe to connect (15s) and never observed early helper-process exit, so a crash / wrong deps / cold-Docker startup surfaced as an opaque "did not connect within 15s" after wasting the full window.

Fix: race the connect against a separate exit-watch source (so a connect timeout cannot be misreported as an exit); on early exit, throw a diagnosable error with the exit code and a bounded, lock-guarded stderr tail; raise the cold-start timeout to 60s; the finally cancels and observes both tasks so neither leaks.

Why it matters

Flake 1 reds build-and-test intermittently and, in production, can drop the data-recovery hint after a force-kill. Flake 2 turns a fast, diagnosable helper-startup failure into a slow blind timeout (and would mask a real RID/arch regression if the build graph ever changes).

Verification

  • Build: 0 warnings / 0 errors (Runtime.Tests and ElevationHelper.IntegrationTests).
  • Flake 1: RunAsync_WhenHelperKillSucceedsAfterCancel... + the unkillable sister test 15/15 green; full Runtime.Tests 1629/1629 x3.
  • Flake 2: harness compiles; the integration suite itself is container-gated and runs in CI.

The local force-kill race is timing-dependent and did not reproduce locally before the fix; the diagnosis is grounded in source analysis and the existing #593 CI failures, and the fix makes the disposition read deterministic.

Scope

Two files: src/EventLogExpert.Runtime/.../ElevatedDatabaseToolsRunner.cs (production) and tests/Integration/.../TestUtils/TestElevatedHelperProcessHost.cs (test harness). No public API change (the runner additions are on a private nested type).

Closes #593.

Copilot AI review requested due to automatic review settings June 21, 2026 00:51

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes two intermittent CI flakes (and one production correctness issue) in the elevation-helper flow by making cancellation/force-kill outcome reporting deterministic and by hardening the integration-test helper startup harness to surface early-exit failures with better diagnostics.

Changes:

  • Ensure the force-kill timer task’s KillDisposition is observed before translating the final outcome (avoids losing the “force-killed” recovery hint).
  • Improve the integration-test helper startup harness by racing pipe-connect vs. early process exit, increasing connect timeout, and capturing a bounded stderr tail for diagnosis.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
src/EventLogExpert.Runtime/DatabaseTools/Elevation/ElevatedDatabaseToolsRunner.cs Captures and awaits the cancellation force-kill timer task so disposition is set before TranslateOutcome reads it.
tests/Integration/EventLogExpert.ElevationHelper.IntegrationTests/TestUtils/TestElevatedHelperProcessHost.cs Adds early-exit detection, longer connect timeout, and bounded stderr capture to make helper-start failures diagnosable.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jschick04 jschick04 force-pushed the jschick/elevation-forcekill-race branch from 3f129cc to f749fd4 Compare June 21, 2026 01:09
@jschick04 jschick04 changed the base branch from main to jschick/embed-symbols-drop-m1 June 21, 2026 01:09
@jschick04 jschick04 requested a review from Copilot June 21, 2026 01:13

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

@jschick04 jschick04 marked this pull request as ready for review June 21, 2026 01:16
@jschick04 jschick04 requested a review from a team as a code owner June 21, 2026 01:16
@jschick04 jschick04 merged commit 2a02c9b into jschick/embed-symbols-drop-m1 Jun 21, 2026
8 checks passed
@jschick04 jschick04 deleted the jschick/elevation-forcekill-race branch June 21, 2026 01:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky test: ElevatedDatabaseToolsRunnerTests.RunAsync_WhenHelperKillSucceedsAfterCancel_ReportsForceKilled fails intermittently under parallel test runs

2 participants