Skip to content

Commit 9524e32

Browse files
kotlarmilosCopilot
andcommitted
[ci-scanner] remove outage circuit breaker
Remove the outage circuit breaker step and its associated template. The per-run trip thresholds were too aggressive in practice — any significant outage immediately tripped them and produced one consolidated tracking issue instead of per-failure KBEs, which is not actionable. Renumber steps 5/6/7 to 4/5/6 accordingly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1 parent 52d2d06 commit 9524e32

1 file changed

Lines changed: 26 additions & 87 deletions

File tree

.github/workflows/ci-failure-scan.md

Lines changed: 26 additions & 87 deletions
Original file line numberDiff line numberDiff line change
@@ -111,14 +111,13 @@ For every actionable failure, converge on these artifacts:
111111

112112
## Step-by-step
113113

114-
Walk the steps in order. Do not skip. Stop at Step 7.
114+
Walk the steps in order. Do not skip. Stop at Step 6.
115115

116116
### Step 1 — Orient
117117

118118
Read once at start:
119119

120-
- The skill matching the pipeline you are about to scan (routing table in Step 5.1). Skills live under `.github/skills/`.
121-
- `/tmp/gh-aw/agent/coverage/_breaker.txt` — if present and dated within the last 12h, the previous run tripped the outage breaker. Re-evaluate in Step 4.
120+
- The skill matching the pipeline you are about to scan (routing table in Step 4.1). Skills live under `.github/skills/`.
122121

123122
### Step 2 — Walk pipelines
124123

@@ -169,13 +168,13 @@ For each row in the pipeline table below, in order:
169168

170169
Decide the class of every failed timeline record before passing it to Step 4. The timeline graph is `Stage -> Phase -> Job -> Task`; walk it via `parentId`. Drill into one representative console log per signature to confirm the shape.
171170

172-
1. **Build break.** Failed task is `Build product` / `Build native components` / `Configure CMake` / any pre-test compile step, AND `Send to Helix` is `skipped`. -> Step 6 Branch D (tracking issue). Do NOT file a KBE.
171+
1. **Build break.** Failed task is `Build product` / `Build native components` / `Configure CMake` / any pre-test compile step, AND `Send to Helix` is `skipped`. -> Step 5 Branch D (tracking issue). Do NOT file a KBE.
173172
2. **Phase/Stage-only failure with no failed Job underneath.** Compile breaks aggregated at phase level (e.g. `windows-arm64 checked` on JIT stress pipelines). Open the Phase log + the latest log of any non-succeeded child Task -> classify as build break.
174-
3. **Helix work-item failure.** `Send to Helix` succeeded but Job still failed. Extract Helix job IDs from the `Send to Helix` log (`Sent Helix Job: <GUID>`), query Helix work items, fetch the failing console log, locate the `[FAIL]` line -> Step 5 (test failure).
175-
4. **Dead-lettered Helix work item.** Console URI contains `helix-workitem-deadletter` -> Step 6 Branch E (grouped infra issue).
176-
5. **Infra-shaped Job failure with no Helix work items.** `Initialize job` failed / agent disconnect / `Pool is offline` -> Step 6 Branch E.
173+
3. **Helix work-item failure.** `Send to Helix` succeeded but Job still failed. Extract Helix job IDs from the `Send to Helix` log (`Sent Helix Job: <GUID>`), query Helix work items, fetch the failing console log, locate the `[FAIL]` line -> Step 4 (test failure).
174+
4. **Dead-lettered Helix work item.** Console URI contains `helix-workitem-deadletter` -> Step 5 Branch E (grouped infra issue).
175+
5. **Infra-shaped Job failure with no Helix work items.** `Initialize job` failed / agent disconnect / `Pool is offline` -> Step 5 Branch E.
177176

178-
For each Step 5 candidate, compute the signature tuple `(definition_id, work_item_or_phase, queue, stress_mode, [FAIL]-or-compile-error signature)`. Look back ~10 prior completed builds in the same definition for first-seen-in-window timestamp and occurrence count.
177+
For each Step 4 candidate, compute the signature tuple `(definition_id, work_item_or_phase, queue, stress_mode, [FAIL]-or-compile-error signature)`. Look back ~10 prior completed builds in the same definition for first-seen-in-window timestamp and occurrence count.
179178

180179
#### Data sources
181180

@@ -185,34 +184,11 @@ For each Step 5 candidate, compute the signature tuple `(definition_id, work_ite
185184
- **Helix REST.** `https://helix.dot.net/api/jobs/{jobId}/workitems?api-version=2019-06-17`. Each item has `Name`, `State`, `ExitCode`, `ConsoleOutputUri`. Failed: `ExitCode != 0` or `State == "Failed"`.
186185
- **Build Analysis attachment (best-effort).** `https://dev.azure.com/dnceng-public/public/_apis/build/builds/{id}/attachments/Build_Analysis_KnownIssues_v1?api-version=7.1`. Use to dedupe. 404 = none attached; do not fail.
187186

188-
### Step 4 — Outage circuit breaker
187+
### Step 4 — Per-signature walk
189188

190-
After Step 3 has produced the full signature set, BEFORE emitting any safe-output, compute:
189+
For each `(definition_id, phase, queue, stress_mode, signature)` produced by Step 3:
191190

192-
- `total-actionable-failures` = distinct `(definition_id, phase, queue, stress_mode, signature)` tuples this run.
193-
- `pipelines-mostly-red` = count of pipelines whose latest scanned build has >= 50% of legs failing.
194-
- `top-signature-share` = max(occurrence_count) / total-actionable-failures.
195-
196-
Trip the breaker if ANY of:
197-
198-
- `total-actionable-failures > 30`
199-
- `pipelines-mostly-red >= 4`
200-
- `top-signature-share >= 0.5`
201-
202-
On trip:
203-
204-
1. Do NOT emit per-failure KBEs, muting PRs, or fix PRs.
205-
2. Emit exactly ONE `create_issue` using the Outage summary template (see Templates).
206-
3. Persist `/tmp/gh-aw/agent/coverage/_breaker.txt` with the three totals + which threshold(s) breached.
207-
4. Skip Steps 5–6; jump to Step 7.
208-
209-
If clean -> continue to Step 5.
210-
211-
### Step 5 — Per-signature walk
212-
213-
For each `(definition_id, phase, queue, stress_mode, signature)` surviving Step 4:
214-
215-
#### Step 5.1 — Load the matching skill
191+
#### Step 4.1 — Load the matching skill
216192

217193
| Pipeline category | Skill |
218194
|---|---|
@@ -222,27 +198,27 @@ For each `(definition_id, phase, queue, stress_mode, signature)` surviving Step
222198
| NativeAOT outer loop | Check `eng/testing/tests.*aot*.targets` and the test `.csproj` for AOT-specific conditions before suggesting a fix. |
223199
| Generic | `ci-pipeline-monitor/SKILL.md` |
224200

225-
#### Step 5.2 — Search for an existing KBE
201+
#### Step 4.2 — Search for an existing KBE
226202

227203
`is:issue is:open label:"Known Build Error" in:body "<error-signature>"`. Try variations: full `[FAIL]` line; assertion text; exception class + test name. On hit, record `existing-kbe #<n>` and continue (the walk does not end — a KBE hit changes the final action, not the inspection).
228204

229-
#### Step 5.3 — Search for an area-team tracker (no KBE label)
205+
#### Step 4.3 — Search for an area-team tracker (no KBE label)
230206

231207
`is:issue is:open in:title "<test-name>"` AND `in:body "<test-file-path>"`. On hit, record `linked-tracker #<n>`. A plain tracker is NOT a KBE substitute (Build Analysis only matches `Known Build Error`-labeled issues with a valid JSON body). File a fresh KBE and cross-link the tracker as `Tracking: dotnet/runtime#<tracker>` inside the KBE body and the muting PR body.
232208

233-
#### Step 5.4 — Search for an existing muting PR
209+
#### Step 4.4 — Search for an existing muting PR
234210

235211
`is:pr is:open in:title "<test-name>" "[ci-scan]"` and `is:pr is:open "<test-name>" ActiveIssue`. On hit, record `existing-PR #<n>` (muting) and stop the walk for this signature.
236212

237-
#### Step 5.5 — Search for an in-flight fix PR by anyone
213+
#### Step 4.5 — Search for an in-flight fix PR by anyone
238214

239215
Broad search (NOT only `[ci-scan]` PRs): `is:pr is:open "<test-name>"`, `is:pr is:open "<test-file-path>"`, `is:pr is:open "<assembly>" in:title`. Fetch each candidate body; if it claims to fix this failure or links the same KBE, record `existing-PR #<n>` (in-flight fix) and stop.
240216

241-
#### Step 5.6 — Verify every embedded issue number exists
217+
#### Step 4.6 — Verify every embedded issue number exists
242218

243219
For every `<n>` you plan to write into source (`[ActiveIssue("...issues/<n>")]`, `Linked KBE: #<n>`, inline `<!-- ...issues/<n> -->`) call `issue_read` with `get` and `{owner: "dotnet", repo: "runtime", issue_number: <n>}`. Confirm it returns an open issue. If it does not -> stop. A dead-link annotation in source requires a follow-up PR to remove.
244220

245-
#### Step 5.7 — Confirm muting is welcome on the candidate issue
221+
#### Step 4.7 — Confirm muting is welcome on the candidate issue
246222

247223
Read the candidate KBE / tracker body + its most recent area-owner comment. Skip muting (record `-> skipped: do-not-mute on issue #<n>`) if ANY of:
248224

@@ -252,7 +228,7 @@ Read the candidate KBE / tracker body + its most recent area-owner comment. Skip
252228

253229
When in doubt -> skip muting and let the next run revisit.
254230

255-
#### Step 5.8 — Verify the candidate KBE actually matches (4-question check)
231+
#### Step 4.8 — Verify the candidate KBE actually matches (4-question check)
256232

257233
Before writing `Linked KBE: #<n>` or `[ActiveIssue("...issues/<n>")]`, answer:
258234

@@ -265,19 +241,19 @@ If any answer is no -> file a fresh KBE this run instead. Embed the four answers
265241

266242
Optional fifth check when the candidate KBE is older than ~14 days: confirm Build Analysis is still matching it. `gh api graphql` over `userContentEdits` gives the edit timeline; a stale never-edited body hints the signature went bad.
267243

268-
### Step 6 — Decide and emit
244+
### Step 5 — Decide and emit
269245

270246
Exactly one of these branches fires per signature.
271247

272248
**Branch A — No existing KBE; test failure; signature is stable (>= 2 occurrences in window).**
273249

274250
Emit one `create_issue` with `temporary_id: "aw_kbe<N>"` (fresh `<N>` per KBE) plus one matching `update_project`. Same-run, same agent output batch. See *Same-run KBE + project linkage payload* in Templates.
275251

276-
If Step 5.3 found a tracker, cross-link as `Tracking: dotnet/runtime#<tracker>` in the KBE body. Muting PR is deferred to the next run.
252+
If Step 4.3 found a tracker, cross-link as `Tracking: dotnet/runtime#<tracker>` in the KBE body. Muting PR is deferred to the next run.
277253

278-
**Branch B — Existing KBE; no muting PR; muting is welcome (Step 5.7 clean).**
254+
**Branch B — Existing KBE; no muting PR; muting is welcome (Step 4.7 clean).**
279255

280-
Emit one `create_pull_request` using the Muting PR template. Diff <= 5 lines; only test annotations or csproj flags. Body MUST include `Linked KBE: #<n>` as a top-level line plus the Step 5.8 four-question block.
256+
Emit one `create_pull_request` using the Muting PR template. Diff <= 5 lines; only test annotations or csproj flags. Body MUST include `Linked KBE: #<n>` as a top-level line plus the Step 4.8 four-question block.
281257

282258
If the existing KBE is not yet on the project board (check via project queries first), also emit one `update_project` referencing the real issue number.
283259

@@ -299,9 +275,9 @@ Group all infra failures in this run into ONE tracking issue. Before emitting, `
299275

300276
Emit one `create_issue` using the Tracking issue template (or the JIT pipeline template for JIT/GC/PGO/stress pipelines). Call out the signature problem in `Recommended action`.
301277

302-
After emitting, record the outcome per signature (Step 7).
278+
After emitting, record the outcome per signature (Step 6).
303279

304-
### Step 7 — Per-pipeline tally + end-of-run summary
280+
### Step 6 — Per-pipeline tally + end-of-run summary
305281

306282
Per signature, append one outcome line to `/tmp/gh-aw/agent/coverage/<pipeline>.txt`:
307283

@@ -311,16 +287,14 @@ Per signature, append one outcome line to `/tmp/gh-aw/agent/coverage/<pipeline>.
311287

312288
`<outcome>` is one of: `filed-issue #aw_<id>`, `filed-PR #aw_<id>`, `existing-issue #<n>`, `existing-PR #<n>`, `linked-to-project #<n>`, `skipped: <reason>`.
313289

314-
A skipped signature MUST have a reason (e.g., `build canceled`, `< 2 occurrences and not blocking`, `do-not-mute on issue #<n>`, `cap reached`, `breaker-tripped`).
290+
A skipped signature MUST have a reason (e.g., `build canceled`, `< 2 occurrences and not blocking`, `do-not-mute on issue #<n>`, `cap reached`).
315291

316292
At end of run, print this table to the agent log:
317293

318294
```
319295
| pipeline | total-signatures | issues-filed | prs-filed | reused-existing | linked-to-project | skipped-with-reason |
320296
```
321297

322-
If the breaker tripped in Step 4, the table contains one row: `_breaker | <total-actionable-failures> | 1 | 0 | 0 | 0 | <breached-thresholds>`.
323-
324298
## Templates
325299

326300
Emit each template verbatim except for `<placeholder>` slots. Match headings exactly — Build Analysis is strict about `## Error Message` and the JSON fence shape.
@@ -521,7 +495,7 @@ Branch handling: branch from `origin/main`. Stage only files you intend to chang
521495
Linked KBE: #<n>
522496
<if applicable: Tracking: dotnet/runtime#<tracker-n>>
523497

524-
Match verification (from Step 5.8):
498+
Match verification (from Step 4.8):
525499
1. Same test/family: <yes + evidence>
526500
2. Same failure signature: <yes + evidence>
527501
3. Same OS: <yes + evidence>
@@ -629,41 +603,6 @@ Used for tracking issues against JIT/GC/PGO/stress pipelines (definitions 109–
629603

630604
Do NOT propose any `area-*` label yourself. Area triage (`area-CodeGen-coreclr` / `area-GC-coreclr` / `area-PGO-coreclr` / `area-Tools-ILVerification`) is added later by a human reviewer.
631605

632-
### Template: Outage summary issue body
633-
634-
Emitted only when Step 4 trips the breaker. Title: `[ci-scan] Outage suspected: <short shape description>`. NO `Known Build Error` label.
635-
636-
````markdown
637-
## Reasoning
638-
The CI scanner's outage circuit breaker tripped this run. Per-failure KBEs/PRs were NOT emitted; this issue is the single consolidated record so the failure set is not lost.
639-
640-
Trip totals:
641-
- total-actionable-failures: <n>
642-
- pipelines-mostly-red: <n>
643-
- top-signature-share: <fraction>
644-
645-
Breached threshold(s): <list which of the three trip conditions fired>
646-
647-
## Impact on platforms
648-
- <top affected pipelines + leg counts>
649-
650-
## Errors log
651-
```
652-
<top signatures by occurrence, with one representative log excerpt per signature>
653-
```
654-
655-
## First build it occurred
656-
- Earliest build in the scanned window with any of these signatures: <link>
657-
- Window size: <n> builds
658-
659-
## Recommended action
660-
- Human review needed before per-failure processing resumes.
661-
- The next scheduled run re-evaluates from a fresh build sample. If the totals fall below the thresholds, per-failure KBEs/PRs resume automatically.
662-
- A human can short-circuit by filing the root-cause KBE manually; once Build Analysis matches the dominant signature, per-leg counts fall and subsequent runs converge.
663-
- Affected pipelines: <list>
664-
- Top signatures: <list>
665-
````
666-
667606
### Template: Sanitization
668607

669608
When pasting log excerpts into issue/PR bodies, strip:
@@ -700,4 +639,4 @@ These look like permission errors but are physical.
700639
- Don't comment on existing KBEs (Build Analysis tracks occurrence counts in the issue body).
701640
- Don't emit `noop`. Either a PR or an issue must come out of every actionable failure.
702641
- One signature = one outcome line in `/tmp/gh-aw/agent/coverage/<pipeline>.txt`.
703-
- The final agent log MUST include the Step 7 summary table.
642+
- The final agent log MUST include the Step 6 summary table.

0 commit comments

Comments
 (0)