You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Remove the outage circuit breaker step and its associated template.
The per-run trip thresholds were too aggressive in practice — any
significant outage immediately tripped them and produced one
consolidated tracking issue instead of per-failure KBEs, which is
not actionable. Renumber steps 5/6/7 to 4/5/6 accordingly.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy file name to clipboardExpand all lines: .github/workflows/ci-failure-scan.md
+26-87Lines changed: 26 additions & 87 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -111,14 +111,13 @@ For every actionable failure, converge on these artifacts:
111
111
112
112
## Step-by-step
113
113
114
-
Walk the steps in order. Do not skip. Stop at Step 7.
114
+
Walk the steps in order. Do not skip. Stop at Step 6.
115
115
116
116
### Step 1 — Orient
117
117
118
118
Read once at start:
119
119
120
-
- The skill matching the pipeline you are about to scan (routing table in Step 5.1). Skills live under `.github/skills/`.
121
-
-`/tmp/gh-aw/agent/coverage/_breaker.txt` — if present and dated within the last 12h, the previous run tripped the outage breaker. Re-evaluate in Step 4.
120
+
- The skill matching the pipeline you are about to scan (routing table in Step 4.1). Skills live under `.github/skills/`.
122
121
123
122
### Step 2 — Walk pipelines
124
123
@@ -169,13 +168,13 @@ For each row in the pipeline table below, in order:
169
168
170
169
Decide the class of every failed timeline record before passing it to Step 4. The timeline graph is `Stage -> Phase -> Job -> Task`; walk it via `parentId`. Drill into one representative console log per signature to confirm the shape.
171
170
172
-
1.**Build break.** Failed task is `Build product` / `Build native components` / `Configure CMake` / any pre-test compile step, AND `Send to Helix` is `skipped`. -> Step 6 Branch D (tracking issue). Do NOT file a KBE.
171
+
1.**Build break.** Failed task is `Build product` / `Build native components` / `Configure CMake` / any pre-test compile step, AND `Send to Helix` is `skipped`. -> Step 5 Branch D (tracking issue). Do NOT file a KBE.
173
172
2.**Phase/Stage-only failure with no failed Job underneath.** Compile breaks aggregated at phase level (e.g. `windows-arm64 checked` on JIT stress pipelines). Open the Phase log + the latest log of any non-succeeded child Task -> classify as build break.
174
-
3.**Helix work-item failure.**`Send to Helix` succeeded but Job still failed. Extract Helix job IDs from the `Send to Helix` log (`Sent Helix Job: <GUID>`), query Helix work items, fetch the failing console log, locate the `[FAIL]` line -> Step 5 (test failure).
175
-
4.**Dead-lettered Helix work item.** Console URI contains `helix-workitem-deadletter` -> Step 6 Branch E (grouped infra issue).
176
-
5.**Infra-shaped Job failure with no Helix work items.**`Initialize job` failed / agent disconnect / `Pool is offline` -> Step 6 Branch E.
173
+
3.**Helix work-item failure.**`Send to Helix` succeeded but Job still failed. Extract Helix job IDs from the `Send to Helix` log (`Sent Helix Job: <GUID>`), query Helix work items, fetch the failing console log, locate the `[FAIL]` line -> Step 4 (test failure).
174
+
4.**Dead-lettered Helix work item.** Console URI contains `helix-workitem-deadletter` -> Step 5 Branch E (grouped infra issue).
175
+
5.**Infra-shaped Job failure with no Helix work items.**`Initialize job` failed / agent disconnect / `Pool is offline` -> Step 5 Branch E.
177
176
178
-
For each Step 5 candidate, compute the signature tuple `(definition_id, work_item_or_phase, queue, stress_mode, [FAIL]-or-compile-error signature)`. Look back ~10 prior completed builds in the same definition for first-seen-in-window timestamp and occurrence count.
177
+
For each Step 4 candidate, compute the signature tuple `(definition_id, work_item_or_phase, queue, stress_mode, [FAIL]-or-compile-error signature)`. Look back ~10 prior completed builds in the same definition for first-seen-in-window timestamp and occurrence count.
179
178
180
179
#### Data sources
181
180
@@ -185,34 +184,11 @@ For each Step 5 candidate, compute the signature tuple `(definition_id, work_ite
185
184
-**Helix REST.**`https://helix.dot.net/api/jobs/{jobId}/workitems?api-version=2019-06-17`. Each item has `Name`, `State`, `ExitCode`, `ConsoleOutputUri`. Failed: `ExitCode != 0` or `State == "Failed"`.
186
185
-**Build Analysis attachment (best-effort).**`https://dev.azure.com/dnceng-public/public/_apis/build/builds/{id}/attachments/Build_Analysis_KnownIssues_v1?api-version=7.1`. Use to dedupe. 404 = none attached; do not fail.
187
186
188
-
### Step 4 — Outage circuit breaker
187
+
### Step 4 — Per-signature walk
189
188
190
-
After Step 3 has produced the full signature set, BEFORE emitting any safe-output, compute:
189
+
For each `(definition_id, phase, queue, stress_mode, signature)` produced by Step 3:
1. Do NOT emit per-failure KBEs, muting PRs, or fix PRs.
205
-
2. Emit exactly ONE `create_issue` using the Outage summary template (see Templates).
206
-
3. Persist `/tmp/gh-aw/agent/coverage/_breaker.txt` with the three totals + which threshold(s) breached.
207
-
4. Skip Steps 5–6; jump to Step 7.
208
-
209
-
If clean -> continue to Step 5.
210
-
211
-
### Step 5 — Per-signature walk
212
-
213
-
For each `(definition_id, phase, queue, stress_mode, signature)` surviving Step 4:
214
-
215
-
#### Step 5.1 — Load the matching skill
191
+
#### Step 4.1 — Load the matching skill
216
192
217
193
| Pipeline category | Skill |
218
194
|---|---|
@@ -222,27 +198,27 @@ For each `(definition_id, phase, queue, stress_mode, signature)` surviving Step
222
198
| NativeAOT outer loop | Check `eng/testing/tests.*aot*.targets` and the test `.csproj` for AOT-specific conditions before suggesting a fix. |
223
199
| Generic |`ci-pipeline-monitor/SKILL.md`|
224
200
225
-
#### Step 5.2 — Search for an existing KBE
201
+
#### Step 4.2 — Search for an existing KBE
226
202
227
203
`is:issue is:open label:"Known Build Error" in:body "<error-signature>"`. Try variations: full `[FAIL]` line; assertion text; exception class + test name. On hit, record `existing-kbe #<n>` and continue (the walk does not end — a KBE hit changes the final action, not the inspection).
228
204
229
-
#### Step 5.3 — Search for an area-team tracker (no KBE label)
205
+
#### Step 4.3 — Search for an area-team tracker (no KBE label)
230
206
231
207
`is:issue is:open in:title "<test-name>"` AND `in:body "<test-file-path>"`. On hit, record `linked-tracker #<n>`. A plain tracker is NOT a KBE substitute (Build Analysis only matches `Known Build Error`-labeled issues with a valid JSON body). File a fresh KBE and cross-link the tracker as `Tracking: dotnet/runtime#<tracker>` inside the KBE body and the muting PR body.
232
208
233
-
#### Step 5.4 — Search for an existing muting PR
209
+
#### Step 4.4 — Search for an existing muting PR
234
210
235
211
`is:pr is:open in:title "<test-name>" "[ci-scan]"` and `is:pr is:open "<test-name>" ActiveIssue`. On hit, record `existing-PR #<n>` (muting) and stop the walk for this signature.
236
212
237
-
#### Step 5.5 — Search for an in-flight fix PR by anyone
213
+
#### Step 4.5 — Search for an in-flight fix PR by anyone
238
214
239
215
Broad search (NOT only `[ci-scan]` PRs): `is:pr is:open "<test-name>"`, `is:pr is:open "<test-file-path>"`, `is:pr is:open "<assembly>" in:title`. Fetch each candidate body; if it claims to fix this failure or links the same KBE, record `existing-PR #<n>` (in-flight fix) and stop.
240
216
241
-
#### Step 5.6 — Verify every embedded issue number exists
217
+
#### Step 4.6 — Verify every embedded issue number exists
242
218
243
219
For every `<n>` you plan to write into source (`[ActiveIssue("...issues/<n>")]`, `Linked KBE: #<n>`, inline `<!-- ...issues/<n> -->`) call `issue_read` with `get` and `{owner: "dotnet", repo: "runtime", issue_number: <n>}`. Confirm it returns an open issue. If it does not -> stop. A dead-link annotation in source requires a follow-up PR to remove.
244
220
245
-
#### Step 5.7 — Confirm muting is welcome on the candidate issue
221
+
#### Step 4.7 — Confirm muting is welcome on the candidate issue
246
222
247
223
Read the candidate KBE / tracker body + its most recent area-owner comment. Skip muting (record `-> skipped: do-not-mute on issue #<n>`) if ANY of:
248
224
@@ -252,7 +228,7 @@ Read the candidate KBE / tracker body + its most recent area-owner comment. Skip
252
228
253
229
When in doubt -> skip muting and let the next run revisit.
Before writing `Linked KBE: #<n>` or `[ActiveIssue("...issues/<n>")]`, answer:
258
234
@@ -265,19 +241,19 @@ If any answer is no -> file a fresh KBE this run instead. Embed the four answers
265
241
266
242
Optional fifth check when the candidate KBE is older than ~14 days: confirm Build Analysis is still matching it. `gh api graphql` over `userContentEdits` gives the edit timeline; a stale never-edited body hints the signature went bad.
267
243
268
-
### Step 6 — Decide and emit
244
+
### Step 5 — Decide and emit
269
245
270
246
Exactly one of these branches fires per signature.
271
247
272
248
**Branch A — No existing KBE; test failure; signature is stable (>= 2 occurrences in window).**
273
249
274
250
Emit one `create_issue` with `temporary_id: "aw_kbe<N>"` (fresh `<N>` per KBE) plus one matching `update_project`. Same-run, same agent output batch. See *Same-run KBE + project linkage payload* in Templates.
275
251
276
-
If Step 5.3 found a tracker, cross-link as `Tracking: dotnet/runtime#<tracker>` in the KBE body. Muting PR is deferred to the next run.
252
+
If Step 4.3 found a tracker, cross-link as `Tracking: dotnet/runtime#<tracker>` in the KBE body. Muting PR is deferred to the next run.
277
253
278
-
**Branch B — Existing KBE; no muting PR; muting is welcome (Step 5.7 clean).**
254
+
**Branch B — Existing KBE; no muting PR; muting is welcome (Step 4.7 clean).**
279
255
280
-
Emit one `create_pull_request` using the Muting PR template. Diff <= 5 lines; only test annotations or csproj flags. Body MUST include `Linked KBE: #<n>` as a top-level line plus the Step 5.8 four-question block.
256
+
Emit one `create_pull_request` using the Muting PR template. Diff <= 5 lines; only test annotations or csproj flags. Body MUST include `Linked KBE: #<n>` as a top-level line plus the Step 4.8 four-question block.
281
257
282
258
If the existing KBE is not yet on the project board (check via project queries first), also emit one `update_project` referencing the real issue number.
283
259
@@ -299,9 +275,9 @@ Group all infra failures in this run into ONE tracking issue. Before emitting, `
299
275
300
276
Emit one `create_issue` using the Tracking issue template (or the JIT pipeline template for JIT/GC/PGO/stress pipelines). Call out the signature problem in `Recommended action`.
301
277
302
-
After emitting, record the outcome per signature (Step 7).
278
+
After emitting, record the outcome per signature (Step 6).
Per signature, append one outcome line to `/tmp/gh-aw/agent/coverage/<pipeline>.txt`:
307
283
@@ -311,16 +287,14 @@ Per signature, append one outcome line to `/tmp/gh-aw/agent/coverage/<pipeline>.
311
287
312
288
`<outcome>` is one of: `filed-issue #aw_<id>`, `filed-PR #aw_<id>`, `existing-issue #<n>`, `existing-PR #<n>`, `linked-to-project #<n>`, `skipped: <reason>`.
313
289
314
-
A skipped signature MUST have a reason (e.g., `build canceled`, `< 2 occurrences and not blocking`, `do-not-mute on issue #<n>`, `cap reached`, `breaker-tripped`).
290
+
A skipped signature MUST have a reason (e.g., `build canceled`, `< 2 occurrences and not blocking`, `do-not-mute on issue #<n>`, `cap reached`).
If the breaker tripped in Step 4, the table contains one row: `_breaker | <total-actionable-failures> | 1 | 0 | 0 | 0 | <breached-thresholds>`.
323
-
324
298
## Templates
325
299
326
300
Emit each template verbatim except for `<placeholder>` slots. Match headings exactly — Build Analysis is strict about `## Error Message` and the JSON fence shape.
@@ -521,7 +495,7 @@ Branch handling: branch from `origin/main`. Stage only files you intend to chang
@@ -629,41 +603,6 @@ Used for tracking issues against JIT/GC/PGO/stress pipelines (definitions 109–
629
603
630
604
Do NOT propose any `area-*` label yourself. Area triage (`area-CodeGen-coreclr` / `area-GC-coreclr` / `area-PGO-coreclr` / `area-Tools-ILVerification`) is added later by a human reviewer.
631
605
632
-
### Template: Outage summary issue body
633
-
634
-
Emitted only when Step 4 trips the breaker. Title: `[ci-scan] Outage suspected: <short shape description>`. NO `Known Build Error` label.
635
-
636
-
````markdown
637
-
## Reasoning
638
-
The CI scanner's outage circuit breaker tripped this run. Per-failure KBEs/PRs were NOT emitted; this issue is the single consolidated record so the failure set is not lost.
<top signatures by occurrence, with one representative log excerpt per signature>
653
-
```
654
-
655
-
## First build it occurred
656
-
- Earliest build in the scanned window with any of these signatures: <link>
657
-
- Window size: <n> builds
658
-
659
-
## Recommended action
660
-
- Human review needed before per-failure processing resumes.
661
-
- The next scheduled run re-evaluates from a fresh build sample. If the totals fall below the thresholds, per-failure KBEs/PRs resume automatically.
662
-
- A human can short-circuit by filing the root-cause KBE manually; once Build Analysis matches the dominant signature, per-leg counts fall and subsequent runs converge.
663
-
- Affected pipelines: <list>
664
-
- Top signatures: <list>
665
-
````
666
-
667
606
### Template: Sanitization
668
607
669
608
When pasting log excerpts into issue/PR bodies, strip:
@@ -700,4 +639,4 @@ These look like permission errors but are physical.
700
639
- Don't comment on existing KBEs (Build Analysis tracks occurrence counts in the issue body).
701
640
- Don't emit `noop`. Either a PR or an issue must come out of every actionable failure.
702
641
- One signature = one outcome line in `/tmp/gh-aw/agent/coverage/<pipeline>.txt`.
703
-
- The final agent log MUST include the Step 7 summary table.
642
+
- The final agent log MUST include the Step 6 summary table.
0 commit comments