[WIP]feat(ci): add testgrid-publish workflow (TG5)#1477
Conversation
178e97c to
01fee9b
Compare
Coverage Report ✅
Coverage BadgeNo Go source files changed in this PR. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a new GitHub Actions workflow for TestGrid publishing. It runs after completed UAT GCP or AWS workflows and supports manual dispatch. The workflow validates the producer repository, resolves environment-scoped GCP settings, downloads and parses evidence pointer artifacts when present, conditionally authenticates to GCP and GHCR, builds Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (2 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 9
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/testgrid-publish.yml:
- Around line 56-58: The TestGrid publishing workflow currently maps the prod
environment to suffixed resource names, which conflicts with the intended
unsuffixed prod resources. Update the environment-to-resource expressions in the
workflow so the prod branch uses the plain TestGrid bucket, GitHub secret, and
publish secret names, while only non-prod keeps the suffixed staging-style
values; check the environment mapping near the environment input and the related
expressions referenced in the later section of the same workflow.
- Around line 68-71: The TestGrid publish trigger is too broad because it runs
on failed workflow_run conclusions even though the producer jobs only upload
pointer.yaml on success, so the download will fail. Update the workflow
condition in the publish job to match the artifact contract, or change the UAT
producer steps that emit the pointer artifact to always publish a failure
evidence pointer; use the existing workflow gate and the
conformance/pointer.yaml upload path to locate the affected logic.
- Around line 167-170: The Publish to TestGrid step in the workflow is
interpolating steps.bundle.outputs.bundle_ref directly into a shell command,
which can allow shell injection. Update the Publish to TestGrid run block to
pass the bundle reference through env first, then reference the shell variable
inside the command instead; use the existing Publish to TestGrid job step and
bundle output wiring to locate the fix.
- Around line 53-55: The publish workflow is ignoring the `source_class`
dispatch input and always using `uat`, which misclassifies manual backfills.
Update the testgrid publish command in the workflow to read from the
`source_class` input instead of hardcoding `uat`, and make sure the same input
is honored anywhere else the publish step is assembled so the value passed to
the publish action matches the manual dispatch request.
- Around line 134-142: The bundle reference assembly in the workflow step that
reads `OCI` and `DIGEST` from `POINTER` only validates `OCI`, so a missing or
null digest can still produce an invalid `BUNDLE_REF`. Add a digest check
alongside the existing OCI validation before constructing `bundle_ref`, and fail
early with a clear error message if `.attestations[0].bundle.digest` is missing
or null. Keep the fix in the same shell block that sets `OCI`, `DIGEST`, and
`BUNDLE_REF` so the pointer contract is enforced before publishing.
- Around line 42-44: The workflow_run trigger is referencing the wrong producer
workflow names, so the publish job will never fire. Update the workflows list in
the testgrid publish workflow to match the exact producer names used elsewhere,
specifically the UAT GCP and UAT AWS workflow identifiers in the workflow_run
configuration.
- Around line 68-75: The workflow_run job currently grants id-token: write and
publishes artifacts without verifying the completed producer run came from the
expected repository/default branch. Add a trusted-ref guard to the job’s if
condition in testgrid-publish so it only runs for workflow_run events from the
intended repo and default branch, while still allowing workflow_dispatch; use
the existing workflow_run and permissions block to locate the fix.
- Around line 108-143: The manual backfill path is still tied to
`github.event.workflow_run.id`, so `workflow_dispatch` cannot download the
artifact and the bundle extraction ignores the required `inputs.bundle_ref`.
Update the `Download evidence pointer` / `Extract bundle ref` logic in
`testgrid-publish.yml` to branch on the trigger: use the artifact-based
`pointer.yaml` flow only for workflow-run driven executions, and for manual
dispatch set `bundle_ref` directly from `inputs.bundle_ref` so the later steps
consume the provided value instead of trying to derive it.
- Around line 73-75: The workflow job is missing the `actions: read` permission
needed by `actions/download-artifact` when using `run-id` and `github-token`.
Update the permissions block in the testgrid publish job to include `actions:
read` alongside the existing `contents: read` and `id-token: write` entries so
the artifact download can access the triggering run.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: a01c8ebc-fd54-489b-944b-da7f2a68750b
📒 Files selected for processing (1)
.github/workflows/testgrid-publish.yml
There was a problem hiding this comment.
♻️ Duplicate comments (1)
.github/workflows/testgrid-publish.yml (1)
131-149: 🔒 Security & Privacy | 🟠 Major | ⚡ Quick winMove
bundle_refinput out of the shell template.Line 133 still injects the manual
bundle_refinput directly intorun:. A value containing shell metacharacters can break out before WIF-backed publishing. Pass it viaenvand validate the final ref before writingGITHUB_OUTPUT.🛡️ Proposed fix
- name: Resolve bundle ref id: bundle + env: + EVENT_NAME: ${{ github.event_name }} + INPUT_BUNDLE_REF: ${{ inputs.bundle_ref }} run: | - if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then - BUNDLE_REF="${{ inputs.bundle_ref }}" + if [ "${EVENT_NAME}" = "workflow_dispatch" ]; then + BUNDLE_REF="${INPUT_BUNDLE_REF}" else POINTER="$(find evidence-pointer/ -name 'pointer.yaml' | head -1)" if [ -z "${POINTER}" ]; then echo "::error::No pointer.yaml found in evidence pointer artifact" @@ fi BUNDLE_REF="${OCI}@${DIGEST}" fi - echo "bundle_ref=${BUNDLE_REF}" >> "${GITHUB_OUTPUT}" + if [[ ! "${BUNDLE_REF}" =~ ^ghcr\.io/nvidia/[^[:space:]]+@sha256:[0-9a-fA-F]{64}$ ]]; then + echo "::error::bundle_ref must be ghcr.io/nvidia/...@sha256:<64 hex chars>" + exit 1 + fi + printf 'bundle_ref=%s\n' "${BUNDLE_REF}" >> "${GITHUB_OUTPUT}" echo "Resolved bundle: ${BUNDLE_REF}"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/testgrid-publish.yml around lines 131 - 149, The workflow step still interpolates the manual bundle_ref input directly inside the shell script, which can allow shell metacharacters to affect execution. Update the publishing step to pass bundle_ref through env instead of embedding it in run, then use that variable in the workflow_dispatch branch and validate the final BUNDLE_REF before echoing it to GITHUB_OUTPUT. Keep the fix localized to the workflow step that sets BUNDLE_REF and writes bundle_ref.Source: Linters/SAST tools
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In @.github/workflows/testgrid-publish.yml:
- Around line 131-149: The workflow step still interpolates the manual
bundle_ref input directly inside the shell script, which can allow shell
metacharacters to affect execution. Update the publishing step to pass
bundle_ref through env instead of embedding it in run, then use that variable in
the workflow_dispatch branch and validate the final BUNDLE_REF before echoing it
to GITHUB_OUTPUT. Keep the fix localized to the workflow step that sets
BUNDLE_REF and writes bundle_ref.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: a872862d-dd25-4b4e-a7fd-d1982f392cc9
📒 Files selected for processing (1)
.github/workflows/testgrid-publish.yml
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
.github/workflows/testgrid-publish.yml (1)
171-194: 🔒 Security & Privacy | 🟠 Major | ⚡ Quick winValidate
BUNDLE_REFbefore writing it toGITHUB_OUTPUT.Manual input can be empty/malformed, and any newline in
BUNDLE_REFcan inject additional outputs. Enforce the digest-pinned ref shape before Line 194.🛡️ Proposed fix
else POINTER="$(find evidence-pointer/ -name 'pointer.yaml' 2>/dev/null | head -1)" @@ fi BUNDLE_REF="${OCI}@${DIGEST}" fi + if ! printf '%s' "${BUNDLE_REF}" | grep -Eq '^ghcr\.io/nvidia/[^[:space:]]+@sha256:[0-9a-fA-F]{64}$'; then + echo "::error::Invalid bundle ref '${BUNDLE_REF}'" + exit 1 + fi echo "bundle_ref=${BUNDLE_REF}" >> "${GITHUB_OUTPUT}"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/testgrid-publish.yml around lines 171 - 194, The TestGrid publish step writes BUNDLE_REF to GITHUB_OUTPUT without validating the final value, so malformed manual input or embedded newlines could inject extra outputs. In the workflow logic around the BUNDLE_REF assignment in the shell block, validate that BUNDLE_REF is non-empty, matches a digest-pinned OCI ref shape, and contains no newline characters before echoing it to GITHUB_OUTPUT. Apply this check for both the workflow_dispatch path and the pointer-derived path, and fail fast with an error if validation does not pass.
♻️ Duplicate comments (1)
.github/workflows/testgrid-publish.yml (1)
95-107: 🔒 Security & Privacy | 🟠 Major | ⚡ Quick winReject untrusted producer heads before publishing.
This guard still publishes non-
mainruns and checksworkflow_run.repository.full_nameinstead of the producer head repository. Usegithub.event.workflow_run.head_repository.full_nameand fail on non-mainso fork/branch runs cannot reach WIF-backed publishing.🛡️ Proposed fix
- PRODUCER_REPO: ${{ github.event.workflow_run.repository.full_name }} + PRODUCER_REPO: ${{ github.event.workflow_run.head_repository.full_name }} PRODUCER_BRANCH: ${{ github.event.workflow_run.head_branch }} @@ if [ "${PRODUCER_BRANCH}" != "main" ]; then - echo "::warning::Producer ran on branch '${PRODUCER_BRANCH}' (not main) — publishing anyway" + echo "::error::Producer ran on branch '${PRODUCER_BRANCH}' (not main) — rejecting" + exit 1 fi🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/testgrid-publish.yml around lines 95 - 107, In the Validate producer step, the workflow is still allowing non-main producer runs to publish and is checking the wrong repository field. Update the guard in the workflow_run path to use github.event.workflow_run.head_repository.full_name instead of github.event.workflow_run.repository.full_name, and make the branch check on github.event.workflow_run.head_branch fail the job for anything other than main so untrusted fork/branch heads cannot reach the publishing flow.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/testgrid-publish.yml:
- Around line 220-234: The release lookup in the testgrid-publish workflow
currently falls back to the latest GitHub release, which can reuse an older
binary for untagged commits. Update the logic around the RELEASE check in the
testgrid-publish step so it only downloads when HEAD is an exact tag (for
example by verifying the current ref/tag before calling gh release view);
otherwise, always take the existing source-build path for
./tools/testgrid-publish.
In @.goreleaser.yaml:
- Around line 201-207: The testgrid-publish release artifact is being shipped as
an unverified tarball, so update the goreleaser config for the testgrid-publish
build artifact and the .github/workflows/testgrid-publish.yml consumer to add an
integrity verification step before extraction. Either publish and verify signed
checksum/attestation metadata for the testgrid-publish artifact using the
existing release artifact pattern as a guide, or remove this binary download
path and build from source in the workflow instead.
---
Outside diff comments:
In @.github/workflows/testgrid-publish.yml:
- Around line 171-194: The TestGrid publish step writes BUNDLE_REF to
GITHUB_OUTPUT without validating the final value, so malformed manual input or
embedded newlines could inject extra outputs. In the workflow logic around the
BUNDLE_REF assignment in the shell block, validate that BUNDLE_REF is non-empty,
matches a digest-pinned OCI ref shape, and contains no newline characters before
echoing it to GITHUB_OUTPUT. Apply this check for both the workflow_dispatch
path and the pointer-derived path, and fail fast with an error if validation
does not pass.
---
Duplicate comments:
In @.github/workflows/testgrid-publish.yml:
- Around line 95-107: In the Validate producer step, the workflow is still
allowing non-main producer runs to publish and is checking the wrong repository
field. Update the guard in the workflow_run path to use
github.event.workflow_run.head_repository.full_name instead of
github.event.workflow_run.repository.full_name, and make the branch check on
github.event.workflow_run.head_branch fail the job for anything other than main
so untrusted fork/branch heads cannot reach the publishing flow.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 6f188cb3-9041-43c5-9752-7fe9041bf5e0
📒 Files selected for processing (2)
.github/workflows/testgrid-publish.yml.goreleaser.yaml
03dac49 to
b5dd3dd
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (3)
.goreleaser.yaml (1)
201-207: 🔒 Security & Privacy | 🟠 Major | 🏗️ Heavy liftRelease artifact ships without integrity/attestation material.
This
testgrid-publisharchive is downloaded and executed by.github/workflows/testgrid-publish.ymlunder WIF-issued GCP credentials, yet—unlike theaicrarchive (lines 214-222)—it carries no checksum/attestation bundle. Either publish and verify signed integrity metadata for this artifact, or keep the workflow source-built only.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.goreleaser.yaml around lines 201 - 207, The testgrid-publish release artifact is being shipped without any integrity or attestation material, unlike the aicr artifact setup. Update the goreleaser configuration for the testgrid-publish entry so it publishes signed checksum/attestation metadata, and make sure the .github/workflows/testgrid-publish.yml flow verifies that metadata before execution. If signed verification cannot be added, adjust the workflow to build from source instead of downloading and running the release archive..github/workflows/testgrid-publish.yml (2)
164-178: 🎯 Functional Correctness | 🟠 Major | 🏗️ Heavy liftDon’t silently skip failed UAT runs that should produce columns.
The workflow advertises failure columns, but missing pointer artifacts on failed producer runs become a successful skip. Either make producers always upload a failure evidence pointer, or restrict this workflow to conclusions with a guaranteed pointer contract.
Also applies to: 207-214
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/testgrid-publish.yml around lines 164 - 178, The “Download evidence pointer” step in the testgrid publish workflow is treating missing pointer artifacts from failed UAT runs as a benign skip, which can hide failure columns. Update the workflow logic around the download-pointer step and the downstream bundle resolution so it only runs when the producer guarantees a pointer contract, or change the producer side to always upload a failure evidence pointer for failed runs; use the existing workflow_run gating and the download-pointer step’s outcome handling to make the failure path explicit instead of silently succeeding.
87-90: 🔒 Security & Privacy | 🟠 Major | ⚡ Quick winReject non-main producer runs before WIF-capable publishing.
Line 121 only warns for non-
mainproducers, but the job still proceeds to download attacker-influenced artifacts and later obtains GCP credentials. Tighten the job condition and make the producer validation fail closed;workflow_runis specifically called out by GitHub as risky when it can grant write/secrets access from triggering-workflow data. (docs.github.com)🛡️ Proposed fix
if: > (github.event_name == 'workflow_dispatch' && github.ref == 'refs/heads/main') || - github.event.workflow_run.conclusion == 'success' || - github.event.workflow_run.conclusion == 'failure' + ( + github.event_name == 'workflow_run' && + github.event.workflow_run.head_repository.full_name == github.repository && + github.event.workflow_run.head_branch == 'main' && + ( + github.event.workflow_run.conclusion == 'success' || + github.event.workflow_run.conclusion == 'failure' + ) + ) @@ - PRODUCER_REPO: ${{ github.event.workflow_run.repository.full_name }} + PRODUCER_REPO: ${{ github.event.workflow_run.head_repository.full_name }} PRODUCER_BRANCH: ${{ github.event.workflow_run.head_branch }} @@ if [ "${PRODUCER_BRANCH}" != "main" ]; then - echo "::warning::Producer ran on branch '${PRODUCER_BRANCH}' (not main) — publishing anyway" + echo "::error::Producer ran on branch '${PRODUCER_BRANCH}' (not main) — rejecting" + exit 1 fiAlso applies to: 113-123
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/testgrid-publish.yml around lines 87 - 90, The testgrid publish workflow currently allows workflow_run executions from non-main producer branches to continue, which can lead to downloading untrusted artifacts before GCP/WIF access is granted. Tighten the job-level `if` in testgrid-publish.yml so `workflow_run` only proceeds when the producing workflow is on `main`, and make the producer validation step in the publish job fail closed instead of only warning. Use the existing `workflow_run` gating and the validation logic around the producer branch check to locate the change.Source: Linters/SAST tools
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/testgrid-publish.yml:
- Around line 64-69: The workflow dispatch inputs for source_class and
environment are currently free-form, which can lead to invalid publish target
derivation. Constrain these inputs in the workflow definition used by
testgrid-publish so only the supported values are accepted, and update the
target derivation logic to assume validated values rather than handling
arbitrary strings. Make the same fix in the publish-target mapping section
referenced by the testgrid-publish workflow so the WIF provider, bucket, and
backfill classification are selected only from known source_class/environment
combinations.
- Around line 199-230: The bundle reference is written to GITHUB_OUTPUT without
enforcing the expected ghcr.io/nvidia/...@sha256:<64-hex> contract, so validate
BUNDLE_REF once before the final echo in this workflow step. Add a single
validation point covering both DISPATCH_BUNDLE_REF and the OCI/DIGEST values
derived from pointer.yaml, and reject any ref containing newlines or not
matching the required registry/digest format. Use the existing bundle_ref
assembly block and the pointer.yaml extraction logic to keep the fix localized.
---
Duplicate comments:
In @.github/workflows/testgrid-publish.yml:
- Around line 164-178: The “Download evidence pointer” step in the testgrid
publish workflow is treating missing pointer artifacts from failed UAT runs as a
benign skip, which can hide failure columns. Update the workflow logic around
the download-pointer step and the downstream bundle resolution so it only runs
when the producer guarantees a pointer contract, or change the producer side to
always upload a failure evidence pointer for failed runs; use the existing
workflow_run gating and the download-pointer step’s outcome handling to make the
failure path explicit instead of silently succeeding.
- Around line 87-90: The testgrid publish workflow currently allows workflow_run
executions from non-main producer branches to continue, which can lead to
downloading untrusted artifacts before GCP/WIF access is granted. Tighten the
job-level `if` in testgrid-publish.yml so `workflow_run` only proceeds when the
producing workflow is on `main`, and make the producer validation step in the
publish job fail closed instead of only warning. Use the existing `workflow_run`
gating and the validation logic around the producer branch check to locate the
change.
In @.goreleaser.yaml:
- Around line 201-207: The testgrid-publish release artifact is being shipped
without any integrity or attestation material, unlike the aicr artifact setup.
Update the goreleaser configuration for the testgrid-publish entry so it
publishes signed checksum/attestation metadata, and make sure the
.github/workflows/testgrid-publish.yml flow verifies that metadata before
execution. If signed verification cannot be added, adjust the workflow to build
from source instead of downloading and running the release archive.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: cb309592-598b-4012-9a40-c98c2e2ffeb6
📒 Files selected for processing (2)
.github/workflows/testgrid-publish.yml.goreleaser.yaml
Wires tools/testgrid-publish into CI so every completed UAT run
(GCP or AWS) automatically publishes to the AICR TestGrid dashboard.
Trigger: workflow_run on "UAT - GCP" / "UAT - AWS" (success or failure).
Also supports workflow_dispatch (main branch only) for backfills.
Pipeline:
1. Validate producer run is from this repository (pre-credential guard)
2. Download evidence pointer artifact; distinguish network errors from
benign "conformance did not complete" skips
3. Extract OCI bundle ref + sha256 digest from pointer.yaml; validate
both fields and enforce ghcr.io/nvidia/...@sha256:<64hex> format
before writing to GITHUB_OUTPUT (guards against newline injection)
4. Authenticate to GCP via WIF (publish SA, objectCreator on groups/ only)
5. Login to GHCR (packages: read) so testgrid-publish can pull the bundle
6. Build testgrid-publish from source (vendor — no integrity risk from
downloading a release artifact into a privileged WIF context)
7. Publish: pull OCI bundle → parse predicate → write GCS in order
Security hardening:
- Top-level permissions: contents: read (least privilege default)
- Job-level: contents:read + actions:read + packages:read + id-token:write
- No project IDs/numbers in source — all from vars.GCP_PROJECT_NUMBER
and vars.GCP_PROJECT_ID (repo variables, not secrets)
- All actions pinned to SHA
- All user-controlled values through env: (no inline ${{ }} in run:)
- set -euo pipefail in every shell script
- workflow_dispatch restricted to refs/heads/main
- workflow_dispatch inputs use type: choice / type: string to prevent
typos silently deriving non-existent WIF pools
- BUNDLE_REF validated against ghcr.io/nvidia/...@sha256:<64hex> regex
- concurrency block (one publish per triggering run, no cancel)
- timeout-minutes: 15
- Release branches intentionally allowed (qualify release → publish)
Environment-aware: defaults to prod (gs://aicr-testgrid). Override to
staging via workflow_dispatch(environment=staging).
Pending before this can merge:
- TG2 (#1447) must be merged first (tools/testgrid-publish must exist)
- Apply prod Terraform in aicr-testgrid (creates aicr-testgrid-prod-github
WIF pool and aicr-testgrid-prod-publish SA)
- Set vars.GCP_PROJECT_NUMBER and vars.GCP_PROJECT_ID in repo settings
b5dd3dd to
37366ea
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
.github/workflows/testgrid-publish.yml (2)
131-135: 🔒 Security & Privacy | 🟠 Major | ⚡ Quick winReject unexpected same-repo branches, not just forks.
Line 133 currently allows every non-
mainbranch and only logs a notice, so any same-repo feature branch running a matching UAT workflow can drive a privileged prod TestGrid publish. If release branches are intended, allowlist them explicitly and fail everything else before WIF auth.🛡️ Proposed fix
- # Release branches (e.g. release/v1.x) intentionally publish — - # their UAT runs qualify the release and belong in TestGrid. - if [ "${PRODUCER_BRANCH}" != "main" ]; then - echo "::notice::Producer ran on branch '${PRODUCER_BRANCH}' — publishing (release branches are intentional)" - fi + case "${PRODUCER_BRANCH}" in + main|release/*) + echo "::notice::Producer ran on branch '${PRODUCER_BRANCH}' — publishing" + ;; + *) + echo "::error::Producer ran on unexpected branch '${PRODUCER_BRANCH}' — rejecting" + exit 1 + ;; + esac🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/testgrid-publish.yml around lines 131 - 135, The branch gate in the publish workflow is too broad because it treats every non-main PRODUCER_BRANCH as publishable; tighten the check so only explicitly allowlisted release branches are accepted and all other same-repo branches fail before WIF auth. Update the branch validation logic in the testgrid publish workflow around the PRODUCER_BRANCH notice to reject unexpected branches, keeping the intent for release branches clear while preventing privileged publishes from feature branches.Source: Linters/SAST tools
242-248: 🔒 Security & Privacy | 🟠 Major | ⚡ Quick winValidate the whole bundle ref before writing
GITHUB_OUTPUT.
grep -qEis line-based, sovalid-ref\nextra=valuecan pass because one line matches, then Line 248 writes injected output lines. Use Bash regex on the variable and a stricter path character class.🛡️ Proposed fix
- if ! echo "${BUNDLE_REF}" | grep -qE '^ghcr\.io/nvidia/[^@]+@sha256:[0-9a-f]{64}$'; then + if [[ ! "${BUNDLE_REF}" =~ ^ghcr\.io/nvidia/[A-Za-z0-9._/-]+@sha256:[0-9a-f]{64}$ ]]; then echo "::error::BUNDLE_REF '${BUNDLE_REF}' does not match expected format ghcr.io/nvidia/...@sha256:<64hex>" exit 1 fi🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/testgrid-publish.yml around lines 242 - 248, The bundle ref check in the testgrid publish workflow is only validating a single line with grep, so a multiline value can still inject extra GITHUB_OUTPUT entries. Update the validation around BUNDLE_REF to use a Bash regex match on the full variable before the echo to GITHUB_OUTPUT, and tighten the allowed ghcr.io/nvidia path character class so only a single safe bundle ref can pass.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/testgrid-publish.yml:
- Around line 217-226: The evidence-pointer lookup in the TestGrid publish
workflow is too brittle under set -euo pipefail because the find-based POINTER
assignment can terminate the script before the missing-artifact skip logic runs.
Update the pointer discovery in the workflow step to tolerate a absent
evidence-pointer/ directory while preserving the existing branch that checks
POINTER, DOWNLOAD_OUTCOME, and UAT_CONCLUSION so the skip=true path remains
reachable.
---
Duplicate comments:
In @.github/workflows/testgrid-publish.yml:
- Around line 131-135: The branch gate in the publish workflow is too broad
because it treats every non-main PRODUCER_BRANCH as publishable; tighten the
check so only explicitly allowlisted release branches are accepted and all other
same-repo branches fail before WIF auth. Update the branch validation logic in
the testgrid publish workflow around the PRODUCER_BRANCH notice to reject
unexpected branches, keeping the intent for release branches clear while
preventing privileged publishes from feature branches.
- Around line 242-248: The bundle ref check in the testgrid publish workflow is
only validating a single line with grep, so a multiline value can still inject
extra GITHUB_OUTPUT entries. Update the validation around BUNDLE_REF to use a
Bash regex match on the full variable before the echo to GITHUB_OUTPUT, and
tighten the allowed ghcr.io/nvidia path character class so only a single safe
bundle ref can pass.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 175178f3-cd87-4a45-ac0f-d4db1f89650e
📒 Files selected for processing (1)
.github/workflows/testgrid-publish.yml
| POINTER="$(find evidence-pointer/ -name 'pointer.yaml' 2>/dev/null | head -1)" | ||
| DOWNLOAD_OUTCOME="${DOWNLOAD_POINTER_OUTCOME}" | ||
| if [ -z "${POINTER}" ]; then | ||
| if [ "${DOWNLOAD_OUTCOME}" = "failure" ] && [ "${UAT_CONCLUSION}" = "success" ]; then | ||
| echo "::error::UAT succeeded but evidence pointer download failed — possible network error" | ||
| exit 1 | ||
| fi | ||
| echo "::notice::No evidence pointer found (conformance did not complete) — skipping TestGrid publish" | ||
| echo "skip=true" >> "${GITHUB_OUTPUT}" | ||
| exit 0 |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Keep the missing-artifact skip path reachable.
With set -euo pipefail, POINTER="$(find evidence-pointer/ ... | head -1)" can exit the script when evidence-pointer/ does not exist, before Line 219 handles the missing pointer. Make the lookup tolerate the absent directory.
🐛 Proposed fix
- POINTER="$(find evidence-pointer/ -name 'pointer.yaml' 2>/dev/null | head -1)"
+ POINTER="$(find evidence-pointer/ -name 'pointer.yaml' -print -quit 2>/dev/null || true)"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| POINTER="$(find evidence-pointer/ -name 'pointer.yaml' 2>/dev/null | head -1)" | |
| DOWNLOAD_OUTCOME="${DOWNLOAD_POINTER_OUTCOME}" | |
| if [ -z "${POINTER}" ]; then | |
| if [ "${DOWNLOAD_OUTCOME}" = "failure" ] && [ "${UAT_CONCLUSION}" = "success" ]; then | |
| echo "::error::UAT succeeded but evidence pointer download failed — possible network error" | |
| exit 1 | |
| fi | |
| echo "::notice::No evidence pointer found (conformance did not complete) — skipping TestGrid publish" | |
| echo "skip=true" >> "${GITHUB_OUTPUT}" | |
| exit 0 | |
| POINTER="$(find evidence-pointer/ -name 'pointer.yaml' -print -quit 2>/dev/null || true)" | |
| DOWNLOAD_OUTCOME="${DOWNLOAD_POINTER_OUTCOME}" | |
| if [ -z "${POINTER}" ]; then | |
| if [ "${DOWNLOAD_OUTCOME}" = "failure" ] && [ "${UAT_CONCLUSION}" = "success" ]; then | |
| echo "::error::UAT succeeded but evidence pointer download failed — possible network error" | |
| exit 1 | |
| fi | |
| echo "::notice::No evidence pointer found (conformance did not complete) — skipping TestGrid publish" | |
| echo "skip=true" >> "${GITHUB_OUTPUT}" | |
| exit 0 |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.github/workflows/testgrid-publish.yml around lines 217 - 226, The
evidence-pointer lookup in the TestGrid publish workflow is too brittle under
set -euo pipefail because the find-based POINTER assignment can terminate the
script before the missing-artifact skip logic runs. Update the pointer discovery
in the workflow step to tolerate a absent evidence-pointer/ directory while
preserving the existing branch that checks POINTER, DOWNLOAD_OUTCOME, and
UAT_CONCLUSION so the skip=true path remains reachable.
mchmarny
left a comment
There was a problem hiding this comment.
Clean, well-structured CI workflow — injection hygiene is solid (all user-controlled values flow through env, anchored format regex, digest-pinned bundle refs), actions are SHA-pinned, permissions are least-privilege at both workflow and job level, and cancel-in-progress: false correctly protects in-flight GCS writes. Verified against the repo: UAT workflow names, the uat-*-evidence-pointer-<run_id> artifact naming, the attestations[0].bundle.oci/.digest pointer schema, and the referenced composite actions all line up.
No blocking issues. Three non-critical notes inline: a community source_class option that the nvidia-only bundle regex makes unusable (medium), a producer-guard comment that overstates what the repo-name check actually does (low — moot since UAT can't be fork-triggered), and unguarded vars.GCP_PROJECT_* lookups (low).
WIP as titled — go build ./tools/testgrid-publish will fail until TG2 (#1447) merges, and the workflow is inert until the prod WIF pool/SA exist; both are already documented in the description. Nothing here blocks merge once those land.
| fi | ||
| # Validate format: ghcr.io/nvidia/...@sha256:<64 hex chars> | ||
| # Also guards against newline injection into GITHUB_OUTPUT. | ||
| if ! echo "${BUNDLE_REF}" | grep -qE '^ghcr\.io/nvidia/[^@]+@sha256:[0-9a-f]{64}$'; then |
There was a problem hiding this comment.
source_class: community (line 71) is offered as a dispatch option, but this regex hard-codes ^ghcr\.io/nvidia/. Committed community evidence bundles live under contributor namespaces — e.g. recipes/evidence/gb200-eks-ubuntu-training.yaml points at ghcr.io/yuanchen8911/aicr-evidence:.... So a dispatch with source_class: community and a real community bundle_ref always fails this check. Either the community option is dead, or the regex needs to permit non-nvidia namespaces (while still requiring @sha256:<64hex>).
| # Guard: reject workflow_run events from forks or unexpected branches | ||
| # before any credential exchange. WIF is gated by the attribute_condition | ||
| # in wif-publish.tf, but an explicit check here fails fast and produces | ||
| # a clear error message rather than a cryptic WIF rejection. | ||
| - name: Validate producer | ||
| if: github.event_name == 'workflow_run' | ||
| env: | ||
| PRODUCER_REPO: ${{ github.event.workflow_run.repository.full_name }} | ||
| PRODUCER_BRANCH: ${{ github.event.workflow_run.head_branch }} | ||
| run: | | ||
| set -euo pipefail | ||
| if [ "${PRODUCER_REPO}" != "${GITHUB_REPOSITORY}" ]; then | ||
| echo "::error::Producer workflow is from '${PRODUCER_REPO}', expected '${GITHUB_REPOSITORY}' — rejecting" |
There was a problem hiding this comment.
The comment says this "rejects workflow_run events from forks," but github.event.workflow_run.repository.full_name is always the repo that hosts the workflow (the base repo) — never the fork. For a fork-originated run the fork identity is in workflow_run.head_repository.full_name, so this != GITHUB_REPOSITORY check can never fire on a fork.
In practice this is moot: UAT only triggers on workflow_dispatch + schedule, never fork PRs, so no fork can ever produce a UAT - GCP/AWS run. But the comment overstates the guard. Either soften the comment to match what the check actually does (base-repo sanity check) or, if fork defense-in-depth is the intent, check head_repository.full_name.
| GCP_PROJECT_NUMBER: ${{ vars.GCP_PROJECT_NUMBER }} | ||
| GCP_PROJECT_ID: ${{ vars.GCP_PROJECT_ID }} | ||
| run: | | ||
| set -euo pipefail | ||
| if [ "${TG_ENV}" = "prod" ]; then | ||
| CLUSTER="aicr-testgrid" | ||
| else | ||
| CLUSTER="aicr-testgrid-${TG_ENV}" | ||
| fi | ||
| echo "cluster=${CLUSTER}" >> "${GITHUB_OUTPUT}" | ||
| echo "bucket=${CLUSTER}" >> "${GITHUB_OUTPUT}" | ||
| echo "wif_provider=projects/${GCP_PROJECT_NUMBER}/locations/global/workloadIdentityPools/${CLUSTER}-github/providers/github-actions" >> "${GITHUB_OUTPUT}" | ||
| echo "wif_sa=${CLUSTER}-publish@${GCP_PROJECT_ID}.iam.gserviceaccount.com" >> "${GITHUB_OUTPUT}" |
There was a problem hiding this comment.
vars.GCP_PROJECT_NUMBER / GCP_PROJECT_ID are consumed unguarded. If either repo var is unset, wif_provider silently becomes projects//locations/global/... and wif_sa becomes ...-publish@.iam.gserviceaccount.com, so the failure surfaces later as a cryptic WIF rejection. A quick [ -n "${GCP_PROJECT_NUMBER}" ]/[ -n "${GCP_PROJECT_ID}" ] guard with an ::error:: would fail fast here — same "fail fast with a clear message" rationale you gave for the producer guard.
njhensley
left a comment
There was a problem hiding this comment.
🔍 Reconciled multi-persona review
Method: 4 independent persona reviewers (Security/CI-supply-chain · Operability/CI-DX · Correctness · Domain & Architecture) → adversarial senior meta-reviewer re-derived every finding from the resolved code, then reconciled against all prior reviews on this PR (CodeRabbit ×5 runs, @mchmarny's human review, pre-merge bot + zizmor). Findings are deduplicated and re-tiered to the cross-source consensus; line links pin to head 37366ea.
Legend: 🔴 Blocker · 🟠 Major · 🟡 Minor · 🔵 Nitpick
Overall
This is a carefully built, security-conscious workflow — all three review sources agree the logic is safe (injection hygiene via env-routing, SHA-pinned actions, least-privilege per-job permissions, persist-credentials: false, cancel-in-progress: false, build-from-vendor) and that the live risks are operability + the prod-Terraform sequencing, not the workflow logic. @mchmarny found no blocking issues; this reconciliation concurs — no hard blocker remains now that TG2 (#1447) has merged.
Recommendation: Approve-with-comments, gated on two operational items before prod traffic:
- Reconcile the prod WIF pool/SA naming against the actual
wif-publish.tf(the inline 🟠 at the Resolve-GCP-config step) — with TG2 merged this is the most likely rollout-breaker. - Merge after the prod Terraform apply, or add a feature-flag guard — otherwise every passing UAT goes red at WIF auth until the pool/SA exist.
The other two 🟠 (line-based grep + its inaccurate comment; unguarded vars.GCP_PROJECT_*) are quick fixes worth landing in this PR.
TG2 status (confirmed on main)
tools/testgrid-publish is now merged, and its flags match the workflow exactly — --bundle / --bucket / --source-class (main.go:71-74), --source-class defaulting to uat and accepting community. So the earlier flag-drift concern is resolved; the build step will now succeed.
Reconciliation with existing reviews
Already addressed by the author on earlier commits (CodeRabbit threads marked ✅): producer workflow names "UAT - GCP"/"UAT - AWS", source_class wired through env, bundle_ref moved to env, actions: read added, prod bucket un-suffixed.
Moot for the current PR: CodeRabbit's .goreleaser.yaml release-artifact-integrity / unsigned-tarball findings — the PR no longer touches .goreleaser.yaml and the workflow builds from source only. zizmor concurrency-limits is resolved (concurrency block present); zizmor's workflow_run dangerous-triggers is informational and mitigated by the producer guard.
Process note — linked-issue / scope mismatch: the PR says Fixes #1268, but the pre-merge bot flags #1268 as TG3 (processing workers + config-gen), not TG5 (publishing) → "Out of Scope" + "Linked Issues" warnings. Worth correcting the linked issue or the Fixes target.
Correction to my own panel: my Security/meta pass initially called the grep regex a newline-injection guard — CodeRabbit is right that it is not (see the inline 🟠). Reconciliation defers to CodeRabbit there.
Confirmed non-issues (examined, agreed by panel + @mchmarny)
Script-injection hygiene (all event/user values env-routed, never inline ${{ }} in run:), SHA-pinned actions, least-privilege per-job permissions, persist-credentials: false, cancel-in-progress: false on the upload job, build-from-vendor in the privileged context, and the producer-repo guard running before any credential exchange (defense-in-depth — moot since UAT can't be fork-triggered). Digest construction, yq null/empty handling, the skip output coverage, and the top-level if: matrix were all traced correct.
Tier tally (this review)
| 🔴 Blocker | 🟠 Major | 🟡 Minor | 🔵 Nitpick |
|---|---|---|---|
| 0 | 5 | 7 | 2 |
🤖 Posted via the reviewing-prs-multipersona skill (multi-persona + adversarial meta-review, reconciled across CodeRabbit and human reviews).
| GCP_PROJECT_ID: ${{ vars.GCP_PROJECT_ID }} | ||
| run: | | ||
| set -euo pipefail | ||
| if [ "${TG_ENV}" = "prod" ]; then |
There was a problem hiding this comment.
🟠 Major — prod WIF pool/SA naming contradicts the PR's stated Terraform
The prod branch emits unsuffixed names: WIF pool aicr-testgrid-github, SA aicr-testgrid-publish, bucket aicr-testgrid. The PR body states prod Terraform creates aicr-testgrid-prod-github and aicr-testgrid-prod-publish (only the bucket matches). CodeRabbit flagged the inverse on an earlier commit; the bucket got reconciled but the pool/SA names still disagree.
Blast radius: Once prod Terraform is applied, a name mismatch = silent WIF auth failure on every publish → the dashboard never updates. With TG2 now merged, this is the single most likely rollout-breaker.
Fix: Reconcile against the actual wif-publish.tf in the aicr-testgrid repo before/with the apply. Drop the -prod suffix in Terraform, or set CLUSTER="aicr-testgrid-prod" here. (Source: persona panel + CodeRabbit history.)
| fi | ||
| # Validate format: ghcr.io/nvidia/...@sha256:<64 hex chars> | ||
| # Also guards against newline injection into GITHUB_OUTPUT. | ||
| if ! echo "${BUNDLE_REF}" | grep -qE '^ghcr\.io/nvidia/[^@]+@sha256:[0-9a-f]{64}$'; then |
There was a problem hiding this comment.
🟠 Major — grep -qE is line-based — does NOT prevent GITHUB_OUTPUT newline injection; the comment is wrong
echo "${BUNDLE_REF}" | grep -qE '^...$' matches per line, so a value like valid-ref\nfoo=bar passes (line 1 matches) and the subsequent echo "bundle_ref=${BUNDLE_REF}" >> GITHUB_OUTPUT writes both lines, injecting foo=bar as a second output. The inline comment claiming this 'guards against newline injection into GITHUB_OUTPUT' is therefore inaccurate.
Blast radius: Bounded — exploitable via the workflow_dispatch bundle_ref input by a main-branch actor (or a newline reaching the trusted pointer). Low likelihood, but the guard does not do what its comment says.
Fix: Use a whole-string Bash match instead of line-based grep: if [[ ! "${BUNDLE_REF}" =~ ^ghcr\.io/nvidia/[A-Za-z0-9._/-]+@sha256:[0-9a-f]{64}$ ]]; then .... Then fix the comment. (Source: CodeRabbit — corrects an earlier claim in my own panel.)
| id: gcp | ||
| env: | ||
| TG_ENV: ${{ inputs.environment || 'prod' }} | ||
| GCP_PROJECT_NUMBER: ${{ vars.GCP_PROJECT_NUMBER }} |
There was a problem hiding this comment.
🟠 Major — Unguarded vars.GCP_PROJECT_NUMBER / GCP_PROJECT_ID → malformed WIF, cryptic failure
No non-empty guard (set -u does not catch empty-but-set env vars). If either repo var is unset, the step still succeeds and emits projects//locations/.../aicr-testgrid-github/... and ...-publish@.iam.gserviceaccount.com, which then fail opaquely inside google-github-actions/auth three steps later.
Blast radius: Operator sees a cryptic WIF rejection instead of 'GCP_PROJECT_NUMBER is unset' — the same legibility the Validate-producer step provides, missing for config vars.
Fix: Fail closed early: : "${GCP_PROJECT_NUMBER:?repo var GCP_PROJECT_NUMBER is unset}" (and GCP_PROJECT_ID) with an ::error::. (Source: persona panel + mchmarny + CodeRabbit.)
| # a failure run is a valid TestGrid column showing what broke. | ||
| # For workflow_dispatch: restrict to main to prevent feature-branch | ||
| # code from obtaining WIF credentials (see comment on trigger above). | ||
| if: > |
There was a problem hiding this comment.
🟠 Major — Job if: silently drops cancelled / timed_out / skipped UAT conclusions
The gate admits only workflow_dispatch-on-main or workflow_run conclusion success/failure. A cancelled, timed_out, action_required, or skipped UAT (e.g. an operator-aborted or 180-min-capped run) makes the whole job skip with no step output.
Blast radius: Observability gap that contradicts the header comment ('after every completed cluster run'): no TestGrid column AND no log explaining the no-op. On-call investigating 'why no column for run X' finds a job that simply didn't run.
Fix: Either broaden the gate and let Resolve-bundle-ref decide (it already handles no-pointer → skip), or add a ::notice:: documenting that non-terminal conclusions are intentionally not published. (Source: persona panel — net-new, not raised by prior reviewers.)
|
|
||
| # Authenticate to GCP via WIF using the testgrid publish SA. | ||
| # This SA has objectCreator on gs://aicr-testgrid[-<env>]/groups/ only. | ||
| - name: Authenticate to GCP |
There was a problem hiding this comment.
🟠 Major — Not 'inert until Terraform': successful UATs go red at WIF auth until prod Terraform applies
TG2 is now merged so go build succeeds, but on a passing UAT the job still reaches Authenticate to GCP and fails there until the prod WIF pool/SA exist. The skip/no-pointer path is genuinely green; the success path is not.
Blast radius: Every passing UAT produces a red TestGrid Publish run (after exercising prod auth) between merge and the Terraform apply — alert noise that erodes CI signal.
Fix: Merge after the prod Terraform apply, OR gate the job behind a vars.TESTGRID_PUBLISH_ENABLED == 'true' feature flag (also makes 'inert until Terraform' literally true and gives on-call one switch). (Source: persona panel + mchmarny; re-pointed from the TG2 gate now that TG2 merged.)
| - name: Resolve GCP config | ||
| id: gcp | ||
| env: | ||
| TG_ENV: ${{ inputs.environment || 'prod' }} |
There was a problem hiding this comment.
🟡 Minor — environment input selects bucket/WIF but is never passed to the tool
environment (prod/staging) flows only into bucket/WIF resource names; the binary call passes --bucket and --source-class but no --environment. Architecturally fine iff the bucket fully encodes env (it does: bucket == cluster == aicr-testgrid[-]).
Blast radius: None today; a coupling worth documenting so a future change doesn't assume the tool sees the env dimension.
Fix: Add a one-line comment that env is encoded in the bucket. (Source: persona panel.)
| # workflow code could otherwise obtain GCS write credentials. | ||
| workflow_dispatch: | ||
| inputs: | ||
| bundle_ref: |
There was a problem hiding this comment.
🟡 Minor — workflow_dispatch bundle_ref provenance unverified beyond prefix+digest
The regex constrains namespace+digest shape but not provenance. A main-branch operator (or anyone with push to ghcr.io/nvidia/*) could feed an arbitrary conforming bundle to the publish binary, which pulls and parses it under WIF creds.
Blast radius: Bounded by the SA's groups/-prefix-only objectCreator scope and the main-only dispatch gate. Real but well-contained.
Fix: Optionally verify the bundle's attestation/signature before pull, or document workflow_dispatch as trusted-operator-only. (Source: persona panel.)
| POINTER="$(find evidence-pointer/ -name 'pointer.yaml' 2>/dev/null | head -1)" | ||
| DOWNLOAD_OUTCOME="${DOWNLOAD_POINTER_OUTCOME}" | ||
| if [ -z "${POINTER}" ]; then | ||
| if [ "${DOWNLOAD_OUTCOME}" = "failure" ] && [ "${UAT_CONCLUSION}" = "success" ]; then |
There was a problem hiding this comment.
🟡 Minor — Narrow false hard-error on a manual skip_tests dispatch
Producer uploads the pointer only if conformance succeeded. The DOWNLOAD_OUTCOME=='failure' && UAT_CONCLUSION=='success' branch is meant to catch a network error, but a deliberate skip_tests UAT concludes success with no pointer → this branch fires exit 1. (Downgraded from Major by meta-review: a normal conformance failure forces conclusion=failure, so this path is reachable only via the manual skip_tests dispatch, which has no evidence to publish anyway.)
Blast radius: A skip_tests UAT shows a confusing red publish run; the not-found-vs-network distinction remains imperfect (both surface as outcome=failure).
Fix: Accept the narrow edge, or have the producer always upload a sentinel pointer on success so absence unambiguously means failure. (Source: persona panel.)
| # If the download step failed, distinguish "artifact not found" | ||
| # (conformance did not complete — benign skip) from a real error | ||
| # (network failure — should not silently drop the publish). | ||
| POINTER="$(find evidence-pointer/ -name 'pointer.yaml' 2>/dev/null | head -1)" |
There was a problem hiding this comment.
🔵 Nitpick — find evidence-pointer/ ... | head -1 brittle under set -euo pipefail
If the download dir is ever absent, find <missing-dir> exits 1 and pipefail+set -e kills the step before the -z disambiguation; head -1 closing the pipe can also SIGPIPE find on many matches. In practice download-artifact@v8 pre-creates the path: dir and merge-multiple yields one file, so it's benign today.
Blast radius: Latent; would surface only on an unexpected layout/failure mode, producing an opaque exit-1 instead of the intended notice/error.
Fix: Cheap insurance: POINTER="$(find evidence-pointer/ -name pointer.yaml 2>/dev/null | head -1 || true)" (or guard [ -d evidence-pointer/ ] first). (Source: persona panel + CodeRabbit.)
| echo "skip=true" >> "${GITHUB_OUTPUT}" | ||
| exit 0 | ||
| fi | ||
| OCI=$(yq '.attestations[0].bundle.oci' "${POINTER}") |
There was a problem hiding this comment.
🔵 Nitpick — yq .attestations[0] hardcodes index 0
Correct for schema V1 (BuildPointer always emits exactly one attestation), but couples to that invariant with no guard.
Blast radius: None today; a future V2 multi-attestation pointer could silently pick the wrong entry.
Fix: Optional fail-loud guard: assert yq '.attestations | length' == 1, or select by .bundle.predicateType. (Source: persona panel.)
Summary
Adds
.github/workflows/testgrid-publish.yml— the GitHub Actions workflow that automatically publishes AICR UAT evidence bundles to the TestGrid dashboard after every completed cluster run.Motivation / Context
TG5 from the AICR TestGrid epic (#1263). Once TG2 (
tools/testgrid-publish) merges, this wires it into CI so every UAT run appears as a column in the TestGrid dashboard without manual intervention.Fixes: #1268
Related: #1263, #1267
Type of Change
Component(s) Affected
.github/workflows/testgrid-publish.yml(new CI workflow)Implementation Notes
Trigger:
workflow_runon "UAT GCP" and "UAT AWS" (success or failure — failing runs are valid TestGrid columns showing which tests broke). Also supportsworkflow_dispatchfor manual backfills.Pipeline:
Auth: WIF pool
aicr-testgrid-prod-github(provisioned in aicr-testgrid Terraform). Workflow filename starts withtestgrid-publishto satisfy the WIFattribute_conditioninwif-publish.tf.Environment: Defaults to prod (
gs://aicr-testgrid). Override to staging viaworkflow_dispatch(environment=staging)for testing. WIF pool, publish SA, and bucket are all derived from the environment input.Dependency: Requires TG2 (
tools/testgrid-publish) — PR #1447.Before merge:
aicr-testgridrepo (createsaicr-testgrid-prod-githubWIF pool andaicr-testgrid-prod-publishSA withobjectCreatorongs://aicr-testgrid/groups/)Testing
Tested the full pipeline manually today:
Risk Assessment
Rollout notes: Workflow is inert until prod WIF pool and SA exist. Safe to merge ahead of Terraform apply.
Checklist
make testwith-race)make lint)git commit -S)