failure-triage-agent: Integrate into daily integration test ansible pipeline by hritik0101 · Pull Request #5830 · GoogleCloudPlatform/cluster-toolkit

hritik0101 · 2026-06-23T06:08:58Z

Overview
This PR integrates the new Failure Triage Agent into our daily integration test pipelines. When a test fails, this PR ensures that before the cluster infrastructure is torn down, a secure webhook is fired to the agent to immediately begin log analysis, root-cause determination and potential next steps.

This significantly reduces the manual effort required to debug daily test failures by providing inline forensic summaries directly in the Cloud Build logs.

Key Changes

Build Context Injection
Modified the Cloud Build test configurations (e.g., ansible-vm.yaml, gke-*.yaml, etc.) to securely pass the $BUILD_ID down to the Ansible playbooks via the full_build_id extra variable. This allows the triage agent to map its analysis back to the exact failing Cloud Build run.
Playbook Orchestration (rescue_gcluster_failure.yml)
Hooked the triage agent trigger into the always block of the base-integration-test.yml by calling it inside rescue_gcluster_failure.yml. This ensures the agent is only invoked upon test failure, and guarantees it runs before the infrastructure is destroyed so logs and states remain intact for the agent.
Trigger Logic (trigger_failure_triage_agent.yml) Created a robust new Ansible task dedicated to invoking the agent:
Feature Flag/Kill Switch: Validates the presence of gs://g-ift-agent-bucket/ENABLE_TRIAGE_AGENT. If missing, the triage agent is skipped. This allows us to disable the agent globally without requiring a code merge.
Secure Authentication: Impersonates the triage-invoker service account to generate a secure OIDC token, ensuring the Cloud Run endpoint remains fully authenticated and protected.
Inline Reporting (Polling): Instead of a fire-and-forget approach, the task implements an asynchronous wait (polling state.json every 60s for up to ~30 mins). Once the analysis completes, it parses the JSON and prints the LLM-generated executive_summary directly to the Cloud Build stdout. Developers can see the root cause without leaving the Google Cloud console.

Testing/Validation
Verified that the trigger task accurately captures the full_build_id and checks the GCS feature flag.
Verified that the identity token is correctly generated using the impersonated service account.
Verified the playbook correctly hits the Cloud Run endpoint via an authenticated POST request.
Verified the polling loop correctly blocks until status: completed and prints the summary report inline.

Notes for Reviewers
Timeouts: Because the playbook now waits for the triage agent to finish its analysis, the total duration of failing builds will increase by up to 15 minutes.

Tested the following daily test trigger with these changes:

Next Steps: The application code for the Failure Triage Agent itself (the Cloud Run service) is being reviewed and merged into a separate Gerrit PR.

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

Fork your PR branch from the Toolkit "develop" branch (not main)
Test all changes with pre-commit in a local branch #
Confirm that "make tests" passes all tests
Add or modify unit tests to cover code changes
Ensure that unit test coverage remains above 80%
Update all applicable documentation
Follow Cluster Toolkit Contribution guidelines #

gemini-code-assist · 2026-06-23T06:09:54Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates an automated Failure Triage Agent into the daily integration test pipelines. By triggering this agent during test failures, the system can now perform automated log analysis and provide forensic summaries directly within the Cloud Build console, significantly reducing manual debugging time.

Highlights

Agent Integration: Introduced a new Ansible task to trigger the Failure Triage Agent upon test failure.
Build Context Injection: Updated Cloud Build configurations to pass the BUILD_ID to Ansible playbooks for accurate forensic mapping.
Authentication & Control: Implemented secure OIDC token generation and a GCS-based feature flag to manage agent execution.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

hritik0101 · 2026-06-23T06:12:07Z

/gemini review

gemini-code-assist

Code Review

This pull request integrates a Failure Triage Agent into the daily integration tests. It introduces a new Ansible task file to trigger the agent, wait for analysis, and print reports, and updates various Cloud Build configurations and playbooks to pass the necessary build ID variables. The code review feedback highlights critical security and efficiency improvements for the new Ansible task file. Specifically, it recommends passing variables via the environment block to prevent command injection vulnerabilities in shell scripts and curl payloads, and replacing a complex, hardcoded shell sleep loop with Ansible's native, more efficient 'until' loop for polling the agent's status.

gemini-code-assist

Code Review

This pull request integrates a Failure Triage Agent into the daily integration tests by adding a new Ansible task file and updating various playbooks and Cloud Build configurations to pass the build ID. The review feedback highlights several critical improvements: fixing a path resolution error when including the triage agent task in rescue_gcluster_failure.yml, refactoring a complex bash wait loop to use Ansible's native until and retries features, parameterizing hardcoded configuration variables to allow overrides, and passing variables to shell tasks via the environment keyword to prevent syntax errors or command injection.

hritik0101 · 2026-06-23T08:56:01Z

/gemini review

gemini-code-assist

Code Review

This pull request integrates a Failure Triage Agent into the daily integration tests by updating Cloud Build configurations to pass the build ID and adding Ansible tasks to trigger and poll the agent. The review feedback highlights several critical improvements: moving the triage agent trigger from the 'always' block to a 'rescue' block to prevent unnecessary execution on successful runs, improving error visibility by removing stderr silencing in gcloud and curl commands, and optimizing the polling loop by removing the initial 60-second sleep.

LAVEEN · 2026-06-23T12:15:28Z

https://storage.cloud.google.com/g-ift-agent-bucket/20ad5942-2241-4397-9995-e07eb919b8e0/report.tx
im not able to view this file.

hritik0101 · 2026-06-23T12:49:56Z

https://storage.cloud.google.com/g-ift-agent-bucket/20ad5942-2241-4397-9995-e07eb919b8e0/report.tx im not able to view this file.

https://storage.cloud.google.com/g-ift-agent-bucket/20ad5942-2241-4397-9995-e07eb919b8e0/report.txt
this the whole link
you missed the "txt"

LAVEEN

LGTM

AdarshK15

LGTM, please run few daily tests and PR test with this commit to test.

kvenkatachala333 · 2026-06-25T10:47:28Z

+- name: Set Triage Agent Configuration
+  ansible.builtin.set_fact:
+    triage_gcs_bucket: "{{ triage_gcs_bucket_override | default('g-ift-agent-bucket') }}"
+    triage_project_number: "{{ triage_project_number_override | default('508417052821') }}"
+    triage_invoker_sa: "{{ triage_invoker_sa_override | default('triage-invoker@hpc-toolkit-dev.iam.gserviceaccount.com') }}"
+    triage_cloud_run_url: "{{ triage_cloud_run_url_override | default('https://failure-triage-agent-508417052821.us-central1.run.app') }}"


Since cluster-toolkit is an open-source, it is not recommended to hardcode internal GCP project numbers (508417052821), buckets, and service accounts as fallback defaults.

Please remove these internal defaults (defaulting them to '' instead) and pass them exclusively via extra-vars in the Cloud Build CI configuration. You can add a quick check in the prerequisite step to skip the agent if these variables are missing

Ansible changes: Integration of Failure-Triage-agent

34b8644

gemini-code-assist Bot reviewed Jun 23, 2026

View reviewed changes

LAVEEN assigned hritik0101 Jun 23, 2026

hritik0101 changed the title ~~Ansible changes: Integration of Failure-Triage-agent~~ failure-triage-agent: Integrate into daily integration test ansible pipeline Jun 23, 2026

Fix include path and improve triage agent trigger playbook

2d8cfb0

gemini-code-assist Bot reviewed Jun 23, 2026

View reviewed changes

chore: remove stderr suppression from gcloud and curl commands

c2b0668

kvenkatachala333 reviewed Jun 23, 2026

View reviewed changes

Comment thread tools/cloud-build/daily-tests/ansible_playbooks/tasks/trigger_failure_triage_agent.yml Outdated

hritik0101 added 2 commits June 24, 2026 03:50

Updating the Environment vars

6e9582c

Kill switch updated

abc6625

hritik0101 marked this pull request as ready for review June 24, 2026 08:33

hritik0101 requested a review from a team as a code owner June 24, 2026 08:33

resolved few comments

6882199

LAVEEN added the release-improvements Added to release notes under the "Improvements" heading. label Jun 24, 2026

LAVEEN assigned AdarshK15 Jun 24, 2026

LAVEEN previously approved these changes Jun 24, 2026

View reviewed changes

LAVEEN assigned kvenkatachala333 and unassigned hritik0101 and kvenkatachala333 Jun 24, 2026

LAVEEN requested review from AdarshK15 and kvenkatachala333 June 24, 2026 15:37

AdarshK15 reviewed Jun 24, 2026

View reviewed changes

AdarshK15 reviewed Jun 25, 2026

View reviewed changes

Comment thread tools/cloud-build/daily-tests/ansible_playbooks/tasks/trigger_failure_triage_agent.yml Outdated

Added a check for state file existence to fail fast

8dd7d1c

hritik0101 dismissed LAVEEN’s stale review via 8dd7d1c June 25, 2026 04:43

hritik0101 added 3 commits June 25, 2026 04:56

Updated ansible message for state file

1e34630

corrected the position of env var

ebd355a

corrected the position of env var

4f30632

AdarshK15 previously approved these changes Jun 25, 2026

View reviewed changes

kvenkatachala333 reviewed Jun 25, 2026

View reviewed changes

Comment thread tools/cloud-build/daily-tests/ansible_playbooks/tasks/trigger_failure_triage_agent.yml

LAVEEN reviewed Jun 25, 2026

View reviewed changes

Comment thread tools/cloud-build/daily-tests/ansible_playbooks/multigroup-integration-test.yml

chore(ci): configure failure triage agent with Secret Manager

4ab6c54

hritik0101 dismissed AdarshK15’s stale review via 4ab6c54 June 25, 2026 18:21

hritik0101 added 2 commits June 25, 2026 18:30

resolved a pr review

0bafeba

Updating secret vars for some build files

6181922

Uh oh!

Conversation

hritik0101 commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Submission Checklist

Uh oh!

gemini-code-assist Bot commented Jun 23, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

hritik0101 commented Jun 23, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hritik0101 commented Jun 23, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LAVEEN commented Jun 23, 2026

Uh oh!

hritik0101 commented Jun 23, 2026

Uh oh!

LAVEEN left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AdarshK15 left a comment

Choose a reason for hiding this comment

Uh oh!

kvenkatachala333 Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

hritik0101 commented Jun 23, 2026 •

edited

Loading