Skip to content

failure-triage-agent: Integrate into daily integration test ansible pipeline#5830

Open
hritik0101 wants to merge 13 commits into
GoogleCloudPlatform:developfrom
hritik0101:feature/ansible-triage-trigger
Open

failure-triage-agent: Integrate into daily integration test ansible pipeline#5830
hritik0101 wants to merge 13 commits into
GoogleCloudPlatform:developfrom
hritik0101:feature/ansible-triage-trigger

Conversation

@hritik0101

@hritik0101 hritik0101 commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Overview
This PR integrates the new Failure Triage Agent into our daily integration test pipelines. When a test fails, this PR ensures that before the cluster infrastructure is torn down, a secure webhook is fired to the agent to immediately begin log analysis, root-cause determination and potential next steps.

This significantly reduces the manual effort required to debug daily test failures by providing inline forensic summaries directly in the Cloud Build logs.

Key Changes

  1. Build Context Injection
    Modified the Cloud Build test configurations (e.g., ansible-vm.yaml, gke-*.yaml, etc.) to securely pass the $BUILD_ID down to the Ansible playbooks via the full_build_id extra variable. This allows the triage agent to map its analysis back to the exact failing Cloud Build run.

  2. Playbook Orchestration (rescue_gcluster_failure.yml)
    Hooked the triage agent trigger into the always block of the base-integration-test.yml by calling it inside rescue_gcluster_failure.yml. This ensures the agent is only invoked upon test failure, and guarantees it runs before the infrastructure is destroyed so logs and states remain intact for the agent.

  3. Trigger Logic (trigger_failure_triage_agent.yml) Created a robust new Ansible task dedicated to invoking the agent:
    Feature Flag/Kill Switch: Validates the presence of gs://g-ift-agent-bucket/ENABLE_TRIAGE_AGENT. If missing, the triage agent is skipped. This allows us to disable the agent globally without requiring a code merge.
    Secure Authentication: Impersonates the triage-invoker service account to generate a secure OIDC token, ensuring the Cloud Run endpoint remains fully authenticated and protected.
    Inline Reporting (Polling): Instead of a fire-and-forget approach, the task implements an asynchronous wait (polling state.json every 60s for up to ~30 mins). Once the analysis completes, it parses the JSON and prints the LLM-generated executive_summary directly to the Cloud Build stdout. Developers can see the root cause without leaving the Google Cloud console.

Testing/Validation
Verified that the trigger task accurately captures the full_build_id and checks the GCS feature flag.
Verified that the identity token is correctly generated using the impersonated service account.
Verified the playbook correctly hits the Cloud Run endpoint via an authenticated POST request.
Verified the polling loop correctly blocks until status: completed and prints the summary report inline.

Notes for Reviewers
Timeouts: Because the playbook now waits for the triage agent to finish its analysis, the total duration of failing builds will increase by up to 15 minutes.

Tested the following daily test trigger with these changes:

  1. DAILY-test-ml-h4d-onspot-slurm
  2. DAILY-test-gke-h4d-onspot
  3. DAILY-test-gke-tpu-7x

Next Steps: The application code for the Failure Triage Agent itself (the Cloud Run service) is being reviewed and merged into a separate Gerrit PR.

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates an automated Failure Triage Agent into the daily integration test pipelines. By triggering this agent during test failures, the system can now perform automated log analysis and provide forensic summaries directly within the Cloud Build console, significantly reducing manual debugging time.

Highlights

  • Agent Integration: Introduced a new Ansible task to trigger the Failure Triage Agent upon test failure.
  • Build Context Injection: Updated Cloud Build configurations to pass the BUILD_ID to Ansible playbooks for accurate forensic mapping.
  • Authentication & Control: Implemented secure OIDC token generation and a GCS-based feature flag to manage agent execution.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@hritik0101

Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request integrates a Failure Triage Agent into the daily integration tests. It introduces a new Ansible task file to trigger the agent, wait for analysis, and print reports, and updates various Cloud Build configurations and playbooks to pass the necessary build ID variables. The code review feedback highlights critical security and efficiency improvements for the new Ansible task file. Specifically, it recommends passing variables via the environment block to prevent command injection vulnerabilities in shell scripts and curl payloads, and replacing a complex, hardcoded shell sleep loop with Ansible's native, more efficient 'until' loop for polling the agent's status.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request integrates a Failure Triage Agent into the daily integration tests by adding a new Ansible task file and updating various playbooks and Cloud Build configurations to pass the build ID. The review feedback highlights several critical improvements: fixing a path resolution error when including the triage agent task in rescue_gcluster_failure.yml, refactoring a complex bash wait loop to use Ansible's native until and retries features, parameterizing hardcoded configuration variables to allow overrides, and passing variables to shell tasks via the environment keyword to prevent syntax errors or command injection.

Comment thread tools/cloud-build/daily-tests/ansible_playbooks/tasks/rescue_gcluster_failure.yml Outdated
@hritik0101 hritik0101 changed the title Ansible changes: Integration of Failure-Triage-agent failure-triage-agent: Integrate into daily integration test ansible pipeline Jun 23, 2026
@hritik0101

Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request integrates a Failure Triage Agent into the daily integration tests by updating Cloud Build configurations to pass the build ID and adding Ansible tasks to trigger and poll the agent. The review feedback highlights several critical improvements: moving the triage agent trigger from the 'always' block to a 'rescue' block to prevent unnecessary execution on successful runs, improving error visibility by removing stderr silencing in gcloud and curl commands, and optimizing the polling loop by removing the initial 60-second sleep.

@LAVEEN

LAVEEN commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

@hritik0101

Copy link
Copy Markdown
Contributor Author

@hritik0101 hritik0101 marked this pull request as ready for review June 24, 2026 08:33
@hritik0101 hritik0101 requested a review from a team as a code owner June 24, 2026 08:33
@LAVEEN LAVEEN added the release-improvements Added to release notes under the "Improvements" heading. label Jun 24, 2026
LAVEEN
LAVEEN previously approved these changes Jun 24, 2026

@LAVEEN LAVEEN left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread tools/cloud-build/daily-tests/builds/gke-a2-highgpu-kueue-onspot.yaml Outdated
AdarshK15
AdarshK15 previously approved these changes Jun 25, 2026

@AdarshK15 AdarshK15 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, please run few daily tests and PR test with this commit to test.

Comment on lines +15 to +20
- name: Set Triage Agent Configuration
ansible.builtin.set_fact:
triage_gcs_bucket: "{{ triage_gcs_bucket_override | default('g-ift-agent-bucket') }}"
triage_project_number: "{{ triage_project_number_override | default('508417052821') }}"
triage_invoker_sa: "{{ triage_invoker_sa_override | default('triage-invoker@hpc-toolkit-dev.iam.gserviceaccount.com') }}"
triage_cloud_run_url: "{{ triage_cloud_run_url_override | default('https://failure-triage-agent-508417052821.us-central1.run.app') }}"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since cluster-toolkit is an open-source, it is not recommended to hardcode internal GCP project numbers (508417052821), buckets, and service accounts as fallback defaults.

Please remove these internal defaults (defaulting them to '' instead) and pass them exclusively via extra-vars in the Cloud Build CI configuration. You can add a quick check in the prerequisite step to skip the agent if these variables are missing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-improvements Added to release notes under the "Improvements" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants