Add Chaibot test failure triage workflow to ci-chat-bot by chaclark1974 · Pull Request #80476 · openshift/release

chaclark1974 · 2026-06-12T15:42:21Z

Add Chaibot test failure triage workflow to ci-chat-bot

This PR adds Chaibot, an AI-powered Slack workflow that automatically
triages and analyzes test failures posted in designated Slack channels.

Overview

Chaibot extends the existing ci-chat-bot service to monitor Slack channels
(initially #opp-discussion) for test failure messages, analyze failures using
OpenAI GPT-4, and post detailed triage analysis in threads.

What's Added

Configuration Files

core-services/ci-chat-bot/triage-config.yaml - Main Chaibot configuration
clusters/app.ci/ci-chat-bot/chaibot-configmap.yaml - Kubernetes ConfigMap
clusters/app.ci/ci-chat-bot/chaibot-deployment-patch.yaml - Prometheus alerts
core-services/ci-secret-bootstrap/chaibot-secret-config.yaml - Secret config guide

Deployment Changes

clusters/app.ci/ci-chat-bot/ci-chat-bot.yaml - Updated with:
- Chaibot triage-config and secrets volumes
- CHAIBOT_ENABLED and OPENAI_API_KEY environment variables
- --enable-triage command line argument

Documentation

docs/chaibot-test-failure-triage.md - Comprehensive user/admin guide
core-services/ci-chat-bot/CHAIBOT.md - Quick reference
CHAIBOT_QUICKSTART.md - Quick start guide
DEPLOY_CHAIBOT.md - Deployment instructions

Features

Automatic Detection: Monitors channels for Prow job failures
AI Analysis: Uses OpenAI to categorize failures (infrastructure, flaky, bug, config)
Historical Context: Integrates with Sippy for past failure patterns
JIRA Integration: Searches for related known issues
Actionable Output: Posts analysis with recommendations in Slack threads

Example Output

When a failure is posted, Chaibot responds with:

Root cause identification (with confidence %)
Evidence from logs
Historical failure patterns
Specific recommendations
Links to Sippy, logs, and related JIRA issues

Configuration Required

Before this can function, the following must be configured:

Slack Channel ID: Update chaibot-configmap.yaml with actual channel ID for #opp-discussion
OpenAI API Key: Add to ci-secret-bootstrap (see chaibot-secret-config.yaml)
Slack App Permissions: Ensure ci-chat-bot app has required OAuth scopes

Implementation Note

⚠️ This PR provides the complete configuration and deployment manifests, but
requires code implementation in openshift/ci-tools (cmd/ci-chat-bot) to
actually process the configuration and perform analysis.

Without the code implementation, the deployment will succeed but Chaibot
will not respond to messages (the --enable-triage flag will be ignored).

Cost Estimate

GPT-4: ~~$0.03/analysis (~~$90/month at 100 failures/day)
GPT-3.5-turbo: ~~$0.003/analysis (~~$9/month at 100 failures/day)
Rate limiting configured to prevent cost overruns

Testing

After deployment:

Update ConfigMap with actual Slack channel ID
Configure OpenAI API key secret
Post test failure message with Prow URL in #opp-discussion
Verify Chaibot responds in thread within 60 seconds

Summary by CodeRabbit

This PR introduces Chaibot, an AI-powered Slack workflow extension to the OpenShift CI's ci-chat-bot service that automatically triages Prow test failures. The feature monitors designated Slack channels (initially #opp-discussion) for test failure messages, analyzes them using OpenAI's language models, and posts detailed triage analyses directly in Slack threads.

Key Additions

Configuration & Deployment:

triage-config.yaml — Core Chaibot configuration defining monitored channels, failure detection patterns, AI analysis parameters, failure categorization rules (infrastructure, flaky tests, bugs, configuration), integrations (Sippy for historical context, JIRA for known issues, Prow for logs), rate limiting, and metrics settings
Kubernetes manifests — A ConfigMap (chaibot-configmap.yaml) embedding the triage configuration, updated deployment manifest (ci-chat-bot.yaml) adding volumes, environment variables, and CLI flags to enable triage, and Prometheus alert rules for monitoring Chaibot health
Secret bootstrap configuration — Adds automatic syncing of the OpenAI API key from Vault to the ci-chat-bot-chaibot-secrets secret in the ci namespace

Documentation:

Setup guides (CHAIBOT_QUICKSTART.md, DEPLOY_CHAIBOT.md) — Step-by-step instructions for deploying Chaibot, including credential setup, secret configuration, manifest application, and validation
Operational documentation (core-services/ci-chat-bot/CHAIBOT.md, docs/chaibot-test-failure-triage.md) — Detailed configuration reference, integration guidance, monitoring/metrics setup, troubleshooting, cost estimation, and security best practices

Key Features

Automatic detection of Prow job failures via configurable keywords and URL patterns
AI-powered analysis that categorizes failures and provides actionable recommendations with links to logs, Sippy queries, and JIRA issues
Slack integration with threaded responses and interactive actions (view logs, retest jobs, create JIRA tickets, mark tests as flaky)
Rate limiting & cost controls including per-user/channel limits and OpenAI usage monitoring

Important Implementation Note

The manifests and configuration are fully provided and deployment-ready, but the runtime implementation in the openshift/ci-tools repository (specifically cmd/ci-chat-bot) must be completed separately for Chaibot to function. Without that code, the deployment will not enable triage behavior despite the configuration being in place.

Infrastructure Impact

This adds a new operational capability to the OpenShift CI tooling for test failure analysis, reducing manual triage work and providing engineers with rapid, AI-assisted insight into test failures in Slack.

This PR adds Chaibot, an AI-powered Slack workflow that automatically triages and analyzes test failures posted in designated Slack channels. ## Overview Chaibot extends the existing ci-chat-bot service to monitor Slack channels (initially #opp-discussion) for test failure messages, analyze failures using OpenAI GPT-4, and post detailed triage analysis in threads. ## What's Added ### Configuration Files - `core-services/ci-chat-bot/triage-config.yaml` - Main Chaibot configuration - `clusters/app.ci/ci-chat-bot/chaibot-configmap.yaml` - Kubernetes ConfigMap - `clusters/app.ci/ci-chat-bot/chaibot-deployment-patch.yaml` - Prometheus alerts - `core-services/ci-secret-bootstrap/chaibot-secret-config.yaml` - Secret config guide ### Deployment Changes - `clusters/app.ci/ci-chat-bot/ci-chat-bot.yaml` - Updated with: - Chaibot triage-config and secrets volumes - CHAIBOT_ENABLED and OPENAI_API_KEY environment variables - --enable-triage command line argument ### Documentation - `docs/chaibot-test-failure-triage.md` - Comprehensive user/admin guide - `core-services/ci-chat-bot/CHAIBOT.md` - Quick reference - `CHAIBOT_QUICKSTART.md` - Quick start guide - `DEPLOY_CHAIBOT.md` - Deployment instructions ## Features - **Automatic Detection**: Monitors channels for Prow job failures - **AI Analysis**: Uses OpenAI to categorize failures (infrastructure, flaky, bug, config) - **Historical Context**: Integrates with Sippy for past failure patterns - **JIRA Integration**: Searches for related known issues - **Actionable Output**: Posts analysis with recommendations in Slack threads ## Example Output When a failure is posted, Chaibot responds with: - Root cause identification (with confidence %) - Evidence from logs - Historical failure patterns - Specific recommendations - Links to Sippy, logs, and related JIRA issues ## Configuration Required Before this can function, the following must be configured: 1. **Slack Channel ID**: Update `chaibot-configmap.yaml` with actual channel ID for #opp-discussion 2. **OpenAI API Key**: Add to ci-secret-bootstrap (see `chaibot-secret-config.yaml`) 3. **Slack App Permissions**: Ensure ci-chat-bot app has required OAuth scopes ## Implementation Note ⚠️ This PR provides the complete configuration and deployment manifests, but requires code implementation in openshift/ci-tools (cmd/ci-chat-bot) to actually process the configuration and perform analysis. Without the code implementation, the deployment will succeed but Chaibot will not respond to messages (the --enable-triage flag will be ignored). ## Cost Estimate - GPT-4: ~$0.03/analysis (~$90/month at 100 failures/day) - GPT-3.5-turbo: ~$0.003/analysis (~$9/month at 100 failures/day) - Rate limiting configured to prevent cost overruns ## Testing After deployment: 1. Update ConfigMap with actual Slack channel ID 2. Configure OpenAI API key secret 3. Post test failure message with Prow URL in #opp-discussion 4. Verify Chaibot responds in thread within 60 seconds ## Related - Extends existing ci-chat-bot service - Integrates with Sippy for historical data - Complements retester for automated failure handling /cc @openshift/test-platform

Add Vault sync configuration for the Chaibot OpenAI API key stored in selfservice/cspi-qe/chaibot-openai-key. This configures ci-secret-bootstrap to automatically sync the key from Vault to the ci-chat-bot-chaibot-secrets Kubernetes secret in the ci namespace on the app.ci cluster. Vault path: selfservice/cspi-qe/chaibot-openai-key Target secret: ci-chat-bot-chaibot-secrets (ci namespace, app.ci cluster)

openshift-merge-bot · 2026-06-12T15:42:37Z

[REHEARSALNOTIFIER]
@chaclark1974: no rehearsable tests are affected by this change

Note: If this PR includes changes to step registry files (ci-operator/step-registry/) and you expected jobs to be found, try rebasing your PR onto the base branch. This helps pj-rehearse accurately detect changes when the base branch has moved forward.

openshift-ci · 2026-06-12T15:42:47Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: chaclark1974
Once this PR has been reviewed and has the lgtm label, please assign jmguzik for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-06-12T15:42:51Z

Caution

Review failed

An error occurred during the review process. Please try again later.

Walkthrough

This PR adds Chaibot, an AI-powered Slack workflow that automatically triages OpenShift CI test failures. It includes quick-start and deployment documentation, triage configuration files, Kubernetes manifest updates, secrets bootstrap configuration, and comprehensive operational guides across the repository.

Changes

Chaibot Test Failure Triage Feature

Layer / File(s)	Summary
Feature Overview and Quick-Start Guide `CHAIBOT_QUICKSTART.md`, `core-services/ci-chat-bot/CHAIBOT.md`	Documents Chaibot purpose, capabilities, example analysis output, configuration fields, monitoring guidance, troubleshooting, cost estimates, and implementation requirements in the ci-tools repo.
Triage Configuration and Integration Setup `core-services/ci-chat-bot/triage-config.yaml`, `clusters/app.ci/ci-chat-bot/chaibot-configmap.yaml`, `core-services/ci-secret-bootstrap/chaibot-secret-config.yaml`	Defines triage configuration with failure detection patterns, AI analysis parameters, categorization rules with confidence thresholds, Slack response formatting, integrations (Sippy, JIRA, Prow, OpenAI), rate limiting, and monitoring settings. Also documents secrets setup for OpenAI API keys and Slack scopes.
Kubernetes Deployment Manifest Wiring `clusters/app.ci/ci-chat-bot/ci-chat-bot.yaml`, `clusters/app.ci/ci-chat-bot/chaibot-deployment-patch.yaml`, `core-services/ci-secret-bootstrap/_config.yaml`	Updates ci-chat-bot Deployment with triage-config ConfigMap volume, secrets volume for OpenAI API key, environment variables (`CHAIBOT_ENABLED`, `OPENAI_API_KEY`), and startup arguments (`--enable-triage=true`, config path). Includes reference patch with PrometheusRule alerts (API error rate, timeout, service downtime) and secret distribution configuration.
Comprehensive Deployment and Operations Guide `DEPLOY_CHAIBOT.md`, `docs/chaibot-test-failure-triage.md`	Provides step-by-step deployment runbook, functional testing instructions, troubleshooting procedures, operational readiness checklist, rollback guidance, and comprehensive user/operations guide covering usage modes, configuration options, Prometheus metrics/alerting, cost analysis, security practices, local development, feature additions, and capability roadmap.
Owner Alias Update `OWNERS_ALIASES`	Adds team member to cspi-qe-ocp-lp alias list.

🎯 2 (Simple) | ⏱️ ~12 minutes

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

chaclark1974 · 2026-06-12T16:10:19Z

/retest

chaclark1974 · 2026-06-12T16:24:35Z

/retest

openshift-ci · 2026-06-12T16:27:48Z

@chaclark1974: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/ci-secret-bootstrap-config-validation	`8423b12`	link	true	`/test ci-secret-bootstrap-config-validation`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

chaclark1974 added 3 commits May 19, 2026 14:59

Added my RH user to cspi-qe-ocp-lp group

62baadd

openshift-ci Bot requested a review from a team June 12, 2026 15:42

openshift-merge-bot Bot added the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label Jun 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Chaibot test failure triage workflow to ci-chat-bot#80476

Add Chaibot test failure triage workflow to ci-chat-bot#80476
chaclark1974 wants to merge 3 commits into
openshift:mainfrom
chaclark1974:chaibot-test-triage

chaclark1974 commented Jun 12, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

openshift-merge-bot Bot commented Jun 12, 2026

Uh oh!

openshift-ci Bot commented Jun 12, 2026

Uh oh!

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading

Review failed

Uh oh!

chaclark1974 commented Jun 12, 2026

Uh oh!

chaclark1974 commented Jun 12, 2026

Uh oh!

openshift-ci Bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

chaclark1974 commented Jun 12, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

What's Added

Configuration Files

Deployment Changes

Documentation

Features

Example Output

Configuration Required

Implementation Note

Cost Estimate

Testing

Related

Summary by CodeRabbit

Key Additions

Key Features

Important Implementation Note

Infrastructure Impact

Uh oh!

openshift-merge-bot Bot commented Jun 12, 2026

Uh oh!

openshift-ci Bot commented Jun 12, 2026

Uh oh!

coderabbitai Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Uh oh!

chaclark1974 commented Jun 12, 2026

Uh oh!

chaclark1974 commented Jun 12, 2026

Uh oh!

openshift-ci Bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

chaclark1974 commented Jun 12, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading