Skip to content

Add Chaibot test failure triage workflow to ci-chat-bot#80476

Open
chaclark1974 wants to merge 3 commits into
openshift:mainfrom
chaclark1974:chaibot-test-triage
Open

Add Chaibot test failure triage workflow to ci-chat-bot#80476
chaclark1974 wants to merge 3 commits into
openshift:mainfrom
chaclark1974:chaibot-test-triage

Conversation

@chaclark1974

@chaclark1974 chaclark1974 commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Add Chaibot test failure triage workflow to ci-chat-bot

This PR adds Chaibot, an AI-powered Slack workflow that automatically
triages and analyzes test failures posted in designated Slack channels.

Overview

Chaibot extends the existing ci-chat-bot service to monitor Slack channels
(initially #opp-discussion) for test failure messages, analyze failures using
OpenAI GPT-4, and post detailed triage analysis in threads.

What's Added

Configuration Files

  • core-services/ci-chat-bot/triage-config.yaml - Main Chaibot configuration
  • clusters/app.ci/ci-chat-bot/chaibot-configmap.yaml - Kubernetes ConfigMap
  • clusters/app.ci/ci-chat-bot/chaibot-deployment-patch.yaml - Prometheus alerts
  • core-services/ci-secret-bootstrap/chaibot-secret-config.yaml - Secret config guide

Deployment Changes

  • clusters/app.ci/ci-chat-bot/ci-chat-bot.yaml - Updated with:
    • Chaibot triage-config and secrets volumes
    • CHAIBOT_ENABLED and OPENAI_API_KEY environment variables
    • --enable-triage command line argument

Documentation

  • docs/chaibot-test-failure-triage.md - Comprehensive user/admin guide
  • core-services/ci-chat-bot/CHAIBOT.md - Quick reference
  • CHAIBOT_QUICKSTART.md - Quick start guide
  • DEPLOY_CHAIBOT.md - Deployment instructions

Features

  • Automatic Detection: Monitors channels for Prow job failures
  • AI Analysis: Uses OpenAI to categorize failures (infrastructure, flaky, bug, config)
  • Historical Context: Integrates with Sippy for past failure patterns
  • JIRA Integration: Searches for related known issues
  • Actionable Output: Posts analysis with recommendations in Slack threads

Example Output

When a failure is posted, Chaibot responds with:

  • Root cause identification (with confidence %)
  • Evidence from logs
  • Historical failure patterns
  • Specific recommendations
  • Links to Sippy, logs, and related JIRA issues

Configuration Required

Before this can function, the following must be configured:

  1. Slack Channel ID: Update chaibot-configmap.yaml with actual channel ID for #opp-discussion
  2. OpenAI API Key: Add to ci-secret-bootstrap (see chaibot-secret-config.yaml)
  3. Slack App Permissions: Ensure ci-chat-bot app has required OAuth scopes

Implementation Note

⚠️ This PR provides the complete configuration and deployment manifests, but
requires code implementation in openshift/ci-tools (cmd/ci-chat-bot) to
actually process the configuration and perform analysis.

Without the code implementation, the deployment will succeed but Chaibot
will not respond to messages (the --enable-triage flag will be ignored).

Cost Estimate

  • GPT-4: $0.03/analysis ($90/month at 100 failures/day)
  • GPT-3.5-turbo: $0.003/analysis ($9/month at 100 failures/day)
  • Rate limiting configured to prevent cost overruns

Testing

After deployment:

  1. Update ConfigMap with actual Slack channel ID
  2. Configure OpenAI API key secret
  3. Post test failure message with Prow URL in #opp-discussion
  4. Verify Chaibot responds in thread within 60 seconds

Related

  • Extends existing ci-chat-bot service
  • Integrates with Sippy for historical data
  • Complements retester for automated failure handling

/cc @openshift/test-platform

Summary by CodeRabbit

This PR introduces Chaibot, an AI-powered Slack workflow extension to the OpenShift CI's ci-chat-bot service that automatically triages Prow test failures. The feature monitors designated Slack channels (initially #opp-discussion) for test failure messages, analyzes them using OpenAI's language models, and posts detailed triage analyses directly in Slack threads.

Key Additions

Configuration & Deployment:

  • triage-config.yaml — Core Chaibot configuration defining monitored channels, failure detection patterns, AI analysis parameters, failure categorization rules (infrastructure, flaky tests, bugs, configuration), integrations (Sippy for historical context, JIRA for known issues, Prow for logs), rate limiting, and metrics settings
  • Kubernetes manifests — A ConfigMap (chaibot-configmap.yaml) embedding the triage configuration, updated deployment manifest (ci-chat-bot.yaml) adding volumes, environment variables, and CLI flags to enable triage, and Prometheus alert rules for monitoring Chaibot health
  • Secret bootstrap configuration — Adds automatic syncing of the OpenAI API key from Vault to the ci-chat-bot-chaibot-secrets secret in the ci namespace

Documentation:

  • Setup guides (CHAIBOT_QUICKSTART.md, DEPLOY_CHAIBOT.md) — Step-by-step instructions for deploying Chaibot, including credential setup, secret configuration, manifest application, and validation
  • Operational documentation (core-services/ci-chat-bot/CHAIBOT.md, docs/chaibot-test-failure-triage.md) — Detailed configuration reference, integration guidance, monitoring/metrics setup, troubleshooting, cost estimation, and security best practices

Key Features

  • Automatic detection of Prow job failures via configurable keywords and URL patterns
  • AI-powered analysis that categorizes failures and provides actionable recommendations with links to logs, Sippy queries, and JIRA issues
  • Slack integration with threaded responses and interactive actions (view logs, retest jobs, create JIRA tickets, mark tests as flaky)
  • Rate limiting & cost controls including per-user/channel limits and OpenAI usage monitoring

Important Implementation Note

The manifests and configuration are fully provided and deployment-ready, but the runtime implementation in the openshift/ci-tools repository (specifically cmd/ci-chat-bot) must be completed separately for Chaibot to function. Without that code, the deployment will not enable triage behavior despite the configuration being in place.

Infrastructure Impact

This adds a new operational capability to the OpenShift CI tooling for test failure analysis, reducing manual triage work and providing engineers with rapid, AI-assisted insight into test failures in Slack.

This PR adds Chaibot, an AI-powered Slack workflow that automatically
triages and analyzes test failures posted in designated Slack channels.

## Overview

Chaibot extends the existing ci-chat-bot service to monitor Slack channels
(initially #opp-discussion) for test failure messages, analyze failures using
OpenAI GPT-4, and post detailed triage analysis in threads.

## What's Added

### Configuration Files
- `core-services/ci-chat-bot/triage-config.yaml` - Main Chaibot configuration
- `clusters/app.ci/ci-chat-bot/chaibot-configmap.yaml` - Kubernetes ConfigMap
- `clusters/app.ci/ci-chat-bot/chaibot-deployment-patch.yaml` - Prometheus alerts
- `core-services/ci-secret-bootstrap/chaibot-secret-config.yaml` - Secret config guide

### Deployment Changes
- `clusters/app.ci/ci-chat-bot/ci-chat-bot.yaml` - Updated with:
  - Chaibot triage-config and secrets volumes
  - CHAIBOT_ENABLED and OPENAI_API_KEY environment variables
  - --enable-triage command line argument

### Documentation
- `docs/chaibot-test-failure-triage.md` - Comprehensive user/admin guide
- `core-services/ci-chat-bot/CHAIBOT.md` - Quick reference
- `CHAIBOT_QUICKSTART.md` - Quick start guide
- `DEPLOY_CHAIBOT.md` - Deployment instructions

## Features

- **Automatic Detection**: Monitors channels for Prow job failures
- **AI Analysis**: Uses OpenAI to categorize failures (infrastructure, flaky, bug, config)
- **Historical Context**: Integrates with Sippy for past failure patterns
- **JIRA Integration**: Searches for related known issues
- **Actionable Output**: Posts analysis with recommendations in Slack threads

## Example Output

When a failure is posted, Chaibot responds with:
- Root cause identification (with confidence %)
- Evidence from logs
- Historical failure patterns
- Specific recommendations
- Links to Sippy, logs, and related JIRA issues

## Configuration Required

Before this can function, the following must be configured:

1. **Slack Channel ID**: Update `chaibot-configmap.yaml` with actual channel ID for #opp-discussion
2. **OpenAI API Key**: Add to ci-secret-bootstrap (see `chaibot-secret-config.yaml`)
3. **Slack App Permissions**: Ensure ci-chat-bot app has required OAuth scopes

## Implementation Note

⚠️ This PR provides the complete configuration and deployment manifests, but
requires code implementation in openshift/ci-tools (cmd/ci-chat-bot) to
actually process the configuration and perform analysis.

Without the code implementation, the deployment will succeed but Chaibot
will not respond to messages (the --enable-triage flag will be ignored).

## Cost Estimate

- GPT-4: ~$0.03/analysis (~$90/month at 100 failures/day)
- GPT-3.5-turbo: ~$0.003/analysis (~$9/month at 100 failures/day)
- Rate limiting configured to prevent cost overruns

## Testing

After deployment:
1. Update ConfigMap with actual Slack channel ID
2. Configure OpenAI API key secret
3. Post test failure message with Prow URL in #opp-discussion
4. Verify Chaibot responds in thread within 60 seconds

## Related

- Extends existing ci-chat-bot service
- Integrates with Sippy for historical data
- Complements retester for automated failure handling

/cc @openshift/test-platform
Add Vault sync configuration for the Chaibot OpenAI API key stored in
selfservice/cspi-qe/chaibot-openai-key.

This configures ci-secret-bootstrap to automatically sync the key from
Vault to the ci-chat-bot-chaibot-secrets Kubernetes secret in the ci
namespace on the app.ci cluster.

Vault path: selfservice/cspi-qe/chaibot-openai-key
Target secret: ci-chat-bot-chaibot-secrets (ci namespace, app.ci cluster)
@openshift-ci openshift-ci Bot requested a review from a team June 12, 2026 15:42
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

[REHEARSALNOTIFIER]
@chaclark1974: no rehearsable tests are affected by this change

Note: If this PR includes changes to step registry files (ci-operator/step-registry/) and you expected jobs to be found, try rebasing your PR onto the base branch. This helps pj-rehearse accurately detect changes when the base branch has moved forward.

@openshift-merge-bot openshift-merge-bot Bot added the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label Jun 12, 2026
@openshift-ci

openshift-ci Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: chaclark1974
Once this PR has been reviewed and has the lgtm label, please assign jmguzik for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Caution

Review failed

An error occurred during the review process. Please try again later.

Walkthrough

This PR adds Chaibot, an AI-powered Slack workflow that automatically triages OpenShift CI test failures. It includes quick-start and deployment documentation, triage configuration files, Kubernetes manifest updates, secrets bootstrap configuration, and comprehensive operational guides across the repository.

Changes

Chaibot Test Failure Triage Feature

Layer / File(s) Summary
Feature Overview and Quick-Start Guide
CHAIBOT_QUICKSTART.md, core-services/ci-chat-bot/CHAIBOT.md
Documents Chaibot purpose, capabilities, example analysis output, configuration fields, monitoring guidance, troubleshooting, cost estimates, and implementation requirements in the ci-tools repo.
Triage Configuration and Integration Setup
core-services/ci-chat-bot/triage-config.yaml, clusters/app.ci/ci-chat-bot/chaibot-configmap.yaml, core-services/ci-secret-bootstrap/chaibot-secret-config.yaml
Defines triage configuration with failure detection patterns, AI analysis parameters, categorization rules with confidence thresholds, Slack response formatting, integrations (Sippy, JIRA, Prow, OpenAI), rate limiting, and monitoring settings. Also documents secrets setup for OpenAI API keys and Slack scopes.
Kubernetes Deployment Manifest Wiring
clusters/app.ci/ci-chat-bot/ci-chat-bot.yaml, clusters/app.ci/ci-chat-bot/chaibot-deployment-patch.yaml, core-services/ci-secret-bootstrap/_config.yaml
Updates ci-chat-bot Deployment with triage-config ConfigMap volume, secrets volume for OpenAI API key, environment variables (CHAIBOT_ENABLED, OPENAI_API_KEY), and startup arguments (--enable-triage=true, config path). Includes reference patch with PrometheusRule alerts (API error rate, timeout, service downtime) and secret distribution configuration.
Comprehensive Deployment and Operations Guide
DEPLOY_CHAIBOT.md, docs/chaibot-test-failure-triage.md
Provides step-by-step deployment runbook, functional testing instructions, troubleshooting procedures, operational readiness checklist, rollback guidance, and comprehensive user/operations guide covering usage modes, configuration options, Prometheus metrics/alerting, cost analysis, security practices, local development, feature additions, and capability roadmap.
Owner Alias Update
OWNERS_ALIASES
Adds team member to cspi-qe-ocp-lp alias list.

🎯 2 (Simple) | ⏱️ ~12 minutes

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@chaclark1974

Copy link
Copy Markdown
Contributor Author

/retest

1 similar comment
@chaclark1974

Copy link
Copy Markdown
Contributor Author

/retest

@openshift-ci

openshift-ci Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

@chaclark1974: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/ci-secret-bootstrap-config-validation 8423b12 link true /test ci-secret-bootstrap-config-validation

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

rehearsals-ack Signifies that rehearsal jobs have been acknowledged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant