DevOps Incident Triage Model

Portfolio-grade NLP and MLOps project for classifying DevOps incident text into the most relevant operational domain for first-pass routing.

This repository focuses on production-minded engineering rather than demo-only modeling:

reproducible training and evaluation pipelines
local inference, FastAPI serving, and async batch jobs
observability with request tracing and Prometheus-style metrics
Docker, CI, release workflow, and Hugging Face publishing
explicit documentation of data limitations and operational scope

Overview

The model takes incident summaries, deployment failures, and log-style operational messages and predicts which domain should review the issue first.

Current label set:

Label	Description
`k8s_cluster`	Kubernetes scheduling, node, or cluster-state issues
`cicd_pipeline`	CI/CD build, test, or deployment pipeline failures
`aws_iam_network`	AWS IAM, VPC, network, or permission-related issues
`deployment_release`	Helm, rollout, or release operation issues
`container_runtime`	Docker, containerd, image, or runtime issues
`observability_alerting`	Monitoring, logging, tracing, or alerting issues
`database_state`	Database connectivity, replication, lock, or storage-state issues

Repository Scope

This repository includes more than a trained classifier.

Model training with transformers
Evaluation with confusion matrix and threshold-based review analysis
CLI inference for single and batch inputs
FastAPI endpoints for real-time and batch inference
Async batch job API for queue-like inference workflows
Benchmark automation across multiple backbone models
Release and documentation flow suitable for a portfolio-grade MLOps project

Data Honesty

The starter dataset in data/sample/incidents_synthetic.csv is synthetic.

It is not collected from a real production environment.
Reported scores should not be interpreted as validated real-world generalization.
Real anonymized incident data is required before any serious operational use.

That limitation is a deliberate part of the project documentation and not hidden in the evaluation results.

Model And Experimentation

Baseline model:

distilbert-base-uncased

Why this baseline:

DevOps logs and error messages are often English-dominant
training and iteration cost remain practical on a personal environment
the same pipeline can be reused for multilingual backbones such as xlm-roberta-base

Benchmark workflow:

uv run ditri-benchmark \
  --data-dir data/processed \
  --models distilbert-base-uncased,sentence-transformers/all-MiniLM-L6-v2,xlm-roberta-base \
  --epochs 4 \
  --skip-existing

Generated outputs:

reports/model_benchmark.json
reports/model_benchmark.md
models/benchmarks/<model-slug>/
reports/benchmarks/<model-slug>/

Quickstart

1. Environment

uv python install 3.12
uv sync --extra dev --extra api --extra viz --extra peft --extra gradio

2. Data Preparation

Using the synthetic starter dataset:

uv run ditri-data-prep \
  --input-path data/sample/incidents_synthetic.csv \
  --output-dir data/processed \
  --seed 42

Using anonymized real data:

uv run ditri-ingest-raw \
  --input-path data/raw/incidents_template.csv \
  --output-canonical-path data/raw/incidents_canonical.csv \
  --output-training-path data/raw/incidents_training_ready.csv \
  --report-path reports/raw_ingestion_report.json

uv run ditri-data-prep \
  --input-path data/raw/incidents_training_ready.csv \
  --output-dir data/processed \
  --seed 42

3. Training

uv run ditri-train \
  --data-dir data/processed \
  --output-dir models/devops-incident-triage \
  --model-name distilbert-base-uncased \
  --epochs 4

Optional PEFT:

uv run ditri-train \
  --data-dir data/processed \
  --output-dir models/devops-incident-triage \
  --model-name distilbert-base-uncased \
  --use-peft

4. Evaluation

uv run ditri-eval \
  --model-path models/devops-incident-triage \
  --data-dir data/processed \
  --report-dir reports \
  --confidence-thresholds 0.4,0.5,0.6,0.7

Key artifacts:

reports/evaluation_metrics.json
reports/per_label_metrics.json
reports/threshold_metrics.json
reports/confusion_matrix.csv
reports/figures/confusion_matrix.png
reports/sample_predictions.jsonl

Inference And Serving

CLI Prediction

Single input:

uv run ditri-predict \
  --model-path models/devops-incident-triage \
  --confidence-threshold 0.6 \
  --review-queue sre_manual_triage \
  --text "EKS worker nodes became NotReady after CNI upgrade."

Batch input:

uv run ditri-predict \
  --model-path models/devops-incident-triage \
  --input-file data/sample/incidents_synthetic.csv \
  --text-column text \
  --output-file reports/batch_predictions.jsonl

Demo Showcase Report

For live demos and portfolio sharing, generate a curated showcase report from representative incident examples:

uv run ditri-demo-showcase \
  --model-path models/devops-incident-triage \
  --confidence-threshold 0.6 \
  --review-queue sre_manual_triage

Generated artifacts:

reports/demo_showcase.json
reports/demo_showcase.md

This workflow is useful when preparing terminal demos, README evidence, or GIF/video walkthroughs because the terminal summary and saved report come from the same curated examples.

FastAPI

CONFIDENCE_THRESHOLD=0.6 REVIEW_QUEUE=sre_manual_triage BATCH_MAX_ITEMS=32 uv run ditri-api

Available endpoints:

GET /health
POST /predict
POST /predict/batch
POST /predict/batch/async
GET /predict/batch/async/{job_id}
POST /retrieve
GET /metrics

Operational features:

X-Request-ID response header for traceability
confidence threshold based human review routing
async batch job flow for queue-like consumption
preview RAG retrieval over local runbook documents
deterministic incident-assist beta response over classifier and retrieval evidence
Prometheus-compatible metrics exposure

RAG Preview Retrieval

release-2026.06-rag-preview introduces a lightweight retrieval layer for runbook evidence. It uses a local scikit-learn TF-IDF sparse vector index over docs/runbooks/ and applies a domain-aware ranking boost from the predicted classifier label.

This is a preview retrieval implementation, not a production Vector DB deployment and not a full LLM assistant.

curl -X POST http://localhost:8000/retrieve \
  -H "Content-Type: application/json" \
  -d '{
    "text": "EKS worker nodes became NotReady after a CNI upgrade.",
    "predicted_domain": "k8s_cluster",
    "top_k": 5
  }'

The response includes cited evidence with document_id, domain, section, score, citation, and excerpt fields.

Retrieval observability is exposed through /metrics with ditri_retrieval_requests_total and ditri_retrieval_latency_seconds.

Incident Assist Beta

release-2026.07-incident-assist-beta adds a deterministic POST /assist flow that combines classifier output, retrieved runbook evidence, citations, and safety notes.

This beta endpoint is intentionally LLM-ready but not LLM-powered yet. It does not call an external model or execute remediation actions.

curl -X POST http://localhost:8000/assist \
  -H "Content-Type: application/json" \
  -d '{
    "text": "GitHub Actions deployment failed because the runner could not assume the production IAM role.",
    "top_k": 5
  }'

The response includes incident, retrieval, assistant_response, and metadata sections so the guidance remains auditable and citation-grounded.

Delivery And Release

This repository follows a lightweight GitFlow-style process:

main: release-ready branch
develop: integration branch
feature/*: scoped feature work
release/*: release stabilization

Current project release:

release-2026.05-classifier-core

Release Strategy

This project now uses a product-style Release Train in addition to traditional semantic versioning. The classifier is already a useful stable baseline, but the next phase is broader than a single model version: the roadmap extends the project toward a future Classifier + RAG + LLM DevOps Incident Triage Assistant.

Release train naming makes the roadmap easier to read as a product plan. Each release tag describes the delivery window and focus area, while the channel communicates maturity. The current stable baseline remains the Transformer-based classifier core; RAG preview retrieval is implemented, and assistant features are evolving through beta releases before any production-style LLM integration.

Release channels:

Channel	Meaning
`experimental`	Early prototype or internal validation work
`preview`	Feature-complete direction, not production-ready
`beta`	Integrated and tested, but still evolving
`stable`	Production-style release with documentation, tests, and monitoring

Release roadmap:

Release tag	Channel	Focus
`release-2026.05-classifier-core`	stable	Transformer classifier, FastAPI inference, batch jobs, evaluation reports, Docker, CI
`release-2026.06-rag-preview`	preview	Runbook corpus loading, domain-aware TF-IDF retrieval, preview vector index selection, `/retrieve` API
`release-2026.07-incident-assist-beta`	beta	Classifier + RAG integration, deterministic `/assist` API, evidence citations, LLM-ready response contract
`release-2026.08-eval-observability`	beta	RAG evaluation, groundedness checks, hallucination checks, retrieval/generation latency metrics
`release-2026.09-cloud-stable`	stable	AWS deployment roadmap, production-style service architecture, monitoring, CI/CD release flow

Concise roadmap:

Phase	Outcome
Classifier core	Keep the current Transformer classifier as the stable routing baseline
RAG preview	Retrieve cited runbook evidence with a lightweight local vector index
Incident assistant beta	Combine predicted domain, retrieved evidence, deterministic guidance, and citations
Evaluation and observability	Measure retrieval quality, groundedness, citations, latency, and service health
Cloud stable	Document an AWS-ready service shape with monitoring and release operations

Detailed planning docs:

Hugging Face Publishing

export HF_TOKEN="hf_xxx"

uv run ditri-publish \
  --model-dir models/devops-incident-triage \
  --repo-id <your-hf-username>/devops-incident-triage

The publish flow copies docs/model_card.md into the model artifact directory as README.md when needed.

Repository Layout

.
├─ src/devops_incident_triage/
│  ├─ data_prep.py
│  ├─ train.py
│  ├─ evaluate.py
│  ├─ benchmark_models.py
│  ├─ predict.py
│  ├─ api.py
│  ├─ hf_publish.py
│  └─ ingest_raw.py
├─ tests/
├─ data/
├─ reports/
├─ models/
├─ docs/
├─ .github/workflows/
├─ Dockerfile
├─ Makefile
└─ pyproject.toml

Limitations

training data is synthetic in the current public baseline
the task is single-label even though real incidents may span multiple domains
long multi-line logs and highly noisy contexts need additional validation
the model is intended for triage support, not autonomous remediation

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DevOps Incident Triage Model

Overview

Repository Scope

Data Honesty

Model And Experimentation

Quickstart

1. Environment

2. Data Preparation

3. Training

4. Evaluation

Inference And Serving

CLI Prediction

Demo Showcase Report

FastAPI

RAG Preview Retrieval

Incident Assist Beta

Delivery And Release

Release Strategy

Hugging Face Publishing

Repository Layout

Limitations

License

About

Uh oh!

Releases 5

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
.github		.github
app		app
data		data
docs		docs
models		models
notebooks		notebooks
reports		reports
src/devops_incident_triage		src/devops_incident_triage
tests		tests
.gitignore		.gitignore
.python-version		.python-version
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.ko.md		README.ko.md
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

DevOps Incident Triage Model

Overview

Repository Scope

Data Honesty

Model And Experimentation

Quickstart

1. Environment

2. Data Preparation

3. Training

4. Evaluation

Inference And Serving

CLI Prediction

Demo Showcase Report

FastAPI

RAG Preview Retrieval

Incident Assist Beta

Delivery And Release

Release Strategy

Hugging Face Publishing

Repository Layout

Limitations

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages