Skip to content

dongkoony/DevOps-Incident-Triage-Model

Repository files navigation

CI Release License: MIT

DevOps Incident Triage Model

English | 한국어

Portfolio-grade NLP and MLOps project for classifying DevOps incident text into the most relevant operational domain for first-pass routing.

This repository focuses on production-minded engineering rather than demo-only modeling:

  • reproducible training and evaluation pipelines
  • local inference, FastAPI serving, and async batch jobs
  • observability with request tracing and Prometheus-style metrics
  • Docker, CI, release workflow, and Hugging Face publishing
  • explicit documentation of data limitations and operational scope

Overview

The model takes incident summaries, deployment failures, and log-style operational messages and predicts which domain should review the issue first.

Current label set:

Label Description
k8s_cluster Kubernetes scheduling, node, or cluster-state issues
cicd_pipeline CI/CD build, test, or deployment pipeline failures
aws_iam_network AWS IAM, VPC, network, or permission-related issues
deployment_release Helm, rollout, or release operation issues
container_runtime Docker, containerd, image, or runtime issues
observability_alerting Monitoring, logging, tracing, or alerting issues
database_state Database connectivity, replication, lock, or storage-state issues

Repository Scope

This repository includes more than a trained classifier.

  • Model training with transformers
  • Evaluation with confusion matrix and threshold-based review analysis
  • CLI inference for single and batch inputs
  • FastAPI endpoints for real-time and batch inference
  • Async batch job API for queue-like inference workflows
  • Benchmark automation across multiple backbone models
  • Release and documentation flow suitable for a portfolio-grade MLOps project

Data Honesty

The starter dataset in data/sample/incidents_synthetic.csv is synthetic.

  • It is not collected from a real production environment.
  • Reported scores should not be interpreted as validated real-world generalization.
  • Real anonymized incident data is required before any serious operational use.

That limitation is a deliberate part of the project documentation and not hidden in the evaluation results.

Model And Experimentation

Baseline model:

  • distilbert-base-uncased

Why this baseline:

  • DevOps logs and error messages are often English-dominant
  • training and iteration cost remain practical on a personal environment
  • the same pipeline can be reused for multilingual backbones such as xlm-roberta-base

Benchmark workflow:

uv run ditri-benchmark \
  --data-dir data/processed \
  --models distilbert-base-uncased,sentence-transformers/all-MiniLM-L6-v2,xlm-roberta-base \
  --epochs 4 \
  --skip-existing

Generated outputs:

  • reports/model_benchmark.json
  • reports/model_benchmark.md
  • models/benchmarks/<model-slug>/
  • reports/benchmarks/<model-slug>/

Quickstart

1. Environment

uv python install 3.12
uv sync --extra dev --extra api --extra viz --extra peft --extra gradio

2. Data Preparation

Using the synthetic starter dataset:

uv run ditri-data-prep \
  --input-path data/sample/incidents_synthetic.csv \
  --output-dir data/processed \
  --seed 42

Using anonymized real data:

uv run ditri-ingest-raw \
  --input-path data/raw/incidents_template.csv \
  --output-canonical-path data/raw/incidents_canonical.csv \
  --output-training-path data/raw/incidents_training_ready.csv \
  --report-path reports/raw_ingestion_report.json

uv run ditri-data-prep \
  --input-path data/raw/incidents_training_ready.csv \
  --output-dir data/processed \
  --seed 42

3. Training

uv run ditri-train \
  --data-dir data/processed \
  --output-dir models/devops-incident-triage \
  --model-name distilbert-base-uncased \
  --epochs 4

Optional PEFT:

uv run ditri-train \
  --data-dir data/processed \
  --output-dir models/devops-incident-triage \
  --model-name distilbert-base-uncased \
  --use-peft

4. Evaluation

uv run ditri-eval \
  --model-path models/devops-incident-triage \
  --data-dir data/processed \
  --report-dir reports \
  --confidence-thresholds 0.4,0.5,0.6,0.7

Key artifacts:

  • reports/evaluation_metrics.json
  • reports/per_label_metrics.json
  • reports/threshold_metrics.json
  • reports/confusion_matrix.csv
  • reports/figures/confusion_matrix.png
  • reports/sample_predictions.jsonl

Inference And Serving

CLI Prediction

Single input:

uv run ditri-predict \
  --model-path models/devops-incident-triage \
  --confidence-threshold 0.6 \
  --review-queue sre_manual_triage \
  --text "EKS worker nodes became NotReady after CNI upgrade."

Batch input:

uv run ditri-predict \
  --model-path models/devops-incident-triage \
  --input-file data/sample/incidents_synthetic.csv \
  --text-column text \
  --output-file reports/batch_predictions.jsonl

Demo Showcase Report

For live demos and portfolio sharing, generate a curated showcase report from representative incident examples:

uv run ditri-demo-showcase \
  --model-path models/devops-incident-triage \
  --confidence-threshold 0.6 \
  --review-queue sre_manual_triage

Generated artifacts:

  • reports/demo_showcase.json
  • reports/demo_showcase.md

This workflow is useful when preparing terminal demos, README evidence, or GIF/video walkthroughs because the terminal summary and saved report come from the same curated examples.

FastAPI

CONFIDENCE_THRESHOLD=0.6 REVIEW_QUEUE=sre_manual_triage BATCH_MAX_ITEMS=32 uv run ditri-api

Available endpoints:

  • GET /health
  • POST /predict
  • POST /predict/batch
  • POST /predict/batch/async
  • GET /predict/batch/async/{job_id}
  • POST /retrieve
  • GET /metrics

Operational features:

  • X-Request-ID response header for traceability
  • confidence threshold based human review routing
  • async batch job flow for queue-like consumption
  • preview RAG retrieval over local runbook documents
  • deterministic incident-assist beta response over classifier and retrieval evidence
  • Prometheus-compatible metrics exposure

RAG Preview Retrieval

release-2026.06-rag-preview introduces a lightweight retrieval layer for runbook evidence. It uses a local scikit-learn TF-IDF sparse vector index over docs/runbooks/ and applies a domain-aware ranking boost from the predicted classifier label.

This is a preview retrieval implementation, not a production Vector DB deployment and not a full LLM assistant.

curl -X POST http://localhost:8000/retrieve \
  -H "Content-Type: application/json" \
  -d '{
    "text": "EKS worker nodes became NotReady after a CNI upgrade.",
    "predicted_domain": "k8s_cluster",
    "top_k": 5
  }'

The response includes cited evidence with document_id, domain, section, score, citation, and excerpt fields.

Retrieval observability is exposed through /metrics with ditri_retrieval_requests_total and ditri_retrieval_latency_seconds.

Incident Assist Beta

release-2026.07-incident-assist-beta adds a deterministic POST /assist flow that combines classifier output, retrieved runbook evidence, citations, and safety notes.

This beta endpoint is intentionally LLM-ready but not LLM-powered yet. It does not call an external model or execute remediation actions.

curl -X POST http://localhost:8000/assist \
  -H "Content-Type: application/json" \
  -d '{
    "text": "GitHub Actions deployment failed because the runner could not assume the production IAM role.",
    "top_k": 5
  }'

The response includes incident, retrieval, assistant_response, and metadata sections so the guidance remains auditable and citation-grounded.

Delivery And Release

This repository follows a lightweight GitFlow-style process:

  • main: release-ready branch
  • develop: integration branch
  • feature/*: scoped feature work
  • release/*: release stabilization

Current project release:

  • release-2026.05-classifier-core

Related operational documentation:

Release Strategy

This project now uses a product-style Release Train in addition to traditional semantic versioning. The classifier is already a useful stable baseline, but the next phase is broader than a single model version: the roadmap extends the project toward a future Classifier + RAG + LLM DevOps Incident Triage Assistant.

Release train naming makes the roadmap easier to read as a product plan. Each release tag describes the delivery window and focus area, while the channel communicates maturity. The current stable baseline remains the Transformer-based classifier core; RAG preview retrieval is implemented, and assistant features are evolving through beta releases before any production-style LLM integration.

Release channels:

Channel Meaning
experimental Early prototype or internal validation work
preview Feature-complete direction, not production-ready
beta Integrated and tested, but still evolving
stable Production-style release with documentation, tests, and monitoring

Release roadmap:

Release tag Channel Focus
release-2026.05-classifier-core stable Transformer classifier, FastAPI inference, batch jobs, evaluation reports, Docker, CI
release-2026.06-rag-preview preview Runbook corpus loading, domain-aware TF-IDF retrieval, preview vector index selection, /retrieve API
release-2026.07-incident-assist-beta beta Classifier + RAG integration, deterministic /assist API, evidence citations, LLM-ready response contract
release-2026.08-eval-observability beta RAG evaluation, groundedness checks, hallucination checks, retrieval/generation latency metrics
release-2026.09-cloud-stable stable AWS deployment roadmap, production-style service architecture, monitoring, CI/CD release flow

Concise roadmap:

Phase Outcome
Classifier core Keep the current Transformer classifier as the stable routing baseline
RAG preview Retrieve cited runbook evidence with a lightweight local vector index
Incident assistant beta Combine predicted domain, retrieved evidence, deterministic guidance, and citations
Evaluation and observability Measure retrieval quality, groundedness, citations, latency, and service health
Cloud stable Document an AWS-ready service shape with monitoring and release operations

Detailed planning docs:

Hugging Face Publishing

export HF_TOKEN="hf_xxx"

uv run ditri-publish \
  --model-dir models/devops-incident-triage \
  --repo-id <your-hf-username>/devops-incident-triage

The publish flow copies docs/model_card.md into the model artifact directory as README.md when needed.

Repository Layout

.
├─ src/devops_incident_triage/
│  ├─ data_prep.py
│  ├─ train.py
│  ├─ evaluate.py
│  ├─ benchmark_models.py
│  ├─ predict.py
│  ├─ api.py
│  ├─ hf_publish.py
│  └─ ingest_raw.py
├─ tests/
├─ data/
├─ reports/
├─ models/
├─ docs/
├─ .github/workflows/
├─ Dockerfile
├─ Makefile
└─ pyproject.toml

Limitations

  • training data is synthetic in the current public baseline
  • the task is single-label even though real incidents may span multiple domains
  • long multi-line logs and highly noisy contexts need additional validation
  • the model is intended for triage support, not autonomous remediation

License

MIT

About

Portfolio-grade DevOps incident triage NLP system with reproducible MLOps pipeline, FastAPI serving, async batch jobs, and model benchmark automation.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors