English | 한국어
Portfolio-grade NLP and MLOps project for classifying DevOps incident text into the most relevant operational domain for first-pass routing.
This repository focuses on production-minded engineering rather than demo-only modeling:
- reproducible training and evaluation pipelines
- local inference, FastAPI serving, and async batch jobs
- observability with request tracing and Prometheus-style metrics
- Docker, CI, release workflow, and Hugging Face publishing
- explicit documentation of data limitations and operational scope
The model takes incident summaries, deployment failures, and log-style operational messages and predicts which domain should review the issue first.
Current label set:
| Label | Description |
|---|---|
k8s_cluster |
Kubernetes scheduling, node, or cluster-state issues |
cicd_pipeline |
CI/CD build, test, or deployment pipeline failures |
aws_iam_network |
AWS IAM, VPC, network, or permission-related issues |
deployment_release |
Helm, rollout, or release operation issues |
container_runtime |
Docker, containerd, image, or runtime issues |
observability_alerting |
Monitoring, logging, tracing, or alerting issues |
database_state |
Database connectivity, replication, lock, or storage-state issues |
This repository includes more than a trained classifier.
- Model training with
transformers - Evaluation with confusion matrix and threshold-based review analysis
- CLI inference for single and batch inputs
- FastAPI endpoints for real-time and batch inference
- Async batch job API for queue-like inference workflows
- Benchmark automation across multiple backbone models
- Release and documentation flow suitable for a portfolio-grade MLOps project
The starter dataset in data/sample/incidents_synthetic.csv is synthetic.
- It is not collected from a real production environment.
- Reported scores should not be interpreted as validated real-world generalization.
- Real anonymized incident data is required before any serious operational use.
That limitation is a deliberate part of the project documentation and not hidden in the evaluation results.
Baseline model:
distilbert-base-uncased
Why this baseline:
- DevOps logs and error messages are often English-dominant
- training and iteration cost remain practical on a personal environment
- the same pipeline can be reused for multilingual backbones such as
xlm-roberta-base
Benchmark workflow:
uv run ditri-benchmark \
--data-dir data/processed \
--models distilbert-base-uncased,sentence-transformers/all-MiniLM-L6-v2,xlm-roberta-base \
--epochs 4 \
--skip-existingGenerated outputs:
reports/model_benchmark.jsonreports/model_benchmark.mdmodels/benchmarks/<model-slug>/reports/benchmarks/<model-slug>/
uv python install 3.12
uv sync --extra dev --extra api --extra viz --extra peft --extra gradioUsing the synthetic starter dataset:
uv run ditri-data-prep \
--input-path data/sample/incidents_synthetic.csv \
--output-dir data/processed \
--seed 42Using anonymized real data:
uv run ditri-ingest-raw \
--input-path data/raw/incidents_template.csv \
--output-canonical-path data/raw/incidents_canonical.csv \
--output-training-path data/raw/incidents_training_ready.csv \
--report-path reports/raw_ingestion_report.json
uv run ditri-data-prep \
--input-path data/raw/incidents_training_ready.csv \
--output-dir data/processed \
--seed 42uv run ditri-train \
--data-dir data/processed \
--output-dir models/devops-incident-triage \
--model-name distilbert-base-uncased \
--epochs 4Optional PEFT:
uv run ditri-train \
--data-dir data/processed \
--output-dir models/devops-incident-triage \
--model-name distilbert-base-uncased \
--use-peftuv run ditri-eval \
--model-path models/devops-incident-triage \
--data-dir data/processed \
--report-dir reports \
--confidence-thresholds 0.4,0.5,0.6,0.7Key artifacts:
reports/evaluation_metrics.jsonreports/per_label_metrics.jsonreports/threshold_metrics.jsonreports/confusion_matrix.csvreports/figures/confusion_matrix.pngreports/sample_predictions.jsonl
Single input:
uv run ditri-predict \
--model-path models/devops-incident-triage \
--confidence-threshold 0.6 \
--review-queue sre_manual_triage \
--text "EKS worker nodes became NotReady after CNI upgrade."Batch input:
uv run ditri-predict \
--model-path models/devops-incident-triage \
--input-file data/sample/incidents_synthetic.csv \
--text-column text \
--output-file reports/batch_predictions.jsonlFor live demos and portfolio sharing, generate a curated showcase report from representative incident examples:
uv run ditri-demo-showcase \
--model-path models/devops-incident-triage \
--confidence-threshold 0.6 \
--review-queue sre_manual_triageGenerated artifacts:
reports/demo_showcase.jsonreports/demo_showcase.md
This workflow is useful when preparing terminal demos, README evidence, or GIF/video walkthroughs because the terminal summary and saved report come from the same curated examples.
CONFIDENCE_THRESHOLD=0.6 REVIEW_QUEUE=sre_manual_triage BATCH_MAX_ITEMS=32 uv run ditri-apiAvailable endpoints:
GET /healthPOST /predictPOST /predict/batchPOST /predict/batch/asyncGET /predict/batch/async/{job_id}POST /retrieveGET /metrics
Operational features:
X-Request-IDresponse header for traceability- confidence threshold based human review routing
- async batch job flow for queue-like consumption
- preview RAG retrieval over local runbook documents
- deterministic incident-assist beta response over classifier and retrieval evidence
- Prometheus-compatible metrics exposure
release-2026.06-rag-preview introduces a lightweight retrieval layer for runbook evidence. It uses a local scikit-learn TF-IDF sparse vector index over docs/runbooks/ and applies a domain-aware ranking boost from the predicted classifier label.
This is a preview retrieval implementation, not a production Vector DB deployment and not a full LLM assistant.
curl -X POST http://localhost:8000/retrieve \
-H "Content-Type: application/json" \
-d '{
"text": "EKS worker nodes became NotReady after a CNI upgrade.",
"predicted_domain": "k8s_cluster",
"top_k": 5
}'The response includes cited evidence with document_id, domain, section, score, citation, and excerpt fields.
Retrieval observability is exposed through /metrics with ditri_retrieval_requests_total and ditri_retrieval_latency_seconds.
release-2026.07-incident-assist-beta adds a deterministic POST /assist flow that combines classifier output, retrieved runbook evidence, citations, and safety notes.
This beta endpoint is intentionally LLM-ready but not LLM-powered yet. It does not call an external model or execute remediation actions.
curl -X POST http://localhost:8000/assist \
-H "Content-Type: application/json" \
-d '{
"text": "GitHub Actions deployment failed because the runner could not assume the production IAM role.",
"top_k": 5
}'The response includes incident, retrieval, assistant_response, and metadata sections so the guidance remains auditable and citation-grounded.
This repository follows a lightweight GitFlow-style process:
main: release-ready branchdevelop: integration branchfeature/*: scoped feature workrelease/*: release stabilization
Current project release:
release-2026.05-classifier-core
Related operational documentation:
This project now uses a product-style Release Train in addition to traditional semantic versioning. The classifier is already a useful stable baseline, but the next phase is broader than a single model version: the roadmap extends the project toward a future Classifier + RAG + LLM DevOps Incident Triage Assistant.
Release train naming makes the roadmap easier to read as a product plan. Each release tag describes the delivery window and focus area, while the channel communicates maturity. The current stable baseline remains the Transformer-based classifier core; RAG preview retrieval is implemented, and assistant features are evolving through beta releases before any production-style LLM integration.
Release channels:
| Channel | Meaning |
|---|---|
experimental |
Early prototype or internal validation work |
preview |
Feature-complete direction, not production-ready |
beta |
Integrated and tested, but still evolving |
stable |
Production-style release with documentation, tests, and monitoring |
Release roadmap:
| Release tag | Channel | Focus |
|---|---|---|
release-2026.05-classifier-core |
stable | Transformer classifier, FastAPI inference, batch jobs, evaluation reports, Docker, CI |
release-2026.06-rag-preview |
preview | Runbook corpus loading, domain-aware TF-IDF retrieval, preview vector index selection, /retrieve API |
release-2026.07-incident-assist-beta |
beta | Classifier + RAG integration, deterministic /assist API, evidence citations, LLM-ready response contract |
release-2026.08-eval-observability |
beta | RAG evaluation, groundedness checks, hallucination checks, retrieval/generation latency metrics |
release-2026.09-cloud-stable |
stable | AWS deployment roadmap, production-style service architecture, monitoring, CI/CD release flow |
Concise roadmap:
| Phase | Outcome |
|---|---|
| Classifier core | Keep the current Transformer classifier as the stable routing baseline |
| RAG preview | Retrieve cited runbook evidence with a lightweight local vector index |
| Incident assistant beta | Combine predicted domain, retrieved evidence, deterministic guidance, and citations |
| Evaluation and observability | Measure retrieval quality, groundedness, citations, latency, and service health |
| Cloud stable | Document an AWS-ready service shape with monitoring and release operations |
Detailed planning docs:
- Release strategy
- RAG roadmap
- Classifier core release evidence
- RAG preview release evidence
- Incident assist beta release evidence
- RAG evaluation plan
export HF_TOKEN="hf_xxx"
uv run ditri-publish \
--model-dir models/devops-incident-triage \
--repo-id <your-hf-username>/devops-incident-triageThe publish flow copies docs/model_card.md into the model artifact directory as README.md when needed.
.
├─ src/devops_incident_triage/
│ ├─ data_prep.py
│ ├─ train.py
│ ├─ evaluate.py
│ ├─ benchmark_models.py
│ ├─ predict.py
│ ├─ api.py
│ ├─ hf_publish.py
│ └─ ingest_raw.py
├─ tests/
├─ data/
├─ reports/
├─ models/
├─ docs/
├─ .github/workflows/
├─ Dockerfile
├─ Makefile
└─ pyproject.toml
- training data is synthetic in the current public baseline
- the task is single-label even though real incidents may span multiple domains
- long multi-line logs and highly noisy contexts need additional validation
- the model is intended for triage support, not autonomous remediation
MIT