A production-grade LLMOps platform for autonomous infrastructure root cause analysis (RCA) — leveraging multi-agent orchestration, retrieval-augmented generation (RAG), self-correcting LLM workflows on AWS Bedrock, and a fully serverless human-in-the-loop (HITL) review layer. Built for SRE/DevOps teams requiring zero-touch incident triage with full observability, experiment tracking, and quality gates.
sequenceDiagram
participant Airflow
participant Orchestrator
participant RAG as ChromaDB RAG
participant Bedrock as AWS Bedrock<br/>(Llama 3 / Mistral 7B)
participant Critic
participant MLflow
participant SF as Step Functions
Airflow->>Orchestrator: Trigger RCA (normalized_log)
loop Self-Correction (max 2 retries)
Orchestrator->>RAG: Query runbook context (semantic search)
RAG-->>Orchestrator: Top-6 relevant chunks
Orchestrator->>Bedrock: Agent 1: Investigator (log + context)
Bedrock-->>Orchestrator: Incident summary
Orchestrator->>Bedrock: Agent 2: Root Cause (summary + context)
Bedrock-->>Orchestrator: Hypothesis + evidence
Orchestrator->>Bedrock: Agent 3: Fix Generator (root cause + context)
Bedrock-->>Orchestrator: Remediation steps
Orchestrator->>Bedrock: Agent 4: Formatter (all outputs)
Bedrock-->>Orchestrator: Structured RCA JSON
Orchestrator->>Critic: Evaluate RCA quality
Critic->>Bedrock: Score faithfulness + relevancy (Mistral-7B)
Bedrock-->>Critic: Quality metrics
Critic-->>Orchestrator: Score + feedback
Orchestrator->>MLflow: Log attempt metrics (score, tokens, latency)
alt Score ≥ 0.8
Orchestrator->>SF: trigger_sf_review (fire and forget)
Orchestrator->>MLflow: Mark run SUCCESS
Orchestrator-->>Airflow: Complete
else Score < 0.8 && retries left
Note over Orchestrator: Inject critic feedback into next iteration
else Max retries exceeded
Orchestrator->>MLflow: Mark run FAILED
Orchestrator-->>Airflow: Fail with diagnostics
end
end
graph LR
subgraph "Indexing Phase (One-time)"
MD["Markdown Runbooks<br/>SOP Documents"]
SPLIT["RecursiveCharacterTextSplitter<br/>chunk_size=1000 · overlap=200"]
EMB_IDX["Bedrock Titan Embed<br/>1536-dim vectors"]
STORE[("ChromaDB<br/>Persistent Storage")]
MD --> SPLIT --> EMB_IDX --> STORE
end
subgraph "Query Phase (Per RCA)"
QUERY["Log Error Context<br/>+ Agent Question"]
EMB_Q["Bedrock Titan Embed<br/>Query Vector"]
SEARCH["Cosine Similarity<br/>Top-K=6"]
RERANK["MMR Reranking<br/>Diversity Filter"]
CONTEXT["Augmented Context<br/>to LLM Prompt"]
QUERY --> EMB_Q --> SEARCH --> RERANK --> CONTEXT
STORE -."Vector Search".-> SEARCH
end
style STORE fill:#4a3a1a,color:#fff
style CONTEXT fill:#1a4a2e,color:#fff
Rather than writing RCA results directly to storage, the pipeline hands off to a fully serverless HITL review layer built on AWS Step Functions, Lambda, API Gateway, and DynamoDB. Airflow fires and forgets — the DAG completes as SUCCESS without waiting for a human decision, keeping pipeline slots free regardless of how long review takes.
flowchart TB
subgraph AIRFLOW["Airflow — last task"]
TRG["review_sent
sfn.start_execution()
returns immediately"]
end
subgraph SF["AWS Step Functions state machine"]
STORE["StoreForReview Lambda
writes RCA + task token to DynamoDB"]
WAIT["WaitForTaskToken
paused · zero cost · 72h heartbeat"]
APPROVE["OnApprove Lambda"]
REJECT["OnReject Lambda"]
TIMEOUT["EscalateTimeout
auto-reject after 72h"]
STORE --> WAIT
WAIT -->|"HeartbeatTimeout"| TIMEOUT
end
subgraph UI["Review UI — API Gateway + Lambda"]
Q["GET /queue
list pending RCAs from DynamoDB"]
D["GET /rca/{id}
full RCA + AI critic + raw log"]
A["POST /approve
send_task_success(token)"]
R["POST /reject
send_task_failure(token)
+ feedback type + reason"]
end
subgraph ON_APPROVE["On Approve"]
S3M["S3 raw/ → processed/
log archived"]
S3R["S3 rca-results/
RCA JSON written"]
ML["MLflow tag
human_verdict=approved · rater_id"]
end
subgraph ON_REJECT["On Reject — feedback-in-log"]
FBL["New timestamped log → S3 raw/
original log + AI critic output
+ human feedback as # === blocks"]
NEXT["Next scheduled DAG run
picks up naturally
agents read prior context"]
end
TRG -->|"starts execution"| STORE
STORE -->|"task token stored in DynamoDB"| WAIT
WAIT -.->|"SRE opens UI"| Q
Q --> D
D --> A & R
A -->|"send_task_success · resumes SF"| APPROVE
R -->|"send_task_failure · resumes SF"| REJECT
TIMEOUT --> REJECT
APPROVE --> S3M & S3R & ML
REJECT --> FBL --> NEXT
NEXT -->|"retried on next run"| TRG
classDef airflow fill:#534AB7,color:#EEEDFE,stroke:#3C3489
classDef sf fill:#185FA5,color:#E6F1FB,stroke:#0C447C
classDef ui fill:#0F6E56,color:#E1F5EE,stroke:#085041
classDef approve fill:#3B6D11,color:#EAF3DE,stroke:#27500A
classDef reject fill:#993C1D,color:#FAECE7,stroke:#712B13
class TRG airflow
class STORE,WAIT,APPROVE,REJECT,TIMEOUT sf
class Q,D,A,R ui
class S3M,S3R,ML approve
class FBL,NEXT reject
Every RCA passes through two critics before reaching an SRE:
- AI critic (Mistral 7B via DeepEval) — scores faithfulness, answer relevancy, and contextual recall. Triggers the self-correction loop if score < 0.8.
- Human critic (SRE via Review UI) — assesses semantic and domain correctness the AI cannot evaluate.
On rejection, the SRE provides a structured feedback_type (wrong_rc / wrong_fix / hallucination / incomplete) plus free-text reason and an optional corrected root cause. Both the AI critic output and the human feedback are embedded into the rejected log file:
<original log content>
# === RCA OUTPUT ===
# summary: PostgreSQL connection pool exhausted
# root_cause: ORM session leak in user_service.py
# fix: Restart pods + add session.close()
# === HUMAN FEEDBACK ===
# feedback_type: wrong_rc
# reason: Actual cause was OOM killer terminating the pod
# corrected_root_cause: Memory limit breached — container hit 512Mi ceiling
# timestamp: 2026-03-29T10:22:00Z
The new timestamped file (raw/kubelet_20260315_rejected_20260329_102200.log) is picked up on the next scheduled DAG run. The investigator prompt instructs agents to treat # === sections as prior analysis context and human corrections — avoiding the same mistake while re-examining the raw evidence independently.
Why no auto-retrigger? Deliberately omitted to prevent race conditions between a rejection-triggered run and an in-progress pipeline processing a different log batch.
| Component | Service | Purpose |
|---|---|---|
| State machine | AWS Step Functions | waitForTaskToken pause — zero cost while waiting |
| Pending queue | DynamoDB rca_reviews |
Stores RCA + task token per incident |
| Review interface | Lambda + API Gateway | Serverless HTML/JS SRE queue — no container to maintain |
| Approve action | Lambda InfraMind-OnApprove |
S3 move, rca-results write, MLflow tag |
| Reject action | Lambda InfraMind-OnReject |
Feedback-in-log write to raw/, MLflow tag |
| Handoff | Lambda InfraMind-StoreForReview |
Receives RCA from Airflow, writes to DynamoDB |
| Layer | Technology | Purpose |
|---|---|---|
| Orchestration | Apache Airflow 3 (Astro CLI) | DAG-based pipeline scheduling, task dependency management |
| LLM Runtime | AWS Bedrock — Llama 3 8B / 70B + Mistral 7B | Serverless LLM inference — model auto-selected by log size |
| Embeddings | AWS Bedrock Titan Embed v2 | 1536-dim semantic vectors for RAG retrieval |
| Vector DB | ChromaDB 0.4.x | Persistent HNSW index with cosine similarity search |
| RAG Framework | LangChain 0.1.x | Prompt engineering, retrieval chains, agent orchestration |
| HITL Orchestration | AWS Step Functions | waitForTaskToken stateful human review pause |
| HITL Compute | AWS Lambda (Python 3.11) | StoreForReview · OnApprove · OnReject · ReviewUI |
| HITL API | AWS API Gateway (REST, Regional) | /queue · /rca/{id} · /approve · /reject · /log |
| HITL Queue | AWS DynamoDB (on-demand) | Pending RCA store with task token per incident |
| Experiment Tracking | MLflow 2.x (DagsHub) | Hyperparameter logging, metric tracking, human verdict tags |
| LLM Evaluation | DeepEval 0.21.x | Faithfulness, answer relevancy, contextual recall metrics |
| Object Storage | AWS S3 | Raw logs, processed logs, RCA results, model artifacts |
| Monitoring | Prometheus + Grafana | Pipeline metrics, token usage, latency tracking |
InfraMind/
├── dags/
│ ├── tasks/
| | ├── embed.py
| | ├── fetch.py
| | ├── normalize.py
| | ├── rca.py
| | └── review.py
│ ├── dag.py # Airflow DAG — fetch → normalize → run_rca → trigger_sf_review
│ ├── workflow.py # RCA orchestrator + self-correction loop
│ └── ingestion.py # S3 log fetching
├── agents/
│ ├── investigator.py
│ ├── root_cause.py
│ ├── fix_generator.py
│ ├── formatter.py
│ └── critic.py
├── core/
│ ├── vectordb.py # ChromaDB + Bedrock embeddings
│ ├── normalizer.py # Multi-format log parser
│ ├── evaluator.py # DeepEval integration
│ ├── tracker.py # MLflow helpers
│ ├── sfn_client.py # Step Functions start_execution wrapper
│ └── bedrock_client.py
├── aws/
│ ├── lambdas/
│ │ ├── StoreForReview.py # StoreForReview Lambda
│ │ ├── OnApprove.py # OnApprove Lambda
│ │ ├── OnReject.py # OnReject Lambda
│ │ └── ReviewUI.py # ReviewUI Lambda — serves HTML + API routes
│ └── step-functions/
| └── Inframind-HILF.json # Step Functions ASL definition
├── config/
│ ├── config.py # Single source of truth for all config
│ ├── settings.yaml
| ├── logging.yaml
│ └── models.yaml
├── monitoring/ # Prometheus setup
├── prompts/ # Agent prompt templates
├── runbook/ # Markdown runbooks (RAG knowledge base)
├── Dockerfile # Astro Runtime + PYTHONPATH
├── docker-compose.override.yml # ChromaDB volume persistence
├── requirements.txt
├── restart.ps1 # Windows clean restart script
└── restart.sh # macOS/Linux clean restart script
graph TB
subgraph "Docker Compose Stack"
subgraph "Airflow Components"
WEBSERVER["Webserver :8080<br/>UI + REST API"]
SCHEDULER["Scheduler<br/>DAG parsing · Task scheduling"]
TRIGGERER["Triggerer<br/>Async task support"]
POSTGRES[("PostgreSQL<br/>Metadata DB")]
end
subgraph "Persistent Volumes"
DAGS_VOL["./dags/<br/>DAG definitions"]
CHROMA_VOL["./chroma_data/<br/>Vector DB persist"]
LOGS_VOL["./logs/<br/>Task logs"]
end
end
subgraph "AWS Services"
BEDROCK["AWS Bedrock<br/>ap-south-1 · LLM inference"]
S3["S3 Bucket<br/>Log storage + RCA results"]
MLFLOW["DagsHub MLflow<br/>Experiment tracking"]
SF_AWS["Step Functions<br/>HITL state machine"]
LAMBDA["Lambda<br/>HITL compute"]
APIGW["API Gateway<br/>Review UI"]
DYNAMO["DynamoDB<br/>Pending queue"]
end
WEBSERVER --> POSTGRES
SCHEDULER --> POSTGRES
SCHEDULER --> DAGS_VOL & CHROMA_VOL
SCHEDULER --> BEDROCK & S3 & MLFLOW
SCHEDULER -->|"trigger_sf_review"| SF_AWS
SF_AWS --> LAMBDA --> DYNAMO & S3
APIGW --> LAMBDA
style SCHEDULER fill:#1a3a5c,color:#fff
style SF_AWS fill:#185FA5,color:#fff
style LAMBDA fill:#534AB7,color:#fff
Task Pool Configuration
single_thread_pool(slots=1): Serializes ChromaDB writes to prevent race conditionsdefault_pool(slots=128): Parallel execution for fetch/normalize tasks
| Component | Requirement | Notes |
|---|---|---|
| Astro CLI | v1.20+ | Install guide |
| Docker Desktop | 4.25+ | 8GB RAM, 4 CPU cores recommended |
| AWS Account | Bedrock + Lambda + Step Functions enabled | All services in ap-south-1 region |
| S3 Bucket | Standard tier | Versioning + lifecycle policies recommended |
| DynamoDB Table | rca_reviews on-demand |
Created automatically or via console |
| DagsHub Account | Free tier | MLflow tracking backend |
Enable the following models in AWS Console → Bedrock → Model access:
meta.llama3-8b-instruct-v1:0— fast inference for short logsmeta.llama3-70b-instruct-v1:0— deep reasoning for large logsmistral.mistral-7b-instruct-v0:2— critic / quality scoring agentamazon.titan-embed-text-v2:0— RAG embeddings
Region: ap-south-1 (Mumbai) — lowest latency for Asia-Pacific
Your Airflow IAM user needs the following in addition to existing Bedrock + S3 permissions:
{
"Effect": "Allow",
"Action": "states:StartExecution",
"Resource": "arn:aws:states:ap-south-1:YOUR_ACCOUNT:stateMachine:InfraMind-HITL"
}git clone https://github.com/nasim-raj-laskar/InfraMind.git
cd InfraMindCreate .env at project root:
# AWS Credentials (IAM user with Bedrock + S3 + Step Functions access)
AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
AWS_REGION=ap-south-1
# DagsHub MLflow Tracking
DAGSHUB_USERNAME=your_username
DAGSHUB_TOKEN=your_dagshub_token
MLFLOW_TRACKING_URI=https://dagshub.com/your_username/InfraMind.mlflow
# HITL — Step Functions
SF_STATE_MACHINE_ARN=arn:aws:states:ap-south-1:YOUR_ACCOUNT:stateMachine:InfraMind-HITL
INFRAMIND_S3_BUCKET=your-bucket-nameDeploy the four Lambda functions in hitl/lambdas/ to AWS (ap-south-1):
| Function | Trigger | Timeout | Layer needed |
|---|---|---|---|
InfraMind-StoreForReview |
Step Functions | 10s | None |
InfraMind-OnApprove |
Step Functions | 30s | mlflow |
InfraMind-OnReject |
Step Functions | 30s | mlflow |
InfraMind-ReviewUI |
API Gateway | 10s | None |
Build the mlflow Lambda Layer:
mkdir -p python
pip install mlflow requests urllib3 packaging \
--target python/ \
--platform manylinux2014_x86_64 \
--python-version 3.11 \
--only-binary=:all:
zip -r mlflow-layer.zip python/
# Upload to Lambda → Layers → Create layer → attach to OnApprove + OnRejectGo to AWS Console → Step Functions → Create state machine → paste hitl/state_machine.json. Copy the ARN into your .env as SF_STATE_MACHINE_ARN.
Create a REST API (InfraMind-HITL-API, Regional, ap-south-1) with the following routes, all pointing to InfraMind-ReviewUI with Lambda Proxy integration enabled:
GET / → serves the HTML review UI
GET /queue → list pending RCAs from DynamoDB
GET /rca/{id} → full RCA detail + AI critic + raw log
GET /log → raw log content from S3 (?key=raw/...)
POST /approve → send_task_success(token)
POST /reject → send_task_failure(token) + human feedback
Enable CORS on all resources → Deploy to stage prod → copy the invoke URL into InfraMind-ReviewUI Lambda as the API constant.
Windows (PowerShell):
.\restart.ps1macOS/Linux:
chmod +x restart.sh
./restart.shBoth scripts stop existing containers, start Astro, wait for the scheduler, then apply pools and Airflow variables automatically.
Via Airflow UI: Navigate to http://localhost:8080 (admin / admin) → enable inframind_rca_pipeline → click Trigger DAG.
Via CLI:
docker exec $(docker ps --format "{{.Names}}" | grep scheduler) \
airflow dags trigger inframind_rca_pipelineVia REST API:
curl -X POST "http://localhost:8080/api/v1/dags/inframind_rca_pipeline/dagRuns" \
-H "Content-Type: application/json" \
-u "admin:admin" \
-d '{"conf": {}}'s3://your-bucket/
├── raw/ ← Airflow picks up from here
│ ├── app_api_20260316.log ← fresh log
│ └── kubelet_rejected_*.log ← rejected logs with embedded feedback
├── processed/ ← moved here on SRE approval
│ └── app_api_20260316.log
└── rca-results/ ← written on SRE approval
└── results_<incident_id>.json
Lifecycle Policies
{
"Rules": [
{
"Id": "ArchiveProcessedLogs",
"Filter": {"Prefix": "processed/"},
"Transitions": [
{"Days": 30, "StorageClass": "STANDARD_IA"},
{"Days": 90, "StorageClass": "GLACIER"}
]
},
{
"Id": "RetainRCAResults",
"Filter": {"Prefix": "rca-results/"},
"Transitions": [{"Days": 365, "StorageClass": "GLACIER_DEEP_ARCHIVE"}]
}
]
}graph TB
subgraph "Configuration Sources"
YAML["config/settings.yaml · Pipeline defaults"]
MODELS["config/models.yaml · LLM model specs"]
ENV[".env · Secrets & credentials"]
AIRFLOW_VARS["Airflow Variables · Runtime overrides"]
end
subgraph "Config Loader (config/config.py)"
LOADER["Singleton ConfigManager · Merge + validate"]
end
subgraph "Runtime Components"
DAG["Airflow DAG"]
AGENTS["LLM Agents"]
RAG["RAG Pipeline"]
HITL_CFG["HITL Lambda"]
end
YAML & MODELS & ENV & AIRFLOW_VARS --> LOADER
LOADER --> DAG & AGENTS & RAG & HITL_CFG
style LOADER fill:#4a3a1a,color:#fff
Pipeline Settings (config/settings.yaml)
| Parameter | Default | Description |
|---|---|---|
pipeline.max_retries |
2 |
Self-correction loop iterations |
pipeline.quality_threshold |
0.8 |
Minimum AI critic score to pass to HITL |
pipeline.timeout_seconds |
300 |
Max execution time per log |
vectordb.chunk_k |
6 |
RAG retrieval count |
vectordb.chunk_size |
1000 |
Text splitter chunk size |
vectordb.chunk_overlap |
200 |
Overlap between chunks |
Airflow Variables
| Variable | Default | Description |
|---|---|---|
INFRAMIND_S3_BUCKET |
— | S3 bucket name (required) |
INFRAMIND_MAX_LOGS |
3 |
Batch size per DAG run |
INFRAMIND_FORCE_REBUILD |
false |
Rebuild ChromaDB index |
INFRAMIND_SLACK_WEBHOOK |
— | Incident notification endpoint |
INFRAMIND_ENABLE_CACHE |
true |
Cache LLM responses (dev mode) |
INFRAMIND_LOG_LEVEL |
INFO |
Logging verbosity |
SF_STATE_MACHINE_ARN |
— | Step Functions state machine ARN (required) |
Each approved RCA written to s3://bucket/rca-results/results_<incident_id>.json:
{
"ai_critic": {
"score": 8,
"reasoning": "The root cause identified in the RCA report is technically correct for the reported error type (TLS handshake timeout). The immediate fix suggested in the report also addresses the root cause by allowing outbound traffic on port 443. However, the severity level seems appropriate for this issue, but it's essential to consider the context, such as the number of affected nodes and the impact on the application, to ensure the severity is accurate."
},
"rca_output": {
"raw_log": "{\"timestamp\":\"2024-01-15T10:23:45Z\",\"level\":\"ERROR\",\"service\":\"kubelet\",\"message\":\"Failed to pull image nginx:latest: RPC error: net/http: TLS handshake timeout\"}",
"summary": "Outbound 443 blocked causing TLS handshake timeout error",
"attempts": 1,
"severity": "High",
"confidence": 0.9,
"log_format": "cloudwatch_json",
"model_used": "Llama-3-8B",
"root_cause": "Outbound 443 blocked",
"incident_id": "10d675cc-fbbc-44b2-a252-c5d81bd106d3",
"log_service": "kubelet",
"log_severity": "ERROR",
"immediate_fix": "Check node's security group configuration to ensure port 443 is allowed for outbound traffic. If not, add an outbound rule to allow traffic on port 443.",
"mlflow_run_id": "ea917216426x49afaax8706x5d70e6a"
}
}Every RCA execution is tracked in MLflow (DagsHub backend). Human verdicts are written back to the original MLflow run as tags post-review, enabling correlation between AI critic scores and SRE-assessed quality over time.
Tracked per run:
- Parameters:
model_name,temperature,max_tokens,chunk_k,quality_threshold - Metrics:
attempt_N_score,final_critic_score,faithfulness,answer_relevancy,total_tokens,inference_latency_ms,cost_usd - Tags (post-HITL):
human_verdict,rater_id,feedback_type - Artifacts:
rca_output.json,critic_feedback.txt,retrieved_context.md
Access MLflow UI: https://dagshub.com/<username>/InfraMind.mlflow
| Metric | Target | Alert Threshold |
|---|---|---|
| RCA Success Rate | ≥ 95% | < 90% |
| Avg AI Critic Score | ≥ 0.85 | < 0.75 |
| SRE Approval Rate | ≥ 80% | < 60% |
| P95 Latency | ≤ 30s | > 60s |
| Token Cost/RCA | ≤ $0.02 | > $0.05 |
| Faithfulness Score | ≥ 0.80 | < 0.70 |
Monthly (1000 logs/month, 2 retries avg, 50/50 8B/70B split):
| Component | Cost |
|---|---|
| Llama 3 8B — 500 logs × 2 × 6000 tokens | $1.80 |
| Llama 3 70B — 500 logs × 2 × 6000 tokens | $15.90 |
| Mistral 7B critic — 1000 × 2 × 2000 tokens | $0.80 |
| Titan Embed v2 RAG queries | $0.10 |
| S3 storage | $1.27 |
| Step Functions — 5 transitions × 1000 executions | $0.13 |
| Lambda invocations | ~$0.50 |
| DynamoDB on-demand | ~$0.10 |
| Total (self-hosted Airflow) | ~$21/month |
| Total (Astro Cloud) | ~$196/month |
Each agent uses a Jinja2 template in prompts/. Key variables injected per call:
| Variable | Source | Used by |
|---|---|---|
{{log_content}} |
Normalized log | Investigator |
{{rag_context}} |
ChromaDB top-6 chunks | All agents |
{{previous_output}} |
Prior agent output | Root Cause, Fix Generator |
{{critic_feedback}} |
AI critic text | Investigator (on retry) |
{{feedback_history}} |
Embedded # === sections in log |
Investigator (on HITL rejection retry) |
The investigator prompt explicitly instructs: "If the log contains lines starting with # ===, treat them as previous analysis context and human corrections. Do NOT repeat a rejected root cause. Re-examine raw log evidence independently to verify any provided correction."
| Term | Definition |
|---|---|
| RAG | Retrieval-Augmented Generation — LLM technique combining vector search with generation |
| HITL | Human-in-the-Loop — human review gate integrated into an automated pipeline |
| HNSW | Hierarchical Navigable Small World — graph-based approximate nearest neighbor algorithm |
| LLMOps | LLM Operations — practices for deploying and managing LLM systems in production |
| RLHF | Reinforcement Learning from Human Feedback — fine-tuning using human preferences |
| Faithfulness | Metric measuring if LLM output is grounded in provided context (no hallucinations) |
| Answer Relevancy | Metric measuring if LLM output addresses the original query |
| Critic Agent | LLM agent (Mistral 7B) that evaluates quality of other agents' outputs |
| Self-Correction Loop | Iterative refinement where critic feedback improves subsequent attempts |
| waitForTaskToken | Step Functions mechanism that pauses a state machine until an external signal resumes it |
| XCom | Airflow's cross-communication mechanism for passing data between tasks |
| DAG | Directed Acyclic Graph — Airflow's workflow definition structure |
Apache License 2.0 — see LICENSE for details.
@software{inframind2026,
author = {Nasim Raj Laskar},
title = {InfraMind: Autonomous Root Cause Analysis with Multi-Agent LLMs and Human-in-the-Loop Review},
year = {2026},
url = {https://github.com/nasim-raj-laskar/InfraMind}
}- Issues: GitHub Issues
- Discussions: GitHub Discussions
- MAIL: Nasim Raj Laskar
Built with ❤️ for SRE teams fighting alert fatigue
Last Updated: April 2026 · Version: 2.0.0 · Maintainer: Nasim Raj Laskar


