An AI-powered multi-agent system for automated incident triage, enrichment, and notification on data platform environments.
ARIA is a multi-agent AI system that automates the first-response lifecycle of infrastructure incidents on data platform environments. Instead of an on-call engineer waking up at 3am to manually investigate a raw alert, ARIA does the preliminary work — correlates logs, identifies affected services, classifies the error pattern, and notifies the right people with a structured findings summary.
ARIA targets data platform environments specifically — a space largely ignored by existing AIOps tools that focus almost exclusively on Kubernetes and cloud-native microservices:
- On-premise: Cloudera CDP, Oracle
- Cloud: GCP, AWS, Azure, Databricks
- Workflow engines: Azure Data Factory, Apache Airflow
This is a proof-of-concept in active development, intended to evolve into a fully open-source project.
The AIOps space is active. HolmesGPT, IncidentFox, FuzzyLabs SRE Agent, PagerDuty SRE Agent, and Dash0 Agent0 all exist and are solving adjacent problems. After studying them, here is what we learned:
| What the competition does | What ARIA does differently |
|---|---|
| All focused on Kubernetes / cloud-native | Targets data platforms: on-premise (CDP, Oracle) and cloud (Databricks, GCP, AWS, Azure) |
| Some build autonomous agents from day 1 | Phase 1 is notify-only — builds trust before write access |
| Most treat log retrieval as a RAG dump | Surgical time-windowed + platform-tagged log queries |
| Few show confidence scoring | Every classification includes a confidence band |
| Build integrations from scratch | Uses pre-built SDKs (ServiceNow Python SDK, Slack Bolt) |
| Vendor lock-in (GCP, AWS, Azure) | Plugin architecture — runs anywhere |
Key lessons absorbed from the community:
- Never present uncertain root cause with confident language — HolmesGPT literally shipped a
fix-holmes-overconfidencepatch - Engineers don't trust agents that write to production without oversight — earn trust in read-only mode first
- The #1 value is surfacing data fast, not deciding for engineers
- Memory compounds over time — past incident history is the "make or break" feature (Phase 2+)
ARIA is delivered across three distinct phases, each adding capability while maintaining architectural consistency.
ARIA investigates and notifies. Human updates ticket manually.
New incident → Identify resource → Find logs → Classify error
→ Notify team → [HUMAN UPDATES TICKET MANUALLY]
Goal: Build trust. Engineers see ARIA's findings, validate them in practice, understand system behavior.
A 6-sprint bridge between the Phase 1 POC and Phase 2 production deployment. No new user-facing capabilities — this phase makes ARIA deployable, observable, and testable at scale.
| Sprint | Focus |
|---|---|
| S1 | Structured logging (structlog, run_id, lifecycle events) |
| S2 | Monitoring foundation — RunRecord, SQLite run store, REST API, Alpine.js dashboard, operating mode scaffold |
| S3 | Docker packaging — Dockerfile, ARIA_CONFIG_PATH + ConfigMap pattern, VertexAILLMClient, LLM provider DI |
| S4 | Testing infrastructure wiring — UC1/UC2/UC3 cluster setup, KB runbooks, CMDB validation |
| S5 | Round 2 acceptance testing — 30 incidents across UC1 + UC2 on real simulated infrastructure |
| S6 | GCP native service connectors — configurable resource type templates for BQ, Cloud Functions, Pub/Sub, GCS |
ARIA investigates, notifies with approval buttons. Writes to ticket only after human approval.
New incident → Identify resource → Find logs → Classify error
→ Notify team with Approve/Reject buttons
→ [IF APPROVED] → Write findings to ticket
Goal: Add automation while keeping human control. ARIA can write, but only after explicit approval.
ARIA acts. The critical addition is auto-acknowledgement, which directly impacts MTTA (Mean Time To Acknowledge) — the metric that governs SLA compliance. A ticket sitting unacknowledged at 3am kills your SLA score even if an engineer fixes it in 10 minutes.
New incident → Auto-acknowledge (MTTA impact) → Identify service
→ Read logs → Root cause analysis → Aggregate all findings
→ Write to ticket → Notify human to resolve
Goal: Full automation of investigation phase. ARIA acknowledges tickets immediately, investigates, writes findings.
ARIA is composed of five agents. Each agent (1–4) is a standalone Python class with a run(PipelineState) → PipelineState interface and an injected LLM client. Agent 0 is the planned LangGraph orchestrator that will wire them into a stateful pipeline — it is the only component that uses LangGraph directly.
ServiceNow ──► Agent 0 (Orchestrator — LangGraph pipeline)
│
▼
Agent 1 (Incident Reader) ◄── CMDB / LLM
│
▼
IncidentMetadata
│
▼
Agent 2 (Log Extractor) ◄── KnowledgeBase / LLM (query planning)
│ ▲
│ │ ReAct loop
▼ │
LogQueryResult
│
▼
Agent 3 (Classifier) ◄── LLM
│
▼
ClassificationResult
│
▼
Agent 4 (Notifier) ──► Slack / MS Teams / LLM (Phase 2)
Agents 2 and 3 form a ReAct loop: if the classifier determines the log evidence is insufficient, it signals Agent 2 to run an additional targeted query. The loop runs in-memory until the classifier has enough evidence or the iteration budget is exhausted.
File: core/orchestrator/pipeline.py
The LangGraph pipeline that owns the shared PipelineState and coordinates the full run — launching Agent 1, threading state through each subsequent agent, managing the Agent 2 ↔ 3 ReAct loop, and surfacing errors. This is the only component that uses LangGraph; agents 1–4 are plain Python nodes composed by it.
Dry-run mode (ARIA_DRY_RUN=true): all connectors are replaced with in-memory stubs — no ServiceNow, SSH, or Slack credentials required. Only ARIA_AGENT1_MODEL is needed (Agent 1 still calls the LLM for CI resolution).
File: core/agents/incident_reader.py
Fetches the raw incident from ServiceNow and resolves the affected CI to a specific node or service so Agent 2 always receives a concrete SSH/API target — never a cluster name.
Three-path CI resolution:
| Path | Condition | What happens |
|---|---|---|
| 1 — Fast path | CI is a known service/node | IP resolved from CMDB. Description scanned for sibling names. No LLM call. |
| 2 — Cluster | CI is a cluster or absent | LLM extracts resource name(s), validated against CMDB membership. IP resolved per resource. |
| 3 — Unknown | CI class unknown, no cluster context | LLM extraction from free-text description. affected_ci_ip not set. |
LLM failures are non-fatal: raw fields pass through with a WARNING. platform_tag (cdp, gcp, aws, azure, databricks, oracle) is always resolved here — Agent 2 uses it for connector routing.
File: core/agents/log_extractor.py
Queries logs via a two-tier strategy:
- Tier 1 (fast path): if
LOG_AGGREGATOR_URLis configured, queries Splunk/ELK directly. - Tier 2 (connector dispatch): queries
KnowledgeBaseInterfaceforLogAccessHint(log paths, keywords), then dispatches to the platform connector.
Time window: opened_at − 30 min. On empty result, retries once with a 60-min window (static routing only). Vault-backed credentials — never hardcoded. Non-fatal: connector failures return empty LogQueryResult.
Optional LLM query planning: when agent2 is set in conf.yaml, Agent 2 calls the LLM before connector dispatch to produce a LogQueryPlan — choosing the connector, log paths, keywords, and time window for the specific incident. Falls back silently to static platform_tag → connector routing on any LLM failure. The plan is exposed in the API response as log_query_plan.
Cross-service log resolution (ReAct loop): when Agent 3 signals that evidence points to a different host, Agent 2 re-runs with a pending_log_request. It resolves the named CI to an IP via data/cluster_hosts.json, fetches logs from that host, and merges the new lines with the existing log_result so Agent 3 receives all evidence on the next pass.
Implemented connectors:
SSHLogConnector(implementations/clusters/onprem/) — provider-agnostic SSH connector for any on-premise cluster (CDP, HDP, Oracle RAC, MapR, etc.). Log dirs and SSH credentials are constructor params.GCPLogConnector(implementations/clusters/cloud/gcp/) — Cloud Logging API with vault-backed service account.
Cloud stubs: Databricks, AWS EMR, Azure Monitor — raise NotImplementedError, full implementations planned.
File: core/agents/classifier.py
Classifies the root cause of an incident from metadata and log evidence using LLM reasoning over the extracted log lines. Returns a ClassificationResult with mandatory confidence scoring — a low-confidence result is never presented as definitive (AC-05 compliance).
Error classes: oom | cpu | disk | network | auth | db_lock | pipeline | unknown
Confidence bands: high (≥0.7) | medium (0.5–0.69) | low (<0.5) — derived from the confidence float, never trusted from the LLM.
Model is injected at construction via LLMClientInterface — fully provider-agnostic. Model name configured via ARIA_AGENT3_MODEL. When no LLM client is injected (dry-run mode), the agent falls back to stub behaviour (error_class="unknown", LOW confidence) without crashing the pipeline.
ReAct loop trigger: when log evidence explicitly names a different host as the root cause, the LLM sets a log_request field in its response instead of classifying. Agent 3 writes this to state.pending_log_request and returns without a classification — the orchestrator routes back to Agent 2 for a targeted cross-service log fetch. The loop is capped at 5 iterations.
File: core/agents/notifier.py
Accepts the completed PipelineState, formats a NotificationPayload, and delivers it to any channel injected at construction via CommunicatorInterface. Channel selection is handled at the DI layer; the agent has zero channel-specific logic. An LLM client is injected at construction but unused in Phase 1 — wired for Phase 2 response interpretation (generating human-readable summaries before write-back).
Notification format: Slack Block Kit attachment with a colour-coded sidebar (green = HIGH confidence, amber = MEDIUM, red = LOW, grey = partial). Partial notification (classification not yet available) is sent automatically — on-call engineers are always informed even if Agent 3 did not run.
Implemented connectors (swap in api/dependencies.py, no agent code changes):
| Connector | Status | Location |
|---|---|---|
SlackConnector |
✅ Full | implementations/coms/slack/connector.py |
TeamsConnector |
✅ Full | implementations/coms/teams/connector.py |
GoogleChatConnector |
✅ Full | implementations/coms/google_chat/connector.py |
TelegramConnector |
🔜 Scaffold | implementations/coms/telegram/connector.py |
WhatsAppConnector |
🔜 Scaffold | implementations/coms/whatsapp/connector.py |
Phase 1 is notify-only — no write-back to ServiceNow. Phase 2 adds interactive Approve/Reject buttons via Slack Bolt (no migration required — slack-bolt is already the underlying library).
Every agent exposes a REST API (FastAPI). This enables two things:
- Individual agent testing — call any agent in isolation and inspect its JSON output without running the full pipeline.
- API mode — agents communicate via HTTP instead of in-process LangGraph, enabling microservice deployments where each agent runs as a separate service.
Two operating modes:
| Mode | Set via | Agent communication | Use case |
|---|---|---|---|
workflow (default) |
ARIA_MODE=workflow |
In-process LangGraph | Single-server deployment |
api |
ARIA_MODE=api |
HTTP calls between agents | Distributed / microservice |
Start the API server:
uvicorn api.main:app --reload
# Swagger UI → http://localhost:8000/docsCall Agent 1 directly:
curl -X POST http://localhost:8000/api/v1/agent1/run \
-H "Content-Type: application/json" \
-d '{"incident_number": "INC0010001"}'{
"status": "success",
"agent": "agent1",
"incident_number": "INC0010001",
"duration_ms": 843,
"data": {
"incident_number": "INC0010001",
"short_description": "Daily quota for dataflow X not reached",
"priority": "P3",
"affected_ci": "cdp-cluster-prod-01",
"llm_extraction": {
"affected_ci": "cdp-cluster-prod-01",
"platform_tag": "cdp",
"confidence": "medium"
}
},
"error": null
}All responses are JSON. All errors use the same envelope — no HTML error pages.
Agent API status:
| Agent | Endpoint | Status |
|---|---|---|
| Agent 1 — Incident Reader | POST /api/v1/agent1/run |
✅ Implemented |
| Agent 2 — Log Extractor | POST /api/v1/agent2/run |
✅ Implemented |
| Agent 3 — Classifier | POST /api/v1/agent3/run |
✅ Implemented |
| Agent 4 — Notifier | POST /api/v1/agent4/run |
✅ Implemented |
| Agent 0 — Pipeline (full run) | POST /api/v1/pipeline/run |
✅ Implemented |
See documentation/aria_apis.md for the full API specification including request/response schemas, error codes, and API mode configuration.
ARIA emits one canonical structured event stream (via structlog), rendered two ways from a single pass:
- stdout — human-readable coloured output for ops by default; set
ARIA_LOG_FORMAT=jsonto emit JSON on stdout instead (for containers that scrape stdout). - rolling file — always JSON, daily rotation with 30-day retention, at
${ARIA_LOG_DIR}/aria.log. This is the machine feed: it is the substrate the Phase 1.5 S2 monitoring sprint queries, and the long-term corpus ARIA learns from.
Every event carries ambient context bound once at pipeline entry — run_id, incident_number, schema_version, service — so a whole run is reconstructable with grep run_id=<uuid> aria.log. Third-party logs (paramiko, anthropic, langgraph, uvicorn) flow through the same sinks with the same structure.
Configuration (env vars):
| Variable | Default | Purpose |
|---|---|---|
ARIA_LOG_DIR |
logs |
Directory for the rolling aria.log file |
ARIA_LOG_FORMAT |
(console) | json forces JSON on stdout; otherwise pretty console |
ARIA_LOG_LEVEL |
INFO |
Root log level |
ARIA_LOG_PII |
(redact) | allow disables incident free-text redaction (debug only) |
Canonical event vocabulary (the frozen contract — core/observability.py):
| Event | Emitted by | Key fields |
|---|---|---|
pipeline_started / pipeline_completed |
Orchestrator | start_time / full RunRecord (status, per-agent durations, token totals, confidence, error class) |
agent_started / agent_completed / agent_failed |
Lifecycle decorator | agent_name, duration_ms, error_class |
ci_resolved |
Agent 1 | resolution_path, platform_tag, ci_count |
log_query_completed |
Agent 2 | connector, lines_returned, total_scanned, window_minutes |
classification_completed |
Agent 3 | error_class, confidence, confidence_band, evidence_count |
react_loop_iteration / routing_decision |
Orchestrator | iteration / from_agent, to_agent, reason |
llm_call_completed |
LLM clients | model, tokens_in, tokens_out, duration_ms |
notification_sent |
Agent 4 | channel, is_partial |
Each run is summarised into a RunRecord (core/models.py) at completion — the same model the S2 monitoring store persists, so logging and monitoring share one contract with no duplicate instrumentation.
PII safety: incident free-text fields (description, long_description, raw_record, caller) are redacted to [REDACTED:<len>] before any sink, keeping both the log file and the learning corpus clean.
Core principle: ARIA's core engine is pure Python with ZERO cloud dependencies. All infrastructure concerns (connectors, queues, state stores) are abstracted behind Python ABCs (Abstract Base Classes).
ARIA targets data platform environments — on-premise (Cloudera CDP, Oracle) and cloud (Databricks, GCP, AWS, Azure) — where cloud vendor lock-in is unacceptable. The plugin architecture ensures ARIA can run anywhere:
- Local development: In-memory queue, SQLite state store, local log files
- On-premise: Kafka queue, PostgreSQL state store, Splunk/ELK log connectors
- Cloud (GCP): Pub/Sub queue, Firestore state store, BigQuery log connector
- Cloud (AWS): SQS queue, DynamoDB state store, CloudWatch log connector
- Cloud (Azure): Service Bus queue, Cosmos DB state store, Log Analytics connector
┌─────────────────────────────────────────────────┐
│ Core Engine (Pure Python) │
│ Agents · LangGraph Pipeline · CMDBResolver │
│ ZERO cloud dependencies │
└─────────────────┬───────────────────────────────┘
│
┌─────────────────▼───────────────────────────────┐
│ Interfaces (Abstract Base Classes) │
│ LogStoreInterface · ConnectorInterface │
│ VaultInterface · KnowledgeBaseInterface · etc. │
└─────────────────┬───────────────────────────────┘
│
┌─────────────────▼───────────────────────────────┐
│ Implementations │
│ clusters/onprem/ ← SSHLogConnector (any VM) │
│ clusters/cloud/ ← GCP / AWS / Databricks / │
│ Azure │
│ itsm/servicenow/ ← ServiceNowConnector │
│ vault/ ← EnvVarVault · GCP SM · │
│ HashiCorp / AWS / Azure KV │
│ memory/ ← Testing stubs │
└─────────────────────────────────────────────────┘
Every layer below is provider-agnostic — each is abstracted behind an interface. The providers listed under "Dev & POC" are what the team used during development; they are reference choices, not requirements.
| Layer | Interface | Dev & POC provider |
|---|---|---|
| Core engine | — | Python 3.11+ |
| Agent orchestration | — | LangGraph 0.2+ |
| LLM | LLMClientInterface |
Claude Sonnet 4.6 (Anthropic API) · Vertex AI (Gemini / Claude-on-Vertex) — P1.5 S3 |
| ITSM / incident source | ConnectorInterface |
ServiceNow REST Table API |
| Log store | LogStoreInterface |
BigQuery + Cloud Storage (GCP) |
| Notifications | CommunicatorInterface |
Slack Bolt (aria_bot) + MS Teams Webhooks |
| Queue | QueueInterface |
In-memory (POC) — Pub/Sub planned |
| State store | StateStoreInterface |
In-memory (POC) — Firestore planned |
| Secrets / vault | VaultInterface |
Environment variables (POC) · GCP Secret Manager (P1.5 S3) · HashiCorp Vault / AWS SM / Azure KV |
| Testing | — | pytest + fixtures |
| Dataset | Source | Used for |
|---|---|---|
| Loghub (HDFS, Spark, OpenStack, BGL) | github.com/logpai/loghub | Few-shot examples for LLM-based classification |
| AIOps Challenge 2020/2022 | competition.aiops-challenge.com | Validation dataset |
| Stack Overflow data dump | archive.org/details/stackexchange | NLP enrichment for context |
| Numenta Anomaly Benchmark (NAB) | github.com/numenta/NAB | Time-series anomaly baseline |
| NASA HTTP Logs | ita.ee.lbl.gov | Baseline log parsing |
Note: No traditional ML training in Phase 1. Datasets are used for LLM few-shot examples and validation only.
Confidence scoring is mandatory. Every Agent 3 output includes a confidence band: high (≥0.7), medium (0.5–0.69), or low (<0.5). A low-confidence result is displayed with explicit caveats. This was the most common failure mode in comparable open-source projects and is a non-negotiable requirement.
Phase 1 is notify-only. ARIA does not write to ServiceNow in Phase 1. Engineers see findings, validate them in practice, build trust. Write access comes in Phase 2 with human approval gate.
Surgical log queries, not RAG dumps. Logs are queried with mandatory filters: time window (incident timestamp ± 30 minutes) and platform tag. No vector database dump of all historical logs. This was documented as a critical failure pattern in production AIOps deployments.
Pre-built connectors over custom integration. The ServiceNow Python SDK, Slack Bolt, and LangChain tool library are used wherever they exist. Building custom OAuth flows and API clients is an integration tax that kills POC timelines.
Cloud-agnostic core. ARIA's core engine has ZERO cloud dependencies. All infrastructure is abstracted behind Python ABCs. This ensures ARIA can run on any platform without vendor lock-in.
aria/
├── api/ # REST API layer (FastAPI)
│ ├── main.py # App entry point — uvicorn api.main:app
│ ├── schemas.py # Pydantic request/response models
│ ├── dependencies.py # Shared DI (agent singletons)
│ ├── static/dashboard/ # Alpine.js dashboard (P1.5 S2)
│ └── routers/ # One router per agent + health
│ ├── health.py
│ ├── agent1.py # ✅ POST /api/v1/agent1/run
│ ├── agent2.py # ✅ POST /api/v1/agent2/run
│ ├── agent3.py # ✅ POST /api/v1/agent3/run
│ ├── agent4.py # ✅ POST /api/v1/agent4/run
│ ├── pipeline.py # ✅ POST /api/v1/pipeline/run
│ └── monitoring.py # P1.5 S2 — GET /api/v1/runs
├── core/ # Pure Python, zero cloud dependencies
│ ├── agents/ # Agent implementations
│ ├── interfaces/ # ABCs: connector, log_store, llm_client, vault, knowledge_base, queue, state_store
│ ├── config.py # conf.yaml loader with env var fallback
│ ├── models.py # Shared data models (IncidentMetadata, LogQueryResult, PipelineState, etc.)
│ ├── exceptions.py # Domain exceptions
│ └── cmdb_resolver.py # ServiceNow CMDB CI relationship queries
├── implementations/
│ ├── clusters/
│ │ ├── onprem/ # SSHLogConnector — any bare-metal/VM cluster (CDP, HDP, Oracle RAC, MapR, etc.)
│ │ └── cloud/
│ │ ├── gcp/ # GCPLogConnector — Cloud Logging API
│ │ ├── databricks/ # stub — planned
│ │ ├── aws/ # stub — planned
│ │ └── azure/ # stub — planned
│ ├── itsm/
│ │ └── servicenow/ # ServiceNowConnector
│ ├── coms/
│ │ ├── slack/ # Slack Bolt client (aria_bot)
│ │ └── teams/ # MS Teams webhook
│ ├── llm/
│ │ ├── anthropic/ # AnthropicLLMClient
│ │ ├── claude_code/ # ClaudeCodeLLMClient (local dev)
│ │ └── vertex_ai/ # VertexAILLMClient — ADC auth (P1.5 S3)
│ ├── vault/ # EnvVarVault (+ HashiCorp, AWS SM, Azure KV)
│ ├── knowledge_base/ # FileKnowledgeBase (+ Chroma/PGVector planned)
│ └── memory/ # In-memory stubs for unit tests
├── tests/
│ ├── unit/ # Mock-based, no network required
│ ├── integration/ # Require real external services
│ └── fixtures/ # Sample incidents, CDP log fixtures (JSONL)
├── deployment/
│ └── monolithic/ # docker-compose + conf.yaml.example (P1.5 S3)
├── documentation/ # MkDocs site source (mkdocs serve)
├── infra/
│ └── terraform/
│ └── uc_testing/ # UC1 (Hadoop VMs) · UC2 (Dataproc) · UC3 (GCP native)
├── ml/ # Datasets, few-shot prompt assets, evaluation scripts
├── tests/acceptance/ # ground_truth.json · round results · AC reports
├── Dockerfile # P1.5 S3 — python:3.11-slim, non-root, single stage
├── conf_template.yaml # Non-secret config — copy to conf.yaml (or mount as ConfigMap)
├── .env.example # Secrets template
├── requirements.txt
└── README.md
See documentation/guides/getting-started.md for the full walkthrough.
- Python 3.11+
- ServiceNow developer instance (free at developer.servicenow.com)
- Slack app with
chat:writescope - API key for your LLM provider (Anthropic Claude Sonnet 4.6 used as the dev reference — bring your own via
LLMClientInterface)
git clone https://github.com/bayrem/aria.git
cd aria
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
# Configuration
cp conf_template.yaml conf.yaml # non-secret config — fill in your values
cp .env.example .env # secrets — fill in your credentials
# Run
uvicorn api.main:app --reload
# Swagger UI → http://localhost:8000/docsPhase 1 is complete when all of the following pass on 10 consecutive test incidents:
| ID | Criterion | Target |
|---|---|---|
| AC-01 | ARIA reads new SNow incident within 60s of creation | Latency < 60s |
| AC-02 | Affected resource correctly identified | ≥ 80% accuracy |
| AC-03 | At least 1 relevant log line returned for incidents with available logs | ≥ 80% recall |
| AC-04 | Error classification label is correct | ≥ 70% accuracy |
| AC-05 | Confidence score shown in every notification | 100% |
| AC-06 | Notification received in Slack/Teams within 3 minutes | Latency < 180s |
| Phase | Milestone | Status |
|---|---|---|
| Phase 0 | Setup: GitHub, Slack, ServiceNow dev instance, core interfaces | ✅ Done |
| Phase 1 | M1: Core interfaces, LLM abstraction, CI/CD foundation | ✅ Done |
| Phase 1 | M2: Agent 1 + ServiceNow connector | ✅ Done |
| Phase 1 | M3: Agent 2 + log connectors (CDP, GCP) + stubs + REST API | ✅ Done |
| Phase 1 | M3.5: Restructure + cloud connectors (Databricks, AWS, Azure) + vault + vector KB | ✅ Done |
| Phase 1 | S5.5: LLM mode selector + Agent 2 optional LLM query planning (LogQueryPlan) |
✅ Done |
| Phase 1 | M4: Agent 3 — LLM-based classifier with confidence scoring | ✅ Done |
| Phase 1 | M5: Agent 4 — Notifier (Slack/Teams/Google Chat) | ✅ Done |
| Phase 1 | M6: Orchestration + full pipeline | ✅ Done |
| Phase 1 | S8: ReAct loop trigger — cross-service log requests | ✅ Done |
| Phase 1 | M7: Acceptance criteria validated on local environment | ✅ Done |
| Phase 1.5 | S1: Structured logging — structlog, run_id, lifecycle events, RunRecord |
✅ Done |
| Phase 1.5 | S2: Monitoring foundation — run store, REST API, Alpine.js dashboard, mode scaffold | 🔜 Next |
| Phase 1.5 | S3: Docker + ARIA_CONFIG_PATH + VertexAILLMClient + LLM provider DI |
🔜 Planned |
| Phase 1.5 | S4: Testing infrastructure — UC1/UC2/UC3 cluster wiring, KB runbooks, CMDB validation | 🔜 Planned |
| Phase 1.5 | S5: Round 2 acceptance testing — 30 incidents on UC1 + UC2 real infrastructure | 🔜 Planned |
| Phase 1.5 | S6: GCP native connectors — BQ, Cloud Functions, Pub/Sub, GCS | 🔜 Planned |
| Phase 2 | Human validation gate + write-back to ServiceNow | 💡 Planned |
| Phase 3 | Autonomous mode with auto-acknowledgement (MTTA impact) | 💡 Vision |
| Risk | Mitigation |
|---|---|
| LLM classification accuracy insufficient | Confidence scoring + Phase 1 human validation in practice |
| Log data unavailable for a platform | Agent 2 returns empty gracefully; notifies human with "no logs found" |
| ServiceNow API rate limiting | Exponential backoff + circuit breaker |
| Plugin architecture adds complexity | Start with 1-2 implementations, document patterns clearly |
| Training data insufficient for Oracle/CDP | Flag as low confidence with explicit platform caveat |
| Engineer distrust of AI-generated findings | Notify-only Phase 1 builds trust before write access |
ARIA was designed with awareness of the following open-source and commercial projects:
- HolmesGPT — CNCF Sandbox, cloud-native/K8s focus, SNow integration. Gap: no data platform support.
- IncidentFox — Multi-agent SRE platform, Slack-first, 85–95% alert noise reduction. Gap: K8s/cloud-native only.
- FuzzyLabs SRE Agent — Lightweight Claude-powered agent, closest in architecture to ARIA Phase 1.
- PagerDuty SRE Agent — Best-in-class memory architecture. Gap: closed source, enterprise pricing.
- Dash0 Agent0 — Transparency-first, OpenTelemetry-based. Lesson adopted: show every reasoning step.
Key lesson: Nobody is focused on data platform incidents — on-premise (CDP, Oracle) or cloud (Databricks). That is ARIA's moat.
Contributions are welcome. See CONTRIBUTING.md for guidelines.
Licensed under the Apache License, Version 2.0.
This project is a proof-of-concept. It is not production-ready.