Production-focused causal incident response copilot for SRE teams.
SRE-Nidaan is a 3-service system (Face + Body + Brain) that combines:
- structured incident analysis with causal DAG output
- grounding against telemetry + knowledge base evidence
- MCP-style tool routing in the backend
- strict human approval gating before interventions
- analyst feedback capture for continuous improvement
| Layer | Visitor signal |
|---|---|
| Face | Next.js operator UI for incident entry, causal graph review, and safety-gated remediation. |
| Body | FastAPI orchestration layer for telemetry lookup, evidence retrieval, tool routing, and persistence. |
| Brain | vLLM-backed Llama inference path with LoRA adapter preparation and structured response constraints. |
| Proof points | Live deployment links, narrated demo, generated architecture diagram, evaluation charts, and reproducible run commands. |
- Product (canonical): https://sre-nidaan-122722888597.us-east4.run.app
- Product alias: https://sre-nidaan-face-122722888597.us-east4.run.app
- Body API: https://sre-nidaan-body-122722888597.us-east4.run.app
- Brain API: https://sre-nidaan-brain-ciiiagnzaq-uk.a.run.app
Click the thumbnail to watch the short narrated product walkthrough:
Direct video link:
In incidents, vanilla LLM responses can sound confident but still be unsafe:
- they often miss confounders
- they can recommend generic actions without grounding
- they do not naturally enforce safety gates
SRE-Nidaan addresses this by forcing structured causal outputs, scoring evidence overlap, and requiring human approval before high-impact actions.
- Face (Next.js): operator UI, incident input, graph workspace, safety and feedback actions.
- Body (FastAPI): orchestration, grounding retrieval, MCP tool calls, candidate verification, and persistence.
- Brain (vLLM + LoRA): OpenAI-compatible inference serving Meta-Llama-3-8B-Instruct with production adapter.
- Operator provides incident brief and telemetry context.
- Body fetches telemetry (
sre.telemetry.get_snapshot) and retrieves grounding evidence fromops/knowledge_base.json. - Body prompts Brain with strict schema constraints (
guided_json). - Body verifies candidate quality (evidence overlap, telemetry overlap, structural viability).
- Body either returns accepted live analysis or deterministic grounded fallback.
- Face renders DAG + reasoning; intervention requires human authorization.
- Analyst feedback is persisted for future reward/preference improvements.
All charts below are generated from project artifacts, not hand-drawn placeholders.
SRE-Nidaan/
├── backend/ # FastAPI body service
│ └── main.py
├── frontend/ # Next.js face service
│ ├── src/app/page.tsx
│ └── src/components/CausalGraph.tsx
├── src/ # Training + runtime libraries
│ ├── data/
│ ├── training/
│ ├── evaluation/
│ ├── runtime/
│ └── utils/
├── scripts/ # Pipeline and utility scripts
│ ├── 01_generate_dataset.py
│ ├── 02_run_sft.py
│ ├── 03_train_reward_model.py
│ ├── 04_run_rlhf.py
│ ├── 05_run_evaluation.py
│ ├── 06_select_best_sft_checkpoint.py
│ ├── 07_generate_structured_response.py
│ ├── 08_prepare_production_adapter.py
│ ├── record_site_demo.py
│ ├── add_demo_voiceover.py
│ └── generate_readme_charts.py
├── data/sre_nidaan_dataset.json
├── results/final_evaluation_report.json
├── ops/knowledge_base.json
├── inference_server.py # Brain service
├── docker-compose.yml
└── deploy/gcp/ # Cloud Run deployment assets
- Python 3.9+
- Node.js 18+
- Docker (optional, recommended for full stack)
- NVIDIA GPU (required for Brain inference/training workloads)
Install dependencies:
git clone https://github.com/RitwijParmar/SRE-Nidaan.git
cd SRE-Nidaan
pip install -r requirements.txtexport HF_TOKEN="your_hf_token"
python scripts/08_prepare_production_adapter.pyexport MODEL_ID="meta-llama/Meta-Llama-3-8B-Instruct"
export NEXUS_LORA_PATH="$(pwd)/results/production_adapter"
python inference_server.pyexport VLLM_ENDPOINT="http://localhost:8000/v1"
export MODEL_ID="meta-llama/Meta-Llama-3-8B-Instruct"
export PRODUCTION_ARTIFACT_LABEL="checkpoint-1064"
uvicorn backend.main:app --host 0.0.0.0 --port 8001 --reloadcd frontend
npm install
NEXT_PUBLIC_API_URL=http://localhost:8001 npm run devexport HF_TOKEN="your_hf_token"
export MODEL_ID="meta-llama/Meta-Llama-3-8B-Instruct"
python scripts/08_prepare_production_adapter.py
docker-compose up --build -dDefault ports:
- Face:
3000 - Body:
8001 - Brain:
8000
Use the deployment script:
export PROJECT_ID="your-gcp-project"
export REGION="us-east4"
export HF_TOKEN="your_hf_token"
bash deploy/gcp/deploy_cloud_run.shWhat it does:
- builds/pushes Face, Body, Brain images via Cloud Build
- deploys Brain on GPU (
nvidia-l4) - deploys Body and wires it to Brain
/v1 - deploys Face and wires it to Body URL
GET /healthGET /api/integration-checkGET /api/telemetry
POST /api/analyze-incidentPOST /api/interventions/authorizePOST /api/analysis-feedback
GET /api/mcp/toolsPOST /api/mcp/call
curl -X POST "http://localhost:8001/api/analyze-incident" \
-H "Content-Type: application/json" \
-H "x-tenant-id: demo-tenant" \
-d '{
"incident_summary": "Auth p95 latency rose from 210ms to 1.8s and DB connections reached 99%.",
"candidate_count": 3
}'The full model workflow follows:
- SFT (QLoRA) on causal SRE examples
- Reward Modeling for preference signal
- RLHF for policy refinement
Run sequence:
python scripts/01_generate_dataset.py
python scripts/02_run_sft.py
python scripts/03_train_reward_model.py
python scripts/04_run_rlhf.py
python scripts/05_run_evaluation.pyProduction serving strategy in this repo prioritizes stable SFT path (checkpoint-1064) with verifier + safety plane, while RLHF remains an optional research track.
- tenant header enforcement (
x-tenant-id) for API calls - optional API key auth middleware (
x-api-key) with env toggles - human-in-the-loop intervention authorization endpoint
- analysis persistence and audit trail in SQLite (
feedback/analyst_feedback.db) - deterministic fallback generation when live inference is unavailable or low confidence
VLLM_ENDPOINTMODEL_IDPRODUCTION_ARTIFACT_LABELGROUNDING_KB_PATHGENERATION_CANDIDATESGENERATION_MAX_TOKENSLIVE_ANALYSIS_TIMEOUT_SECONDSREQUIRE_API_AUTHAPI_AUTH_TOKENREQUIRE_TENANT_IDALLOWED_ORIGINSFEEDBACK_LOG_PATHFEEDBACK_DB_PATH
MODEL_IDNEXUS_LORA_PATHSERVING_BACKENDMAX_LORA_RANKMAX_MODEL_LENGPU_MEMORY_UTILIZATIONPORT
If dataset/evaluation files change, regenerate charts:
python scripts/generate_readme_charts.pyGenerated outputs:
assets/readme/architecture_split_compute.pngassets/readme/dataset_domain_distribution.pngassets/readme/dataset_pearl_level_mix.pngassets/readme/evaluation_category_scores.pngassets/readme/evaluation_domain_scores.pngassets/readme/evaluation_quality_signals.pngassets/readme/training_runtime_profile.png
- telemetry source defaults to static snapshot when live source is unavailable
- no full enterprise RBAC/SSO flow yet
- evaluation outcomes are highly sensitive to strict schema settings and prompt strategy
- RLHF path can underperform without careful reward-data alignment and checkpoint selection
Apache 2.0






