Failure-mode repository, event triage engine, and shift handoff workflow for autonomous delivery robot operations.
Autonomy Fleet Triage converts raw robot events into operator-ready mitigation guidance and engineering-ready case records. It keeps failure modes as data, scores incoming events against known symptoms and subsystems, promotes safety-sensitive incidents, assigns SLA targets, and renders a static shift dashboard for operations review.
- SQLite-backed failure-mode repository with owners, runbook links, symptoms, likely causes, and mitigation steps
- JSONL robot event ingestion for shift replay or operational logs
- Transparent triage engine using symptom overlap, subsystem match, free-text evidence, and severity promotion rules
- SLA assignment for low, medium, high, and critical cases
- CLI for initializing the repository, ingesting events, triaging a full shift, triaging a single live event, and rendering a dashboard
- Static HTML operations dashboard for shift review and cross-functional handoff
- Idempotent case upserts so rerunning a shift replay refreshes existing cases instead of creating duplicates
- Tests for repository round trips, triage matching, escalation routing, and safety-sensitive battery handling
- Initialize the SQLite repository with reviewed failure modes and runbook links.
- Ingest JSONL robot events from a shift replay or live operations export.
- Score each event against known failure modes using subsystem, symptom, and text evidence.
- Promote severity for blocked missions, unsafe stops, weak network signals, and low battery conditions.
- Store idempotent case records with escalation team, SLA, evidence, and recommended mitigation.
- Render a dashboard that gives Operations and Engineering the same case context.
python3 -m venv .venv
source .venv/bin/activate
python -m pip install -e ".[dev]"
autonomy-triage init
autonomy-triage ingest
autonomy-triage triage
autonomy-triage dashboard
python -m pytest -qOpen reports/dashboard.html after running the dashboard command.
autonomy-triage triage-one \
--event-id EVT-LIVE-001 \
--robot-id DL-104 \
--timestamp 2026-05-08T22:41:00+00:00 \
--location "San Francisco / Mission Bay" \
--subsystem connectivity \
--symptoms telemetry_gap,low_rssi,command_timeout \
--notes "Video froze during handoff and operator saw delayed command acknowledgment" \
--network-rssi -87 \
--battery-pct 44 \
--mission-state blockedsrc/autonomy_fleet_triage/
cli.py command-line workflow
dashboard.py static HTML dashboard renderer
models.py dataclasses and severity model
repository.py SQLite schema, seed, ingest, case storage
triage.py matching, severity promotion, SLA logic
data/
seed_failure_modes.json
sample_events.jsonl
docs/
architecture.md
runbooks.md
tests/
- Explainable triage first: live support needs traceable evidence more than opaque automation.
- Failure modes as data: Operations and Engineering can review the repository without editing Python code.
- Safety-aware escalation: blocked missions, unsafe stops, low battery, and weak network signals promote severity.
- Local-first architecture: SQLite plus static HTML keeps the workflow reproducible and easy to run during interviews.
- Automation-ready output: every triage case includes severity, SLA, evidence, escalation team, and recommended actions.
case=1 event=EVT-1001 robot=DL-042 severity=high score=0.99 mode=Localization drift near dense curbside pickup
case=2 event=EVT-1002 robot=DL-017 severity=high score=0.89 mode=Network handoff degradation during live mission
case=3 event=EVT-1003 robot=DL-088 severity=high score=0.99 mode=Planner regression after autonomy software release
- Initial debugging and troubleshooting through
triage-one. - Failure-mode repository ownership through
seed_failure_modes.json. - Cross-functional escalation to Robot Operations, Hardware Operations, Platform Engineering, or Autonomy Software.
- Self-service documentation through runbook links and case evidence.
- Workflow improvement through stored case history and recurring pattern review.