Skip to content

amlfarhad/autonomy-fleet-triage

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Autonomy Fleet Triage

Failure-mode repository, event triage engine, and shift handoff workflow for autonomous delivery robot operations.

Autonomy Fleet Triage converts raw robot events into operator-ready mitigation guidance and engineering-ready case records. It keeps failure modes as data, scores incoming events against known symptoms and subsystems, promotes safety-sensitive incidents, assigns SLA targets, and renders a static shift dashboard for operations review.

Features

  • SQLite-backed failure-mode repository with owners, runbook links, symptoms, likely causes, and mitigation steps
  • JSONL robot event ingestion for shift replay or operational logs
  • Transparent triage engine using symptom overlap, subsystem match, free-text evidence, and severity promotion rules
  • SLA assignment for low, medium, high, and critical cases
  • CLI for initializing the repository, ingesting events, triaging a full shift, triaging a single live event, and rendering a dashboard
  • Static HTML operations dashboard for shift review and cross-functional handoff
  • Idempotent case upserts so rerunning a shift replay refreshes existing cases instead of creating duplicates
  • Tests for repository round trips, triage matching, escalation routing, and safety-sensitive battery handling

Operational Workflow

  1. Initialize the SQLite repository with reviewed failure modes and runbook links.
  2. Ingest JSONL robot events from a shift replay or live operations export.
  3. Score each event against known failure modes using subsystem, symptom, and text evidence.
  4. Promote severity for blocked missions, unsafe stops, weak network signals, and low battery conditions.
  5. Store idempotent case records with escalation team, SLA, evidence, and recommended mitigation.
  6. Render a dashboard that gives Operations and Engineering the same case context.

Quick Start

python3 -m venv .venv
source .venv/bin/activate
python -m pip install -e ".[dev]"

autonomy-triage init
autonomy-triage ingest
autonomy-triage triage
autonomy-triage dashboard
python -m pytest -q

Open reports/dashboard.html after running the dashboard command.

Live Event Example

autonomy-triage triage-one \
  --event-id EVT-LIVE-001 \
  --robot-id DL-104 \
  --timestamp 2026-05-08T22:41:00+00:00 \
  --location "San Francisco / Mission Bay" \
  --subsystem connectivity \
  --symptoms telemetry_gap,low_rssi,command_timeout \
  --notes "Video froze during handoff and operator saw delayed command acknowledgment" \
  --network-rssi -87 \
  --battery-pct 44 \
  --mission-state blocked

Project Structure

src/autonomy_fleet_triage/
  cli.py          command-line workflow
  dashboard.py    static HTML dashboard renderer
  models.py       dataclasses and severity model
  repository.py   SQLite schema, seed, ingest, case storage
  triage.py       matching, severity promotion, SLA logic
data/
  seed_failure_modes.json
  sample_events.jsonl
docs/
  architecture.md
  runbooks.md
tests/

Architecture Decisions

  • Explainable triage first: live support needs traceable evidence more than opaque automation.
  • Failure modes as data: Operations and Engineering can review the repository without editing Python code.
  • Safety-aware escalation: blocked missions, unsafe stops, low battery, and weak network signals promote severity.
  • Local-first architecture: SQLite plus static HTML keeps the workflow reproducible and easy to run during interviews.
  • Automation-ready output: every triage case includes severity, SLA, evidence, escalation team, and recommended actions.

Example Output

case=1 event=EVT-1001 robot=DL-042 severity=high score=0.99 mode=Localization drift near dense curbside pickup
case=2 event=EVT-1002 robot=DL-017 severity=high score=0.89 mode=Network handoff degradation during live mission
case=3 event=EVT-1003 robot=DL-088 severity=high score=0.99 mode=Planner regression after autonomy software release

Support Surface

  • Initial debugging and troubleshooting through triage-one.
  • Failure-mode repository ownership through seed_failure_modes.json.
  • Cross-functional escalation to Robot Operations, Hardware Operations, Platform Engineering, or Autonomy Software.
  • Self-service documentation through runbook links and case evidence.
  • Workflow improvement through stored case history and recurring pattern review.

About

Failure-mode repository and triage workflow for autonomous robot operations with SQLite cases, severity/SLA logic, runbooks, and a shift dashboard.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages