Xander is a local Kubernetes telemetry demo for spotting pod resource pressure and hidden dependencies. It collects container metrics from a local k3s cluster, stores them in SQLite, builds rolling aggregates, and exposes a small HTTP API and a Streamlit UI for quick investigation.
telemetry-collector/— DaemonSet collector for pod discovery, cgroup metrics, events, and node-local SQLite storage.context-engine/— Runs as a node-local context-engine container or local CLI, reads collector metrics once per cycle, builds rolling aggregates, evaluates rule-engine findings, and produces context JSON.telemetry-api/— Small read-only API over the collector metrics database.agent/— Python analysis agent that loads context files and produces analysis reports (supportsanalyze,daemon,watch).streamlit_app.py— Lightweight interactive UI for exploring metrics and aggregates.
- Go 1.21+
- Python 3.8+
- Docker, used by
k3dand for local collector image builds kubectlk3dfor the local k3s clustersqlite3(for manual inspection)
The easiest way to get a working local environment is the root Makefile:
cd /path/to/xander
make upThis installs Python deps into a virtualenv, downloads Go modules, creates or reuses a single-node k3d cluster, deploys the noisy-neighbor scenario, and deploys the collector with the context-engine container.
Typical next steps:
source .venv/bin/activate
streamlit run streamlit_app.pyOpen the Streamlit UI and, in the sidebar, point it at /tmp/collector-metrics.db (or the path you copied from the collector pod).
To sanity-check that the scenario is producing pod metrics:
make verify-scenarioThese sections describe how to run each component individually for development and testing.
- Collector and node-local context service (k8s): Build and deploy the collector plus context-engine container images into your local cluster.
cd telemetry-collector
make docker-build-all
make deploy-k3- Copy the metrics DB from the collector pod for local use:
POD_NAME=$(kubectl get pods -n telemetry-system -l app=telemetry-collector -o jsonpath='{.items[0].metadata.name}')
kubectl cp -c collector telemetry-system/$POD_NAME:/data/metrics.db ./metrics.db
sqlite3 ./metrics.db "SELECT COUNT(*) FROM metrics;"- Aggregates and context:
cd context-engine
make aggregates DB=../telemetry-collector/metrics.db # aggregate JSON only
make findings DB=../telemetry-collector/metrics.db # aggregate JSON + separate rule findings JSON
make run DB=../telemetry-collector/metrics.db # aggregates + contextThe context engine reads ../telemetry-collector/metrics.db once per run. That single raw dataset feeds both rolling aggregation and the isolated rule engine. Rule findings are written as a separate node-local findings_*.json artifact and are not mixed into the context output yet.
- Node-local findings:
make sync-db
make findings DB=telemetry-collector/metrics.dbThis is the current Option A architecture: each node collector owns that node's raw metrics, the node-local analyzer produces compact findings, and a later cluster-level API/UI can collect findings without moving raw metrics between nodes.
- Node-local context service:
make sync-db
make context-service DB=telemetry-collector/metrics.dbThe service wakes on an interval, reads recent raw SQLite samples once, builds aggregates, evaluates rules, and persists timestamped JSON files under context-engine/service-output/. It also maintains *_latest.json copies for easy ingestion.
Inside Kubernetes, the same service runs as the context-engine container in each collector pod. The collector writes /data/metrics.db; the context-engine container reads that same node-local DB and writes /data/context-engine/{aggregates,findings,context} plus /data/context-engine/results.db.
When rule-based detection finds a problem, context-engine writes a notification under /data/agent/inbox/. The agent can also query rolling metrics through the persisted SQLite DB instead of reading raw collector data directly.
- Telemetry API:
cd telemetry-api
make run DB=../telemetry-collector/metrics.dbEndpoints include GET /cluster-summary, GET /top-risk, and GET /incidents.
- Agent (analysis CLI):
The agent/ package includes a small CLI entrypoint at agent/main.py.
cd agent
source ../.venv/bin/activate # or use your Python env
# to test/analyze the agent working
python main.py analyze --context-file /path/to/context_123.json
# or analyze latest context in the configured directory
python main.py analyze --latest
# Run as a daemon monitoring for new context files (writes analyses to `analyses/` next to contexts)
python main.py daemon --poll-interval 60
# Query rolling metrics persisted by context-engine
python main.py query-metrics --recent-findings
python main.py query-metrics --pod default/pod-nameThe analyze command supports --output-format (markdown or json) and --output-file.
Run the dashboard from the repo root (after activating the .venv created by start-proj.sh):
source .venv/bin/activate
streamlit run streamlit_app.pyThe sidebar lets you point the UI at a local metrics.db file and toggle live charting.
- If Docker is not reachable, ensure it is running and that your user has permission to access the daemon.
- If
k3dis missing, install it first and rerunmake up. - The collector stores the live DB inside the pod at
/data/metrics.dbwhen deployed with the context-engine container. Older local runs may still use/tmp/metrics.db. make sync-dbmerges collector DBs if a multi-node cluster already exists.
Contributions are welcome.
For development, use make up, make down, and make clean from the repo root.