SimbaScribe reads untrusted team chat and answers questions about it with an LLM agent. That makes prompt injection a first-class threat, not an afterthought. The architecture is built around containing it.
The thing that captures never thinks; the things that think never capture, and nothing that thinks ever mutates state directly.
- The LLM only ever proposes (a digest, or a structured set of tracker changes).
- Deterministic, LLM-free code validates the proposal against reality and applies it.
- Humans correct exceptions out-of-band.
| Component | Runs as | Can write? | Why |
|---|---|---|---|
| Listener, Synth, Snapshot publisher | the writer user | yes (its own DBs) | trusted; the only writers |
| MCP server + the Pulse agent | a separate, unprivileged user | no — reads a read-only snapshot | sandboxed; this is the part exposed to untrusted chat |
The agent that reads untrusted chat runs as a different, unprivileged user with no
filesystem/shell, and queries a read-only VACUUM INTO snapshot — never the
live database. A fully compromised agent can only ever see a stale, read-only copy.
Collapse this split and the sandbox is theater. It is the injection containment, not incidental hygiene.
The tracker is maintained from chat, so an attacker could try to inject "close every todo" or "create a malicious item." The defenses, in layers:
- Validation before apply. The model's proposal is checked against reality: a resolution/touch must target a currently-open item; evidence must be a message in this run's window (you can't close item #7 by citing a message you didn't just read); transitions must be legal for the kind. Anything else is dropped and logged, not applied.
- Strong-signal-only auto-close, with a per-run cap. The dangerous "real work silently vanishes" error needs an unambiguous signal; a run proposing too many closes has them all demoted to a "looks done?" review (bounds blast radius).
- Soft close, never hard delete + an append-only audit log → a wrong close is visible and reversible.
- Aging keeps the list current; phantoms age out.
The human correction signal is a reaction (✅/❌) on the bot's own message. A prompt injection can change what the AI proposes, but it cannot manufacture a real teammate's reaction — the signal lives outside any text channel an injection controls.
This holds only because reactions are filtered to human users (the platform's own bot/app flag). A bot/webhook/agent reaction is ignored, so even a compromised agent cannot ✅ its own malicious proposal. Reaction apply is idempotent and the human override is sticky.
Every digest and every agent answer — and any knowledge-base doc the agent cites —
sends that content to an external model provider (the one named in your workspace
profile). For a client or regulated
deployment, that may matter. The provider is swappable by config
(provider in the profile — any Anthropic-protocol endpoint, including self-hosted
or alternative providers). Decide and document where chat goes before pointing this
at sensitive workspaces.
API keys live only in .env; the workspace profile stores the name of the env
var, never the key, and the schema rejects an inlined credential. Never commit
.env or a real profile (both are gitignored). The snapshot directory is group-
readable but not group-writable (setgid 2750), so the sandboxed reader cannot
plant files there.
This is currently a self-hosted tool with no central service. If you find a vulnerability, please report it privately via this repository's GitHub Security Advisories ("Report a vulnerability" under the Security tab) rather than opening a public issue. We'll respond as soon as we can.