Phase 7 docs#576
Open
DanielKim03 wants to merge 25 commits into
Open
Conversation
Introduces backend/app/llm/ with an abstract LLMBackend interface and three concrete implementations: Ollama (local), OpenAI-SDK-compat (OpenAI, Anthropic, Together, DeepInfra, Groq, Fireworks), and vLLM (thin specialization of openai_compat). ModelRouter resolves task roles (fast/balanced/heavy/embed) to backends, wraps every call with exponential-backoff retry + configurable fallback chain, and persists per-call token/latency/cost to a SQLite llm_calls table for the acceptance-check cache-hit-rate metric. Also defers flask imports in app/__init__.py so `import app.llm` works in unit tests without installing the full Flask stack. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
LLMClient becomes a thin back-compat shim over ModelRouter.default(). Its public methods (chat, chat_json) keep the same signatures, so every existing caller continues to work. Adds chat_raw() for callers that need token counts or finish_reason (used by profile / sim-config generators to detect truncation). Migrates the three duplicated OpenAI(...) instantiation sites: - oasis_profile_generator.py: direct client -> LLMClient(role=BALANCED) - simulation_config_generator.py: direct client -> LLMClient(role=BALANCED) - utils/llm_client.py: openai.OpenAI -> router delegate Assigns task-appropriate roles at the remaining callers: - ontology_generator, zep_tools -> fast - report_agent -> heavy Prompts are unchanged; only the transport moves. cache_key hints are added at the two migrated sites so Anthropic / OpenAI prompt caching actually kicks in on the stable system-prompt prefix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds BACKEND_MODE (local|cloud|custom), LLM_ROLE_<role>_BACKEND/MODEL/..., VLLM_DRAFT_MODEL / VLLM_SPECULATIVE_TOKENS, LLM_MAX_RETRIES, LLM_CALLS_DB, LLM_PRICING_JSON. Back-compat: legacy LLM_API_KEY/LLM_BASE_URL/LLM_MODEL_NAME remain the default for any cloud role whose per-role keys are unset. .env.example documents the full shape with commented Anthropic-Haiku/Opus example. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Coverage per module:
base.py 5 tests -- complete/stream defaults, BackendError shape
openai_compat.py 7 tests -- usage parsing, cache_key routing by provider,
Anthropic cache_control tagging, error classification
ollama.py 5 tests -- token counts, JSON mode, 5xx retryable,
network wrap, per-text embed loop
vllm.py 2 tests -- extra_body forwarding, cache_key suppression
accounting.py 7 tests -- cost table, cached-rate math, SQLite round-trip,
cache-hit-rate aggregation
router.py 7 tests -- happy path, retry-then-succeed, non-retryable
skip, fallback chain, accounting wiring,
missing-role error, embed dispatch
conftest.py redirects LLM_CALLS_DB to a tmp path per test and resets the
ModelRouter default singleton between tests to keep runs hermetic.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds backend/app/memory/ with:
base.py -- MemoryBackend abstract interface; Observation/Reflection
dataclasses; Namespace helpers (agent:<sim>:<id> /
public:<sim>:timeline); base cosine + recency + importance
scoring helpers shared across backends
in_memory.py -- dict-backed reference implementation (tests + minimal
local runs)
zep_cloud.py -- adapter around the existing Zep graph.add / graph.search
pipeline; stores observations as marker-prefixed episodes
so they can be parsed back on read
neo4j_local.py -- self-hosted Neo4j 5.x via bolt:// with a clean Cypher
schema (Observation/Reflection/Namespace nodes + IN /
DERIVED_FROM / CONTRADICTS edges); cosine sim computed
client-side for portability, with a commented upgrade
path to native vector indexes
neo4j_aura.py -- managed AuraDB subclass, warns on non-TLS URIs
hierarchical.py -- ImportanceScorer (fast LLM, 1-10, fallback 5),
ReflectionScheduler (every N rounds, top-K by importance,
balanced LLM, 3-5 beliefs with source pointers),
ContradictionDetector (fast LLM binary, top-3 neighbors,
writes conflict_edge on sentiment flip)
router.py -- MemoryRouter picks backend from MEMORY_BACKEND env with
auto-heuristic (NEO4J_URI > ZEP_API_KEY > in_memory)
manager.py -- MemoryManager wraps a backend + the three hierarchical
services. Enforces per-agent namespace isolation: public
posts get mirrored to public:<sim>:timeline, private
observations stay in agent:<sim>:<id>, cross-agent reads
never traverse another agent's private partition.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Integration points:
* ZepGraphMemoryManager grows .get_memory_manager(sim_id) which lazily
instantiates a MemoryManager per simulation. The existing Zep batch
updater keeps running for document-seeded graph enrichment; every
add_activity() now also mirrors the activity into the MemoryManager
so Phase-2 features (importance scoring, reflection, contradiction,
retrieval) light up without touching simulation_runner.py.
* CREATE_POST / QUOTE_POST / REPOST / CREATE_COMMENT are mirrored to
the public:<sim>:timeline namespace so peer agents can see them via
retrieve_for_agent(include_public=True). Non-public actions stay
private. stop_updater() also closes the associated MemoryManager.
New blueprint /api/agents:
GET /api/agents/<id>/reflections?simulation_id=...
GET /api/agents/<id>/conflicts?simulation_id=...
POST /api/agents/<id>/retrieve (body: simulation_id, query, top_k, weights)
Config + env additions: MEMORY_BACKEND (auto/in_memory/zep_cloud/neo4j_local/
neo4j_aura), NEO4J_URI/USER/PASSWORD/DATABASE, REFLECTION_EVERY_N_ROUNDS,
REFLECTION_TOP_K_SOURCES, MEMORY_ALPHA/BETA/GAMMA, MEMORY_ENABLE_*. .env.example
documents the full surface with commented examples for Aura / local Neo4j.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Coverage per module:
base.py 11 tests -- namespace factories + parsing, record kind
defaults, recency/importance/cosine helpers,
error shape
in_memory.py 9 tests -- namespace validation, combined-score ranking,
cross-agent isolation, vector KNN ignores
recency/importance, reflection source-id
validation, conflict edge persistence &
endpoint checks, summarize_window ordering
hierarchical.py 9 tests -- importance parsing + verbose-reply recovery
+ LLM-error fallback; reflection scheduler
writes beliefs, skips below cadence, skips
with too-few sources; contradiction detector
writes edges on positive, skips without
embedding, no-op on negative classification
manager.py 7 tests -- private namespace writes, public mirroring,
cross-agent read isolation, reflection cadence
wiring, stance-flip end-to-end (Phase-2
acceptance criterion), close() propagation
router.py 5 tests -- explicit selection, auto-heuristic picking
in_memory / neo4j_local / neo4j_aura /
zep_cloud based on env, unknown-kind error
LLM calls are fully stubbed via ScriptedRouter / FakeRouter — no network is
required.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New package backend/app/transport/:
base.py -- Transport (backend) + ServerTransport (subprocess) ABCs;
Command / Response / Event dataclasses with JSON +
(topic, body) frame roundtrip; in-memory pair factory for tests
file_ipc.py -- preserves the original file-poll protocol for back-compat.
Adds an append-only jsonl events channel so both sides can
at least tail events (not real-time)
zmq_transport.py -- DEALER/ROUTER for commands (lets the backend issue
concurrent requests without REQ/REP turn-taking) + PUB/SUB
for events. Uses ipc:// sockets by default; TCP available
via env. A doc comment explains the grpc-vs-zmq tradeoff
per the phase ground rules
factory.py -- build_client_transport / build_server_transport pick the
right pair based on IPC_TRANSPORT env (default: zmq)
The legacy simulation_ipc.py is untouched. Callers migrate incrementally by
swapping SimulationIPCClient for build_client_transport(); the two can
coexist per-simulation during rollout.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds backend/app/ws/:
bridge.py -- EventBridge: one background worker thread per run_id
tails the transport event stream and fans out to every
registered subscriber. Thread-safe; subscribe() returns
an unsubscribe closure; stop_run() tears down the worker
+ transport on simulation teardown. Process-wide singleton
accessed via get_bridge().
streaming.py -- flask-sock routes. No-op when flask-sock isn't installed
so the HTTP API keeps working on bare flask installs.
/ws/simulation/<run_id> live event feed
/ws/simulation/<run_id>/interview streaming token-by-token reply
via router.stream_chat() — skips
the subprocess round trip so
latency drops from ~200ms to
single-digit ms per token.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds backend/app/checkpoint/:
serializer.py -- collect_checkpoint() walks every namespace the manager
has seen (per-agent + public timeline) and snapshots
every record + conflict edge into a CheckpointData
dataclass. restore_into() replays records in a safe
order (observations before reflections, conflicts last)
so source_ids always resolve. format_version mismatch
raises so stale archives can't silently corrupt state.
archiver.py -- save_checkpoint / restore_checkpoint pack the snapshot
into .tar.zst (with gzip fallback when zstandard is
unavailable). Archive layout is boring tar so operators
can `zstd -d | tar -tv` for inspection.
API endpoints mounted under /api/simulation/<sim_id>:
POST /checkpoint -- capture round state to disk
POST /restore -- restore by path or by round_num
GET /checkpoints -- list archived checkpoints with size + mtime
Phase-3 config additions: IPC_TRANSPORT, IPC_CMD_ENDPOINT, IPC_EVENT_ENDPOINT.
requirements.txt adds pyzmq, flask-sock, zstandard, and neo4j (phase 2
backend driver; optional — not installed unless MEMORY_BACKEND=neo4j_*).
.env.example documents the transport and WebSocket endpoints.
create_app() now also calls register_ws_routes(app) so /ws/* endpoints are
attached automatically when flask-sock is installed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…green)
Coverage per module:
transport/base 6 tests -- command/response/event JSON roundtrip,
event frame encoding, in-memory pair
happy-path + timeout-on-silent-server
transport/file_ipc 3 tests -- file-based command roundtrip, append-only
event tailing, per-run event filtering
transport/zmq 3 tests -- inproc:// command roundtrip (DEALER/ROUTER),
timeout when server is silent, PUB/SUB
slow-joiner-aware multi-event fan-out
ws/bridge 3 tests -- multi-subscriber fan-out correctness,
unsubscribe halts delivery, stop_run
releases transport + worker
checkpoint 5 tests -- captures all namespaces (agent private +
public timeline), tar.zst archive roundtrip,
restore_into reproduces state in a fresh
manager (Phase-3 acceptance criterion),
format_version mismatch raises, archive
path contains round number
Totals: 33 (phase 1) + 41 (phase 2) + 20 (phase 3) = 94 passing.
No network; ZMQ tests use inproc:// endpoints; WS bridge tests drive the
file transport for deterministic timing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds backend/app/personas/:
schema.py -- StructuredPersona dataclass (Big Five traits + conviction +
credibility + background + initial stance); Archetype enum
with defaults per archetype (conviction floor = 1.0 for
bots/trolls; credibility ceiling matching persona type).
Background hard-capped at 200 chars for prefix cacheability.
Clean JSON round-trip (acceptance criterion).
prompts.py -- persona_system_block(): the fixed template injected into
every agent prompt. Stable prefix (archetype rules + scoring
scales) first, volatile persona block last — this ordering
is what lets Anthropic / OpenAI prompt caching actually
catch the common prefix across every agent in the run.
generator.py -- PersonaGenerator uses the `balanced` LLM role + strict JSON
schema to fill in Big Five / stance / background. Procedural
fallback when the LLM fails so simulation doesn't die. BOT
and TROLL personas bypass the LLM entirely — their behavior
is dictated by the archetype.
population.py -- build_population() mixes normal / media / expert / bot /
troll agents by percentage, with deterministic seeding.
build_bot_persona / build_troll_persona produce procedural
personas with the right extras ({narrative} / {tone}).
inertia.py -- StanceInertia: per-agent counter of opposing vs supporting
posts seen. Valence threshold (0.2) filters out noise.
should_allow_flip(persona) enforces the
ceil(10*conviction) spec rule. Snapshot / restore for
checkpoints.
credibility.py -- CredibilityWeighter: re-ranks retrieval results by author
credibility. Formula: base * (1 + weight * (cred - 0.5)).
Unknown authors get neutral (0.5) — posts are never
silently dropped.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Integration points:
* MemoryManager grows an optional credibility_weighter parameter +
set_credibility_weighter() setter. retrieve_for_agent() applies the
weighter after merging private + public records, before the final sort.
When unset, behavior is identical to Phase 2.
* OasisProfileGenerator gains two Phase-4 methods:
generate_structured_persona_for_entity(entity, user_id, archetype,
topic_summary)
-> StructuredPersona via PersonaGenerator
attach_structured_persona(profile, persona)
-> splices the prompt block + JSON tag into the OASIS profile's
`persona` field. The original prose-based path is untouched
so legacy callers keep working.
Config + env additions:
BOT_POPULATION_PCT, TROLL_POPULATION_PCT (default 0/0 — enabling these
changes outcomes per the phase spec)
MEDIA_POPULATION_PCT, EXPERT_POPULATION_PCT (institutional boosts)
POPULATION_SEED (deterministic mixing for reproducible eval runs)
CREDIBILITY_WEIGHT (re-rank strength; 0.0 disables)
.env.example documents each knob with the "enabling these changes outcomes"
warning called out by the spec.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Coverage per module:
schema.py 8 tests -- Big Five clamping, stance valence clamping,
background truncation, JSON round-trip
(Phase-4 acceptance), opposing_posts_needed
scales with conviction, stance_is_opposed_by
sign check + neutral-stance safeguard, bot
archetype default floors, strict from_dict
population.py 8 tests -- default all-normal, floor rounding, exact
percentages, deterministic seeding, over-100
rejection, negative rejection, bot narrative
extras, troll tone extras
inertia.py 8 tests -- high-conviction resists single opposing post
(Phase-4 acceptance), resists 20 rounds with
8 opposing (below threshold), flips once
threshold crossed, low-conviction flips
quickly, valence-threshold filters noise,
supporting posts counted separately, reset
clears counters, snapshot/restore round-trip
credibility.py 5 tests -- high cred outranks low cred at tied base
score, multiplier formula, weight=0 is noop,
unknown author uses neutral, non-mutating
prompts.py 6 tests -- stable prefix identical across agents (prefix
cache correctness), bot narrative embedded,
troll tone embedded, conviction + opposing-
needed count appear in volatile, topic summary
appended, archetype rules vary per archetype
generator.py 5 tests -- LLM JSON -> persona assembly, code-fence
stripping, fallback on any LLM error (runtime /
network / parse), background length cap,
archetype floor clamping even from LLM output
integration.py 4 tests -- credibility reweights public-timeline
retrieval (end-to-end), bot population
changes retrievable content (Phase-4
acceptance), high-conviction agent holds
across 20 rounds (Phase-4 acceptance),
deterministic population seed
Totals: 33 (p1) + 41 (p2) + 20 (p3) + 44 (p4) = 138 passing.
All LLM calls stubbed — no network required.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…unner)
New package backend/eval/:
determinism.py -- DeterministicClock + deterministic_run() context manager.
Replaces wall-clock now_ts(), seeds the global RNG, and
restores state on exit. DETERMINISTIC_VERSION constant
bumped on any math / mock-table change so CI catches drift.
scoring.py -- Pure functions: directional_accuracy, magnitude_error,
calibration. Composite = 0.5*dir + 0.3*(1-mag) + 0.2*cal.
Direction synonyms (support/oppose) accepted.
verdict.py -- verdict_from_public_timeline aggregates public-namespace
posts into a signed support_ratio, weighted by author
credibility when personas are supplied. verdict_from_report
parses the optional ReportAgent JSON surface.
mocks.py -- MockRouter: deterministic drop-in for ModelRouter. SHA-256
hashes prompt+salt for every decision; importance returns
an integer 1-10; reflection returns 3 canned beliefs;
contradiction is True ~25% via hash bucket; persona JSON
is synthesized from the entity name so downstream code
sees varied credibility / valence distributions.
pipeline.py -- run_case() orchestrates: build_population -> PersonaGenerator
-> MemoryManager (Phase-2/4 features per FeatureFlags) ->
rounds of posts with stance-anchored valence -> Verdict.
FeatureFlags is the knob the ablation tool sweeps over.
storage.py -- JSONL append / read for the eval-results dashboard.
EVAL_RESULTS_PATH env override.
runner.py -- CLI. `python -m backend.eval.runner --case <name>
--deterministic --mock-llm` produces a numeric score.
Two runs with those flags are BYTE-IDENTICAL (Phase-5
acceptance). Warns if --deterministic is passed without
--mock-llm.
ablation.py -- CLI. Sweeps baseline + 7 variants (no_importance,
no_reflection, no_contradiction, no_credibility,
no_conviction, no_phase2, no_phase4) and prints a
comparison table with Δ vs baseline.
backend/__init__.py added so `python -m backend.eval.runner` resolves and
so modules inside eval/ can reach backend/app/* via the sys.path guard
at the top of runner.py / ablation.py.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cases (backend/eval/datasets/<name>/{seed.md, question.md, truth.json}):
sample_policy_carbon_tax -- truth: negative, magnitude 0.45
sample_product_vr_headset -- truth: positive, magnitude 0.55
sample_policy_remote_work -- truth: positive, magnitude 0.35
sample_product_ai_service -- truth: neutral (polarized), magnitude 0.10
sample_election_incumbent_mayor -- truth: positive, magnitude 0.30
Each truth.json cites a comparable real-world analog in its `notes` field.
README.md spells out the schema and flags these as starter fixtures to be
replaced with peer-reviewed cases before publishing benchmark numbers.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
API:
GET /api/eval/results?limit=50[&case=<name>]
-> {"count": N, "results": [<record>, ...]} — newest first
Reads from the JSONL store populated by `runner.py --persist`.
CI workflow (.github/workflows/eval-smoke.yml):
On every PR against main/master:
1. Install the minimal deterministic-path deps (no Zep / OASIS / Neo4j)
2. pytest backend/tests/
3. Run `backend.eval.runner` twice with --deterministic --mock-llm,
--output-json; `diff -q` enforces byte-identical output
(Phase-5 acceptance criterion)
4. Smoke-run backend.eval.ablation to confirm the table format stays stable
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Coverage per module:
scoring.py 7 tests -- directional synonym handling, magnitude clipping,
calibration rewards/penalties, composite uses
weights, custom weights, clamping
verdict.py 6 tests -- empty timeline -> neutral/0 confidence,
consistent positive maps to positive, credibility
tips split votes, ReportAgent JSON parsing with
code fences, None on garbage
determinism.py 5 tests -- clock monotonic + reproducible, wall-clock
fallback outside block, global RNG state
restored, seeded_random isolation, version
constant exposed
mocks.py 6 tests -- chat determinism, importance integer range,
reflection 3 beliefs, contradiction mixed
distribution, embed determinism, persona schema
validity
storage.py 7 tests -- append creates file, recorded_ts auto-added,
newest-first ordering, limit honored, case
filter, missing file -> empty, malformed line
skipped
runner.py + ablation.py 6 tests (subprocess-based) — numeric score emitted
(Phase-5 acceptance 666ghj#1), byte-identical output
across two runs (Phase-5 acceptance 666ghj#2),
deterministic warning without mock-llm, ablation
emits table with all variants, --output-json
parseable (Phase-5 acceptance 666ghj#3)
Totals: 33 (p1) + 41 (p2) + 20 (p3) + 44 (p4) + 37 (p5) = 175 passing
in 14s. Subprocess-based runner tests exercise the real CLI end-to-end.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds backend/app/observability/:
logging.py -- configure_logging() configures structlog -> JSON on stdout
when structlog is installed; else falls back to a stdlib
JSON formatter that still honors bind_context(). Every log
line carries run_id / agent_id / phase via contextvars
without code changes at the call sites.
metrics.py -- Prometheus registry exposing the Phase-6-spec metrics:
llm_calls_total{role,provider,model,status}
llm_tokens_total{role,kind}
llm_cache_hit_ratio (rolling gauge, recomputed on emit)
memory_op_duration_seconds{op,backend} histogram
simulation_active_runs, simulation_rounds_total{platform}
auth_rejections_total{reason}
Degrades to a minimal in-process counter store with a
human-readable banner if prometheus_client is missing, so
/metrics still responds 200.
tracing.py -- OTel setup with OTLP/HTTP exporter. start_span() is a
no-op context manager when OTel SDK isn't installed or
OTEL_EXPORTER_OTLP_ENDPOINT is unset, so callers can use
it unconditionally.
Wires observe_llm_call() into the LLM router. Every successful and every
failed backend call (including retries) records prompt/completion/cached
tokens + status into Prometheus. Metric emission is wrapped in a bare
try/except so it can never break the LLM call path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds backend/app/auth/:
keys.py -- SQLite-backed ApiKeyStore. Plaintext key format is
`mf_<8-hex-id>_<~40-char-urlsafe-secret>`. Only the SHA-256
hash is stored; plaintext is returned ONCE at issue time.
Constant-time compare on verify to avoid timing leaks.
quotas.py -- QuotaTracker with atomic check-and-debit. Raises
QuotaExceeded on over-cap with a structured .to_dict()
for the 429 response body. Preview-mode non-mutating
check powers the cost-estimator approval flow.
30-day rolling window (env overridable).
middleware.py -- @require_api_key Flask decorator. Accepts
X-MiroFish-Key header (preferred) or ?api_key= query
(fallback). ALLOW_ANONYMOUS_API=true bypasses with a
metric increment so dashboards see reliance on anon.
Adds backend/app/cost/:
estimator.py -- estimate_simulation_cost(agents, rounds, role_models)
multiplies per-role default token budgets (tuned against
observed qwen-plus runs) by agents × rounds × calls.
Resolves (provider, model) -> price via Phase-1's
_PRICING table; unknown pairs annotate `note` without
crashing. ApprovalRequired raised by require_approval()
when estimate exceeds user_cap_usd.
New endpoints:
GET /metrics -- Prometheus scrape target
POST /api/auth/keys -- issue (admin-only)
GET /api/auth/keys -- list (admin-only)
DELETE /api/auth/keys/<id> -- revoke (admin-only)
GET /api/auth/quota -- current key's usage
POST /api/simulation/estimate-cost -- pre-flight estimate
create_app() now calls configure_logging() + configure_tracing() at
startup. Admin endpoints require `X-MiroFish-Admin-Token: $ADMIN_TOKEN`
and return 503 when ADMIN_TOKEN is unset (makes misconfiguration loud).
Config + env additions: ADMIN_TOKEN, ALLOW_ANONYMOUS_API, AUTH_DB_PATH,
QUOTA_DB_PATH, OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_SERVICE_NAME,
COST_BUDGET_<ROLE>_{CALLS,IN,OUT,CACHED} overrides.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
deploy/helm/mirofish/:
Chart.yaml -- appVersion 0.6.0, kubeVersion >=1.24
values.yaml -- every key documented inline; groups: backend / frontend /
redis / vllm / memory / llm / auth / observability / ingress.
Defaults: backend x2 replicas, redis on, frontend + vllm
+ ingress off. Neo4j expected external (Aura) per spec.
templates/
_helpers.tpl -- fullname + labels
backend-{dep,svc}.yaml
redis.yaml -- inline, toggleable via .Values.redis.enabled
vllm.yaml -- optional GPU deployment
configmap.yaml -- flattens values into the backend's envFrom
secret.yaml -- placeholder Secret for ADMIN_TOKEN / LLM_API_KEY /
ZEP_API_KEY / neo4j-password (populate externally)
ingress.yaml -- optional, ingressClassName-aware
README.md -- install + lint + values overview
requirements.txt: structlog, prometheus_client, opentelemetry-api + sdk +
otlp-proto-http. Neo4j 5.x driver stays optional (only installed when
MEMORY_BACKEND=neo4j_*).
.env.example: documents ALLOW_ANONYMOUS_API, ADMIN_TOKEN, AUTH_DB_PATH,
QUOTA_DB_PATH, OTEL_*, and COST_BUDGET_* overrides in a single Phase-6
block before the existing Phase-4 persona section.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tal)
Coverage per module:
observability/logging 5 tests -- bind_context roundtrip, nested
LIFO unbind, stdlib JSON formatter
w/ contextvars merge,
configure_logging idempotency,
structlog path selection
observability/metrics 8 tests -- llm_call metric labels + cache
ratio update, memory_op histogram
buckets, active_run gauge, auth
rejection counter, content-type,
fallback when prometheus_client
missing, singleton accessor
observability/tracing 4 tests -- no-endpoint -> disabled, no-op span
when disabled, configure is
idempotent, attribute-setting span
opens + shuts down cleanly
auth/keys 9 tests -- issue returns plaintext once,
verify roundtrip, rejects garbage
+ tampered + revoked, list filters
by owner, excludes revoked by
default, to_dict strips secret,
quotas stored on key
auth/quotas 8 tests -- unlimited key passthrough, token
quota enforced, usd quota enforced,
atomic debit, failed-debit doesn't
apply (critical), preview non-
mutating, reset, fresh-key zeros
auth/middleware 6 tests -- missing header 401, valid key
accepted, invalid key 401, revoked
key 401, anonymous flag bypass,
query-string fallback
cost/estimator 8 tests -- linear scaling, unknown vendor ->
zero cost + note, cached fraction
discounts, approval flag when over
cap, ApprovalRequired exception,
zero cap disables, per-role
breakdown present, env budget
overrides merge
Totals: 33 (p1) + 41 (p2) + 20 (p3) + 44 (p4) + 37 (p5) + 48 (p6) = 223
passing in ~24s.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MIGRATION.md at repo root: TL;DR + per-phase rundown, running in each mode (local / cloud / vLLM / Kubernetes), full new-keys table with per- phase origin, and notes on what's breaking vs additive. Flags that the public HTTP surface has ZERO breaking changes — every upstream endpoint behaves identically. README.md: inserts a prominent "MiroFish-Cloud (Phase 1-6)" banner above Quick Start pointing at MIGRATION / architecture / BENCHMARKS. Adds Option 2 (multi-provider cloud), Option 3 (local-only via Ollama), and Option 4 (Helm chart) alongside the existing npm-dev quickstart. docs/architecture.md: full module tree for every phase, abstract-backend diagrams (LLM router, memory layer, transport), request data-flow traces for a normal round + streaming interview, the Phase-6 Kubernetes deployment topology, and a cross-phase "notable design decisions" table. BENCHMARKS.md: specifies the four benchmarks that matter (throughput, interview latency, eval scores, cost per 1k-agent sim), gives exact reproduction commands, carries the captured deterministic-ablation table as an in-repo number (verified by CI), and holds⚠️ -marked placeholder tables for the live-LLM numbers operators fill in after their first production runs. Test-suite runtime table (223 tests, ~24s) ships as a CI regression guard baseline. All docs-only; pytest still green at 223/223. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Config.validate() previously required LLM_API_KEY and ZEP_API_KEY
unconditionally, which blocked run.py from starting in Path A's local-only
mode (Ollama + in-memory backend, no cloud creds). Two relaxations:
* When BACKEND_MODE=local, no cloud LLM key is required. The router
uses Ollama defaults for every role.
* When MEMORY_BACKEND is in_memory or neo4j_*, ZEP_API_KEY stops being
required — Zep is only used under MEMORY_BACKEND=zep_cloud (or the
`auto` default).
* In cloud/custom mode, LLM_ROLE_BALANCED_API_KEY satisfies the check
when a per-role key has been configured without the legacy fallback.
backend/pyproject.toml grows entries for every phase-1-6 runtime dep that
already lives in requirements.txt (pyzmq, flask-sock, zstandard, neo4j,
structlog, prometheus_client, opentelemetry-*, requests). A single
`uv sync` now pulls the complete set — no follow-up `pip install` needed.
Verified live in local mode:
GET /health -> 200
GET /metrics -> 200 (Prometheus text + phase-6 metrics)
POST /api/simulation/estimate-cost -> 200 with full per-role breakdown
Every phase 1-6 blueprint + WebSocket route registered at startup.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
.env.example restructured for production deployment:
* Adds a prominent "CLOUD DEPLOYMENT QUICKSTART" header block at the top
listing the 3 secret lines to replace (LLM_API_KEY, ADMIN_TOKEN, and
optionally NEO4J_*).
* Default LLM config is now single-vendor OpenAI (sk-REPLACE-ME /
gpt-4o-mini) — simplest cloud path, one bill, built-in price table,
works for both chat and embeddings. Aliyun DashScope kept as a
commented alternative for existing upstream users.
* Uncommented LLM_ROLE_HEAVY_* and LLM_ROLE_EMBED_* so the default setup
uses gpt-4o for ReportAgent synthesis and text-embedding-3-large for
vector retrieval.
* FLASK_HOST=0.0.0.0 so containers / load balancers reach the backend.
* FLASK_DEBUG=false — auto-reload off in production.
* ADMIN_TOKEN gets a literal placeholder (was a commented hint) so
cloud deployments fail loud at boot when it's not filled in.
* Neo4j Aura block moved above local CE (cloud-first ordering) with
explicit connection-string format and REPLACE-ME placeholders.
* ZEP_API_KEY placeholder updated to `zk-REPLACE-ME` to match the
expected format.
backend/uv.lock: locks every phase 1-6 runtime dep added to pyproject.toml
in the previous commit — opentelemetry-{api,sdk,exporter-otlp-proto-http},
prometheus-client, structlog, flask-sock, pyzmq, zstandard, neo4j, etc.
Pinned versions match what `uv sync` produced in Path A.
frontend/package-lock.json: regenerated by `npm install` during Path A
setup; no version drift, just lockfile metadata refresh.
No functional code changes; .env itself stays gitignored.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.