Skip to content

Phase 7 docs#576

Open
DanielKim03 wants to merge 25 commits into
666ghj:mainfrom
DanielKim03:phase-7-docs
Open

Phase 7 docs#576
DanielKim03 wants to merge 25 commits into
666ghj:mainfrom
DanielKim03:phase-7-docs

Conversation

@DanielKim03
Copy link
Copy Markdown

No description provided.

MiroFish Migration and others added 23 commits April 23, 2026 15:50
Introduces backend/app/llm/ with an abstract LLMBackend interface and three
concrete implementations: Ollama (local), OpenAI-SDK-compat (OpenAI, Anthropic,
Together, DeepInfra, Groq, Fireworks), and vLLM (thin specialization of
openai_compat). ModelRouter resolves task roles (fast/balanced/heavy/embed) to
backends, wraps every call with exponential-backoff retry + configurable
fallback chain, and persists per-call token/latency/cost to a SQLite
llm_calls table for the acceptance-check cache-hit-rate metric.

Also defers flask imports in app/__init__.py so `import app.llm` works in
unit tests without installing the full Flask stack.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
LLMClient becomes a thin back-compat shim over ModelRouter.default().
Its public methods (chat, chat_json) keep the same signatures, so every
existing caller continues to work. Adds chat_raw() for callers that need
token counts or finish_reason (used by profile / sim-config generators to
detect truncation).

Migrates the three duplicated OpenAI(...) instantiation sites:
  - oasis_profile_generator.py: direct client -> LLMClient(role=BALANCED)
  - simulation_config_generator.py: direct client -> LLMClient(role=BALANCED)
  - utils/llm_client.py: openai.OpenAI -> router delegate

Assigns task-appropriate roles at the remaining callers:
  - ontology_generator, zep_tools -> fast
  - report_agent -> heavy

Prompts are unchanged; only the transport moves. cache_key hints are
added at the two migrated sites so Anthropic / OpenAI prompt caching
actually kicks in on the stable system-prompt prefix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds BACKEND_MODE (local|cloud|custom), LLM_ROLE_<role>_BACKEND/MODEL/...,
VLLM_DRAFT_MODEL / VLLM_SPECULATIVE_TOKENS, LLM_MAX_RETRIES, LLM_CALLS_DB,
LLM_PRICING_JSON. Back-compat: legacy LLM_API_KEY/LLM_BASE_URL/LLM_MODEL_NAME
remain the default for any cloud role whose per-role keys are unset. .env.example
documents the full shape with commented Anthropic-Haiku/Opus example.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Coverage per module:
  base.py           5 tests -- complete/stream defaults, BackendError shape
  openai_compat.py  7 tests -- usage parsing, cache_key routing by provider,
                               Anthropic cache_control tagging, error classification
  ollama.py         5 tests -- token counts, JSON mode, 5xx retryable,
                               network wrap, per-text embed loop
  vllm.py           2 tests -- extra_body forwarding, cache_key suppression
  accounting.py     7 tests -- cost table, cached-rate math, SQLite round-trip,
                               cache-hit-rate aggregation
  router.py         7 tests -- happy path, retry-then-succeed, non-retryable
                               skip, fallback chain, accounting wiring,
                               missing-role error, embed dispatch

conftest.py redirects LLM_CALLS_DB to a tmp path per test and resets the
ModelRouter default singleton between tests to keep runs hermetic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds backend/app/memory/ with:
  base.py          -- MemoryBackend abstract interface; Observation/Reflection
                      dataclasses; Namespace helpers (agent:<sim>:<id> /
                      public:<sim>:timeline); base cosine + recency + importance
                      scoring helpers shared across backends
  in_memory.py     -- dict-backed reference implementation (tests + minimal
                      local runs)
  zep_cloud.py     -- adapter around the existing Zep graph.add / graph.search
                      pipeline; stores observations as marker-prefixed episodes
                      so they can be parsed back on read
  neo4j_local.py   -- self-hosted Neo4j 5.x via bolt:// with a clean Cypher
                      schema (Observation/Reflection/Namespace nodes + IN /
                      DERIVED_FROM / CONTRADICTS edges); cosine sim computed
                      client-side for portability, with a commented upgrade
                      path to native vector indexes
  neo4j_aura.py    -- managed AuraDB subclass, warns on non-TLS URIs
  hierarchical.py  -- ImportanceScorer (fast LLM, 1-10, fallback 5),
                      ReflectionScheduler (every N rounds, top-K by importance,
                      balanced LLM, 3-5 beliefs with source pointers),
                      ContradictionDetector (fast LLM binary, top-3 neighbors,
                      writes conflict_edge on sentiment flip)
  router.py        -- MemoryRouter picks backend from MEMORY_BACKEND env with
                      auto-heuristic (NEO4J_URI > ZEP_API_KEY > in_memory)
  manager.py       -- MemoryManager wraps a backend + the three hierarchical
                      services. Enforces per-agent namespace isolation: public
                      posts get mirrored to public:<sim>:timeline, private
                      observations stay in agent:<sim>:<id>, cross-agent reads
                      never traverse another agent's private partition.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Integration points:
  * ZepGraphMemoryManager grows .get_memory_manager(sim_id) which lazily
    instantiates a MemoryManager per simulation. The existing Zep batch
    updater keeps running for document-seeded graph enrichment; every
    add_activity() now also mirrors the activity into the MemoryManager
    so Phase-2 features (importance scoring, reflection, contradiction,
    retrieval) light up without touching simulation_runner.py.
  * CREATE_POST / QUOTE_POST / REPOST / CREATE_COMMENT are mirrored to
    the public:<sim>:timeline namespace so peer agents can see them via
    retrieve_for_agent(include_public=True). Non-public actions stay
    private. stop_updater() also closes the associated MemoryManager.

New blueprint /api/agents:
  GET  /api/agents/<id>/reflections?simulation_id=...
  GET  /api/agents/<id>/conflicts?simulation_id=...
  POST /api/agents/<id>/retrieve  (body: simulation_id, query, top_k, weights)

Config + env additions: MEMORY_BACKEND (auto/in_memory/zep_cloud/neo4j_local/
neo4j_aura), NEO4J_URI/USER/PASSWORD/DATABASE, REFLECTION_EVERY_N_ROUNDS,
REFLECTION_TOP_K_SOURCES, MEMORY_ALPHA/BETA/GAMMA, MEMORY_ENABLE_*. .env.example
documents the full surface with commented examples for Aura / local Neo4j.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Coverage per module:
  base.py          11 tests -- namespace factories + parsing, record kind
                               defaults, recency/importance/cosine helpers,
                               error shape
  in_memory.py      9 tests -- namespace validation, combined-score ranking,
                               cross-agent isolation, vector KNN ignores
                               recency/importance, reflection source-id
                               validation, conflict edge persistence &
                               endpoint checks, summarize_window ordering
  hierarchical.py   9 tests -- importance parsing + verbose-reply recovery
                               + LLM-error fallback; reflection scheduler
                               writes beliefs, skips below cadence, skips
                               with too-few sources; contradiction detector
                               writes edges on positive, skips without
                               embedding, no-op on negative classification
  manager.py        7 tests -- private namespace writes, public mirroring,
                               cross-agent read isolation, reflection cadence
                               wiring, stance-flip end-to-end (Phase-2
                               acceptance criterion), close() propagation
  router.py         5 tests -- explicit selection, auto-heuristic picking
                               in_memory / neo4j_local / neo4j_aura /
                               zep_cloud based on env, unknown-kind error

LLM calls are fully stubbed via ScriptedRouter / FakeRouter — no network is
required.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New package backend/app/transport/:
  base.py     -- Transport (backend) + ServerTransport (subprocess) ABCs;
                 Command / Response / Event dataclasses with JSON +
                 (topic, body) frame roundtrip; in-memory pair factory for tests
  file_ipc.py -- preserves the original file-poll protocol for back-compat.
                 Adds an append-only jsonl events channel so both sides can
                 at least tail events (not real-time)
  zmq_transport.py -- DEALER/ROUTER for commands (lets the backend issue
                 concurrent requests without REQ/REP turn-taking) + PUB/SUB
                 for events. Uses ipc:// sockets by default; TCP available
                 via env. A doc comment explains the grpc-vs-zmq tradeoff
                 per the phase ground rules
  factory.py  -- build_client_transport / build_server_transport pick the
                 right pair based on IPC_TRANSPORT env (default: zmq)

The legacy simulation_ipc.py is untouched. Callers migrate incrementally by
swapping SimulationIPCClient for build_client_transport(); the two can
coexist per-simulation during rollout.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds backend/app/ws/:
  bridge.py    -- EventBridge: one background worker thread per run_id
                  tails the transport event stream and fans out to every
                  registered subscriber. Thread-safe; subscribe() returns
                  an unsubscribe closure; stop_run() tears down the worker
                  + transport on simulation teardown. Process-wide singleton
                  accessed via get_bridge().
  streaming.py -- flask-sock routes. No-op when flask-sock isn't installed
                  so the HTTP API keeps working on bare flask installs.
        /ws/simulation/<run_id>            live event feed
        /ws/simulation/<run_id>/interview  streaming token-by-token reply
                                           via router.stream_chat() — skips
                                           the subprocess round trip so
                                           latency drops from ~200ms to
                                           single-digit ms per token.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds backend/app/checkpoint/:
  serializer.py -- collect_checkpoint() walks every namespace the manager
                   has seen (per-agent + public timeline) and snapshots
                   every record + conflict edge into a CheckpointData
                   dataclass. restore_into() replays records in a safe
                   order (observations before reflections, conflicts last)
                   so source_ids always resolve. format_version mismatch
                   raises so stale archives can't silently corrupt state.
  archiver.py   -- save_checkpoint / restore_checkpoint pack the snapshot
                   into .tar.zst (with gzip fallback when zstandard is
                   unavailable). Archive layout is boring tar so operators
                   can `zstd -d | tar -tv` for inspection.

API endpoints mounted under /api/simulation/<sim_id>:
  POST /checkpoint   -- capture round state to disk
  POST /restore      -- restore by path or by round_num
  GET  /checkpoints  -- list archived checkpoints with size + mtime

Phase-3 config additions: IPC_TRANSPORT, IPC_CMD_ENDPOINT, IPC_EVENT_ENDPOINT.
requirements.txt adds pyzmq, flask-sock, zstandard, and neo4j (phase 2
backend driver; optional — not installed unless MEMORY_BACKEND=neo4j_*).
.env.example documents the transport and WebSocket endpoints.

create_app() now also calls register_ws_routes(app) so /ws/* endpoints are
attached automatically when flask-sock is installed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…green)

Coverage per module:
  transport/base        6 tests -- command/response/event JSON roundtrip,
                                   event frame encoding, in-memory pair
                                   happy-path + timeout-on-silent-server
  transport/file_ipc    3 tests -- file-based command roundtrip, append-only
                                   event tailing, per-run event filtering
  transport/zmq         3 tests -- inproc:// command roundtrip (DEALER/ROUTER),
                                   timeout when server is silent, PUB/SUB
                                   slow-joiner-aware multi-event fan-out
  ws/bridge             3 tests -- multi-subscriber fan-out correctness,
                                   unsubscribe halts delivery, stop_run
                                   releases transport + worker
  checkpoint            5 tests -- captures all namespaces (agent private +
                                   public timeline), tar.zst archive roundtrip,
                                   restore_into reproduces state in a fresh
                                   manager (Phase-3 acceptance criterion),
                                   format_version mismatch raises, archive
                                   path contains round number

Totals: 33 (phase 1) + 41 (phase 2) + 20 (phase 3) = 94 passing.
No network; ZMQ tests use inproc:// endpoints; WS bridge tests drive the
file transport for deterministic timing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds backend/app/personas/:
  schema.py      -- StructuredPersona dataclass (Big Five traits + conviction +
                    credibility + background + initial stance); Archetype enum
                    with defaults per archetype (conviction floor = 1.0 for
                    bots/trolls; credibility ceiling matching persona type).
                    Background hard-capped at 200 chars for prefix cacheability.
                    Clean JSON round-trip (acceptance criterion).
  prompts.py     -- persona_system_block(): the fixed template injected into
                    every agent prompt. Stable prefix (archetype rules + scoring
                    scales) first, volatile persona block last — this ordering
                    is what lets Anthropic / OpenAI prompt caching actually
                    catch the common prefix across every agent in the run.
  generator.py   -- PersonaGenerator uses the `balanced` LLM role + strict JSON
                    schema to fill in Big Five / stance / background. Procedural
                    fallback when the LLM fails so simulation doesn't die. BOT
                    and TROLL personas bypass the LLM entirely — their behavior
                    is dictated by the archetype.
  population.py  -- build_population() mixes normal / media / expert / bot /
                    troll agents by percentage, with deterministic seeding.
                    build_bot_persona / build_troll_persona produce procedural
                    personas with the right extras ({narrative} / {tone}).
  inertia.py     -- StanceInertia: per-agent counter of opposing vs supporting
                    posts seen. Valence threshold (0.2) filters out noise.
                    should_allow_flip(persona) enforces the
                    ceil(10*conviction) spec rule. Snapshot / restore for
                    checkpoints.
  credibility.py -- CredibilityWeighter: re-ranks retrieval results by author
                    credibility. Formula: base * (1 + weight * (cred - 0.5)).
                    Unknown authors get neutral (0.5) — posts are never
                    silently dropped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Integration points:
  * MemoryManager grows an optional credibility_weighter parameter +
    set_credibility_weighter() setter. retrieve_for_agent() applies the
    weighter after merging private + public records, before the final sort.
    When unset, behavior is identical to Phase 2.
  * OasisProfileGenerator gains two Phase-4 methods:
        generate_structured_persona_for_entity(entity, user_id, archetype,
                                                topic_summary)
            -> StructuredPersona via PersonaGenerator
        attach_structured_persona(profile, persona)
            -> splices the prompt block + JSON tag into the OASIS profile's
               `persona` field. The original prose-based path is untouched
               so legacy callers keep working.

Config + env additions:
    BOT_POPULATION_PCT, TROLL_POPULATION_PCT (default 0/0 — enabling these
        changes outcomes per the phase spec)
    MEDIA_POPULATION_PCT, EXPERT_POPULATION_PCT (institutional boosts)
    POPULATION_SEED (deterministic mixing for reproducible eval runs)
    CREDIBILITY_WEIGHT (re-rank strength; 0.0 disables)
.env.example documents each knob with the "enabling these changes outcomes"
warning called out by the spec.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Coverage per module:
  schema.py           8 tests -- Big Five clamping, stance valence clamping,
                                 background truncation, JSON round-trip
                                 (Phase-4 acceptance), opposing_posts_needed
                                 scales with conviction, stance_is_opposed_by
                                 sign check + neutral-stance safeguard, bot
                                 archetype default floors, strict from_dict
  population.py       8 tests -- default all-normal, floor rounding, exact
                                 percentages, deterministic seeding, over-100
                                 rejection, negative rejection, bot narrative
                                 extras, troll tone extras
  inertia.py          8 tests -- high-conviction resists single opposing post
                                 (Phase-4 acceptance), resists 20 rounds with
                                 8 opposing (below threshold), flips once
                                 threshold crossed, low-conviction flips
                                 quickly, valence-threshold filters noise,
                                 supporting posts counted separately, reset
                                 clears counters, snapshot/restore round-trip
  credibility.py      5 tests -- high cred outranks low cred at tied base
                                 score, multiplier formula, weight=0 is noop,
                                 unknown author uses neutral, non-mutating
  prompts.py          6 tests -- stable prefix identical across agents (prefix
                                 cache correctness), bot narrative embedded,
                                 troll tone embedded, conviction + opposing-
                                 needed count appear in volatile, topic summary
                                 appended, archetype rules vary per archetype
  generator.py        5 tests -- LLM JSON -> persona assembly, code-fence
                                 stripping, fallback on any LLM error (runtime /
                                 network / parse), background length cap,
                                 archetype floor clamping even from LLM output
  integration.py      4 tests -- credibility reweights public-timeline
                                 retrieval (end-to-end), bot population
                                 changes retrievable content (Phase-4
                                 acceptance), high-conviction agent holds
                                 across 20 rounds (Phase-4 acceptance),
                                 deterministic population seed

Totals: 33 (p1) + 41 (p2) + 20 (p3) + 44 (p4) = 138 passing.
All LLM calls stubbed — no network required.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…unner)

New package backend/eval/:
  determinism.py -- DeterministicClock + deterministic_run() context manager.
                    Replaces wall-clock now_ts(), seeds the global RNG, and
                    restores state on exit. DETERMINISTIC_VERSION constant
                    bumped on any math / mock-table change so CI catches drift.
  scoring.py     -- Pure functions: directional_accuracy, magnitude_error,
                    calibration. Composite = 0.5*dir + 0.3*(1-mag) + 0.2*cal.
                    Direction synonyms (support/oppose) accepted.
  verdict.py     -- verdict_from_public_timeline aggregates public-namespace
                    posts into a signed support_ratio, weighted by author
                    credibility when personas are supplied. verdict_from_report
                    parses the optional ReportAgent JSON surface.
  mocks.py       -- MockRouter: deterministic drop-in for ModelRouter. SHA-256
                    hashes prompt+salt for every decision; importance returns
                    an integer 1-10; reflection returns 3 canned beliefs;
                    contradiction is True ~25% via hash bucket; persona JSON
                    is synthesized from the entity name so downstream code
                    sees varied credibility / valence distributions.
  pipeline.py    -- run_case() orchestrates: build_population -> PersonaGenerator
                    -> MemoryManager (Phase-2/4 features per FeatureFlags) ->
                    rounds of posts with stance-anchored valence -> Verdict.
                    FeatureFlags is the knob the ablation tool sweeps over.
  storage.py     -- JSONL append / read for the eval-results dashboard.
                    EVAL_RESULTS_PATH env override.
  runner.py      -- CLI. `python -m backend.eval.runner --case <name>
                    --deterministic --mock-llm` produces a numeric score.
                    Two runs with those flags are BYTE-IDENTICAL (Phase-5
                    acceptance). Warns if --deterministic is passed without
                    --mock-llm.
  ablation.py    -- CLI. Sweeps baseline + 7 variants (no_importance,
                    no_reflection, no_contradiction, no_credibility,
                    no_conviction, no_phase2, no_phase4) and prints a
                    comparison table with Δ vs baseline.

backend/__init__.py added so `python -m backend.eval.runner` resolves and
so modules inside eval/ can reach backend/app/* via the sys.path guard
at the top of runner.py / ablation.py.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cases (backend/eval/datasets/<name>/{seed.md, question.md, truth.json}):
  sample_policy_carbon_tax        -- truth: negative, magnitude 0.45
  sample_product_vr_headset       -- truth: positive, magnitude 0.55
  sample_policy_remote_work       -- truth: positive, magnitude 0.35
  sample_product_ai_service       -- truth: neutral (polarized), magnitude 0.10
  sample_election_incumbent_mayor -- truth: positive, magnitude 0.30

Each truth.json cites a comparable real-world analog in its `notes` field.
README.md spells out the schema and flags these as starter fixtures to be
replaced with peer-reviewed cases before publishing benchmark numbers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
API:
  GET /api/eval/results?limit=50[&case=<name>]
    -> {"count": N, "results": [<record>, ...]} — newest first
  Reads from the JSONL store populated by `runner.py --persist`.

CI workflow (.github/workflows/eval-smoke.yml):
  On every PR against main/master:
    1. Install the minimal deterministic-path deps (no Zep / OASIS / Neo4j)
    2. pytest backend/tests/
    3. Run `backend.eval.runner` twice with --deterministic --mock-llm,
       --output-json; `diff -q` enforces byte-identical output
       (Phase-5 acceptance criterion)
    4. Smoke-run backend.eval.ablation to confirm the table format stays stable

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Coverage per module:
  scoring.py       7 tests -- directional synonym handling, magnitude clipping,
                              calibration rewards/penalties, composite uses
                              weights, custom weights, clamping
  verdict.py       6 tests -- empty timeline -> neutral/0 confidence,
                              consistent positive maps to positive, credibility
                              tips split votes, ReportAgent JSON parsing with
                              code fences, None on garbage
  determinism.py   5 tests -- clock monotonic + reproducible, wall-clock
                              fallback outside block, global RNG state
                              restored, seeded_random isolation, version
                              constant exposed
  mocks.py         6 tests -- chat determinism, importance integer range,
                              reflection 3 beliefs, contradiction mixed
                              distribution, embed determinism, persona schema
                              validity
  storage.py       7 tests -- append creates file, recorded_ts auto-added,
                              newest-first ordering, limit honored, case
                              filter, missing file -> empty, malformed line
                              skipped
  runner.py + ablation.py  6 tests (subprocess-based) — numeric score emitted
                              (Phase-5 acceptance 666ghj#1), byte-identical output
                              across two runs (Phase-5 acceptance 666ghj#2),
                              deterministic warning without mock-llm, ablation
                              emits table with all variants, --output-json
                              parseable (Phase-5 acceptance 666ghj#3)

Totals: 33 (p1) + 41 (p2) + 20 (p3) + 44 (p4) + 37 (p5) = 175 passing
in 14s. Subprocess-based runner tests exercise the real CLI end-to-end.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds backend/app/observability/:
  logging.py  -- configure_logging() configures structlog -> JSON on stdout
                 when structlog is installed; else falls back to a stdlib
                 JSON formatter that still honors bind_context(). Every log
                 line carries run_id / agent_id / phase via contextvars
                 without code changes at the call sites.
  metrics.py  -- Prometheus registry exposing the Phase-6-spec metrics:
                   llm_calls_total{role,provider,model,status}
                   llm_tokens_total{role,kind}
                   llm_cache_hit_ratio (rolling gauge, recomputed on emit)
                   memory_op_duration_seconds{op,backend} histogram
                   simulation_active_runs, simulation_rounds_total{platform}
                   auth_rejections_total{reason}
                 Degrades to a minimal in-process counter store with a
                 human-readable banner if prometheus_client is missing, so
                 /metrics still responds 200.
  tracing.py  -- OTel setup with OTLP/HTTP exporter. start_span() is a
                 no-op context manager when OTel SDK isn't installed or
                 OTEL_EXPORTER_OTLP_ENDPOINT is unset, so callers can use
                 it unconditionally.

Wires observe_llm_call() into the LLM router. Every successful and every
failed backend call (including retries) records prompt/completion/cached
tokens + status into Prometheus. Metric emission is wrapped in a bare
try/except so it can never break the LLM call path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds backend/app/auth/:
  keys.py       -- SQLite-backed ApiKeyStore. Plaintext key format is
                   `mf_<8-hex-id>_<~40-char-urlsafe-secret>`. Only the SHA-256
                   hash is stored; plaintext is returned ONCE at issue time.
                   Constant-time compare on verify to avoid timing leaks.
  quotas.py     -- QuotaTracker with atomic check-and-debit. Raises
                   QuotaExceeded on over-cap with a structured .to_dict()
                   for the 429 response body. Preview-mode non-mutating
                   check powers the cost-estimator approval flow.
                   30-day rolling window (env overridable).
  middleware.py -- @require_api_key Flask decorator. Accepts
                   X-MiroFish-Key header (preferred) or ?api_key= query
                   (fallback). ALLOW_ANONYMOUS_API=true bypasses with a
                   metric increment so dashboards see reliance on anon.

Adds backend/app/cost/:
  estimator.py  -- estimate_simulation_cost(agents, rounds, role_models)
                   multiplies per-role default token budgets (tuned against
                   observed qwen-plus runs) by agents × rounds × calls.
                   Resolves (provider, model) -> price via Phase-1's
                   _PRICING table; unknown pairs annotate `note` without
                   crashing. ApprovalRequired raised by require_approval()
                   when estimate exceeds user_cap_usd.

New endpoints:
    GET  /metrics                          -- Prometheus scrape target
    POST /api/auth/keys                    -- issue (admin-only)
    GET  /api/auth/keys                    -- list (admin-only)
    DELETE /api/auth/keys/<id>             -- revoke (admin-only)
    GET  /api/auth/quota                   -- current key's usage
    POST /api/simulation/estimate-cost     -- pre-flight estimate

create_app() now calls configure_logging() + configure_tracing() at
startup. Admin endpoints require `X-MiroFish-Admin-Token: $ADMIN_TOKEN`
and return 503 when ADMIN_TOKEN is unset (makes misconfiguration loud).

Config + env additions: ADMIN_TOKEN, ALLOW_ANONYMOUS_API, AUTH_DB_PATH,
QUOTA_DB_PATH, OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_SERVICE_NAME,
COST_BUDGET_<ROLE>_{CALLS,IN,OUT,CACHED} overrides.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
deploy/helm/mirofish/:
  Chart.yaml   -- appVersion 0.6.0, kubeVersion >=1.24
  values.yaml  -- every key documented inline; groups: backend / frontend /
                  redis / vllm / memory / llm / auth / observability / ingress.
                  Defaults: backend x2 replicas, redis on, frontend + vllm
                  + ingress off. Neo4j expected external (Aura) per spec.
  templates/
    _helpers.tpl          -- fullname + labels
    backend-{dep,svc}.yaml
    redis.yaml             -- inline, toggleable via .Values.redis.enabled
    vllm.yaml              -- optional GPU deployment
    configmap.yaml         -- flattens values into the backend's envFrom
    secret.yaml            -- placeholder Secret for ADMIN_TOKEN / LLM_API_KEY /
                              ZEP_API_KEY / neo4j-password (populate externally)
    ingress.yaml           -- optional, ingressClassName-aware
  README.md   -- install + lint + values overview

requirements.txt: structlog, prometheus_client, opentelemetry-api + sdk +
otlp-proto-http. Neo4j 5.x driver stays optional (only installed when
MEMORY_BACKEND=neo4j_*).

.env.example: documents ALLOW_ANONYMOUS_API, ADMIN_TOKEN, AUTH_DB_PATH,
QUOTA_DB_PATH, OTEL_*, and COST_BUDGET_* overrides in a single Phase-6
block before the existing Phase-4 persona section.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tal)

Coverage per module:
  observability/logging        5 tests -- bind_context roundtrip, nested
                                          LIFO unbind, stdlib JSON formatter
                                          w/ contextvars merge,
                                          configure_logging idempotency,
                                          structlog path selection
  observability/metrics        8 tests -- llm_call metric labels + cache
                                          ratio update, memory_op histogram
                                          buckets, active_run gauge, auth
                                          rejection counter, content-type,
                                          fallback when prometheus_client
                                          missing, singleton accessor
  observability/tracing        4 tests -- no-endpoint -> disabled, no-op span
                                          when disabled, configure is
                                          idempotent, attribute-setting span
                                          opens + shuts down cleanly
  auth/keys                    9 tests -- issue returns plaintext once,
                                          verify roundtrip, rejects garbage
                                          + tampered + revoked, list filters
                                          by owner, excludes revoked by
                                          default, to_dict strips secret,
                                          quotas stored on key
  auth/quotas                  8 tests -- unlimited key passthrough, token
                                          quota enforced, usd quota enforced,
                                          atomic debit, failed-debit doesn't
                                          apply (critical), preview non-
                                          mutating, reset, fresh-key zeros
  auth/middleware              6 tests -- missing header 401, valid key
                                          accepted, invalid key 401, revoked
                                          key 401, anonymous flag bypass,
                                          query-string fallback
  cost/estimator               8 tests -- linear scaling, unknown vendor ->
                                          zero cost + note, cached fraction
                                          discounts, approval flag when over
                                          cap, ApprovalRequired exception,
                                          zero cap disables, per-role
                                          breakdown present, env budget
                                          overrides merge

Totals: 33 (p1) + 41 (p2) + 20 (p3) + 44 (p4) + 37 (p5) + 48 (p6) = 223
passing in ~24s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MIGRATION.md at repo root: TL;DR + per-phase rundown, running in each
mode (local / cloud / vLLM / Kubernetes), full new-keys table with per-
phase origin, and notes on what's breaking vs additive. Flags that the
public HTTP surface has ZERO breaking changes — every upstream endpoint
behaves identically.

README.md: inserts a prominent "MiroFish-Cloud (Phase 1-6)" banner
above Quick Start pointing at MIGRATION / architecture / BENCHMARKS.
Adds Option 2 (multi-provider cloud), Option 3 (local-only via Ollama),
and Option 4 (Helm chart) alongside the existing npm-dev quickstart.

docs/architecture.md: full module tree for every phase, abstract-backend
diagrams (LLM router, memory layer, transport), request data-flow traces
for a normal round + streaming interview, the Phase-6 Kubernetes
deployment topology, and a cross-phase "notable design decisions" table.

BENCHMARKS.md: specifies the four benchmarks that matter (throughput,
interview latency, eval scores, cost per 1k-agent sim), gives exact
reproduction commands, carries the captured deterministic-ablation
table as an in-repo number (verified by CI), and holds ⚠️-marked
placeholder tables for the live-LLM numbers operators fill in after
their first production runs. Test-suite runtime table (223 tests,
~24s) ships as a CI regression guard baseline.

All docs-only; pytest still green at 223/223.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dosubot dosubot Bot added size:XXL This PR changes 1000+ lines, ignoring generated files. documentation Improvements or additions to documentation labels Apr 23, 2026
MiroFish Migration and others added 2 commits April 23, 2026 20:15
Config.validate() previously required LLM_API_KEY and ZEP_API_KEY
unconditionally, which blocked run.py from starting in Path A's local-only
mode (Ollama + in-memory backend, no cloud creds). Two relaxations:

  * When BACKEND_MODE=local, no cloud LLM key is required. The router
    uses Ollama defaults for every role.
  * When MEMORY_BACKEND is in_memory or neo4j_*, ZEP_API_KEY stops being
    required — Zep is only used under MEMORY_BACKEND=zep_cloud (or the
    `auto` default).
  * In cloud/custom mode, LLM_ROLE_BALANCED_API_KEY satisfies the check
    when a per-role key has been configured without the legacy fallback.

backend/pyproject.toml grows entries for every phase-1-6 runtime dep that
already lives in requirements.txt (pyzmq, flask-sock, zstandard, neo4j,
structlog, prometheus_client, opentelemetry-*, requests). A single
`uv sync` now pulls the complete set — no follow-up `pip install` needed.

Verified live in local mode:
  GET  /health                         -> 200
  GET  /metrics                        -> 200 (Prometheus text + phase-6 metrics)
  POST /api/simulation/estimate-cost   -> 200 with full per-role breakdown
  Every phase 1-6 blueprint + WebSocket route registered at startup.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
.env.example restructured for production deployment:
  * Adds a prominent "CLOUD DEPLOYMENT QUICKSTART" header block at the top
    listing the 3 secret lines to replace (LLM_API_KEY, ADMIN_TOKEN, and
    optionally NEO4J_*).
  * Default LLM config is now single-vendor OpenAI (sk-REPLACE-ME /
    gpt-4o-mini) — simplest cloud path, one bill, built-in price table,
    works for both chat and embeddings. Aliyun DashScope kept as a
    commented alternative for existing upstream users.
  * Uncommented LLM_ROLE_HEAVY_* and LLM_ROLE_EMBED_* so the default setup
    uses gpt-4o for ReportAgent synthesis and text-embedding-3-large for
    vector retrieval.
  * FLASK_HOST=0.0.0.0 so containers / load balancers reach the backend.
  * FLASK_DEBUG=false — auto-reload off in production.
  * ADMIN_TOKEN gets a literal placeholder (was a commented hint) so
    cloud deployments fail loud at boot when it's not filled in.
  * Neo4j Aura block moved above local CE (cloud-first ordering) with
    explicit connection-string format and REPLACE-ME placeholders.
  * ZEP_API_KEY placeholder updated to `zk-REPLACE-ME` to match the
    expected format.

backend/uv.lock: locks every phase 1-6 runtime dep added to pyproject.toml
in the previous commit — opentelemetry-{api,sdk,exporter-otlp-proto-http},
prometheus-client, structlog, flask-sock, pyzmq, zstandard, neo4j, etc.
Pinned versions match what `uv sync` produced in Path A.

frontend/package-lock.json: regenerated by `npm install` during Path A
setup; no version drift, just lockfile metadata refresh.

No functional code changes; .env itself stays gitignored.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant