Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 25 additions & 17 deletions .cursor/rules/architecture.mdc
Original file line number Diff line number Diff line change
Expand Up @@ -5,22 +5,30 @@ alwaysApply: true

# Architecture

- `domain.py` holds all data structures (not `models.py`, which is ambiguous in an LLM project)
- `openrouter.py` is the HTTP client layer (separate from orchestration)
- `simulation.py` holds all dry-run fakes (LLM responses and search results)
- `prompts.py` contains prompt builders (pure functions) and output parsers (review verdict, synthesis decision)
- `compression.py` handles both low-level text compression and prompt fitting within token budgets
- `progress.py` defines the `ProgressCallback` protocol and `NoOpProgress` fallback
- `reviewers.py` holds the reviewer-to-candidate assignment algorithm
- `exclamations.py` holds the Simpsons-quote prefix helper used to dress up error messages
- Cross-group model overlap is allowed: a model may generate and review in the same round
- Compression priority: candidates first, then reviews, then context; the task instruction is never compressed

## Dependency direction
## One rule

- `cli.py` depends on `core/` and `ui/`; nothing in `core/` or `ui/` imports from `cli.py`
- `ui/tui.py` depends on `core/progress.py` (protocol) and `core/domain.py` (types); it never imports from `core/orchestrator.py`
- Within `core/`, `orchestrator.py` is the only module that imports from `openrouter.py`, `compression.py`, and `reviewers.py`; `search.py` is imported by `orchestrator.py` only but may itself import `simulation.py` for its dry-run path
- `prompts.py` depends only on `domain.py`; it has no runtime dependencies on other core modules
- `exclamations.py` is a leaf module (stdlib-only); it may be imported from anywhere in `core/`
- `config.py` loads `crossfire.toml` with CLI override precedence: CLI flags > TOML values > defaults; per-mode overrides go in `[modes.<mode>.*]`, missing roles fall back to global `[models.*]` defaults

Within `core/`, modules import freely from each other as needed.

## Module responsibilities

- `domain.py` — all data structures (not `models.py`, which is ambiguous in an LLM project)
- `orchestrator.py` — the generate → review → synthesize loop, concurrency, failure handling
- `openrouter.py` — OpenRouter HTTP client, retry logic, model ID utilities
- `prompts.py` — prompt builders (pure functions) and output parsers (review verdict, synthesis decision)
- `compression.py` — extractive text compression and prompt fitting within token budgets
- `config.py` — TOML loading with CLI override precedence: CLI flags > TOML values > defaults; per-mode overrides in `[modes.<mode>.*]`
- `pricing.py` — OpenRouter pricing cache (`pricing.json`) and cost estimation for dry runs
- `simulation.py` — deterministic fakes for dry-run mode (LLM responses and search results)
- `search.py` — Tavily web search integration
- `reviewers.py` — reviewer-to-candidate assignment algorithm
- `tokens.py` — tiktoken-based token estimation
- `progress.py` — `ProgressCallback` protocol and `NoOpProgress` fallback
- `exclamations.py` — Simpsons-quote prefix helper for error messages
- `archive.py` — disk archival of run artifacts

## Design decisions

- Cross-group model overlap is allowed: a model may generate and review in the same round
- Compression priority: candidates first, then reviews, then context; the task instruction is never compressed
17 changes: 17 additions & 0 deletions .cursor/rules/planning.mdc
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
---
description: Plan validation — end-to-end trace, contradiction check, edge cases
alwaysApply: true
---

# Planning

Before finalizing any plan that touches 3+ files:

- Trace every data flow end-to-end through the actual code (not the abstraction): verify function signatures, call sites, and who calls whom
- Check that the plan is internally consistent: no section may contradict another
- Identify every existing function, configuration object, or data structure the new code depends on and verify it provides what the plan assumes (e.g. which configuration variant: base vs mode-resolved?)
- For every external input (files, API responses, environment variables), list what happens when it is missing, empty, malformed, or the wrong type
- Flag every assumption about the runtime context: sync vs async, working directory, which process or event loop, which thread
- Separate input estimates from output estimates: never use the same ceiling for both unless explicitly justified
- After drafting the plan, re-read it as a hostile reviewer looking for gaps, contradictions, and unstated assumptions; fix before presenting
- After implementation, verify the code matches the plan: trace actual variable values through each formula and code path, checking that upstream mutations (e.g. enrichment rewriting an instruction) are reflected in every downstream use of that variable
20 changes: 18 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@
### Install

```bash
uv python install 3.12 # transitive incompatibility in docformatter->untokenize

Check warning on line 39 in README.md

View workflow job for this annotation

GitHub Actions / spellcheck

Unknown word (untokenize)
uv venv --python 3.12
uv sync
```
Expand Down Expand Up @@ -96,6 +96,16 @@
--context-file paper.pdf
```

### Cost estimation
Cost estimation in ``--dry-run`` requires current model prices from OpenRouter.
These can be grabbed and stored in `pricing.json` with the following command:

```bash
uv run crossfire prices
```

Since it fetches pricing on _all_ OpenRouter models, we can add moves to `crossfire.toml` without re-fetching.

### Clean up
Remove all generated and cached files (runs, `.venv`, caches, bytecode):

Expand Down Expand Up @@ -152,7 +162,7 @@
If all reviews report no weaknesses, the remaining rounds are skipped, as there is no value in further refinement.
Disable with `--no-early-stop`.

### Generator refusal
### Generator refusal detection
Some models occasionally refuse to produce output, responding with meta-commentary like "the sources are insufficient" instead of answering the prompt.
Crossfire detects these refusals and attempts a replacement model from the generator pool.
If no replacement is available, the generator is dropped and the round fails gracefully.
Expand Down Expand Up @@ -214,7 +224,7 @@
## Development

```bash
uv python install 3.12 # transitive incompatibility in docformatter->untokenize

Check warning on line 227 in README.md

View workflow job for this annotation

GitHub Actions / spellcheck

Unknown word (untokenize)
uv venv --python 3.12
uv sync
uv run pre-commit install # set up git hooks (recommended)
Expand All @@ -228,6 +238,7 @@

Use `--dry-run` to verify your changes without making API calls.
It produces deterministic synthetic outputs via SHA-256 hashing.
If `pricing.json` is present (from `crossfire prices`), the summary table includes an upper-bound cost estimate.

```bash
uv run crossfire run \
Expand Down Expand Up @@ -267,6 +278,7 @@
│ │ ├── progress.py # progress reporting
│ │ ├── reviewers.py # reviewer-to-candidate assignment
│ │ ├── search.py # search integration with Tavily
│ │ ├── pricing.py # OpenRouter pricing cache and cost estimation
│ │ ├── exclamations.py # The Simpsons prefixes for error messages
│ │ └── archive.py # disk archival
│ ├── ui/
Expand Down Expand Up @@ -296,6 +308,11 @@
When prompts exceed the token budget, Crossfire drops sections and sentences rather than summarizing.
The task instruction is _never_ compressed.

**Cost estimates are approximate.**
The dry-run estimate uses fixed output-token defaults (~5,000 tokens per generator/synthesizer call, ~2,000 per reviewer) and average pricing across each model group.
It does not predict early stopping or actual output lengths, so it typically overestimates by 2-4x for runs that stop early.
If the instruction contains an explicit word or page count, the estimate uses that instead, but the regex may also match counts that describe the input rather than the desired output.

**No streaming.**
Responses are received in full.

Expand All @@ -305,6 +322,5 @@
### Ideas for improvements

- Resume: restart interrupted runs from the last completed round
- Cost prediction: estimate total run costs upfront from prompt sizes and model pricing, reported as part of `--dry-run` output
- Local UI: browser-based interface with live progress, output panel, and searchable run history
- Library and containerization: extract a reusable library and Docker image so Crossfire can run as a service (switch to UTC-based logs)
Loading
Loading