Token counting update, document visuals as vision input, attachment modernization#489
Merged
MaxEriksson2000 merged 20 commits intoJun 12, 2026
Merged
Conversation
… overhead
Token counting previously measured concatenated text only, with three
systematic errors:
- count_tokens received the bare model name while litellm is invoked
with "{provider_type}/{name}", silently resolving the default
tokenizer for Azure/vLLM/Mistral deployments
- images (current question and history) counted as 0 tokens
- tool definitions (generate_image + MCP tools) and per-message chat
scaffolding counted as 0 tokens
Counting now mirrors the OpenAI-format payload the adapter builds:
litellm.token_counter(messages=...) includes image_url content and
message wrappers, tools are counted via the tools= parameter, and the
MCP proxy is created before context building so its tool definitions
shrink the history/knowledge budgets.
Also fixes add_knowledge subtracting the prompt tokens a second time
from a budget that already excluded them.
…ploads PDF attachments previously reached the model as extracted text only — embedded photos and scanned pages were silently dropped. Uploading a PDF now also renders its embedded image regions (pdfplumber + pypdfium, already transitive dependencies) into derived image files linked via files.parent_file_id, capped per document and filtered by source size to skip logos. At ask time the attachment list is expanded with derived images only when the model supports vision; without vision the PDF degrades to a plain text attachment instead of erroring. History replay now also strips images for non-vision models, which previously produced provider errors after a model switch. All vision images (direct uploads included) are downscaled to max 2048px and recompressed before storage, since blobs are base64-encoded into every request and replayed for the rest of the session.
- PDF: per-page [PAGE N] markers and tables rendered as markdown (table text excluded from the running text to avoid duplication); image-only PDFs return empty text instead of bare markers - PPTX: speaker notes are extracted (previously discarded) - Attachment prompt block: compact 'FILE: name (mimetype)' format replaces JSON-per-line (less token overhead, gives the model the file type), and each file is capped at a configurable token budget with a visible truncation notice instead of failing the whole request via QueryException
iPhone photos default to HEIC, which was rejected. pillow-heif decodes them and the existing downscale step converts to JPEG before storage — providers never see image/heic. Formats outside PNG/JPEG/WEBP are now always converted even when conversion does not shrink the blob. The frontend accept list picks the new mimetypes up automatically via the limits endpoint (ImageMimeTypes drives FormatLimit).
🧹 Dead-code & unused-dependency reportAdvisory — never gates the PR. Whole-repo scan, so some findings may be false positives (dynamic dispatch, framework hooks, runtime-resolved imports). Triage before removing. ✅ No dead code or unused dependencies detected. |
📊 Patch coverageShare of this PR's new/changed lines exercised by tests. Report-only — never gates the PR.
Uncovered lines — Frontend (9 files)
Uncovered lines — Backend (13 files)
|
Replace per-image bbox cropping with whole-page rendering for pages that carry visual content (raster images, many curves, diagonal lines, or filled rects outside detected tables), so vector-drawn charts and diagrams reach vision models too. Render resolution is capped on both axes, bounding memory even for degenerate page sizes. DOCX/PPTX uploads now yield their embedded raster media as derived images (EMF/WMF and oversized zip entries are skipped). Derived files are named after their source page when known. Image uploads get their checksum recomputed after downscaling so it describes the stored blob. Settings renamed to attachment_image_extraction and attachment_max_extracted_images (previously pdf_attachment_*), since extraction is no longer PDF-only.
…etail Images are now sent and counted with an explicit detail=high block — litellm prices detail-less image payloads at a flat 85 tokens while providers resolve auto to high, undercounting a 2048px image by roughly a thousand tokens (more on Anthropic models). Derived images (rendered PDF pages etc.) no longer leak to user-facing surfaces: they are expanded into the completion payload at ask time (current files, assistant attachments and session history) instead of being persisted on the question, and the file library listing hides them. Apps and services expand their input files the same way, and attachment images (prompt_files) now ride on the current user message for vision models instead of being silently dropped. The token preflight prices files like the real request: same litellm model identifier as the adapter, with image files and document-derived images included for vision models.
…-vision-and-token-accuracy
After merging develop, b4f2a9c1e7d3 and the develop merge migration 202606111200 were both children of branchpoint 3eb6a34b6733, producing two heads and breaking 'alembic upgrade head'. Re-point this migration's down_revision to 202606111200 so the chain is linear with a single head. Safe: it only touches the files table, independent of the governance/ reasoning changes in the develop branch.
The chat initialised the AttachmentManager with rules=undefined (the real getAttachmentRulesStore call was commented out), so client-side validation was disabled: oversized/unsupported files were sent to the server and only rejected there. Combined with hardcoded English error strings, users saw no clear feedback about what went wrong or what is allowed. - Wire getAttachmentRulesStore in ConversationView so uploads are validated client-side against the backend per-format size limits and the partner model's vision support — oversized/unsupported files are now rejected instantly with a clear message instead of failing silently server-side. - Translate all validation and upload-failure messages to paraglide (sv+en), stating the actual size and the allowed maximum. - Surface paste-upload validation errors (were silently dropped). - getUploadErrorMessage now returns a file-name-free reason so callers don't duplicate the file name; update JobManager accordingly.
…toast Upload/validation errors were shown as a transient toast that is easy to miss. Surface them inline next to the chat input instead, matching the existing context-window error banner, and keep them visible until the user retries or dismisses them. - AttachmentManager gains an uploadError store (single source for validation and server-side failures) plus a clearUploadError action and an inlineErrors option that suppresses the toast when a consumer renders the store inline. - ConversationView opts the chat in (inlineErrors) and passes it to the drop area; ConversationInput renders a persistent, dismissible error banner in the same style as the context-window banner. - Chat-only entry points (icon button, paste) no longer toast; the shared AttachmentDropArea keeps toasting outside the chat via the inlineErrors prop, so app upload flows are unchanged.
The preflight token estimate keyed derived-image lookup on text_files, which required extractable text (f.text). An image-only PDF (e.g. scanned pages or photos) has no text, so it fell out of that set: its derived vision images were never fetched or priced, and it was reported as an excluded file. The preflight showed ~15 tokens for a 9 MB image-only PDF. Key derived-image lookup on every document upload (FileType.TEXT) instead, mirroring file_service.with_derived_images on the real ask path, and treat a document as included when it contributes either inlined text or derived images. Add a regression test for the image-only PDF case.
litellm.token_counter applies OpenAI's tile formula to every provider, so Claude image attachments were counted ~30% too low in the token preflight (an 8-page PDF reported 8,840 tokens vs ~13,100 actually billed). Count Anthropic image blocks from their pixel dimensions instead — (width × height) / 750 after capping the long edge to 1568px, per Anthropic's documented image cost — and let litellm handle only text and message scaffolding. Applied at the count_message_tokens chokepoint, so both the preflight estimate and the context-build counting are corrected. OpenAI/Azure counting is unchanged (litellm is already exact there).
…model _with_vision_derivatives checked the assistant's configured model, but governance policy can override the answering model. With a non-vision configured model and a policy-enforced vision model, document-derived images were never attached even though preflight counted them — the vision model silently lost the document pages and the token projection diverged from the real payload. Gate on the effective model instead.
Image token cost depends only on pixel dimensions, so count both providers with their documented formulas (OpenAI 85 + 170/tile at high detail, Anthropic (w*h)/750) read straight from the stored blob. This removes the per-request base64 round-trips that counting used to do for every current and history image — previously each request encoded all blobs to data URLs, and the Anthropic path decoded them again and PIL-opened each one, all synchronously on the event loop. Detect Claude behind any provider route (openai-compatible, bedrock style prefixes), not only "anthropic/" — otherwise its images were priced ~30% too low. The flat image fallback is now derived from the OpenAI formula instead of a magic 1105. Extract the canonical message/content builders into message_payload.py, used by both TenantModelAdapter and token counting, so the counted shape cannot drift from the sent shape. The replayed tool_calls structure is now counted as the JSON the provider actually receives instead of a name+arguments approximation. Preflight now counts textless documents too — the real request still sends their FILE header block and preamble, so an image-only PDF no longer undercounts. Attachment truncation budgets the appended notice and re-checks token-dense text, keeping each file within attachment_max_tokens_per_file.
Every completion response carries the authoritative prompt_tokens. No external source maintains the image-pricing formulas for us (litellm prices all images with OpenAI's tile formula and has no Anthropic formula at all), so instead of hoping ours stay current, compare each estimate against the provider-reported count and log a warning when they diverge more than 20%. Any staleness — provider pricing changes, tokenizer swaps, payload-shape drift — surfaces in logs within days instead of silently skewing context budgets. Hooked where prediction and actuals already meet: the non-streaming response assembly in completion_service (covers apps, services, group chats) and the streaming usage reconciliation in assistant_service.
The (w*h)/750 approximation with only a long-edge cap overcounted real usage by ~33% on image-heavy PDFs: preflight said 18,581 tokens where the provider billed 13,967. Anthropic's documented cost is one token per 28x28 pixel patch after resizing to fit BOTH a long-edge limit and a per-image token budget (1568px/1568 tokens on most models, 2576px/ 4784 on the high-resolution Opus 4.7+ family) — the missing token budget is what capped the real cost. Mirror the reference implementation from Anthropic's vision docs and pin the documented pricing-table values as tests. A rendered A4 page at 2048px now counts 1551 tokens instead of 2319; eight pages land within text-overhead distance of the observed actuals.
Anthropic's vision pricing is the only provider formula we track by hand, so make it a cleanly removable unit: anthropic_image_pricing.py plus its test file, reached through a single commented dispatch branch in count_image_tokens. Dropping the special handling is now two file deletions and one if-statement.
OpenAI's image cost is not one formula: gpt-4o/4.1/4.5 use the classic 85 + 170/tile method, but the mini/nano/o4-mini families count 32px patches (capped at 1536) times a per-model multiplier, and several tile models have their own base/tile constants (gpt-4o-mini 2833/5667, o1/o3 75/150, gpt-5 70/140). Pricing everything with the classic formula undercounted an image-heavy PDF on gpt-4.1-mini by ~2.2x: preflight said 8,885 tokens where the provider billed ~22,900. Mirror the documented tables in openai_image_pricing.py — isolated the same way as anthropic_image_pricing.py, reached through the single dispatch point in count_image_tokens, removable the same way. Unknown models fall back to the classic formula and the drift alarm flags them.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR improves attachments and token accounting for assistants that need to read images embedded in PDF protocols and other documents.
1. Token counting mirrors the actual payload
2. Document visuals reach vision models
3. Attachment extraction and upload behavior
Review notes
max_output_tokensis deliberately not reserved from the configured input budget.ATTACHMENT_IMAGE_EXTRACTION,ATTACHMENT_MAX_EXTRACTED_IMAGES, andATTACHMENT_MAX_TOKENS_PER_FILE.Validation