Token counting update, document visuals as vision input, attachment modernization by MaxEriksson2000 · Pull Request #489 · eneo-ai/eneo

MaxEriksson2000 · 2026-06-11T18:26:48Z

Summary

This PR improves attachments and token accounting for assistants that need to read images embedded in PDF protocols and other documents.

1. Token counting mirrors the actual payload

Count message scaffolding, images, tool definitions, replayed tool calls, and model-specific tokenization from the same payload shape sent to providers.
Price OpenAI and Anthropic images from stored pixel dimensions using their documented model-family formulas, including GPT-5.4/5.5's 2500-patch high-detail budget.
Include document-derived images and image-only PDFs in token preflight for vision models.
Compare estimates with provider-reported prompt usage and warn when they drift materially.
Treat preflight as an advisory estimate: local overflow no longer blocks sending, the provider validates the final request, and a provider rejection leaves the message and attachments in the composer.

2. Document visuals reach vision models

Render visually meaningful PDF pages, including raster images, scanned pages, vector charts, and diagrams, into bounded derived images.
Extract embedded raster media from DOCX and PPTX while skipping unsupported vector formats and oversized archive entries.
Expand derived images only when building completion payloads for the governance-effective vision model; they are not persisted as user question attachments or exposed in the file library.
Strip replayed images when switching to a non-vision model.
Downscale vision uploads to a bounded size before storage so the stored, sent, and counted payloads agree.

3. Attachment extraction and upload behavior

Add PDF page markers and markdown table extraction without duplicating table text.
Extract PPTX speaker notes.
Replace JSON-per-line attachment prompts with a compact file header and token-bounded truncation notice.
Accept HEIC/HEIF uploads and convert them to a provider-compatible stored format.
Validate upload limits in the chat, translate failures, and keep upload errors visible beside the composer.

Review notes

max_output_tokens is deliberately not reserved from the configured input budget.
New settings: ATTACHMENT_IMAGE_EXTRACTION, ATTACHMENT_MAX_EXTRACTED_IMAGES, and ATTACHMENT_MAX_TOKENS_PER_FILE.
Files uploaded before this change cannot be backfilled because original document bytes are not retained for text files.
Extraction is keyed on the upload content type; there is no file-signature sniffing.
Preflight still excludes MCP tool definitions.

Validation

The full branch backend suite previously passed with 3,181 tests.
Latest targeted backend verification passed 64 tests covering context building, streaming overflow forwarding, preflight, and image pricing.
Frontend unit tests, Svelte check, Ruff, Pyright, formatting, lint, i18n, schema drift, and pre-push checks passed.
Alembic has a single migration head.

… overhead Token counting previously measured concatenated text only, with three systematic errors: - count_tokens received the bare model name while litellm is invoked with "{provider_type}/{name}", silently resolving the default tokenizer for Azure/vLLM/Mistral deployments - images (current question and history) counted as 0 tokens - tool definitions (generate_image + MCP tools) and per-message chat scaffolding counted as 0 tokens Counting now mirrors the OpenAI-format payload the adapter builds: litellm.token_counter(messages=...) includes image_url content and message wrappers, tools are counted via the tools= parameter, and the MCP proxy is created before context building so its tool definitions shrink the history/knowledge budgets. Also fixes add_knowledge subtracting the prompt tokens a second time from a budget that already excluded them.

…ploads PDF attachments previously reached the model as extracted text only — embedded photos and scanned pages were silently dropped. Uploading a PDF now also renders its embedded image regions (pdfplumber + pypdfium, already transitive dependencies) into derived image files linked via files.parent_file_id, capped per document and filtered by source size to skip logos. At ask time the attachment list is expanded with derived images only when the model supports vision; without vision the PDF degrades to a plain text attachment instead of erroring. History replay now also strips images for non-vision models, which previously produced provider errors after a model switch. All vision images (direct uploads included) are downscaled to max 2048px and recompressed before storage, since blobs are base64-encoded into every request and replayed for the rest of the session.

- PDF: per-page [PAGE N] markers and tables rendered as markdown (table text excluded from the running text to avoid duplication); image-only PDFs return empty text instead of bare markers - PPTX: speaker notes are extracted (previously discarded) - Attachment prompt block: compact 'FILE: name (mimetype)' format replaces JSON-per-line (less token overhead, gives the model the file type), and each file is capped at a configurable token budget with a visible truncation notice instead of failing the whole request via QueryException

iPhone photos default to HEIC, which was rejected. pillow-heif decodes them and the existing downscale step converts to JPEG before storage — providers never see image/heic. Formats outside PNG/JPEG/WEBP are now always converted even when conversion does not shrink the blob. The frontend accept list picks the new mimetypes up automatically via the limits endpoint (ImageMimeTypes drives FormatLimit).

github-actions · 2026-06-11T18:27:40Z

🧹 Dead-code & unused-dependency report

Advisory — never gates the PR. Whole-repo scan, so some findings may be false positives (dynamic dispatch, framework hooks, runtime-resolved imports). Triage before removing.

✅ No dead code or unused dependencies detected.

github-actions · 2026-06-11T18:48:24Z

📊 Patch coverage

Share of this PR's new/changed lines exercised by tests. Report-only — never gates the PR.

Area	Changed	Uncovered	Coverage
Frontend	126	99	21%
Backend	600	63	89%

Uncovered lines — Frontend (9 files)

…/components/conversation/conversationInputState.ts — 40–43, 49–53
…/attachments/components/AttachmentUploadIconButton.svelte — 25
…/components/conversation/ConversationView.svelte — 4–5, 32–36, 38, 40–42, 144
…/components/conversation/ConversationInput.svelte — 19, 21, 34, 36–37, 114, 150–153, 156, 160–163, 176, 292–297, 341–342, 346–348, 350, 352, 354, 356
…/features/attachments/getUploadErrorMessage.ts — 20–23
…/features/attachments/AttachmentManager.ts — 8, 69, 87, 184–187, 200–201, 203–204, 213, 221, 235–238, 249–253, 271–274, 282–286, 301, 304, 314, 316
…/features/jobs/JobManager.ts — 246
…/attachments/components/AttachmentDropArea.svelte — 15, 42
…/components/conversation/ContextUsageBar.svelte — 185, 194, 196, 292

Uncovered lines — Backend (13 files)

…/intric/files/image_processing.py — 101, 172, 224–225, 270, 274, 276, 280
…/intric/services/service_runner.py — 67–69
…/completion_models/infrastructure/completion_service.py — 302, 307, 311–313, 411
…/intric/files/file_repo.py — 47–48, 50, 56–57
…/completion_models/infrastructure/context_builder.py — 143–144
…/intric/tokens/token_utils.py — 73, 124, 143, 154
…/infrastructure/adapters/base_adapter.py — 28
…/infrastructure/adapters/tenant_model_adapter.py — 478, 1538
…/intric/files/text.py — 150, 175–177, 202–203, 208
…/intric/assistants/assistant_service.py — 1212–1213, 1422
…/conversations/application/conversation_service.py — 40
…/intric/files/file_protocol.py — 86, 195–196, 201, 207–208, 214, 226–227, 229–230, 232, 239–241, 246, 256
…/apps/apps/app_service.py — 396–399

Replace per-image bbox cropping with whole-page rendering for pages that carry visual content (raster images, many curves, diagonal lines, or filled rects outside detected tables), so vector-drawn charts and diagrams reach vision models too. Render resolution is capped on both axes, bounding memory even for degenerate page sizes. DOCX/PPTX uploads now yield their embedded raster media as derived images (EMF/WMF and oversized zip entries are skipped). Derived files are named after their source page when known. Image uploads get their checksum recomputed after downscaling so it describes the stored blob. Settings renamed to attachment_image_extraction and attachment_max_extracted_images (previously pdf_attachment_*), since extraction is no longer PDF-only.

…etail Images are now sent and counted with an explicit detail=high block — litellm prices detail-less image payloads at a flat 85 tokens while providers resolve auto to high, undercounting a 2048px image by roughly a thousand tokens (more on Anthropic models). Derived images (rendered PDF pages etc.) no longer leak to user-facing surfaces: they are expanded into the completion payload at ask time (current files, assistant attachments and session history) instead of being persisted on the question, and the file library listing hides them. Apps and services expand their input files the same way, and attachment images (prompt_files) now ride on the current user message for vision models instead of being silently dropped. The token preflight prices files like the real request: same litellm model identifier as the adapter, with image files and document-derived images included for vision models.

…-vision-and-token-accuracy

After merging develop, b4f2a9c1e7d3 and the develop merge migration 202606111200 were both children of branchpoint 3eb6a34b6733, producing two heads and breaking 'alembic upgrade head'. Re-point this migration's down_revision to 202606111200 so the chain is linear with a single head. Safe: it only touches the files table, independent of the governance/ reasoning changes in the develop branch.

The chat initialised the AttachmentManager with rules=undefined (the real getAttachmentRulesStore call was commented out), so client-side validation was disabled: oversized/unsupported files were sent to the server and only rejected there. Combined with hardcoded English error strings, users saw no clear feedback about what went wrong or what is allowed. - Wire getAttachmentRulesStore in ConversationView so uploads are validated client-side against the backend per-format size limits and the partner model's vision support — oversized/unsupported files are now rejected instantly with a clear message instead of failing silently server-side. - Translate all validation and upload-failure messages to paraglide (sv+en), stating the actual size and the allowed maximum. - Surface paste-upload validation errors (were silently dropped). - getUploadErrorMessage now returns a file-name-free reason so callers don't duplicate the file name; update JobManager accordingly.

…toast Upload/validation errors were shown as a transient toast that is easy to miss. Surface them inline next to the chat input instead, matching the existing context-window error banner, and keep them visible until the user retries or dismisses them. - AttachmentManager gains an uploadError store (single source for validation and server-side failures) plus a clearUploadError action and an inlineErrors option that suppresses the toast when a consumer renders the store inline. - ConversationView opts the chat in (inlineErrors) and passes it to the drop area; ConversationInput renders a persistent, dismissible error banner in the same style as the context-window banner. - Chat-only entry points (icon button, paste) no longer toast; the shared AttachmentDropArea keeps toasting outside the chat via the inlineErrors prop, so app upload flows are unchanged.

The preflight token estimate keyed derived-image lookup on text_files, which required extractable text (f.text). An image-only PDF (e.g. scanned pages or photos) has no text, so it fell out of that set: its derived vision images were never fetched or priced, and it was reported as an excluded file. The preflight showed ~15 tokens for a 9 MB image-only PDF. Key derived-image lookup on every document upload (FileType.TEXT) instead, mirroring file_service.with_derived_images on the real ask path, and treat a document as included when it contributes either inlined text or derived images. Add a regression test for the image-only PDF case.

litellm.token_counter applies OpenAI's tile formula to every provider, so Claude image attachments were counted ~30% too low in the token preflight (an 8-page PDF reported 8,840 tokens vs ~13,100 actually billed). Count Anthropic image blocks from their pixel dimensions instead — (width × height) / 750 after capping the long edge to 1568px, per Anthropic's documented image cost — and let litellm handle only text and message scaffolding. Applied at the count_message_tokens chokepoint, so both the preflight estimate and the context-build counting are corrected. OpenAI/Azure counting is unchanged (litellm is already exact there).

…model _with_vision_derivatives checked the assistant's configured model, but governance policy can override the answering model. With a non-vision configured model and a policy-enforced vision model, document-derived images were never attached even though preflight counted them — the vision model silently lost the document pages and the token projection diverged from the real payload. Gate on the effective model instead.

Image token cost depends only on pixel dimensions, so count both providers with their documented formulas (OpenAI 85 + 170/tile at high detail, Anthropic (w*h)/750) read straight from the stored blob. This removes the per-request base64 round-trips that counting used to do for every current and history image — previously each request encoded all blobs to data URLs, and the Anthropic path decoded them again and PIL-opened each one, all synchronously on the event loop. Detect Claude behind any provider route (openai-compatible, bedrock style prefixes), not only "anthropic/" — otherwise its images were priced ~30% too low. The flat image fallback is now derived from the OpenAI formula instead of a magic 1105. Extract the canonical message/content builders into message_payload.py, used by both TenantModelAdapter and token counting, so the counted shape cannot drift from the sent shape. The replayed tool_calls structure is now counted as the JSON the provider actually receives instead of a name+arguments approximation. Preflight now counts textless documents too — the real request still sends their FILE header block and preamble, so an image-only PDF no longer undercounts. Attachment truncation budgets the appended notice and re-checks token-dense text, keeping each file within attachment_max_tokens_per_file.

Every completion response carries the authoritative prompt_tokens. No external source maintains the image-pricing formulas for us (litellm prices all images with OpenAI's tile formula and has no Anthropic formula at all), so instead of hoping ours stay current, compare each estimate against the provider-reported count and log a warning when they diverge more than 20%. Any staleness — provider pricing changes, tokenizer swaps, payload-shape drift — surfaces in logs within days instead of silently skewing context budgets. Hooked where prediction and actuals already meet: the non-streaming response assembly in completion_service (covers apps, services, group chats) and the streaming usage reconciliation in assistant_service.

The (w*h)/750 approximation with only a long-edge cap overcounted real usage by ~33% on image-heavy PDFs: preflight said 18,581 tokens where the provider billed 13,967. Anthropic's documented cost is one token per 28x28 pixel patch after resizing to fit BOTH a long-edge limit and a per-image token budget (1568px/1568 tokens on most models, 2576px/ 4784 on the high-resolution Opus 4.7+ family) — the missing token budget is what capped the real cost. Mirror the reference implementation from Anthropic's vision docs and pin the documented pricing-table values as tests. A rendered A4 page at 2048px now counts 1551 tokens instead of 2319; eight pages land within text-overhead distance of the observed actuals.

Anthropic's vision pricing is the only provider formula we track by hand, so make it a cleanly removable unit: anthropic_image_pricing.py plus its test file, reached through a single commented dispatch branch in count_image_tokens. Dropping the special handling is now two file deletions and one if-statement.

OpenAI's image cost is not one formula: gpt-4o/4.1/4.5 use the classic 85 + 170/tile method, but the mini/nano/o4-mini families count 32px patches (capped at 1536) times a per-model multiplier, and several tile models have their own base/tile constants (gpt-4o-mini 2833/5667, o1/o3 75/150, gpt-5 70/140). Pricing everything with the classic formula undercounted an image-heavy PDF on gpt-4.1-mini by ~2.2x: preflight said 8,885 tokens where the provider billed ~22,900. Mirror the documented tables in openai_image_pricing.py — isolated the same way as anthropic_image_pricing.py, reached through the single dispatch point in count_image_tokens, removable the same way. Unknown models fall back to the classic formula and the drift alarm flags them.

MaxEriksson2000 added 4 commits June 11, 2026 19:47

MaxEriksson2000 mentioned this pull request Jun 11, 2026

Harden LiteLLM provider flow and upgrade to 1.88.1 #490

Merged

MaxEriksson2000 added 2 commits June 11, 2026 22:14

MaxEriksson2000 changed the title ~~Accurate token counting, PDF-embedded images as vision input, attachment modernization~~ Accurate token counting, document visuals as vision input, attachment modernization Jun 11, 2026

MaxEriksson2000 added 14 commits June 12, 2026 08:08

Merge remote-tracking branch 'origin/develop' into feature/attachment…

f902c71

…-vision-and-token-accuracy

fix(tokens): price GPT-5.4 images with patch budget

2ff207c

fix(chat): treat context preflight as advisory

79b645d

MaxEriksson2000 changed the title ~~Accurate token counting, document visuals as vision input, attachment modernization~~ Token counting update, document visuals as vision input, attachment modernization Jun 12, 2026

MaxEriksson2000 merged commit 008842e into develop Jun 12, 2026
15 checks passed

MaxEriksson2000 deleted the feature/attachment-vision-and-token-accuracy branch June 12, 2026 19:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Token counting update, document visuals as vision input, attachment modernization#489

Token counting update, document visuals as vision input, attachment modernization#489
MaxEriksson2000 merged 20 commits into
developfrom
feature/attachment-vision-and-token-accuracy

MaxEriksson2000 commented Jun 11, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 11, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

MaxEriksson2000 commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. Token counting mirrors the actual payload

2. Document visuals reach vision models

3. Attachment extraction and upload behavior

Review notes

Validation

Uh oh!

github-actions Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🧹 Dead-code & unused-dependency report

Uh oh!

github-actions Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📊 Patch coverage

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

MaxEriksson2000 commented Jun 11, 2026 •

edited

Loading

github-actions Bot commented Jun 11, 2026 •

edited

Loading

github-actions Bot commented Jun 11, 2026 •

edited

Loading