Skip to content

Token counting update, document visuals as vision input, attachment modernization#489

Merged
MaxEriksson2000 merged 20 commits into
developfrom
feature/attachment-vision-and-token-accuracy
Jun 12, 2026
Merged

Token counting update, document visuals as vision input, attachment modernization#489
MaxEriksson2000 merged 20 commits into
developfrom
feature/attachment-vision-and-token-accuracy

Conversation

@MaxEriksson2000

@MaxEriksson2000 MaxEriksson2000 commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Summary

This PR improves attachments and token accounting for assistants that need to read images embedded in PDF protocols and other documents.

1. Token counting mirrors the actual payload

  • Count message scaffolding, images, tool definitions, replayed tool calls, and model-specific tokenization from the same payload shape sent to providers.
  • Price OpenAI and Anthropic images from stored pixel dimensions using their documented model-family formulas, including GPT-5.4/5.5's 2500-patch high-detail budget.
  • Include document-derived images and image-only PDFs in token preflight for vision models.
  • Compare estimates with provider-reported prompt usage and warn when they drift materially.
  • Treat preflight as an advisory estimate: local overflow no longer blocks sending, the provider validates the final request, and a provider rejection leaves the message and attachments in the composer.

2. Document visuals reach vision models

  • Render visually meaningful PDF pages, including raster images, scanned pages, vector charts, and diagrams, into bounded derived images.
  • Extract embedded raster media from DOCX and PPTX while skipping unsupported vector formats and oversized archive entries.
  • Expand derived images only when building completion payloads for the governance-effective vision model; they are not persisted as user question attachments or exposed in the file library.
  • Strip replayed images when switching to a non-vision model.
  • Downscale vision uploads to a bounded size before storage so the stored, sent, and counted payloads agree.

3. Attachment extraction and upload behavior

  • Add PDF page markers and markdown table extraction without duplicating table text.
  • Extract PPTX speaker notes.
  • Replace JSON-per-line attachment prompts with a compact file header and token-bounded truncation notice.
  • Accept HEIC/HEIF uploads and convert them to a provider-compatible stored format.
  • Validate upload limits in the chat, translate failures, and keep upload errors visible beside the composer.

Review notes

  • max_output_tokens is deliberately not reserved from the configured input budget.
  • New settings: ATTACHMENT_IMAGE_EXTRACTION, ATTACHMENT_MAX_EXTRACTED_IMAGES, and ATTACHMENT_MAX_TOKENS_PER_FILE.
  • Files uploaded before this change cannot be backfilled because original document bytes are not retained for text files.
  • Extraction is keyed on the upload content type; there is no file-signature sniffing.
  • Preflight still excludes MCP tool definitions.

Validation

  • The full branch backend suite previously passed with 3,181 tests.
  • Latest targeted backend verification passed 64 tests covering context building, streaming overflow forwarding, preflight, and image pricing.
  • Frontend unit tests, Svelte check, Ruff, Pyright, formatting, lint, i18n, schema drift, and pre-push checks passed.
  • Alembic has a single migration head.

… overhead

Token counting previously measured concatenated text only, with three
systematic errors:

- count_tokens received the bare model name while litellm is invoked
  with "{provider_type}/{name}", silently resolving the default
  tokenizer for Azure/vLLM/Mistral deployments
- images (current question and history) counted as 0 tokens
- tool definitions (generate_image + MCP tools) and per-message chat
  scaffolding counted as 0 tokens

Counting now mirrors the OpenAI-format payload the adapter builds:
litellm.token_counter(messages=...) includes image_url content and
message wrappers, tools are counted via the tools= parameter, and the
MCP proxy is created before context building so its tool definitions
shrink the history/knowledge budgets.

Also fixes add_knowledge subtracting the prompt tokens a second time
from a budget that already excluded them.
…ploads

PDF attachments previously reached the model as extracted text only —
embedded photos and scanned pages were silently dropped. Uploading a
PDF now also renders its embedded image regions (pdfplumber + pypdfium,
already transitive dependencies) into derived image files linked via
files.parent_file_id, capped per document and filtered by source size
to skip logos.

At ask time the attachment list is expanded with derived images only
when the model supports vision; without vision the PDF degrades to a
plain text attachment instead of erroring. History replay now also
strips images for non-vision models, which previously produced provider
errors after a model switch.

All vision images (direct uploads included) are downscaled to max
2048px and recompressed before storage, since blobs are base64-encoded
into every request and replayed for the rest of the session.
- PDF: per-page [PAGE N] markers and tables rendered as markdown
  (table text excluded from the running text to avoid duplication);
  image-only PDFs return empty text instead of bare markers
- PPTX: speaker notes are extracted (previously discarded)
- Attachment prompt block: compact 'FILE: name (mimetype)' format
  replaces JSON-per-line (less token overhead, gives the model the
  file type), and each file is capped at a configurable token budget
  with a visible truncation notice instead of failing the whole
  request via QueryException
iPhone photos default to HEIC, which was rejected. pillow-heif decodes
them and the existing downscale step converts to JPEG before storage —
providers never see image/heic. Formats outside PNG/JPEG/WEBP are now
always converted even when conversion does not shrink the blob.

The frontend accept list picks the new mimetypes up automatically via
the limits endpoint (ImageMimeTypes drives FormatLimit).
@github-actions

github-actions Bot commented Jun 11, 2026

Copy link
Copy Markdown

🧹 Dead-code & unused-dependency report

Advisory — never gates the PR. Whole-repo scan, so some findings may be false positives (dynamic dispatch, framework hooks, runtime-resolved imports). Triage before removing.

No dead code or unused dependencies detected.

@github-actions

github-actions Bot commented Jun 11, 2026

Copy link
Copy Markdown

📊 Patch coverage

Share of this PR's new/changed lines exercised by tests. Report-only — never gates the PR.

Area Changed Uncovered Coverage
Frontend 126 99 21%
Backend 600 63 89%
Uncovered lines — Frontend (9 files)
  • …/components/conversation/conversationInputState.ts — 40–43, 49–53
  • …/attachments/components/AttachmentUploadIconButton.svelte — 25
  • …/components/conversation/ConversationView.svelte — 4–5, 32–36, 38, 40–42, 144
  • …/components/conversation/ConversationInput.svelte — 19, 21, 34, 36–37, 114, 150–153, 156, 160–163, 176, 292–297, 341–342, 346–348, 350, 352, 354, 356
  • …/features/attachments/getUploadErrorMessage.ts — 20–23
  • …/features/attachments/AttachmentManager.ts — 8, 69, 87, 184–187, 200–201, 203–204, 213, 221, 235–238, 249–253, 271–274, 282–286, 301, 304, 314, 316
  • …/features/jobs/JobManager.ts — 246
  • …/attachments/components/AttachmentDropArea.svelte — 15, 42
  • …/components/conversation/ContextUsageBar.svelte — 185, 194, 196, 292
Uncovered lines — Backend (13 files)
  • …/intric/files/image_processing.py — 101, 172, 224–225, 270, 274, 276, 280
  • …/intric/services/service_runner.py — 67–69
  • …/completion_models/infrastructure/completion_service.py — 302, 307, 311–313, 411
  • …/intric/files/file_repo.py — 47–48, 50, 56–57
  • …/completion_models/infrastructure/context_builder.py — 143–144
  • …/intric/tokens/token_utils.py — 73, 124, 143, 154
  • …/infrastructure/adapters/base_adapter.py — 28
  • …/infrastructure/adapters/tenant_model_adapter.py — 478, 1538
  • …/intric/files/text.py — 150, 175–177, 202–203, 208
  • …/intric/assistants/assistant_service.py — 1212–1213, 1422
  • …/conversations/application/conversation_service.py — 40
  • …/intric/files/file_protocol.py — 86, 195–196, 201, 207–208, 214, 226–227, 229–230, 232, 239–241, 246, 256
  • …/apps/apps/app_service.py — 396–399

Replace per-image bbox cropping with whole-page rendering for pages that
carry visual content (raster images, many curves, diagonal lines, or
filled rects outside detected tables), so vector-drawn charts and
diagrams reach vision models too. Render resolution is capped on both
axes, bounding memory even for degenerate page sizes.

DOCX/PPTX uploads now yield their embedded raster media as derived
images (EMF/WMF and oversized zip entries are skipped). Derived files
are named after their source page when known. Image uploads get their
checksum recomputed after downscaling so it describes the stored blob.

Settings renamed to attachment_image_extraction and
attachment_max_extracted_images (previously pdf_attachment_*), since
extraction is no longer PDF-only.
…etail

Images are now sent and counted with an explicit detail=high block —
litellm prices detail-less image payloads at a flat 85 tokens while
providers resolve auto to high, undercounting a 2048px image by roughly
a thousand tokens (more on Anthropic models).

Derived images (rendered PDF pages etc.) no longer leak to user-facing
surfaces: they are expanded into the completion payload at ask time
(current files, assistant attachments and session history) instead of
being persisted on the question, and the file library listing hides
them. Apps and services expand their input files the same way, and
attachment images (prompt_files) now ride on the current user message
for vision models instead of being silently dropped.

The token preflight prices files like the real request: same litellm
model identifier as the adapter, with image files and document-derived
images included for vision models.
@MaxEriksson2000 MaxEriksson2000 changed the title Accurate token counting, PDF-embedded images as vision input, attachment modernization Accurate token counting, document visuals as vision input, attachment modernization Jun 11, 2026
After merging develop, b4f2a9c1e7d3 and the develop merge migration
202606111200 were both children of branchpoint 3eb6a34b6733, producing
two heads and breaking 'alembic upgrade head'. Re-point this migration's
down_revision to 202606111200 so the chain is linear with a single head.
Safe: it only touches the files table, independent of the governance/
reasoning changes in the develop branch.
The chat initialised the AttachmentManager with rules=undefined (the real
getAttachmentRulesStore call was commented out), so client-side validation
was disabled: oversized/unsupported files were sent to the server and only
rejected there. Combined with hardcoded English error strings, users saw no
clear feedback about what went wrong or what is allowed.

- Wire getAttachmentRulesStore in ConversationView so uploads are validated
  client-side against the backend per-format size limits and the partner
  model's vision support — oversized/unsupported files are now rejected
  instantly with a clear message instead of failing silently server-side.
- Translate all validation and upload-failure messages to paraglide (sv+en),
  stating the actual size and the allowed maximum.
- Surface paste-upload validation errors (were silently dropped).
- getUploadErrorMessage now returns a file-name-free reason so callers don't
  duplicate the file name; update JobManager accordingly.
…toast

Upload/validation errors were shown as a transient toast that is easy to
miss. Surface them inline next to the chat input instead, matching the
existing context-window error banner, and keep them visible until the user
retries or dismisses them.

- AttachmentManager gains an uploadError store (single source for validation
  and server-side failures) plus a clearUploadError action and an inlineErrors
  option that suppresses the toast when a consumer renders the store inline.
- ConversationView opts the chat in (inlineErrors) and passes it to the drop
  area; ConversationInput renders a persistent, dismissible error banner in the
  same style as the context-window banner.
- Chat-only entry points (icon button, paste) no longer toast; the shared
  AttachmentDropArea keeps toasting outside the chat via the inlineErrors prop,
  so app upload flows are unchanged.
The preflight token estimate keyed derived-image lookup on text_files, which
required extractable text (f.text). An image-only PDF (e.g. scanned pages or
photos) has no text, so it fell out of that set: its derived vision images
were never fetched or priced, and it was reported as an excluded file. The
preflight showed ~15 tokens for a 9 MB image-only PDF.

Key derived-image lookup on every document upload (FileType.TEXT) instead,
mirroring file_service.with_derived_images on the real ask path, and treat a
document as included when it contributes either inlined text or derived
images. Add a regression test for the image-only PDF case.
litellm.token_counter applies OpenAI's tile formula to every provider, so
Claude image attachments were counted ~30% too low in the token preflight
(an 8-page PDF reported 8,840 tokens vs ~13,100 actually billed).

Count Anthropic image blocks from their pixel dimensions instead —
(width × height) / 750 after capping the long edge to 1568px, per
Anthropic's documented image cost — and let litellm handle only text and
message scaffolding. Applied at the count_message_tokens chokepoint, so
both the preflight estimate and the context-build counting are corrected.
OpenAI/Azure counting is unchanged (litellm is already exact there).
…model

_with_vision_derivatives checked the assistant's configured model, but
governance policy can override the answering model. With a non-vision
configured model and a policy-enforced vision model, document-derived
images were never attached even though preflight counted them — the
vision model silently lost the document pages and the token projection
diverged from the real payload. Gate on the effective model instead.
Image token cost depends only on pixel dimensions, so count both
providers with their documented formulas (OpenAI 85 + 170/tile at high
detail, Anthropic (w*h)/750) read straight from the stored blob. This
removes the per-request base64 round-trips that counting used to do for
every current and history image — previously each request encoded all
blobs to data URLs, and the Anthropic path decoded them again and
PIL-opened each one, all synchronously on the event loop.

Detect Claude behind any provider route (openai-compatible, bedrock
style prefixes), not only "anthropic/" — otherwise its images were
priced ~30% too low. The flat image fallback is now derived from the
OpenAI formula instead of a magic 1105.

Extract the canonical message/content builders into message_payload.py,
used by both TenantModelAdapter and token counting, so the counted
shape cannot drift from the sent shape. The replayed tool_calls
structure is now counted as the JSON the provider actually receives
instead of a name+arguments approximation.

Preflight now counts textless documents too — the real request still
sends their FILE header block and preamble, so an image-only PDF no
longer undercounts. Attachment truncation budgets the appended notice
and re-checks token-dense text, keeping each file within
attachment_max_tokens_per_file.
Every completion response carries the authoritative prompt_tokens. No
external source maintains the image-pricing formulas for us (litellm
prices all images with OpenAI's tile formula and has no Anthropic
formula at all), so instead of hoping ours stay current, compare each
estimate against the provider-reported count and log a warning when
they diverge more than 20%. Any staleness — provider pricing changes,
tokenizer swaps, payload-shape drift — surfaces in logs within days
instead of silently skewing context budgets.

Hooked where prediction and actuals already meet: the non-streaming
response assembly in completion_service (covers apps, services, group
chats) and the streaming usage reconciliation in assistant_service.
The (w*h)/750 approximation with only a long-edge cap overcounted real
usage by ~33% on image-heavy PDFs: preflight said 18,581 tokens where
the provider billed 13,967. Anthropic's documented cost is one token
per 28x28 pixel patch after resizing to fit BOTH a long-edge limit and
a per-image token budget (1568px/1568 tokens on most models, 2576px/
4784 on the high-resolution Opus 4.7+ family) — the missing token
budget is what capped the real cost.

Mirror the reference implementation from Anthropic's vision docs and
pin the documented pricing-table values as tests. A rendered A4 page at
2048px now counts 1551 tokens instead of 2319; eight pages land within
text-overhead distance of the observed actuals.
Anthropic's vision pricing is the only provider formula we track by
hand, so make it a cleanly removable unit: anthropic_image_pricing.py
plus its test file, reached through a single commented dispatch branch
in count_image_tokens. Dropping the special handling is now two file
deletions and one if-statement.
OpenAI's image cost is not one formula: gpt-4o/4.1/4.5 use the classic
85 + 170/tile method, but the mini/nano/o4-mini families count 32px
patches (capped at 1536) times a per-model multiplier, and several tile
models have their own base/tile constants (gpt-4o-mini 2833/5667, o1/o3
75/150, gpt-5 70/140). Pricing everything with the classic formula
undercounted an image-heavy PDF on gpt-4.1-mini by ~2.2x: preflight
said 8,885 tokens where the provider billed ~22,900.

Mirror the documented tables in openai_image_pricing.py — isolated the
same way as anthropic_image_pricing.py, reached through the single
dispatch point in count_image_tokens, removable the same way. Unknown
models fall back to the classic formula and the drift alarm flags them.
@MaxEriksson2000 MaxEriksson2000 changed the title Accurate token counting, document visuals as vision input, attachment modernization Token counting update, document visuals as vision input, attachment modernization Jun 12, 2026
@MaxEriksson2000 MaxEriksson2000 merged commit 008842e into develop Jun 12, 2026
15 checks passed
@MaxEriksson2000 MaxEriksson2000 deleted the feature/attachment-vision-and-token-accuracy branch June 12, 2026 19:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant