Harden LiteLLM provider flow and upgrade to 1.88.1#490
Merged
Conversation
… overhead
Token counting previously measured concatenated text only, with three
systematic errors:
- count_tokens received the bare model name while litellm is invoked
with "{provider_type}/{name}", silently resolving the default
tokenizer for Azure/vLLM/Mistral deployments
- images (current question and history) counted as 0 tokens
- tool definitions (generate_image + MCP tools) and per-message chat
scaffolding counted as 0 tokens
Counting now mirrors the OpenAI-format payload the adapter builds:
litellm.token_counter(messages=...) includes image_url content and
message wrappers, tools are counted via the tools= parameter, and the
MCP proxy is created before context building so its tool definitions
shrink the history/knowledge budgets.
Also fixes add_knowledge subtracting the prompt tokens a second time
from a budget that already excluded them.
…ploads PDF attachments previously reached the model as extracted text only — embedded photos and scanned pages were silently dropped. Uploading a PDF now also renders its embedded image regions (pdfplumber + pypdfium, already transitive dependencies) into derived image files linked via files.parent_file_id, capped per document and filtered by source size to skip logos. At ask time the attachment list is expanded with derived images only when the model supports vision; without vision the PDF degrades to a plain text attachment instead of erroring. History replay now also strips images for non-vision models, which previously produced provider errors after a model switch. All vision images (direct uploads included) are downscaled to max 2048px and recompressed before storage, since blobs are base64-encoded into every request and replayed for the rest of the session.
- PDF: per-page [PAGE N] markers and tables rendered as markdown (table text excluded from the running text to avoid duplication); image-only PDFs return empty text instead of bare markers - PPTX: speaker notes are extracted (previously discarded) - Attachment prompt block: compact 'FILE: name (mimetype)' format replaces JSON-per-line (less token overhead, gives the model the file type), and each file is capped at a configurable token budget with a visible truncation notice instead of failing the whole request via QueryException
iPhone photos default to HEIC, which was rejected. pillow-heif decodes them and the existing downscale step converts to JPEG before storage — providers never see image/heic. Formats outside PNG/JPEG/WEBP are now always converted even when conversion does not shrink the blob. The frontend accept list picks the new mimetypes up automatically via the limits endpoint (ImageMimeTypes drives FormatLimit).
LiteLLM BadRequestError was mapped to OpenAIException, which the embedding/transcription retry policies do not exempt: a deterministic upstream 400 was retried three times (up to ~40s of backoff) and then surfaced as HTTP 503, masking tenant configuration errors as availability incidents. Introduce ProviderRejectedRequestException (subclass of OpenAIException so existing provider-error handling keeps catching it) mapped to 400 with error code 9041, and a shared NON_RETRYABLE_PROVIDER_ERRORS tuple used by both adapter retry policies. Invalid API credentials (APIKeyNotConfiguredException) are excluded from retries as well.
Capability snapshots ignored the model's reasoning flag: LiteLLM discovery is name-based, so opaque routes (e.g. Azure deployment names) missed reasoning_effort entirely, and the discovery-failure fallback hardcoded reasoning=False. Since the persisted snapshot acts as an explicit override at resolve time and is never widened again, reasoning models could permanently lose the reasoning_effort control. snapshot_supported_model_kwargs now takes the admin-declared reasoning flag: the fallback respects it, and reasoning_effort is forced on when discovery misses it. update() re-snapshots when name or reasoning changes (previously only name), with an explicit capability payload still winning over the refreshed snapshot.
Base automatically changed from
feature/attachment-vision-and-token-accuracy
to
develop
June 12, 2026 19:40
…er-flow # Conflicts: # backend/alembic/versions/b4f2a9c1e7d3_add_parent_file_id_to_files.py # backend/src/intric/assistants/assistant_service.py # backend/src/intric/completion_models/infrastructure/adapters/base_adapter.py # backend/src/intric/completion_models/infrastructure/adapters/tenant_model_adapter.py # backend/src/intric/completion_models/infrastructure/completion_service.py # backend/src/intric/completion_models/infrastructure/context_builder.py # backend/src/intric/files/file_protocol.py # backend/src/intric/files/file_service.py # backend/src/intric/files/image_processing.py # backend/src/intric/main/config.py # backend/src/intric/tokens/token_utils.py # backend/tests/unit/test_completion_service_streaming.py # backend/tests/unittests/ai_models/test_context_builder.py # backend/tests/unittests/assistants/test_assistants_service.py # backend/tests/unittests/files/test_file_service.py # backend/tests/unittests/files/test_image_processing.py # backend/tests/unittests/tokens/test_token_utils.py
🧹 Dead-code & unused-dependency reportAdvisory — never gates the PR. Whole-repo scan, so some findings may be false positives (dynamic dispatch, framework hooks, runtime-resolved imports). Triage before removing. ✅ No dead code or unused dependencies detected. |
📊 Patch coverageShare of this PR's new/changed lines exercised by tests. Report-only — never gates the PR.
Uncovered lines — Backend (11 files)
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
1.83.10to1.88.1andpython-dotenvfrom1.0.1to1.2.2PreparedModelStreamWhy
LiteLLM had accumulated transport, provider, orchestration, policy, and error-handling responsibilities inside the completion adapter. This duplicated provider logic across model types, caused streaming and non-streaming behavior to diverge, allowed dependency metadata to alter model availability, and risked exposing raw provider details.
The new structure keeps LiteLLM as a transport dependency while Eneo owns tenant boundaries, model policy, capabilities, orchestration, and public contracts.
Impact
model_kwargs_capabilitiesadded to tenant completion model create/updateValidation
72focused unit tests passed66relevant integration tests passed3170 passed, with one pre-existing unrelated failure intests/unit/test_route_error_contract.py::test_detects_positional_http_exception_status@intric/intric-jslint passedStacked PR
This PR is intentionally based on #489 (
feature/attachment-vision-and-token-accuracy) because the work was developed and verified on that branch. After #489 merges, this PR can be rebased or retargeted todevelop.