Support modular Sentence Transformers cross-encoder rerankers (e.g. ettin-reranker)#867
Support modular Sentence Transformers cross-encoder rerankers (e.g. ettin-reranker)#867hotchpotch wants to merge 18 commits into
Conversation
A transient download failure of the pooling/final-dense config used to warn and continue, letting `detect_modular_reranker` fall back to the embedding path and silently disable `/rerank`. Require those configs for transformer -> pooling -> ... -> dense pipelines, and document why the local-read fallbacks differ before vs. after the reranker is confirmed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wrapping the ORT and Python startup blocks in `else { ... }` re-indented
every original line, swamping the diff. Use `'label: { ... break }` early
guards so the unchanged backend bodies keep their original indentation and
the diff only shows the post-pooling-prediction skip.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The post-pooling prediction loader was nested inside the existing Dense
loop's `if`, pushing the original Dense-loading branch into an `else` and
re-indenting every line. Split it into `if use_post_pooling_prediction {
.. } else if let Some(dense_paths) ..` so the embedding Dense loop is
byte-identical to main, and fold the "requires at least one module" guard
into the new branch.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Drop the unused `module_input_name`/`module_output_name` fields from the Candle `DenseConfig` (they were `#[allow(dead_code)]`); the detection-time copy lives in the router's `DenseDetectionConfig`. - Document why `PredictionHeadModule` is kept separate from `DenseLayer` despite the identical signature. - Reword the detection-config download failure so it is not reranker-specific: the same `transformer -> pooling -> ... -> dense` shape also covers embedding models ending in a Dense projection. - Apply rustfmt to the prediction-head loader and detection guard so the pre-commit `fmt` hook passes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add the post-pooling-head reranker family (e.g. `cross-encoder/ettin-reranker-*`) to the supported re-rankers list, and note that it is scored by a post-pooling Dense head rather than a `*ForSequenceClassification` head. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Cover the post-pooling prediction-head path end to end with `cross-encoder/ettin-reranker-17m-v1`, mirroring the existing `gte-reranker-modernbert` classification test: - `download_modular_reranker_artifacts` fetches the embedding backbone plus the post-pooling scoring head modules (`2_Dense`, `3_LayerNorm`, `4_Dense`), parsing `modules.json` leniently so both legacy and current module type strings work. - `test_modernbert_modular_reranker` loads it via `new_with_post_pooling_prediction` (CLS pooling) and snapshots the score. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reword the `PredictionHeadModule` doc comment and the post-pooling loader comment to state what the code is (the reranker scoring head, distinct from the embedding `DenseLayer` projection) rather than narrating which paths were kept separate or left untouched. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Drop the doc comments added on private/internal helpers (the repo does not `///` private fns or test helpers) and shorten the inline rationale comments, keeping only the non-obvious "why" notes in the terse style used elsewhere. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Link the ettin reranker family collection instead of a single checkpoint in the supported re-rankers table and intro sentence. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Some throughput numbers vs the reference Sentence Transformers Benchmark target:
Both runners used the same GPU (NVIDIA RTX PRO 6000 Blackwell), float16, max_length=512, batch size 512, Speed
TEI is about 1.5x faster than CrossEncoder + FlashAttention 2 and about 2.5x faster than Methodology note: the benchmark drives TEI's Score agreement (full dataset)All 100,231 pairs, raw scores:
At matched float16 the mean absolute score difference over the full dataset is about 8e-4 with Pearson |
Single-pair parity and the effect of FlashAttention
Verification (selected)
|
|
Deployed the fork using GPU: NVIDIA A10G (23 GB VRAM), CUDA Driver 13.0 ERROR rerank:predict{truncate=true truncation_direction=Right raw_scores=false}: text_embeddings_core::infer: core/src/infer.rs:450: MatMulUnexpectedStriding { lhs_l: Layout { shape: [2, 1024], stride: [31744, 1], start_offset: 0 }, rhs_l: Layout { shape: [1024, 1024], stride: [1, 1024], start_offset: 0 }, bmnk: (1, 2, 1024, 1024), msg: "non-contiguous lhs" } |
|
Hello! @nikmall Thanks for the report. I could not test on an A10G directly, but I verified the same Docker CUDA/FA2 build path on my local GPU. My local environment: I built the runtime image for my local GPU with FA2 enabled: docker build -f Dockerfile-cuda \
--build-arg CUDA_COMPUTE_CAP=120 \
-t tei-cuda120-fa2-runtime .Then I started TEI on GPU 0: docker run -d --name tei-cuda120-fa2-repro \
--gpus '"device=0"' \
-p 18083:80 \
-v ~/.cache/huggingface:/data \
tei-cuda120-fa2-runtime \
--model-id cross-encoder/ettin-reranker-17m-v1 \
--dtype float16 \
--max-batch-tokens 32768 \
--max-batch-requests 64The logs show that the CUDA FlashModernBert backend was used: I tested curl -sS -w '\nHTTP %{http_code}\n' 127.0.0.1:18083/predict \
-X POST -H 'Content-Type: application/json' \
-d '{"inputs":[["What is Deep Learning?","Deep Learning is not..."],["What is Machine Learning?","Machine learning is a field of AI."]],"truncate":true,"truncation_direction":"Right","raw_scores":false}'This returned HTTP 200. I also tested a batch of 16 and 32 concurrent I also confirmed that the A10G-targeted Docker build completes successfully: docker build -f Dockerfile-cuda \
--build-arg CUDA_COMPUTE_CAP=86 \
-t tei-cuda86-fa2-runtime .I cannot run this Could you try building and running the A10G image with docker build -f Dockerfile-cuda \
--build-arg CUDA_COMPUTE_CAP=86 \
-t tei-cuda86-fa2-runtime .docker run --rm --name tei-a10g-repro \
--gpus '"device=0"' \
-p 18083:80 \
-v ~/.cache/huggingface:/data \
tei-cuda86-fa2-runtime \
--model-id cross-encoder/ettin-reranker-17m-v1 \
--dtype float16 \
--max-batch-tokens 32768 \
--max-batch-requests 64curl -sS -w '\nHTTP %{http_code}\n' 127.0.0.1:18083/predict \
-X POST -H 'Content-Type: application/json' \
-d '{"inputs":[["What is Deep Learning?","Deep Learning is not..."],["What is Machine Learning?","Machine learning is a field of AI."]],"truncate":true,"truncation_direction":"Right","raw_scores":false}'If this still fails on A10G with |
What does this PR do?
This PR adds support for serving modular Sentence Transformers cross-encoder rerankers such as the
ettin-reranker family in TEI.
Unlike the existing
*ForSequenceClassificationrerankers, these score with a post-pooling head:This format is already part of Sentence Transformers, so more rerankers are likely to ship in this
shape. On the same GPU, TEI serves them about 1.5x faster than the Sentence Transformers
CrossEncoderwith numerically equivalent scores (benchmark in a comment below).
Changes
BackendOutput::{Predict, Embed}(split fromModelType) so an embedding backbone can be routedto predict: a reranker is
ModelType::Embedding(pool)+BackendOutput::Predict. Existing embeddingbackbones load unchanged, and
Backend::new/CandleBackend::newkeep their signatures.modules.json+ the final Dense config (out_features == 1,output
"scores"); default single-label map whenid2label/label2idare missing; rejectunsupported post-pooling modules.
Dense/LayerNorm), run it on the pooled embeddings in onebatch, validate the
[batch, 1]score shape. Post-pooling prediction is Candle-only (ORT/Python skipwith a clear log).
rope_parametersand nullableeos/bos_token_id, extending the pattern fromSet
pad_token_idas nullable & add support forrope_parameters#832 (which covered GTE/Qwen/Gemma/...) to ModernBert, which ettin requires.Testing
modules.jsonparsing (legacy + current module strings), and aCandle integration test with an
instasnapshot (cross-encoder/ettin-reranker-17m-v1).CrossEncoder(details in a comment below).Happy to adjust the design/naming (e.g. the
BackendOutputname) based on feedback.Before submitting
instasnapshots?Who can review?
cc @alvarobartt @Narsil — happy to take feedback