Skip to content

OPENNLP-1836: Fix input encoding in SentenceVectorsDL#1072

Open
krickert wants to merge 2 commits into
apache:mainfrom
ai-pipestream:OPENNLP-1836
Open

OPENNLP-1836: Fix input encoding in SentenceVectorsDL#1072
krickert wants to merge 2 commits into
apache:mainfrom
ai-pipestream:OPENNLP-1836

Conversation

@krickert

Copy link
Copy Markdown
Contributor

See https://issues.apache.org/jira/browse/OPENNLP-1836

SentenceVectorsDL sent an all-zero attention_mask and all-one token_type_ids to the ONNX model, so the encoder attended to nothing. This fixes the encoding to the standard single-segment BERT convention (mask=1, types=0), consistent with DocumentCategorizerDL, and additionally:

  • closes the OnnxTensor inputs and OrtSession.Result (native memory leak)
  • replaces the NPE on a vocabulary miss with a descriptive IllegalArgumentException
  • adds a unit test for the encoding (tokenize is now package-private static, no ONNX session needed)
  • updates SentenceVectorsDLEval expectations

Eval values were verified empirically: the unfixed code reproduces the previously pinned values exactly against the public sentence-transformers/all-MiniLM-L6-v2 ONNX export, and the corrected encoding produces the new pinned values (dimension 384).

Note: this is a behavioral fix - vectors persisted from the old encoding are not comparable with the corrected output and should be re-embedded.

SentenceVectorsDL sent an all-zero attention_mask and all-one
token_type_ids, so the model attended to nothing. Use the standard
single-segment BERT encoding (mask=1, types=0), consistent with
DocumentCategorizerDL. Also close OnnxTensor/Result resources, replace
the NPE on a vocabulary miss with a descriptive exception, add a unit
test for the encoding, and update the eval test expectations (verified
against the same MiniLM ONNX export). Vectors produced by the previous
encoding are not comparable with the corrected output.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes SentenceVectorsDL’s ONNX input encoding so sentence-transformer models receive standard single-segment BERT inputs (attention mask = 1 for real tokens, token type ids = 0), aligning behavior with other DL components and updating expected eval outputs accordingly.

Changes:

  • Corrects SentenceVectorsDL token encoding (mask/types) and improves vocabulary-miss handling with a descriptive exception.
  • Prevents native-memory leaks by closing ONNX tensors and OrtSession.Result.
  • Adds unit tests for tokenization/encoding and updates SentenceVectorsDLEval pinned vector expectations.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
opennlp-eval-tests/src/test/java/opennlp/dl/vectors/SentenceVectorsDLEval.java Updates pinned expected vector values for the corrected encoding.
opennlp-core/opennlp-ml/opennlp-dl/src/test/java/opennlp/dl/vectors/SentenceVectorsDLTest.java Adds unit tests validating single-segment BERT encoding and vocabulary/UNK behavior.
opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/vectors/SentenceVectorsDL.java Fixes mask/types encoding, closes ONNX resources, and improves vocab-mismatch error reporting.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread opennlp-eval-tests/src/test/java/opennlp/dl/vectors/SentenceVectorsDLEval.java Outdated
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@krickert

Copy link
Copy Markdown
Contributor Author

Ran copilot against this. It didn't do a bad job because it only said that the expected vs actual were reversed.

@mawiesne mawiesne requested a review from jzonthemtn June 10, 2026 18:04
@mawiesne mawiesne changed the title OPENNLP-1836 - Fix input encoding in SentenceVectorsDL OPENNLP-1836: Fix input encoding in SentenceVectorsDL Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants