OPENNLP-1836: Fix input encoding in SentenceVectorsDL by krickert · Pull Request #1072 · apache/opennlp

krickert · 2026-06-10T11:59:31Z

See https://issues.apache.org/jira/browse/OPENNLP-1836

SentenceVectorsDL sent an all-zero attention_mask and all-one token_type_ids to the ONNX model, so the encoder attended to nothing. This fixes the encoding to the standard single-segment BERT convention (mask=1, types=0), consistent with DocumentCategorizerDL, and additionally:

closes the OnnxTensor inputs and OrtSession.Result (native memory leak)
replaces the NPE on a vocabulary miss with a descriptive IllegalArgumentException
adds a unit test for the encoding (tokenize is now package-private static, no ONNX session needed)
updates SentenceVectorsDLEval expectations

Eval values were verified empirically: the unfixed code reproduces the previously pinned values exactly against the public sentence-transformers/all-MiniLM-L6-v2 ONNX export, and the corrected encoding produces the new pinned values (dimension 384).

Note: this is a behavioral fix - vectors persisted from the old encoding are not comparable with the corrected output and should be re-embedded.

SentenceVectorsDL sent an all-zero attention_mask and all-one token_type_ids, so the model attended to nothing. Use the standard single-segment BERT encoding (mask=1, types=0), consistent with DocumentCategorizerDL. Also close OnnxTensor/Result resources, replace the NPE on a vocabulary miss with a descriptive exception, add a unit test for the encoding, and update the eval test expectations (verified against the same MiniLM ONNX export). Vectors produced by the previous encoding are not comparable with the corrected output.

Copilot

Pull request overview

Fixes SentenceVectorsDL’s ONNX input encoding so sentence-transformer models receive standard single-segment BERT inputs (attention mask = 1 for real tokens, token type ids = 0), aligning behavior with other DL components and updating expected eval outputs accordingly.

Changes:

Corrects SentenceVectorsDL token encoding (mask/types) and improves vocabulary-miss handling with a descriptive exception.
Prevents native-memory leaks by closing ONNX tensors and OrtSession.Result.
Adds unit tests for tokenization/encoding and updates SentenceVectorsDLEval pinned vector expectations.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
opennlp-eval-tests/src/test/java/opennlp/dl/vectors/SentenceVectorsDLEval.java	Updates pinned expected vector values for the corrected encoding.
opennlp-core/opennlp-ml/opennlp-dl/src/test/java/opennlp/dl/vectors/SentenceVectorsDLTest.java	Adds unit tests validating single-segment BERT encoding and vocabulary/UNK behavior.
opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/vectors/SentenceVectorsDL.java	Fixes mask/types encoding, closes ONNX resources, and improves vocab-mismatch error reporting.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

krickert · 2026-06-10T15:50:57Z

Ran copilot against this. It didn't do a bad job because it only said that the expected vs actual were reversed.

krickert requested review from Copilot, mawiesne and rzo1 June 10, 2026 12:02

Copilot started reviewing on behalf of krickert June 10, 2026 13:27 View session

Copilot AI reviewed Jun 10, 2026

View reviewed changes

Comment thread opennlp-eval-tests/src/test/java/opennlp/dl/vectors/SentenceVectorsDLEval.java Outdated

Potential fix for pull request finding

15fc495

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

mawiesne requested a review from jzonthemtn June 10, 2026 18:04

mawiesne changed the title ~~OPENNLP-1836 - Fix input encoding in SentenceVectorsDL~~ OPENNLP-1836: Fix input encoding in SentenceVectorsDL Jun 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OPENNLP-1836: Fix input encoding in SentenceVectorsDL#1072

OPENNLP-1836: Fix input encoding in SentenceVectorsDL#1072
krickert wants to merge 2 commits into
apache:mainfrom
ai-pipestream:OPENNLP-1836

krickert commented Jun 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

krickert commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

krickert commented Jun 10, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

krickert commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants