Skip to content

Fatal JVM crash (Access Violation) on Windows during StackBatchifier.batchify with OnnxRuntime #3859

Description

@HaibaraAi2517

DJL OnnxRuntime Native Crash on Windows

Severity | Reproducibility | Discovered -- | -- | -- Critical | Deterministic | 2026-05-28

Controlled Diagnostics (How we isolated it)

To ensure this wasn't a broken model or environment issue, we implemented a 4-stage isolation test on the exact same Windows machine:

  1. Stage 1: Tokenizer Only -> SUCCESS. HuggingFace Tokenizer encodes text into tokens perfectly.

  2. Stage 2: Native ONNX Runtime Java API -> SUCCESS. Bypassing DJL and directly feeding token IDs into Microsoft's official ai.onnxruntime.OrtSession successfully outputs the 512-dimensional embedding vectors. This proves the model.onnx file and the core ORT C++ binaries are 100% healthy on Windows.

  3. Stage 3: DJL Predictor (Single/Batch Inference) -> FATAL CRASH. The moment Predictor.predict() is called, the JVM dies immediately.

Fatal Linkage Stacktrace

The generated hs_err_pid.log consistently highlights the following native call stack:

  [tokenizers.dll + 0xXXXXXX] (or similar native library boundaries)
ai.djl.engine.rust.RustLibrary.tensorOf(...)
ai.djl.ndarray.NDArrays.stack(...)
ai.djl.translate.StackBatchifier.batchify(...)
ai.djl.inference.Predictor.predict(...)

Expected behavior

If there is an unsupported memory alignment or stride error during tensor stacking on Windows, DJL should safely handle the pointer validation and throw a Java-level TranslateException or IndexOutOfBoundsException, rather than allowing an unmanaged out-of-bounds pointer or access violation to forcefully terminate the whole JVM process.

Additional context

It seems that during StackBatchifier.batchify, when multi-dimensional token arrays are stacked and bridged to the Rust-side NDArray via RustLibrary.tensorOf, an unaligned memory access or invalid pointer offset triggers the OS protection fault (EXCEPTION_ACCESS_VIOLATION) on Windows. This might be related to how Windows handles memory strides for standard primitives compared to Linux/macOS.

hs_err_pid46252.log

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions