Support DELTA_BINARY_PACKED integer encoding#186
Conversation
Signed-off-by: Puneet Dixit <236133619+puneetdixit200@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR adds support for Parquet’s DELTA_BINARY_PACKED encoding and updates the reference fixtures/tests accordingly.
Changes:
- Implement
DELTA_BINARY_PACKEDencode/decode codec and export it from the codec index. - Expand reference “first record” fixtures and adjust unsupported reference-test expectations.
- Add unit tests covering core
DELTA_BINARY_PACKEDencode/decode scenarios.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| test/reference-test/read-all.test.ts | Updates unsupported reference files list to reflect newly supported encodings. |
| test/reference-test/first-record.ts | Adds expected first-record fixtures for additional reference parquet files. |
| test/codec_delta_binary_packed.test.js | Adds unit tests for DELTA_BINARY_PACKED codec behavior. |
| lib/declare.ts | Extends ParquetCodec union type to include DELTA_BINARY_PACKED. |
| lib/codec/index.ts | Exports the new DELTA_BINARY_PACKED codec module. |
| lib/codec/delta_binary_packed.ts | Introduces DELTA_BINARY_PACKED implementation (varints, zigzag, bit-packing). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
This is an ok first pass, but it cannot be merged as-is.
tl;dr:
- Block layout validation is too permissive relative to the spec.
- Decoder does not validate mini-block bit widths.
- Potential denial-of-service risks from attacker-controlled block sizes / mini-block counts / bit widths.
- INT32 delta arithmetic appears to wrap deltas to 32-bit signed values during encoding, which is likely non-canonical and may harm interoperability.
- Bit-packing byte/bit order should be verified carefully against the Parquet spec and independent test vectors.
- The feature is not reflected in README supported encodings.
- Tests are helpful but insufficient for malformed input, boundary values, and interoperability.
In detail:
Correctness findings
- blockSize validation does not enforce Parquet’s layout requirements
The implementation validates only:
blockSize > 0
miniBlockCount > 0
blockSize % miniBlockCount === 0But the Parquet Delta Binary Packed encoding specification states that:
- the block size must be a multiple of 128
- the block is subdivided into mini-blocks
- each mini-block should contain a number of values that is a multiple of 32
- The current validateBlockLayout() would accept layouts such as:
blockSize = 8
miniBlockCount = 2
valuesPerMiniBlock = 4That is even used in the added test:
const opts = {
deltaBinaryPackedBlockSize: 8,
deltaBinaryPackedMiniBlockCount: 2,
};This is convenient for small tests, but it produces non-standard Parquet delta binary packed streams. Other readers may reject those files. This PR should enforce spec-valid layouts for production encoding/decoding. If small test layouts are desired, keep them behind a private/internal test-only helper, not public options. Suggested validation rules:
blockSize % 128 === 0
blockSize % miniBlockCount === 0
(blockSize / miniBlockCount) % 32 === 0- Mini-block bit widths are not validated
In decoding, each bit width byte is read directly from the data page:
const bitWidths = cursor.buffer.subarray(cursor.offset, cursor.offset + miniBlockCount);Then passed to:
unpackMiniBlock(cursor, valuesPerMiniBlock, bitWidth)There is no check that the bit width is valid for the physical type.
For Delta Binary Packed:
INT32 adjusted deltas should not require more than roughly 32 bits in normal canonical encoding.
INT64 adjusted deltas should not require more than 64 bits.
A bit width byte like 255 is malformed and should be rejected immediately.
Currently, a malformed page can cause large BigInt shifts and large computed byte lengths:
const byteLength = Math.ceil((count * bitWidth) / 8);This is a correctness issue and also a robustness/security issue.
The PR should reject bit widths greater than the physical type’s maximum usable width.
Example logic:
const maxBitWidth = type === 'INT32' ? 32 : 64;
for (const bitWidth of bitWidths) {
if (bitWidth > maxBitWidth) {
throw new Error('invalid DELTA_BINARY_PACKED encoding');
}
}- Decoder accepts non-spec block layouts from untrusted files.
Because the decoder uses the same permissive layout validation as the encoder, a malicious or malformed Parquet file can declare very large or strange layouts, for example:
- enormous blockSize
- enormous miniBlockCount
- valuesPerMiniBlock values that are not spec-valid
- huge mini-block count causing a large bit-width table read
Some size limits are indirectly constrained by the input buffer length, but not all resource usage is bounded early. Please add explicit upper bounds and spec checks before allocating or looping. At minimum this must:
- reject non-spec block layout
- reject miniBlockCount > blockSize
- reject valuesPerMiniBlock <= 0
- reject valuesPerMiniBlock that is not an integer multiple of 32
- consider practical maximums for blockSize and miniBlockCount
- Encoder wraps INT32 deltas to signed 32-bit values
During encoding:
deltas.push(toSignedInteger(normalizedValues[i] - normalizedValues[i - 1], typeBitWidth));For INT32, this wraps the delta into signed 32-bit range.
That means a transition like:
2147483647 -> -2147483648
has a mathematical delta of:
-4294967295
but the encoder will turn it into:
+1
The local decoder then also wraps accumulated values, so round-trips may pass inside this implementation. However, this is likely not canonical Delta Binary Packed encoding and may produce files that other Parquet readers decode differently or reject.
The spec describes storing deltas between consecutive values, then storing the minimum delta and adjusted deltas. It does not describe computing deltas modulo the physical type width.
Instead: compute deltas as mathematical bigint differences, not wrapped differences. Only validate final output values against the physical type range.
For example:
deltas.push(normalizedValues[i] - normalizedValues[i - 1]);Then ensure bit-width calculation supports the possible adjusted delta range. For INT32 values, adjusted deltas may need more than 32 bits in edge cases if using mathematical deltas.
- Bit-packing order needs independent verification
The implementation packs bits MSB-first within the output byte stream:
accumulator = (accumulator << BigInt(bitWidth)) | value;and unpacks similarly.
Delta Binary Packed uses bit-packed mini-block values. The exact bit order must match Parquet’s bit-packing definition. The added codec tests mostly verify that the encoder and decoder agree with each other, which is not enough to prove spec compatibility.
The included hand-written test vector is useful, but it would be safer to compare against:
- Apache Parquet Java-generated data
- parquet-mr decoding
- parquet-cpp / Arrow decoding
- official examples if available
- The reference test addition for delta_binary_packed.parquet helps, but I would still add explicit low-level vectors for bit widths such as 1, 2, 3, 7, 8, 9, 31, 32, 63, and 64.
- Possible bug: bitWidths[miniBlockIndex] = bitWidth can silently truncate
bitWidths is a Buffer, so assignments are byte assignments:
bitWidths[miniBlockIndex] = bitWidth;If bitWidth were greater than 255, Node will coerce/truncate. In normal valid cases that should not happen, but given mathematical deltas it could exceed expected widths if layout or type assumptions are wrong.
Explicitly validate before assignment, and include a test for same:
if (bitWidth > 64) {
throw new Error('invalid DELTA_BINARY_PACKED bit width');
}or type-specific maximums.
- The decoder requires totalValueCount === count
The decoder reads the total value count from the Delta Binary Packed stream and requires it to match the externally requested count:
if (totalValueCount !== count) {
throw new Error(...)
}This is probably fine if the surrounding page decoder always passes the exact page value count. However, if this codec is ever used for partial reads, page slicing, or repeated/definition level handling in a way where the caller’s count is not exactly the encoded count, this strict check could become a compatibility issue.
You should confirm the call sites always pass the physical encoded value count for the page.
Security / robustness findings
- Malformed files can trigger expensive BigInt work
Because bitWidth is attacker-controlled and unbounded, a malicious page can force large BigInt shifts and masking:
bitMask(bitWidth)with values up to 255 from a single byte. That is not catastrophic by itself, but combined with large block sizes and many values, it can amplify CPU usage.
Severity: medium for untrusted Parquet files.
Fix: validate bit widths early.
- Malformed files can force large allocations / loops
The decoder trusts:
blockSize
miniBlockCount
totalValueCountfrom the encoded stream, subject only to Number.MAX_SAFE_INTEGER and basic divisibility.
A malicious file could specify large values that lead to:
- large arrays from new Array(count).fill(0n) when bitWidth === 0
- many loop iterations
- large bit-width metadata reads
- large output arrays
Some of this is inherent when decoding a page with many values, but the implementation should reject impossible or non-spec layouts before doing work.
Severity: medium for applications that read untrusted Parquet files.
Fix: enforce spec layout and practical bounds.
- Varint length cap may be too loose but is directionally good
readUnsignedVarint() rejects after:
if (shift > 70n)This prevents unbounded varints.
For Parquet values here, tighter limits could be applied depending on field:
block size and mini-block count should be bounded much lower
value count should be bounded by page metadata / requested count
first value and min delta should be bounded according to expected numeric width or a reasonable signed delta range
Test coverage gaps
The following should be added.
- Recommended additional tests
- Spec-valid default layout only
- Avoid promoting invalid tiny layouts as supported public behavior.
- Boundary INT32 values
- INT32_MIN
- INT32_MAX
- transition INT32_MAX -> INT32_MIN
- transition INT32_MIN -> INT32_MAX
- Boundary INT64 values
- INT64_MIN
- INT64_MAX
- large positive/negative deltas
- transitions across extremes
- Malformed bit width
- INT32 stream with bit width 33
- INT64 stream with bit width 65
- bit width 255
- Malformed block layout
- block size not multiple of 128
- mini-block value count not multiple of 32
- mini-block count larger than block size
- zero mini-block count
- huge mini-block count
- Truncated streams
- truncated header
- truncated first value
- truncated min delta
- truncated bit-width table
- truncated mini-block payload
- Independent interoperability vectors
- encode using this library, decode with Apache Parquet Java / Arrow
- encode with Apache Parquet Java / Arrow, decode with this library
- [ ]
Documentation
Please also update the README to include DELTA_BINARY_PACKED as a supported encoding.
Summary of Required changes:
- Enforce Parquet block-layout constraints.
- Validate decoded mini-block bit widths.
- Add malformed-input tests.
- Confirm/fix INT32/INT64 delta arithmetic semantics.
- Verify bit-packing order against an independent implementation.
- Add practical resource limits for decoded block metadata.
- Add README support table updates.
- Add interoperability tests using known-good Parquet files.
- Avoid test-only invalid block sizes in public codec options.
Fixes #100
Summary
Testing
npm_config_cache=$PWD/.npm-cache npm cinpm run test:only -- test/codec_delta_binary_packed.test.js test/reference-test/read-all.test.tsnpm run typenpm run lintgit diff --checkNote:
npm testreached 306 passing / 2 pending, then timed out intest/int96_timestamp.test.js; that same INT96 timeout reproduces when runningnpm run test:only -- test/int96_timestamp.test.jsby itself, outside this DELTA_BINARY_PACKED change.