Skip to content

Support DELTA_BINARY_PACKED integer encoding#186

Open
puneetdixit200 wants to merge 2 commits into
frequency-chain:mainfrom
puneetdixit200:100-delta-binary-packed
Open

Support DELTA_BINARY_PACKED integer encoding#186
puneetdixit200 wants to merge 2 commits into
frequency-chain:mainfrom
puneetdixit200:100-delta-binary-packed

Conversation

@puneetdixit200

Copy link
Copy Markdown

Fixes #100

Summary

  • add DELTA_BINARY_PACKED codec support for INT32 and INT64 values
  • wire the codec into schema/read/write encoding validation
  • add codec coverage and enable reference reads for delta-binary-packed fixtures

Testing

  • npm_config_cache=$PWD/.npm-cache npm ci
  • npm run test:only -- test/codec_delta_binary_packed.test.js test/reference-test/read-all.test.ts
  • npm run type
  • npm run lint
  • git diff --check

Note: npm test reached 306 passing / 2 pending, then timed out in test/int96_timestamp.test.js; that same INT96 timeout reproduces when running npm run test:only -- test/int96_timestamp.test.js by itself, outside this DELTA_BINARY_PACKED change.

Signed-off-by: Puneet Dixit <236133619+puneetdixit200@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 21, 2026 14:57

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR adds support for Parquet’s DELTA_BINARY_PACKED encoding and updates the reference fixtures/tests accordingly.

Changes:

  • Implement DELTA_BINARY_PACKED encode/decode codec and export it from the codec index.
  • Expand reference “first record” fixtures and adjust unsupported reference-test expectations.
  • Add unit tests covering core DELTA_BINARY_PACKED encode/decode scenarios.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
test/reference-test/read-all.test.ts Updates unsupported reference files list to reflect newly supported encodings.
test/reference-test/first-record.ts Adds expected first-record fixtures for additional reference parquet files.
test/codec_delta_binary_packed.test.js Adds unit tests for DELTA_BINARY_PACKED codec behavior.
lib/declare.ts Extends ParquetCodec union type to include DELTA_BINARY_PACKED.
lib/codec/index.ts Exports the new DELTA_BINARY_PACKED codec module.
lib/codec/delta_binary_packed.ts Introduces DELTA_BINARY_PACKED implementation (varints, zigzag, bit-packing).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lib/codec/delta_binary_packed.ts
Comment thread lib/codec/delta_binary_packed.ts Outdated
Comment thread lib/codec/delta_binary_packed.ts
Comment thread lib/codec/delta_binary_packed.ts Outdated
Comment thread lib/codec/delta_binary_packed.ts Outdated
Comment thread test/codec_delta_binary_packed.test.js

@shannonwells shannonwells left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an ok first pass, but it cannot be merged as-is.

tl;dr:

  • Block layout validation is too permissive relative to the spec.
  • Decoder does not validate mini-block bit widths.
  • Potential denial-of-service risks from attacker-controlled block sizes / mini-block counts / bit widths.
  • INT32 delta arithmetic appears to wrap deltas to 32-bit signed values during encoding, which is likely non-canonical and may harm interoperability.
  • Bit-packing byte/bit order should be verified carefully against the Parquet spec and independent test vectors.
  • The feature is not reflected in README supported encodings.
  • Tests are helpful but insufficient for malformed input, boundary values, and interoperability.

In detail:

Correctness findings

  1. blockSize validation does not enforce Parquet’s layout requirements
    The implementation validates only:
blockSize > 0
miniBlockCount > 0
blockSize % miniBlockCount === 0

But the Parquet Delta Binary Packed encoding specification states that:

  • the block size must be a multiple of 128
  • the block is subdivided into mini-blocks
  • each mini-block should contain a number of values that is a multiple of 32
  • The current validateBlockLayout() would accept layouts such as:
blockSize = 8
miniBlockCount = 2
valuesPerMiniBlock = 4

That is even used in the added test:

const opts = {
  deltaBinaryPackedBlockSize: 8,
  deltaBinaryPackedMiniBlockCount: 2,
};

This is convenient for small tests, but it produces non-standard Parquet delta binary packed streams. Other readers may reject those files. This PR should enforce spec-valid layouts for production encoding/decoding. If small test layouts are desired, keep them behind a private/internal test-only helper, not public options. Suggested validation rules:

blockSize % 128 === 0
blockSize % miniBlockCount === 0
(blockSize / miniBlockCount) % 32 === 0
  1. Mini-block bit widths are not validated
    In decoding, each bit width byte is read directly from the data page:
const bitWidths = cursor.buffer.subarray(cursor.offset, cursor.offset + miniBlockCount);

Then passed to:

unpackMiniBlock(cursor, valuesPerMiniBlock, bitWidth)

There is no check that the bit width is valid for the physical type.
For Delta Binary Packed:
INT32 adjusted deltas should not require more than roughly 32 bits in normal canonical encoding.
INT64 adjusted deltas should not require more than 64 bits.
A bit width byte like 255 is malformed and should be rejected immediately.
Currently, a malformed page can cause large BigInt shifts and large computed byte lengths:

const byteLength = Math.ceil((count * bitWidth) / 8);

This is a correctness issue and also a robustness/security issue.
The PR should reject bit widths greater than the physical type’s maximum usable width.
Example logic:

const maxBitWidth = type === 'INT32' ? 32 : 64;

for (const bitWidth of bitWidths) {
  if (bitWidth > maxBitWidth) {
    throw new Error('invalid DELTA_BINARY_PACKED encoding');
  }
}
  1. Decoder accepts non-spec block layouts from untrusted files.
    Because the decoder uses the same permissive layout validation as the encoder, a malicious or malformed Parquet file can declare very large or strange layouts, for example:
  • enormous blockSize
  • enormous miniBlockCount
  • valuesPerMiniBlock values that are not spec-valid
  • huge mini-block count causing a large bit-width table read

Some size limits are indirectly constrained by the input buffer length, but not all resource usage is bounded early. Please add explicit upper bounds and spec checks before allocating or looping. At minimum this must:

  • reject non-spec block layout
  • reject miniBlockCount > blockSize
  • reject valuesPerMiniBlock <= 0
  • reject valuesPerMiniBlock that is not an integer multiple of 32
  • consider practical maximums for blockSize and miniBlockCount
  1. Encoder wraps INT32 deltas to signed 32-bit values
    During encoding:
deltas.push(toSignedInteger(normalizedValues[i] - normalizedValues[i - 1], typeBitWidth));

For INT32, this wraps the delta into signed 32-bit range.
That means a transition like:

2147483647 -> -2147483648

has a mathematical delta of:

-4294967295

but the encoder will turn it into:

+1

The local decoder then also wraps accumulated values, so round-trips may pass inside this implementation. However, this is likely not canonical Delta Binary Packed encoding and may produce files that other Parquet readers decode differently or reject.

The spec describes storing deltas between consecutive values, then storing the minimum delta and adjusted deltas. It does not describe computing deltas modulo the physical type width.

Instead: compute deltas as mathematical bigint differences, not wrapped differences. Only validate final output values against the physical type range.
For example:

deltas.push(normalizedValues[i] - normalizedValues[i - 1]);

Then ensure bit-width calculation supports the possible adjusted delta range. For INT32 values, adjusted deltas may need more than 32 bits in edge cases if using mathematical deltas.

  1. Bit-packing order needs independent verification
    The implementation packs bits MSB-first within the output byte stream:
accumulator = (accumulator << BigInt(bitWidth)) | value;

and unpacks similarly.

Delta Binary Packed uses bit-packed mini-block values. The exact bit order must match Parquet’s bit-packing definition. The added codec tests mostly verify that the encoder and decoder agree with each other, which is not enough to prove spec compatibility.
The included hand-written test vector is useful, but it would be safer to compare against:

  • Apache Parquet Java-generated data
  • parquet-mr decoding
  • parquet-cpp / Arrow decoding
  • official examples if available
  • The reference test addition for delta_binary_packed.parquet helps, but I would still add explicit low-level vectors for bit widths such as 1, 2, 3, 7, 8, 9, 31, 32, 63, and 64.
  1. Possible bug: bitWidths[miniBlockIndex] = bitWidth can silently truncate
    bitWidths is a Buffer, so assignments are byte assignments:
bitWidths[miniBlockIndex] = bitWidth;

If bitWidth were greater than 255, Node will coerce/truncate. In normal valid cases that should not happen, but given mathematical deltas it could exceed expected widths if layout or type assumptions are wrong.

Explicitly validate before assignment, and include a test for same:

if (bitWidth > 64) {
  throw new Error('invalid DELTA_BINARY_PACKED bit width');
}

or type-specific maximums.

  1. The decoder requires totalValueCount === count
    The decoder reads the total value count from the Delta Binary Packed stream and requires it to match the externally requested count:
if (totalValueCount !== count) {
  throw new Error(...)
}

This is probably fine if the surrounding page decoder always passes the exact page value count. However, if this codec is ever used for partial reads, page slicing, or repeated/definition level handling in a way where the caller’s count is not exactly the encoded count, this strict check could become a compatibility issue.

You should confirm the call sites always pass the physical encoded value count for the page.

Security / robustness findings

  1. Malformed files can trigger expensive BigInt work
    Because bitWidth is attacker-controlled and unbounded, a malicious page can force large BigInt shifts and masking:
bitMask(bitWidth)

with values up to 255 from a single byte. That is not catastrophic by itself, but combined with large block sizes and many values, it can amplify CPU usage.
Severity: medium for untrusted Parquet files.
Fix: validate bit widths early.

  1. Malformed files can force large allocations / loops
    The decoder trusts:
blockSize
miniBlockCount
totalValueCount

from the encoded stream, subject only to Number.MAX_SAFE_INTEGER and basic divisibility.
A malicious file could specify large values that lead to:

  • large arrays from new Array(count).fill(0n) when bitWidth === 0
  • many loop iterations
  • large bit-width metadata reads
  • large output arrays

Some of this is inherent when decoding a page with many values, but the implementation should reject impossible or non-spec layouts before doing work.
Severity: medium for applications that read untrusted Parquet files.
Fix: enforce spec layout and practical bounds.

  1. Varint length cap may be too loose but is directionally good
    readUnsignedVarint() rejects after:
if (shift > 70n)

This prevents unbounded varints.

For Parquet values here, tighter limits could be applied depending on field:
block size and mini-block count should be bounded much lower
value count should be bounded by page metadata / requested count
first value and min delta should be bounded according to expected numeric width or a reasonable signed delta range

Test coverage gaps
The following should be added.

  • Recommended additional tests
  • Spec-valid default layout only
  • Avoid promoting invalid tiny layouts as supported public behavior.
  • Boundary INT32 values
  • INT32_MIN
  • INT32_MAX
  • transition INT32_MAX -> INT32_MIN
  • transition INT32_MIN -> INT32_MAX
  • Boundary INT64 values
  • INT64_MIN
  • INT64_MAX
  • large positive/negative deltas
  • transitions across extremes
  • Malformed bit width
  • INT32 stream with bit width 33
  • INT64 stream with bit width 65
  • bit width 255
  • Malformed block layout
  • block size not multiple of 128
  • mini-block value count not multiple of 32
  • mini-block count larger than block size
  • zero mini-block count
  • huge mini-block count
  • Truncated streams
  • truncated header
  • truncated first value
  • truncated min delta
  • truncated bit-width table
  • truncated mini-block payload
  • Independent interoperability vectors
  • encode using this library, decode with Apache Parquet Java / Arrow
  • encode with Apache Parquet Java / Arrow, decode with this library
  • [ ]

Documentation

Please also update the README to include DELTA_BINARY_PACKED as a supported encoding.

Summary of Required changes:

  • Enforce Parquet block-layout constraints.
  • Validate decoded mini-block bit widths.
  • Add malformed-input tests.
  • Confirm/fix INT32/INT64 delta arithmetic semantics.
  • Verify bit-packing order against an independent implementation.
  • Add practical resource limits for decoded block metadata.
  • Add README support table updates.
  • Add interoperability tests using known-good Parquet files.
  • Avoid test-only invalid block sizes in public codec options.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support DELTA_BINARY_PACKED

4 participants