Skip to content

Optimize chunked validation by budgeting per-call token load instead of fixed entity count #134

@lipikaramaswamy

Description

@lipikaramaswamy

Priority Level

Medium

Task Summary

Summary

Chunked validation currently partitions candidates by a fixed validation_max_entities_per_call count. This is simple and works, but it does
not account for how much prompt text each chunk actually sends to the validator. In practice, adjacent chunks can resend overlapping excerpt
context, and some entities contribute much more text than others. That means equal-size chunks by entity count can still be badly unbalanced by
total token load.

We should explore replacing or augmenting the current entity-count heuristic with a token-budget-aware chunking strategy.

Current behavior

Today we:

  • order validation candidates by position
  • split them into chunks of at most validation_max_entities_per_call
  • build a bounded excerpt around each chunk
  • render one validator prompt per chunk

This gives predictable behavior and keeps the implementation straightforward, but it has two inefficiencies:

  • prompt size can vary widely across chunks even when candidate count is the same
  • neighboring chunks may resend overlapping context, increasing total input tokens

Problem

Entity count is only a proxy for prompt cost.

Actual per-call cost depends on both:

  • input size
    • tagged excerpt size
    • validation skeleton size
    • prompt-template overhead
  • output size
    • number of decisions returned
    • any per-decision reasoning/reclassification fields

A token-budget-aware chunker could produce fewer, better-balanced validation calls and reduce total spend, especially on long documents with
dense nearby entities.

Proposed direction

Investigate a chunking strategy that targets an estimated per-call token budget rather than only a fixed entity count.

Possible shape:

  • estimate prompt cost for adding the next candidate to a chunk
  • account for:
    • excerpt growth
    • skeleton growth
    • fixed template overhead
    • expected response size per candidate
  • close the chunk when adding another candidate would exceed budget
  • preserve position order
  • keep current fixed-count behavior as the default until the new approach is validated

Open design questions:

  • should this replace validation_max_entities_per_call, or be an additional knob?
  • how should we estimate output-token budget per candidate?
  • how should we handle overlapping excerpt windows between neighboring chunks?
  • do we want a hard cap on candidate count even with token budgeting?

Acceptance criteria

  • prototype a token-budget-aware chunking strategy for validator calls
  • benchmark against the current fixed-count chunking on representative long-text inputs
  • compare:
    • total validator calls
    • estimated input/output tokens
    • runtime
    • merged decision parity / behavioral regressions
  • add tests that cover:
    • chunk-size balancing
    • edge cases around excerpt growth
    • stable ordering / deterministic chunk boundaries
    • parity with current behavior where appropriate

Notes

This is an optimization / follow-up, not a blocker. Fixed entity-count chunking is a reasonable first implementation and should remain the
baseline until the token-budget approach is clearly better and well tested.

Technical Details & Implementation Plan

No response

Dependencies

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    taskDevelopment task

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions