Priority Level
Medium
Task Summary
Summary
Chunked validation currently partitions candidates by a fixed validation_max_entities_per_call count. This is simple and works, but it does
not account for how much prompt text each chunk actually sends to the validator. In practice, adjacent chunks can resend overlapping excerpt
context, and some entities contribute much more text than others. That means equal-size chunks by entity count can still be badly unbalanced by
total token load.
We should explore replacing or augmenting the current entity-count heuristic with a token-budget-aware chunking strategy.
Current behavior
Today we:
- order validation candidates by position
- split them into chunks of at most
validation_max_entities_per_call
- build a bounded excerpt around each chunk
- render one validator prompt per chunk
This gives predictable behavior and keeps the implementation straightforward, but it has two inefficiencies:
- prompt size can vary widely across chunks even when candidate count is the same
- neighboring chunks may resend overlapping context, increasing total input tokens
Problem
Entity count is only a proxy for prompt cost.
Actual per-call cost depends on both:
- input size
- tagged excerpt size
- validation skeleton size
- prompt-template overhead
- output size
- number of decisions returned
- any per-decision reasoning/reclassification fields
A token-budget-aware chunker could produce fewer, better-balanced validation calls and reduce total spend, especially on long documents with
dense nearby entities.
Proposed direction
Investigate a chunking strategy that targets an estimated per-call token budget rather than only a fixed entity count.
Possible shape:
- estimate prompt cost for adding the next candidate to a chunk
- account for:
- excerpt growth
- skeleton growth
- fixed template overhead
- expected response size per candidate
- close the chunk when adding another candidate would exceed budget
- preserve position order
- keep current fixed-count behavior as the default until the new approach is validated
Open design questions:
- should this replace
validation_max_entities_per_call, or be an additional knob?
- how should we estimate output-token budget per candidate?
- how should we handle overlapping excerpt windows between neighboring chunks?
- do we want a hard cap on candidate count even with token budgeting?
Acceptance criteria
- prototype a token-budget-aware chunking strategy for validator calls
- benchmark against the current fixed-count chunking on representative long-text inputs
- compare:
- total validator calls
- estimated input/output tokens
- runtime
- merged decision parity / behavioral regressions
- add tests that cover:
- chunk-size balancing
- edge cases around excerpt growth
- stable ordering / deterministic chunk boundaries
- parity with current behavior where appropriate
Notes
This is an optimization / follow-up, not a blocker. Fixed entity-count chunking is a reasonable first implementation and should remain the
baseline until the token-budget approach is clearly better and well tested.
Technical Details & Implementation Plan
No response
Dependencies
No response
Priority Level
Medium
Task Summary
Summary
Chunked validation currently partitions candidates by a fixed
validation_max_entities_per_callcount. This is simple and works, but it doesnot account for how much prompt text each chunk actually sends to the validator. In practice, adjacent chunks can resend overlapping excerpt
context, and some entities contribute much more text than others. That means equal-size chunks by entity count can still be badly unbalanced by
total token load.
We should explore replacing or augmenting the current entity-count heuristic with a token-budget-aware chunking strategy.
Current behavior
Today we:
validation_max_entities_per_callThis gives predictable behavior and keeps the implementation straightforward, but it has two inefficiencies:
Problem
Entity count is only a proxy for prompt cost.
Actual per-call cost depends on both:
A token-budget-aware chunker could produce fewer, better-balanced validation calls and reduce total spend, especially on long documents with
dense nearby entities.
Proposed direction
Investigate a chunking strategy that targets an estimated per-call token budget rather than only a fixed entity count.
Possible shape:
Open design questions:
validation_max_entities_per_call, or be an additional knob?Acceptance criteria
Notes
This is an optimization / follow-up, not a blocker. Fixed entity-count chunking is a reasonable first implementation and should remain the
baseline until the token-budget approach is clearly better and well tested.
Technical Details & Implementation Plan
No response
Dependencies
No response