SAI-6231: Bucketed segment flush handling on only single doc iterator by patsonluk · Pull Request #41 · cowpaths/lucene

patsonluk · 2026-06-09T18:51:39Z

Description

Built on top of this #40, with changes to only support delegating to a bucketed DocumentWriterPerThread (DWPT) if the input to DocumentWriter#updateDocuments is a single doc. This simplifies the logic flow

The bucket key will be the value of the input doc's field (defined by sysprop lucene.temporalField.name) mapped to the boundary of the bucket. The default boundaries are (in days) [-9, 3, 9, 32, 94, 184]. For example for -Dlucene.temporalField.name=EventStart,SessionStart,UserLastIndexedEventStart,UserTipLastEventStart, a doc with EventStart a day ago will be mapped to the bucket 3 (up to 3 days ago)

This is related to the TemporalMergePolicy work which group segments of the same bucket (using the same boundaries as defined in here) for merges.

Solution

Mostly work from @magibney from this PR

Logic to read sysprop lucene.temporalField.name and map input doc into bucket key
Changes to DocumentWriterFlushControl and DocumentWriterPerThreadPool to support creating/managing DWPT instances with a bucket key
PeekIterable to peek first (and only doc) to delegate to corresponding bucket DWPT
Change in MergePolicy to expose bucket mapping logic to higher layer (Solr TemporalMergePolicy etc)

Take note that we only support bucket mapping for DocumentWriter#updateDocuments with single doc for now. As supporting multiple document get complicated, as demonstrated in this original PR:

Need to delegate to multiple DWPTs while maintaining execute order across all the instances
Extra threads to support multiple DWPTs per updateDocuments processing

Note

Medium Risk
Changes the hot indexing path (DWPT acquisition and pooling) when routing is on, though behavior is opt-in via system properties and limited to single-document updates without a parent field.

Overview
Adds time-bucket routing at flush time so single-document updateDocuments paths can pin in-RAM segments to a bucket derived from a document’s temporal numeric field (primary + optional fallbacks), driven by JVM properties such as lucene.temporalField.name, lucene.temporalField.boundaries, and lucene.temporalField.adjustNow. Multi-document batches and indexes with a parent field skip routing and use the default bucket.

DocumentsWriter peeks the lone document (via PeekIterable) when routing is enabled, maps it with SegmentRoutingUtil, and calls obtainAndLock(bucket). DocumentsWriterPerThreadPool keeps separate free queues per bucket; each DocumentsWriterPerThread carries its bucket for checkout/return. MergePolicy.mapToBucket re-exports the boundary logic for Solr-style temporal merge policies.

A concurrent testUpdateDocumentRouting checks that live docs land in one temporal bucket per leaf after adds/updates; merge-policy reflection tests ignore the new static helper.

^{Reviewed by Cursor Bugbot for commit 5d28831. Bugbot is set up for automated code reviews on this repo. Configure here.}

This reverts commit f7f46a5. no longer necessary now that we peek the first doc and pull a corresponding DWPT.

…c per updateDocuments call

…Now, which now takes a date string and use that directly as now

magibney

Looking good I think!

I wonder if it'd be easy/worth pulling out as much as possible into its own class, as opposed to static fields/methods on DWPT (and PeekIterable from DocumentsWriter as well). As it stands it's kinda obvious where the changes are, but putting in a dedicated file would make it a bit clearer/cleaner?

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 39a3420. Configure here.}

patsonluk · 2026-06-10T22:17:32Z

Looking good I think!

I wonder if it'd be easy/worth pulling out as much as possible into its own class, as opposed to static fields/methods on DWPT (and PeekIterable from DocumentsWriter as well). As it stands it's kinda obvious where the changes are, but putting in a dedicated file would make it a bit clearer/cleaner?

Thanks! Did a quick refactor at here . For now keeping PeekIterable in where it is, cause the coupling is quite strong? 😁 Lemme know!

2. Enable routing for unit testing 3. Added comments on parent field handling

…stments to SegmentRoutingUtil to allow setting some constants for tests

magibney · 2026-06-16T20:55:46Z

+  }
+
+  static long mapToBucket(Iterable<? extends IndexableField> doc) {
+    return mapToBucket(doc, TEMPORAL_ADJUST_NOW != null ? TEMPORAL_ADJUST_NOW : System.currentTimeMillis(), defaultBucket());


we shouldn't really need to repeateedly invoke System.currentTimeMillis(). Ideally we could set a TEMPORAL_ADJUST_NANOS based on diff between System.nanoTime() and configured sysprop lucene.temporalField.adjustNow?

are u thinking about optimization of calling System.nanoTime() instead of System.currentTimeMillis() all the time? like calculate the base diff at init time (vs System.currentTimeMillis() or TEMPORAL_ADJUST_NOW). Then later on we can just use base diff + System.nanoTime() (ie cheaper?)

exactly that, yes.

I have made the changes here, can u please 👀 ? e9809bc

It's slightly different for the handling of lucene.temporalField.adjustNow. In such case I think we don't even want the time to advance at all, since having a static now time will be the most consistent for testing.

patsonluk · 2026-06-17T15:21:59Z

ping @hiteshk25 before we merge (think it should be close to approval from @magibney), please let us know if there are any concerns. A TL;DR of the changes:

The new routing logic are all in new class SegmentRoutingUtil
Changes to existing classes (DocumentsWriter...) are mainly:
1. Allow lookups of DocumentWriterPerThread using a "bucket key" which is mapped from the hard-coded day range boundaries
2. The routing logic only get triggered if sysprop lucene.temporalField.name is defined. If undefined, it should run the old logic, using the "default" bucket key for everything and w/o any routing.

hiteshk25 · 2026-06-17T16:52:02Z

👁️

magibney · 2026-06-17T17:00:47Z

+    if (ADJUST_NOW != null) { //explicitly defined a static now time. Use it for all calls
+      return ADJUST_NOW;
+    } else {
+      return NOW_BASE_IN_MILLI_SEC + TimeUnit.NANOSECONDS.toMillis(System.nanoTime());


I think this is not formally correct. System.nanoTime() can rollover and be negative. I think if we're going to do this, we have to convert fully to nanos -- e.g.

NOW_BASE_MILLIS = ADJUST_NOW == null ? System.currentTimeMillis() : ADJUST_NOW; NOW_BASE_NANOS = System.nanoTime(); getNow() { return NOW_BASE_MILLIS + TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - NOW_BASE_NANOS); }

I was trying to save one arithmetic but good point on the wrap-around issue!

Im not going to care about ADJUST_NOW, since if it's defined, we will NOT do any computation and use the value directly

magibney

LGTM

magibney and others added 14 commits May 12, 2026 18:07

POC temporal segment flushing

7d74ab6

cant lazily determine bucket

2e4f769

fix edge case where all docs in batch get re-routed

f7f46a5

expose mapToBucket()

e2fb5bf

only re-route once; cleaner

043413b

Revert "fix edge case where all docs in batch get re-routed"

947da5a

This reverts commit f7f46a5. no longer necessary now that we peek the first doc and pull a corresponding DWPT.

allow dynamically specified boundaries, and "pathological future" bucket

56c83d1

accept fallback field names if primary temporal field not present

b5a43c0

Added documentations

f2b46f9

Added documentations

830f9b6

Adjusted documentations

0eaf39c

Simplified code flow based on the assumption (and check) of single do…

f5fdd5e

…c per updateDocuments call

Simplified code flow based on the assumption (and check) of single do…

c444bd0

…c per updateDocuments call

Removed unused imports

305c535

patsonluk marked this pull request as ready for review June 9, 2026 19:27

cursor Bot reviewed Jun 9, 2026

View reviewed changes

Comment thread lucene/core/src/java/org/apache/lucene/index/DocumentsWriter.java

Comment thread lucene/core/src/java/org/apache/lucene/index/DocumentsWriter.java Outdated

Fixed after AI review

dea00b7

patsonluk changed the title ~~Patsonluk/single doc bucketed flush~~ SAI-6231: Bucketed segment flush handling on only single doc iterator Jun 9, 2026

Replaced lucene.temporalField.adjust with lucene.temporalField.adjust…

d62770f

…Now, which now takes a date string and use that directly as now

magibney reviewed Jun 10, 2026

View reviewed changes

patsonluk added 3 commits June 10, 2026 13:43

Changes after code review

6becddf

Factor out bucket segment routing code to SegmentRoutingUtil

a82be6a

Fixed test case to skip checking overrides on static methods

39a3420

cursor Bot reviewed Jun 10, 2026

View reviewed changes

Comment thread lucene/core/src/java/org/apache/lucene/index/DocumentsWriter.java

patsonluk added 2 commits June 10, 2026 16:00

Expose SegmentRoutingUtil.TEMPORAL_ADJUST_NOW

6b811e6

Removed unused imports

abec5b7

magibney requested changes Jun 12, 2026

View reviewed changes

patsonluk added 2 commits June 12, 2026 09:36

Reverted unnecessary change

d4b9eab

Reverted unnecessary change

58e8814

1. Renamed ENABLE_REROUTING to ENABLE_ROUTING

d9c02b6

2. Enable routing for unit testing 3. Added comments on parent field handling

magibney requested changes Jun 12, 2026

View reviewed changes

Comment thread lucene/core/src/test/org/apache/lucene/search/TestLRUQueryCache.java Outdated

patsonluk added 4 commits June 16, 2026 10:26

(temp) return default bucket when there's no TEMPORAL_FIELD_NAME

6feb350

Added test case testUpdateDocumentRouting for bucket routing and adju…

0ff0225

…stments to SegmentRoutingUtil to allow setting some constants for tests

Removed unused imports

4dea4e5

Use default bucket as fallback instead of -1

1c25596

magibney reviewed Jun 16, 2026

View reviewed changes

patsonluk added 4 commits June 16, 2026 14:11

Reverted unnecessary change

3f3b5b6

Removed irrelevant comment

2ba26df

Changed the now time code after review

e9809bc

Changed the now time code after review

4b88356

magibney reviewed Jun 17, 2026

View reviewed changes

patsonluk added 2 commits June 17, 2026 10:41

Address possible wrap around for nanoTime()

e036b59

Reverted unnecessary change

5d28831

magibney approved these changes Jun 17, 2026

View reviewed changes

patsonluk merged commit 928bce3 into fs/branch_9_11 Jun 25, 2026
2 checks passed

Uh oh!

Conversation

patsonluk commented Jun 9, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Solution

Uh oh!

Uh oh!

Uh oh!

magibney left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

patsonluk commented Jun 10, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

magibney Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

patsonluk Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

magibney Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

patsonluk Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

patsonluk commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hiteshk25 commented Jun 17, 2026

Uh oh!

magibney Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

patsonluk Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

magibney left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

patsonluk commented Jun 9, 2026 •

edited by cursor Bot

Loading

patsonluk commented Jun 17, 2026 •

edited

Loading