feat: DiskSINDI sparse index with batched and merged rerank IO by Roxanne0321 · Pull Request #2129 · antgroup/vsag

Roxanne0321 · 2026-06-02T11:58:07Z

Summary

Introduce DiskSINDI, a new sparse index family that stores posting lists on disk through a DiskSparseTermListDataCell, while keeping the term dictionary and (optional) rerank flat datacell in memory.

This PR includes three incremental commits:

Initial DiskSINDI implementation — on-disk posting list layout, parameter handling, factory registration, and end-to-end tests.
Stage3 Phase A — collapse per-candidate rerank IO into a single batched GetCodesByIdsBatch call, eliminating N serialized io_submit/io_getevents round-trips.
Stage3 Phase B — sort candidate inner_ids before issuing the batched IO, then merge adjacent/nearby disk reads (gap ≤ 4 KiB, merged segment ≤ 1 MiB) to further reduce syscall count on DirectIO backends.

Key Changes

src/algorithm/disksindi/: new index implementation (disksindi.{h,cpp}, parameters, tests)
src/datacell/disk_sparse_term_list_datacell.{h,cpp,inl}: on-disk term list datacell
src/datacell/sparse_vector_datacell.{h,inl}: GetCodesByIdsBatch with IO merge logic
src/datacell/flatten_interface.h: BatchCodesResult and virtual GetCodesByIdsBatch
Factory registration, constants, and shared test fixtures

Testing

All unit tests pass (DiskSINDI end-to-end, SparseDataCell batch correctness, IO merge correctness)
Cross-checked rerank distances against CalcDistanceById single-id path

Closes: #1957

Keep legacy SINDI rerank deserialization compatible while moving rerank storage to SparseVectorDataCell. Signed-off-by: Roxanne0321 <liruoxvan020321@qq.com> Assisted-by: GitHub Copilot:GPT-5.4

Grow sparse datacell storage based on actual encoded vector size instead of using dense dim as a hard upper bound, and address follow-up review feedback in SINDI and sparse vector retrieval. Signed-off-by: Roxanne0321 <liruoxvan020321@qq.com> Assisted-by: GitHub Copilot:GPT-5.4

Signed-off-by: Roxanne0321 <liruoxvan020321@qq.com> Assisted-by: GitHub Copilot:GPT-5.4

Resolve the upstream/main merge while keeping SparseIndex removed from this branch. Signed-off-by: Roxanne0321 <liruoxvan020321@qq.com> Assisted-by: GitHub Copilot:GPT-5.4

Add DiskSINDI as a new sparse index family. Compared to SINDI, the posting lists are stored on disk through a DiskSparseTermListDataCell, while the term dictionary and (optional) rerank flat datacell remain in memory. The index integrates with the existing IO abstractions (memory / mmap / buffer / async / reader) so the same code path can be benchmarked under different IO backends. Main pieces: - src/algorithm/disksindi/: new index implementation, parameter handling and unit tests (disksindi.{h,cpp}, disksindi_parameter.{h,cpp}, disksindi_parameter_test.cpp, CMakeLists.txt). - src/datacell/disk_sparse_term_list_datacell.{h,cpp,inl} and matching unit test: term list datacell that owns the on-disk posting layout, with serialization and search-time scan helpers. - src/datacell/sparse_vector_datacell.{h,inl} and its test: share rerank flat layout with SINDI and surface helpers that DiskSINDI uses for the rerank path. - src/datacell/flatten_interface.h: tidy includes to support the new call sites; no functional change to existing flatten datacells. - src/algorithm/sindi/sindi.cpp: align computer / search-impl helpers with the refactored sparse pieces shared with DiskSINDI. - src/quantization/sparse_quantization/sparse_term_computer.h: expose a small accessor needed by DiskSINDI search. - src/io/mmap_io.cpp: minor change to support DiskSINDI's read pattern. - include/vsag/constants.h, include/vsag/index.h, src/constants.cpp, src/factory/index_creators.cpp, src/algorithm/CMakeLists.txt: register the new INDEX_DISKSINDI type and wire up the factory. - tests/fixtures/unittest.h: include catch2 matchers headers needed by the new DiskSINDI / disk sparse term list tests. Signed-off-by: Roxanne0321 <liruoxvan020321@qq.com> Assisted-by: CodeFuse:claude-sonnet-4.5

Stage3 phase A: collapse the per-candidate rerank IO into a single batched read. This addresses the dominant cost on the DiskSINDI async-IO rerank path, where each candidate previously triggered an independent io_submit / io_getevents round-trip. Code changes: - src/datacell/flatten_interface.h: introduce BatchCodesResult and a new virtual GetCodesByIdsBatch on FlattenInterface. The default implementation falls back to a loop over GetCodesById, copying each fixed-size code blob into a contiguous buffer; it is safe for any backend whose code length equals code_size_. - src/datacell/sparse_vector_datacell.{h,inl}: override GetCodesByIdsBatch. The override walks offset_io_ (in-memory) once to collect every DocLocation, computes per-id sizes and in-buffer offsets, then issues a single io_->MultiRead for all payloads. The rest of the access path (locking, encoding) is unchanged. Phase A does not sort or merge requests; that is left to phase B. - src/algorithm/disksindi/disksindi.cpp: extract compute_distance_from_codes so the existing single-candidate path (cal_distance_by_id_unsafe) and the new batched path share the same scoring routine. Rewrite the rerank loop in search_impl to (1) pop the heap into a flat Vector<InnerIdType> preserving the original pop order, (2) call rerank_flat_->GetCodesByIdsBatch once, (3) iterate with const uint8_t* codes = batch.buffer.data() + batch.in_buffer_offsets[i] and feed compute_distance_from_codes. Tests: - src/datacell/sparse_vector_datacell_test.cpp: add a "SparseDataCell Batch Codes Matches Single" test that compares GetCodesByIdsBatch against GetCodesById byte-for-byte across ascending ids, shuffled ids, subsets with duplicates, and the empty input, over the memory_io and block_memory_io backends. - src/algorithm/disksindi/disksindi_test.cpp: new end-to-end test that builds a DiskSINDI index with buffer_io term_io and memory_io rerank_io, runs a top-k query, and asserts that the result dim matches k, distances are non-decreasing, and the query itself appears in its own top-k. Behavior: - Serialization format is unchanged; this is a runtime-only change. - Recall is unaffected (the batched path produces the same codes as the per-id path, only the IO scheduling differs). - The async-IO rerank backend is the primary beneficiary; memory and mmap rerank backends degrade to a contiguous memcpy / sequential read and stay within noise. Signed-off-by: Roxanne0321 <liruoxvan020321@qq.com> Assisted-by: CodeFuse:claude-sonnet-4.5

Stage3 phase B: sort candidate inner_ids before the batched rerank IO and merge adjacent/nearby disk reads into larger segments. This builds on the Phase A batched IO to further reduce syscall count when the IO backend is DirectIO-based (async_io). Code changes: - src/algorithm/disksindi/disksindi.cpp: after popping candidates from the heap into a flat vector, apply std::sort so that disk offsets are monotonically increasing before calling GetCodesByIdsBatch. - src/datacell/sparse_vector_datacell.inl: rewrite GetCodesByIdsBatch to build merged IO ranges. After gathering (offset, size) tuples, the function scans them in order and coalesces pairs whose gap is within MERGE_GAP_LIMIT (DirectIO alignment, typically 4 KiB) and whose merged length stays under MAX_MERGED_IO_LEN (1 MiB). A scratch buffer receives the merged reads via MultiRead, then a scatter loop copies each candidate's payload into its final slot in result.buffer. Correctness is independent of input ordering: unsorted ids simply produce fewer merges, degenerating to the Phase A behavior. Tests: - src/algorithm/disksindi/disksindi_test.cpp: add "DiskSINDI Sorted Merge Rerank End-To-End" which cross-checks every result distance against CalcDistanceById to ensure the sorted+merged code path produces identical distance computations. - src/datacell/sparse_vector_datacell_test.cpp: add "SparseDataCell Batch IO Merge Correctness" covering seven sections (unsorted, ascending contiguous, strided, large range, single, two adjacent, two distant) over both memory_io and block_memory_io backends. Signed-off-by: Roxanne0321 <liruoxvan020321@qq.com> Assisted-by: CodeFuse:claude-sonnet-4

mergify · 2026-06-02T11:58:56Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Require kind label

Wonderful, this rule succeeded.

label~=^kind/

🟢 Require version label

Wonderful, this rule succeeded.

label~=^version/

🟢 Require linked issue for feature/bug PRs

Wonderful, this rule succeeded.

body~=(?im)(?:^|[\s\-\*])(?:close[sd]?|fix(?:e[sd])?|resolve[sd]?)\s*:?\s+(?:#\d+|[\w.\-]+/[\w.\-]+#\d+|https?://github\.com/[\w.\-]+/[\w.\-]+/issues/\d+)

gemini-code-assist

Code Review

This pull request introduces the DiskSINDI index type to support disk-backed sparse vector indexing, refactors the existing SINDI index to use FlattenInterface for reranking, and upgrades SparseVectorDataCell to support 64-bit offsets with backward compatibility. The review feedback highlights several critical bug fixes and optimization opportunities, including guarding against undefined behavior and crashes when handling zero-length allocations or empty heaps, avoiding heap allocation overhead for small padding writes, and correcting a minor include path typo.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR removes the legacy SparseIndex implementation and introduces DiskSINDI plus supporting data-cells, while also upgrading sparse vector storage for 64-bit offsets and batched IO reads.

Changes:

Added DiskSINDI (new index type) and a disk-backed sparse term list data-cell with serialization/deserialization support.
Updated SparseVectorDataCell to use 64-bit offsets, add a v2 serialization sentinel/version, and implement GetCodesByIdsBatch with IO-merge logic.
Updated tests/docs and refactored SINDI’s rerank-flat storage to use FlattenInterface instead of SparseIndex.

Reviewed changes

Copilot reviewed 38 out of 38 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
tests/test_sparse_index.cpp	Removed functional tests for deleted `SparseIndex`.
tests/fixtures/unittest.h	Added Catch2 matcher includes for new `REQUIRE_THROWS_WITH` usage.
src/quantization/sparse_quantization/sparse_term_computer.h	Added DiskSINDI search-param constructor.
src/io/mmap_io.cpp	Populate mmap size from existing file size.
src/factory/index_creators.cpp	Register `DiskSINDI` creator; remove `SparseIndex` creator.
src/datacell/sparse_vector_datacell_test.cpp	Added tests for v1/v2 serialization compatibility, batched reads, and sparse vector fetch by inner id.
src/datacell/sparse_vector_datacell.inl	Implemented v2 serialization + legacy fallback; added batched code fetch with IO merging; added locking and new inner-id sparse fetch.
src/datacell/sparse_vector_datacell.h	Switched to 64-bit offsets with packed `DocLocation`; added batch API and serialization format constants.
src/datacell/flatten_interface.h	Introduced `BatchCodesResult` + default `GetCodesByIdsBatch` and sparse fetch hook.
src/datacell/disk_sparse_term_list_datacell_test.cpp	Added unit tests for disk term list data-cell IO restore and IO type validation.
src/datacell/disk_sparse_term_list_datacell.inl	Added heap insertion helpers / window scan logic implementation.
src/datacell/disk_sparse_term_list_datacell.h	Added disk sparse term list data-cell interface + template implementation.
src/datacell/disk_sparse_term_list_datacell.cpp	Implemented disk sparse term list data-cell build, IO, and query helpers.
src/constants.cpp	Removed `INDEX_SPARSE`; added `INDEX_DISKSINDI`.
src/algorithm/sparse_index_parameters.h	Deleted legacy `SparseIndex` parameters.
src/algorithm/sparse_index_parameters.cpp	Deleted legacy `SparseIndex` parameters impl.
src/algorithm/sparse_index.h	Deleted legacy `SparseIndex`.
src/algorithm/sparse_index.cpp	Deleted legacy `SparseIndex` implementation.
src/algorithm/sparse_distance.h	Introduced shared sparse sort + distance utilities for SINDI/DiskSINDI.
src/algorithm/sindi/sindi_test.cpp	Switched ground-truth from `SparseIndex` to an “exact SINDI” configuration; adjusted tolerances.
src/algorithm/sindi/sindi.h	Replaced rerank `SparseIndex` with `FlattenInterface`.
src/algorithm/sindi/sindi.cpp	Implemented rerank flat as a datacell, added legacy rerank deserialization path, and reused new sparse distance helpers.
src/algorithm/inner_index_interface.h	Updated docs to remove `SparseIndex` references.
src/algorithm/inner_index_interface.cpp	Removed SPARSE handling branch in `GetVectorByIds`.
src/algorithm/disksindi/disksindi_test.cpp	Added end-to-end tests for DiskSINDI batched rerank behavior.
src/algorithm/disksindi/disksindi_parameter_test.cpp	Added DiskSINDI parameter parsing and compatibility tests.
src/algorithm/disksindi/disksindi_parameter.h	Added DiskSINDI index/search parameters.
src/algorithm/disksindi/disksindi_parameter.cpp	Implemented DiskSINDI parameter parsing and compatibility rules.
src/algorithm/disksindi/disksindi.h	Added DiskSINDI index interface.
src/algorithm/disksindi/disksindi.cpp	Implemented DiskSINDI build/search/serialize/deserialize and batched rerank.
src/algorithm/disksindi/CMakeLists.txt	Added build target for DiskSINDI sources.
src/algorithm/CMakeLists.txt	Wired DiskSINDI subdirectory and object library.
include/vsag/index.h	Public API: removed `SPARSE`, added `DISKSINDI`; updated docs text.
include/vsag/constants.h	Public constants: removed `INDEX_SPARSE`, added `INDEX_DISKSINDI`.
docs/docs/zh/src/advanced/search_allocator.md	Removed SparseIndex mention.
docs/docs/zh/src/advanced/introspection.md	Removed SparseIndex mention.
docs/docs/en/src/advanced/search_allocator.md	Removed SparseIndex mention.
docs/docs/en/src/advanced/introspection.md	Removed SparseIndex mention.

Comments suppressed due to low confidence (4)

src/datacell/sparse_vector_datacell.inl:1

max_code_size_ is now initialized to sizeof(uint32_t), which makes Resize() (and any pre-allocation strategy that depends on max_code_size_) severely under-estimate IO storage needs before the first insert. Since InsertVector() resizes io_ to required_size incrementally, this can devolve into frequent small resizes and copying. Consider initializing max_code_size_ to a more realistic bound/estimate (e.g., based on configured expected avg sparse length, or a conservative heuristic), and/or grow io_ capacity with an amortized strategy (e.g., geometric growth) instead of resizing to exactly required_size.
src/datacell/sparse_vector_datacell.inl:1
In the legacy deserialization path, legacy_offset_io_size is divided by sizeof(LegacyDocLocation) without validating exact divisibility. If the stream is corrupted/misaligned, truncation will leave unread bytes in the stream, which can shift quantizer_->Deserialize(reader) and cause hard-to-debug failures. Add a strict check that legacy_offset_io_size % sizeof(LegacyDocLocation) == 0 (and ideally that the resulting doc_count matches total_count_ / expected count if that invariant exists) and throw a descriptive exception on mismatch.
src/datacell/sparse_vector_datacell_test.cpp:1
This test sorts expected but not actual, yet compares them for equality, which makes the assertion order-dependent on the internal encoding/storage order of GetSparseVectorByInnerId. If ordering is not strictly guaranteed, sort actual as well (or compare via an order-insensitive matcher) to ensure the test validates content rather than incidental ordering.
src/datacell/sparse_vector_datacell.h:1
__attribute__((packed)) is compiler-specific (GCC/Clang) and may break portability to toolchains like MSVC. If this project supports multiple compilers, consider wrapping packing behind a project-wide portability macro (or using #pragma pack(push, 1) / #pragma pack(pop) guarded per compiler) to keep the on-disk layout guarantees while remaining portable.

Signed-off-by: Roxanne0321 <liruoxvan020321@qq.com> Assisted-by: Codex:GPT-5

Copilot

Pull request overview

Copilot reviewed 38 out of 38 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (3)

src/io/mmap_io.cpp:1

std::filesystem::file_size() throws if filepath_ exists but is not a regular file (e.g., directory, symlink edge cases, permission issues). Since this runs in a constructor, it can unexpectedly terminate callers. Fix: guard with is_regular_file() (or call file_size(path, ec) and handle ec) and surface a consistent VsagException with context.
src/datacell/sparse_vector_datacell_test.cpp:1
The test sorts expected but not actual, making the assertion order-dependent on the internal encoding/decoding order of GetSparseVectorByInnerId. To avoid flaky failures when internal ordering changes, sort actual as well (or compare as multisets) before equality.
src/datacell/sparse_vector_datacell.inl:1
Legacy deserialization computes doc_count via integer division without validating that legacy_offset_io_size is a multiple of sizeof(LegacyDocLocation). For corrupted/truncated inputs this can desync the stream cursor and cause hard-to-debug downstream failures. Add an explicit validation (e.g., legacy_offset_io_size % sizeof(LegacyDocLocation) == 0) and throw a clear VsagException on mismatch.

Signed-off-by: Roxanne0321 <liruoxvan020321@qq.com> Assisted-by: Codex:GPT-5

Copilot

Pull request overview

Copilot reviewed 34 out of 34 changed files in this pull request and generated 7 comments.

Comments suppressed due to low confidence (1)

src/io/mmap_io.cpp:1

std::filesystem::file_size() can throw (e.g., permission issues, broken symlink, transient FS errors), which would escape the constructor and bypass your existing error wrapping style. Consider using the file_size(path, std::error_code&) overload and convert failures into the same VsagException pattern used above, so callers get a consistent error type/message.


    DatasetPtr vectors = Dataset::Make();
-    if (GetIndexType() == IndexType::SINDI or GetIndexType() == IndexType::SPARSE) {
+    if (GetIndexType() == IndexType::SINDI || GetIndexType() == IndexType::DISKSINDI) {


+struct DiskTermEntry {
+    uint64_t posting_payload_offset{0};
+    uint32_t posting_payload_size{0};
+    uint32_t term_num{0};
+};


+        entry.posting_payload_size =
+            static_cast<uint32_t>(current_offset - entry.posting_payload_offset);


+    struct __attribute__((packed)) DocLocation {
+        uint64_t offset{0};
+        uint32_t size{0};
+    };


    if (not deserialize_without_footer_) {
        JsonType jsonify_basic_info;
        if (not read_index_footer(reader, jsonify_basic_info)) {
-            throw VsagException(ErrorType::READ_ERROR, "failed to read index footer");
-        }
-        // Check if the index parameter is compatible
-        {
-            auto param = jsonify_basic_info[INDEX_PARAM].GetString();
-            SINDIParameterPtr index_param = std::make_shared<SINDIParameter>();
-            index_param->FromString(param);
-            if (not this->create_param_ptr_->CheckCompatibility(index_param)) {
-                auto message = fmt::format("SINDI index parameter not match, current: {}, new: {}",
-                                           this->create_param_ptr_->ToString(),
-                                           index_param->ToString());
-                logger::error(message);
-                throw VsagException(ErrorType::INVALID_ARGUMENT, message);
+            logger::debug("SINDI footer not found, fallback to legacy deserialize path");
+        } else {


 auto r = index->CalDistanceById(query_ptr, ids, count, /*calculate_precise_distance=*/true);

-// Sparse vector indexes (SINDI, SparseIndex) — wrap the query in a Dataset
+// Sparse vector indexes (SINDI) — wrap the query in a Dataset


    this->offset_io_ =
        std::make_shared<MemoryBlockIO>(Options::Instance().block_size_limit(), allocator_);
-    this->max_code_size_ = (this->quantizer_->GetDim() * 2 + 1) * sizeof(uint32_t);
+    this->max_code_size_ = sizeof(uint32_t);


Signed-off-by: Roxanne0321 <liruoxvan020321@qq.com> Assisted-by: Codex:GPT-5

Copilot

Pull request overview

Copilot reviewed 34 out of 34 changed files in this pull request and generated 7 comments.

+    if (GetIndexType() == IndexType::SINDI || GetIndexType() == IndexType::DISKSINDI) {
        auto* sparse_vectors =
            static_cast<SparseVector*>(allocator->Allocate(sizeof(SparseVector) * count));


+        uint64_t legacy_offset_io_size = 0;
+        StreamReader::ReadObj(reader, legacy_offset_io_size);
+        const uint64_t legacy_entry_size = sizeof(LegacyDocLocation);
+        const uint64_t doc_count =
+            legacy_entry_size == 0 ? 0 : legacy_offset_io_size / legacy_entry_size;
+        this->offset_io_->Resize(doc_count * sizeof(DocLocation));


    auto sparse_vector = (const SparseVector*)vector;
    uint64_t code_size = (sparse_vector->len_ * 2 + 1) * sizeof(uint32_t);
-    if (code_size > max_code_size_) {
-        throw VsagException(ErrorType::INVALID_ARGUMENT, fmt::format("code size ({}) of sparse vector more than max code size ({})", code_size, max_code_size_));
-    }
    auto* codes = reinterpret_cast<uint8_t*>(allocator_->Allocate(code_size));
    quantizer_->EncodeOne((const float*)vector, codes);
-    uint32_t old_offset = 0;
+    DocLocation location;
    {
-        std::lock_guard lock(current_offset_mutex_);
-        old_offset = current_offset_;
+        std::scoped_lock lock(mutex_, current_offset_mutex_);
+        total_count_ = std::max(total_count_, idx + 1);
+        max_code_size_ = std::max(max_code_size_, code_size);
+        const auto required_size = current_offset_ + code_size;
+        if (required_size > this->io_->size_) {
+            this->io_->Resize(required_size);
+        }
+        location.offset = current_offset_;
+        location.size = static_cast<uint32_t>(code_size);
        current_offset_ += code_size;
+        offset_io_->Write(reinterpret_cast<uint8_t*>(&location),
+                          sizeof(location),
+                          static_cast<uint64_t>(idx) * sizeof(location));
+        io_->Write(codes, code_size, location.offset);
    }
-    offset_io_->Write(
-        (uint8_t*)&old_offset, sizeof(current_offset_), idx * sizeof(current_offset_));
-    io_->Write(codes, code_size, old_offset);
    allocator_->Deallocate(codes);


 const uint8_t*
 SparseVectorDataCell<QuantTmpl, IOTmpl>::GetCodesById(InnerIdType id, bool& need_release) const {
-    uint32_t offset;
-    offset_io_->Read(sizeof(offset), id * sizeof(offset), (uint8_t*)&offset);
-    uint32_t length;
-    io_->Read(sizeof(length), offset, (uint8_t*)&length);
-    need_release = true;
-    uint64_t read_size = sizeof(uint32_t) * (2 * length + 1);
-    auto* codes = (uint8_t*)allocator_->Allocate(read_size);
-    io_->Read(read_size, offset, codes);
-    return codes;
+    DocLocation location;
+    offset_io_->Read(sizeof(location),
+                     static_cast<uint64_t>(id) * sizeof(location),
+                     reinterpret_cast<uint8_t*>(&location));
+    return io_->Read(location.size, location.offset, need_release);
+}


+    // Packed so each entry is exactly 12 bytes on disk and in the offset_io_
+    // buffer. The unpacked layout would round sizeof up to 16 due to the
+    // uint64 alignment requirement, wasting 33% of the offset table.
+    struct __attribute__((packed)) DocLocation {
+        uint64_t offset{0};
+        uint32_t size{0};
+    };
+    static_assert(sizeof(DocLocation) == 12, "DocLocation must be 12 bytes on disk");


+float
+compute_distance_from_codes(const uint8_t* codes,
+                            const Vector<uint32_t>& sorted_ids,
+                            const Vector<float>& sorted_vals) {
+    auto len = *reinterpret_cast<const uint32_t*>(codes);
+    const auto* entries = reinterpret_cast<const BufferEntry*>(codes + sizeof(uint32_t));
+    float sum = 0.0F;


+    // Kept for source compatibility with SparseIndex callers; new sparse workloads should
+    // prefer SINDI or DISKSINDI.
+    SPARSE = 6,
+    SINDI = 7,
+    WARP = 8,
+    DISKSINDI = 9,


Roxanne0321 added 9 commits May 14, 2026 16:32

feat(sindi): use sparse vector datacell and remove sparse index

e2ca69c

Keep legacy SINDI rerank deserialization compatible while moving rerank storage to SparseVectorDataCell. Signed-off-by: Roxanne0321 <liruoxvan020321@qq.com> Assisted-by: GitHub Copilot:GPT-5.4

fix(sindi): account for rerank datacell memory blocks

979d53c

Signed-off-by: Roxanne0321 <liruoxvan020321@qq.com> Assisted-by: GitHub Copilot:GPT-5.4

fix(datacell): serialize sparse insert races

c09720c

Signed-off-by: Roxanne0321 <liruoxvan020321@qq.com> Assisted-by: GitHub Copilot:GPT-5.4

fix(sindi): detect footer before legacy fallback

3546917

Signed-off-by: Roxanne0321 <liruoxvan020321@qq.com> Assisted-by: GitHub Copilot:GPT-5.4

merge: sync upstream main

de4c5a7

Resolve the upstream/main merge while keeping SparseIndex removed from this branch. Signed-off-by: Roxanne0321 <liruoxvan020321@qq.com> Assisted-by: GitHub Copilot:GPT-5.4

Copilot AI review requested due to automatic review settings June 2, 2026 11:58

Roxanne0321 requested review from LHT129, inabao, jiaweizone and wxyucs as code owners June 2, 2026 11:58

pull-request-size Bot added the size/XXL label Jun 2, 2026

Roxanne0321 added kind/feature New feature version/1.0 labels Jun 2, 2026

mergify Bot added module/docs module/api module/datacell module/testing labels Jun 2, 2026

gemini-code-assist Bot reviewed Jun 2, 2026

View reviewed changes

Copilot AI reviewed Jun 2, 2026

View reviewed changes

wxyucs self-assigned this Jun 3, 2026

Roxanne0321 added 2 commits June 7, 2026 22:48

fix: address DiskSINDI review comments

32a44fb

Signed-off-by: Roxanne0321 <liruoxvan020321@qq.com> Assisted-by: Codex:GPT-5

fix: handle sparse review edge cases

abc7ec7

Signed-off-by: Roxanne0321 <liruoxvan020321@qq.com> Assisted-by: Codex:GPT-5

Copilot AI review requested due to automatic review settings June 7, 2026 14:52

Copilot AI reviewed Jun 7, 2026

View reviewed changes

Comment thread src/algorithm/inner_index_interface.cpp Outdated

Comment thread src/datacell/disk_sparse_term_list_datacell.cpp

Comment thread include/vsag/index.h

Comment thread src/algorithm/disksindi/disksindi_test.cpp Outdated

Comment thread include/vsag/index.h Outdated

fix: refine DiskSINDI review followups

ffe344c

Signed-off-by: Roxanne0321 <liruoxvan020321@qq.com> Assisted-by: Codex:GPT-5

chore: merge upstream main into DiskSINDI branch

a9ac368

Signed-off-by: Roxanne0321 <liruoxvan020321@qq.com> Assisted-by: Codex:GPT-5

Copilot AI review requested due to automatic review settings June 8, 2026 02:09

Copilot AI reviewed Jun 8, 2026

View reviewed changes

Roxanne0321 added 2 commits June 8, 2026 11:08

fix: detect SINDI rerank datacell format without footer

611cd93

Signed-off-by: Roxanne0321 <liruoxvan020321@qq.com> Assisted-by: Codex:GPT-5

fix: satisfy DiskSINDI clang-tidy checks

857cf0f

Signed-off-by: Roxanne0321 <liruoxvan020321@qq.com> Assisted-by: Codex:GPT-5

Copilot AI review requested due to automatic review settings June 8, 2026 06:26

Copilot AI reviewed Jun 8, 2026

View reviewed changes

		entry.posting_payload_size =
		static_cast<uint32_t>(current_offset - entry.posting_payload_offset);

Conversation

Roxanne0321 commented Jun 2, 2026

Summary

Key Changes

Testing

Uh oh!

mergify Bot commented Jun 2, 2026

Merge Protections

🟢 Require kind label

🟢 Require version label

🟢 Require linked issue for feature/bug PRs

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants