Skip to content

feat(storage): mmap-backed sidecar readers#116

Open
poyrazK wants to merge 6 commits into
mainfrom
feature/autocomplete-suggest
Open

feat(storage): mmap-backed sidecar readers#116
poyrazK wants to merge 6 commits into
mainfrom
feature/autocomplete-suggest

Conversation

@poyrazK

@poyrazK poyrazK commented Jun 13, 2026

Copy link
Copy Markdown
Owner

Summary

  • Add memmap2 dependency for memory-mapped file support
  • PositionsReader: add read_mmap() and from_mmap() using Arc to keep mapping alive
  • SuggestReader: add from_mmap() using Arc pattern
  • Both readers fall back to fs::read + from_bytes if mmap fails
  • Update load functions in IndexHandle to use mmap-based loading with fallback to heap read
  • Backwards-compatible: from_bytes sets mmap/Arc fields to None

Summary by CodeRabbit

  • New Features
    • Added memory-mapped loading for position and suggest sidecar readers, reducing reload overhead during index persistence/restoration.
  • Refactor
    • Updated persistence/restore to prefer memory-mapped I/O and fall back to in-memory loading when mmap isn’t available or yields no data.
  • Bug Fixes
    • Improved handling and reporting of malformed/truncated position data with a dedicated error type.
  • Tests
    • Extended unit tests to cover mmap parsing and updated error expectations.

- Add memmap2 dependency for memory-mapped file support
- PositionsReader: add read_mmap() and from_mmap() using Arc<Mmap> to keep mapping alive
- SuggestReader: add from_mmap() using Arc<Mmap> pattern
- Update load functions in IndexHandle to use mmap-based loading with fallback to heap read
- Both readers fall back to fs::read + from_bytes if mmap fails
- PositionsReader stores mmap: Option<Arc<Mmap>> to own the mapping, body_data Vec is still used as the parsed data slice
- SuggestReader stores _mmap: Option<Arc<Mmap>> to keep mapping alive
- Backwards-compatible: from_bytes sets mmap/Arc fields to None
@coderabbitai

coderabbitai Bot commented Jun 13, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@poyrazK, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 52 minutes and 24 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d8d91c94-9bb1-4b90-a3dc-986c4c264c80

📥 Commits

Reviewing files that changed from the base of the PR and between 96efa10 and 588e62d.

📒 Files selected for processing (2)
  • rust/crates/cloudsearch-storage/src/inverted_index.rs
  • rust/crates/cloudsearch-storage/src/suggest_index.rs
📝 Walkthrough

Walkthrough

This PR introduces memory-mapped IO support to the cloudsearch storage layer and integrates it into sidecar loading paths. The workspace now declares memmap2 0.4, and PositionsReader and SuggestReader gain optional mmap retention fields and constructors to load from memory-mapped files. IndexHandle sidecar loading now prefers mmap-backed IO with heap-read fallback for performance optimization.

Changes

Memory-mapped IO Integration

Layer / File(s) Summary
Workspace and crate dependencies
rust/Cargo.toml, rust/crates/cloudsearch-index/Cargo.toml, rust/crates/cloudsearch-storage/Cargo.toml
Declare memmap2 0.4 as a workspace dependency and add it to cloudsearch-index and cloudsearch-storage crates.
PositionsReader memory-mapped support
rust/crates/cloudsearch-storage/src/inverted_index.rs
Introduce PositionsReaderError enum for structured error handling, extend PositionsReader with optional mmap field and read_mmap/from_mmap constructors, refactor from_bytes to use the new error type, and update unit tests for mmap-backed reading.
SuggestReader memory-mapped support
rust/crates/cloudsearch-storage/src/suggest_index.rs
Add optional _mmap field to SuggestReader, implement from_mmap constructor, and refactor tests to use a make_reader helper for consistent reader construction.
IndexHandle sidecar loading integration
rust/crates/cloudsearch-index/src/lib.rs
Update load_single_suggest_reader to attempt mmap-first loading with heap-read fallback, switch load_single_positions_reader to mmap-based loading, and update post-merge and post-flush sidecar reloads to use the new mmap methods.

Sequence Diagrams

sequenceDiagram
  participant Caller
  participant PositionsReader
  participant memmap2
  participant File
  Caller->>PositionsReader: read_mmap(path)
  PositionsReader->>File: open_async(path)
  PositionsReader->>File: sync_all()
  PositionsReader->>memmap2: Mmap::map(file)
  memmap2->>PositionsReader: mmap
  PositionsReader->>PositionsReader: from_mmap(mmap)
  PositionsReader->>Caller: Self with mmap retained
Loading
sequenceDiagram
  participant IndexHandle
  participant load_single_suggest_reader
  participant load_single_positions_reader
  participant SuggestReader
  participant PositionsReader
  IndexHandle->>load_single_suggest_reader: load segment
  load_single_suggest_reader->>SuggestReader: from_mmap(path)
  alt mmap succeeds with fields
    SuggestReader->>load_single_suggest_reader: reader
  else fallback to heap
    load_single_suggest_reader->>SuggestReader: from_bytes(data)
  end
  load_single_suggest_reader->>IndexHandle: reader
  IndexHandle->>load_single_positions_reader: load segment
  load_single_positions_reader->>PositionsReader: read_mmap(path)
  PositionsReader->>load_single_positions_reader: reader
  load_single_positions_reader->>IndexHandle: reader
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Memory maps flutter through files like whispers in the wind,
No heap reads needed when mmap's data lives within,
Fast sidecars load with grace, or tumble back to bytes,
Positions dance on mapped pages, suggest fields take their flights,
The index now remembers through mmap's magic light! 🗺️

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat(storage): mmap-backed sidecar readers' clearly and accurately summarizes the main change: introducing memory-mapped file support for sidecar readers in the storage module.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/autocomplete-suggest

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
rust/crates/cloudsearch-storage/src/inverted_index.rs (1)

329-385: ⚡ Quick win

Add direct tests for mmap loading paths.

This suite validates from_bytes, but the new read_mmap / from_mmap constructors are not directly exercised. Add at least one tempfile-backed mmap round-trip assertion to catch mmap-specific regressions.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rust/crates/cloudsearch-storage/src/inverted_index.rs` around lines 329 -
385, Add a test that exercises the mmap constructors (read_mmap/from_mmap) by
writing the same minimal positions file bytes to a tempfile, syncing/flush the
file, memory-mapping it (read-only) and calling PositionsReader::read_mmap or
PositionsReader::from_mmap to construct the reader, then assert term_count(),
get("test") and postings match the existing from_bytes assertions; use
tempfile::NamedTempFile (or tempfile::tempfile) and the project's mmap helper
(or memmap2) to create the mmap region and ensure the file is persisted before
mapping so mmap-specific parsing is validated.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@rust/crates/cloudsearch-index/src/lib.rs`:
- Around line 214-217: At both affected sites
(rust/crates/cloudsearch-index/src/lib.rs:214-217 and
rust/crates/cloudsearch-index/src/lib.rs:1611-1614) the call
std_file.sync_all().ok()? must not short-circuit the mmap/heap fallback; change
the behavior so sync_all() errors are handled locally and do not return None —
either remove the sync_all() call or catch its Result and continue (e.g., log or
ignore the error) before attempting memmap/fs::read; apply the same change at
the flush-time helper site so a failing sync does not prevent the fallback path.

In `@rust/crates/cloudsearch-storage/src/inverted_index.rs`:
- Around line 238-242: The from_mmap implementation currently calls from_bytes,
which copies the body into body_data (via to_vec) and defeats the mmap memory
advantage; instead modify from_mmap to parse the header and establish references
directly into the mmap backing buffer and then set reader.mmap =
Some(Arc::new(mmap)) so body_data is a borrowed view (or an owned Arc-backed
slice) into that mmap rather than a cloned Vec. Concretely: avoid invoking
from_bytes from from_mmap; reuse the same parsing logic used by from_bytes to
locate the header/body ranges on the mmap byte slice, set
PositionsReader::body_data to a zero-copy reference into the mmap bytes (or an
Arc<[u8]> view), and then set reader.mmap to the Arc-wrapped mmap so the bytes
remain valid.

In `@rust/crates/cloudsearch-storage/src/suggest_index.rs`:
- Around line 154-156: from_mmap currently calls from_bytes which can panic on
corrupt/truncated input due to unchecked slicing/unwraps; change the parsing in
from_bytes to perform bounds-checked parsing and return Result<T,
std::io::Error> (use std::io::Error::new(InvalidData, ...)) instead of
panicking, replacing all unwraps/unchecked slice accesses with safe checks that
return Err on invalid input, and then update from_mmap to propagate that
io::Result from the revised from_bytes so corrupt mmap data yields an Err rather
than a panic (refactor functions: from_bytes and from_mmap accordingly).

---

Nitpick comments:
In `@rust/crates/cloudsearch-storage/src/inverted_index.rs`:
- Around line 329-385: Add a test that exercises the mmap constructors
(read_mmap/from_mmap) by writing the same minimal positions file bytes to a
tempfile, syncing/flush the file, memory-mapping it (read-only) and calling
PositionsReader::read_mmap or PositionsReader::from_mmap to construct the
reader, then assert term_count(), get("test") and postings match the existing
from_bytes assertions; use tempfile::NamedTempFile (or tempfile::tempfile) and
the project's mmap helper (or memmap2) to create the mmap region and ensure the
file is persisted before mapping so mmap-specific parsing is validated.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3ab73f35-d4bb-40c4-a43f-57660035ba01

📥 Commits

Reviewing files that changed from the base of the PR and between aece867 and bad2221.

⛔ Files ignored due to path filters (1)
  • rust/Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (6)
  • rust/Cargo.toml
  • rust/crates/cloudsearch-index/Cargo.toml
  • rust/crates/cloudsearch-index/src/lib.rs
  • rust/crates/cloudsearch-storage/Cargo.toml
  • rust/crates/cloudsearch-storage/src/inverted_index.rs
  • rust/crates/cloudsearch-storage/src/suggest_index.rs

Comment thread rust/crates/cloudsearch-index/src/lib.rs
Comment thread rust/crates/cloudsearch-storage/src/inverted_index.rs Outdated
Comment on lines +154 to +156
pub fn from_mmap(mmap: memmap2::Mmap) -> std::io::Result<Self> {
let result = Self::from_bytes(&mmap)?;
Ok(Self {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

from_mmap currently preserves panic-on-corrupt-input behavior.

Line 155 delegates to from_bytes, which still uses unchecked slicing/unwrap in parse steps. Malformed/truncated files can panic the process instead of returning InvalidData, despite the io::Result contract here.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rust/crates/cloudsearch-storage/src/suggest_index.rs` around lines 154 - 156,
from_mmap currently calls from_bytes which can panic on corrupt/truncated input
due to unchecked slicing/unwraps; change the parsing in from_bytes to perform
bounds-checked parsing and return Result<T, std::io::Error> (use
std::io::Error::new(InvalidData, ...)) instead of panicking, replacing all
unwraps/unchecked slice accesses with safe checks that return Err on invalid
input, and then update from_mmap to propagate that io::Result from the revised
from_bytes so corrupt mmap data yields an Err rather than a panic (refactor
functions: from_bytes and from_mmap accordingly).

poyrazK added 5 commits June 13, 2026 17:00
… doc fixes

- Add backticks around body_data in PositionsReader doc comment
- Add # Errors section to SuggestReader::from_mmap
- Apply cargo fmt to suggest_index.rs
- lib.rs: change sync_all().ok()? to best-effort (let _ =) so sync
  errors don't prevent mmap fallback to heap read
- inverted_index.rs: change body_data to Arc<[u8]>, from_mmap now parses
  directly into mmap bytes without to_vec() copy (zero-copy body access)
- suggest_index.rs: remove incorrect # Panics doc on from_bytes since
  it already returns errors for malformed data
- Add async tempfile test for PositionsReader::read_mmap
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant