feat(storage): mmap-backed sidecar readers#116
Conversation
- Add memmap2 dependency for memory-mapped file support - PositionsReader: add read_mmap() and from_mmap() using Arc<Mmap> to keep mapping alive - SuggestReader: add from_mmap() using Arc<Mmap> pattern - Update load functions in IndexHandle to use mmap-based loading with fallback to heap read - Both readers fall back to fs::read + from_bytes if mmap fails - PositionsReader stores mmap: Option<Arc<Mmap>> to own the mapping, body_data Vec is still used as the parsed data slice - SuggestReader stores _mmap: Option<Arc<Mmap>> to keep mapping alive - Backwards-compatible: from_bytes sets mmap/Arc fields to None
|
Warning Review limit reached
More reviews will be available in 52 minutes and 24 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThis PR introduces memory-mapped IO support to the cloudsearch storage layer and integrates it into sidecar loading paths. The workspace now declares ChangesMemory-mapped IO Integration
Sequence DiagramssequenceDiagram
participant Caller
participant PositionsReader
participant memmap2
participant File
Caller->>PositionsReader: read_mmap(path)
PositionsReader->>File: open_async(path)
PositionsReader->>File: sync_all()
PositionsReader->>memmap2: Mmap::map(file)
memmap2->>PositionsReader: mmap
PositionsReader->>PositionsReader: from_mmap(mmap)
PositionsReader->>Caller: Self with mmap retained
sequenceDiagram
participant IndexHandle
participant load_single_suggest_reader
participant load_single_positions_reader
participant SuggestReader
participant PositionsReader
IndexHandle->>load_single_suggest_reader: load segment
load_single_suggest_reader->>SuggestReader: from_mmap(path)
alt mmap succeeds with fields
SuggestReader->>load_single_suggest_reader: reader
else fallback to heap
load_single_suggest_reader->>SuggestReader: from_bytes(data)
end
load_single_suggest_reader->>IndexHandle: reader
IndexHandle->>load_single_positions_reader: load segment
load_single_positions_reader->>PositionsReader: read_mmap(path)
PositionsReader->>load_single_positions_reader: reader
load_single_positions_reader->>IndexHandle: reader
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
rust/crates/cloudsearch-storage/src/inverted_index.rs (1)
329-385: ⚡ Quick winAdd direct tests for mmap loading paths.
This suite validates
from_bytes, but the newread_mmap/from_mmapconstructors are not directly exercised. Add at least one tempfile-backed mmap round-trip assertion to catch mmap-specific regressions.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@rust/crates/cloudsearch-storage/src/inverted_index.rs` around lines 329 - 385, Add a test that exercises the mmap constructors (read_mmap/from_mmap) by writing the same minimal positions file bytes to a tempfile, syncing/flush the file, memory-mapping it (read-only) and calling PositionsReader::read_mmap or PositionsReader::from_mmap to construct the reader, then assert term_count(), get("test") and postings match the existing from_bytes assertions; use tempfile::NamedTempFile (or tempfile::tempfile) and the project's mmap helper (or memmap2) to create the mmap region and ensure the file is persisted before mapping so mmap-specific parsing is validated.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@rust/crates/cloudsearch-index/src/lib.rs`:
- Around line 214-217: At both affected sites
(rust/crates/cloudsearch-index/src/lib.rs:214-217 and
rust/crates/cloudsearch-index/src/lib.rs:1611-1614) the call
std_file.sync_all().ok()? must not short-circuit the mmap/heap fallback; change
the behavior so sync_all() errors are handled locally and do not return None —
either remove the sync_all() call or catch its Result and continue (e.g., log or
ignore the error) before attempting memmap/fs::read; apply the same change at
the flush-time helper site so a failing sync does not prevent the fallback path.
In `@rust/crates/cloudsearch-storage/src/inverted_index.rs`:
- Around line 238-242: The from_mmap implementation currently calls from_bytes,
which copies the body into body_data (via to_vec) and defeats the mmap memory
advantage; instead modify from_mmap to parse the header and establish references
directly into the mmap backing buffer and then set reader.mmap =
Some(Arc::new(mmap)) so body_data is a borrowed view (or an owned Arc-backed
slice) into that mmap rather than a cloned Vec. Concretely: avoid invoking
from_bytes from from_mmap; reuse the same parsing logic used by from_bytes to
locate the header/body ranges on the mmap byte slice, set
PositionsReader::body_data to a zero-copy reference into the mmap bytes (or an
Arc<[u8]> view), and then set reader.mmap to the Arc-wrapped mmap so the bytes
remain valid.
In `@rust/crates/cloudsearch-storage/src/suggest_index.rs`:
- Around line 154-156: from_mmap currently calls from_bytes which can panic on
corrupt/truncated input due to unchecked slicing/unwraps; change the parsing in
from_bytes to perform bounds-checked parsing and return Result<T,
std::io::Error> (use std::io::Error::new(InvalidData, ...)) instead of
panicking, replacing all unwraps/unchecked slice accesses with safe checks that
return Err on invalid input, and then update from_mmap to propagate that
io::Result from the revised from_bytes so corrupt mmap data yields an Err rather
than a panic (refactor functions: from_bytes and from_mmap accordingly).
---
Nitpick comments:
In `@rust/crates/cloudsearch-storage/src/inverted_index.rs`:
- Around line 329-385: Add a test that exercises the mmap constructors
(read_mmap/from_mmap) by writing the same minimal positions file bytes to a
tempfile, syncing/flush the file, memory-mapping it (read-only) and calling
PositionsReader::read_mmap or PositionsReader::from_mmap to construct the
reader, then assert term_count(), get("test") and postings match the existing
from_bytes assertions; use tempfile::NamedTempFile (or tempfile::tempfile) and
the project's mmap helper (or memmap2) to create the mmap region and ensure the
file is persisted before mapping so mmap-specific parsing is validated.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 3ab73f35-d4bb-40c4-a43f-57660035ba01
⛔ Files ignored due to path filters (1)
rust/Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (6)
rust/Cargo.tomlrust/crates/cloudsearch-index/Cargo.tomlrust/crates/cloudsearch-index/src/lib.rsrust/crates/cloudsearch-storage/Cargo.tomlrust/crates/cloudsearch-storage/src/inverted_index.rsrust/crates/cloudsearch-storage/src/suggest_index.rs
| pub fn from_mmap(mmap: memmap2::Mmap) -> std::io::Result<Self> { | ||
| let result = Self::from_bytes(&mmap)?; | ||
| Ok(Self { |
There was a problem hiding this comment.
from_mmap currently preserves panic-on-corrupt-input behavior.
Line 155 delegates to from_bytes, which still uses unchecked slicing/unwrap in parse steps. Malformed/truncated files can panic the process instead of returning InvalidData, despite the io::Result contract here.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@rust/crates/cloudsearch-storage/src/suggest_index.rs` around lines 154 - 156,
from_mmap currently calls from_bytes which can panic on corrupt/truncated input
due to unchecked slicing/unwraps; change the parsing in from_bytes to perform
bounds-checked parsing and return Result<T, std::io::Error> (use
std::io::Error::new(InvalidData, ...)) instead of panicking, replacing all
unwraps/unchecked slice accesses with safe checks that return Err on invalid
input, and then update from_mmap to propagate that io::Result from the revised
from_bytes so corrupt mmap data yields an Err rather than a panic (refactor
functions: from_bytes and from_mmap accordingly).
… doc fixes - Add backticks around body_data in PositionsReader doc comment - Add # Errors section to SuggestReader::from_mmap - Apply cargo fmt to suggest_index.rs
- lib.rs: change sync_all().ok()? to best-effort (let _ =) so sync errors don't prevent mmap fallback to heap read - inverted_index.rs: change body_data to Arc<[u8]>, from_mmap now parses directly into mmap bytes without to_vec() copy (zero-copy body access) - suggest_index.rs: remove incorrect # Panics doc on from_bytes since it already returns errors for malformed data - Add async tempfile test for PositionsReader::read_mmap
Summary
Summary by CodeRabbit