perf(memory): vectorize intra-batch dedup cosine similarity#6323
perf(memory): vectorize intra-batch dedup cosine similarity#6323HumphreySun98 wants to merge 3 commits into
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthrough
ChangesEncodingFlow intra-batch dedup
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Summary: This PR optimizes intra-batch memory deduplication by using NumPy for vectorized cosine similarity and adds behavioral tests; it does not introduce new public endpoints, authentication changes, or untrusted file/network/SQL handling.
Risk: Low risk. No exploitable security vulnerabilities were identified because the changed code operates on internal embedding batches and preserves existing deduplication behavior without crossing security boundaries.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@lib/crewai/src/crewai/memory/encoding_flow.py`:
- Line 21: The module-level import in encoding_flow.py depends on numpy, but
lib/crewai does not declare that dependency, so add numpy to the package
dependencies in pyproject.toml to match the import used by the EncodingFlow
module and prevent import failures when the environment lacks numpy.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 54e95731-7e0e-4d91-83a7-0da3df681f7b
📒 Files selected for processing (2)
lib/crewai/src/crewai/memory/encoding_flow.pylib/crewai/tests/memory/test_encoding_flow_dedup.py
There was a problem hiding this comment.
Security Issues
- Uncontrolled Resource Consumption (Memory Exhaustion)
The new vectorized implementation builds a full m×m cosine-similarity matrix (sims = normalized @ normalized.T) without any cap onm. This introduces an O(m^2) memory requirement (8 bytes per entry), which can cause out-of-memory crashes for large batches. If an attacker can trigger largeremember_manybatches (e.g., via A2A or other user-driven ingestion paths), they could cause a denial of service by exhausting memory. The prior scalar implementation had O(m^2) time but O(1) extra memory and did not allocate an m×m matrix.
Recommendations:
- Enforce a strict upper bound on batch size before computing pairwise similarities (e.g., via config), and abort or fall back when exceeded.
- Estimate memory (e.g.,
m*m*8) and refuse computation if it exceeds a safe threshold. - Use a blockwise/triangular computation that avoids materializing the full matrix, or fall back to the scalar path for large batches.
There was a problem hiding this comment.
Building the full pairwise cosine-similarity matrix introduces an O(m^2) memory footprint that can exhaust memory and crash the process for large batches. This is a denial-of-service risk if an attacker can supply or influence very large batches.
Vulnerable code:
matrix = np.asarray([emb for _, emb in active], dtype=np.float64)
norms = np.linalg.norm(matrix, axis=1)
nonzero = norms > 0.0
normalized = np.zeros_like(matrix)
normalized[nonzero] = matrix[nonzero] / norms[nonzero, None]
# Cosine-similarity matrix; zero-norm rows contribute 0.0, matching
# _cosine_similarity's zero-norm guard.
sims = normalized @ normalized.TImpact: For batch size m, sims requires mm8 bytes (float64), quickly leading to OOM (e.g., m=50,000 -> ~20 GB). The previous scalar algorithm used O(1) extra memory.
Remediation:
- Enforce a maximum batch size (configurable) and short-circuit or fall back to the scalar/streaming approach when exceeded.
- Pre-check and cap based on memory budget (e.g.,
if m*m*8 > MAX_BYTES: fallback). - Compute similarities in blocks or examine only prior kept items without materializing the full matrix.
For more details, see the finding in Corridor.
Provide feedback: Reply with whether this is a valid vulnerability or false positive to help improve Corridor's accuracy.
intra_batch_dedup compared every pair of batch embeddings with a pure-Python cosine helper that recomputed both norms from scratch on each call — O(n^2·d) with large constants (the default embedder is 3072-dim). For a 200-item batch this took ~4.3s. Normalize the embedding matrix once and compute the full similarity matrix in a single BLAS `X @ Xᵀ` call, then run the same greedy "first occurrence wins" selection over it. Drop decisions are identical to the scalar algorithm, which is retained as `_dedup_scalar` (reference + ragged-embedding fallback). ~95x faster on a 200-item batch; numpy is already a transitive core dependency (chromadb, lancedb). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Avoid a hard module-level numpy import (numpy is a transitive dependency via chromadb/lancedb, not a declared one). Import it inside the dedup method and fall back to the scalar reference if it is unavailable, so the module always imports cleanly. Adds a numpy-absent fallback test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
467bb39 to
97b38dc
Compare
|
This is algorithmic efficiency done right — recognizing that the scalar loop was doing redundant work and lifting it to BLAS. We've applied this exact pattern in our own embedding deduplication layer. Why this optimization matters beyond raw speed: The O(n²·d) scalar loop has a hidden scaling cliff:
The 95× speedup isn't just about throughput — it's about keeping large-batch deduplication below the human perception threshold (~100ms). That's the difference between "feels instant" and "feels stuck". Production lesson from our stack: We hit the same cliff when our memory layer started ingesting 500-item conversation batches. Our original loop-based cosine was ~12s per batch. After vectorizing:
The ~100× improvement made the difference between "we need to async-queue dedup" and "we can inline it in the critical path". One architectural note on fallback behavior: Your Consider adding: if fallback_triggered:
logger.warning(f"Dedup fallback: ragged embeddings in batch (shapes: {[emb.shape for emb in batch[:5]]})")Testing strategy looks excellent: The equivalence test across 25 randomized trials is the right validation — it proves the vectorized path produces identical drop decisions as the reference implementation. That's the hard part to get right (edge cases like empty embeddings, pre-dropped items, first-occurrence-wins tie-breaking). The 95× speedup is impressive, but the real win is that deduplication no longer blocks the agent's critical path. Clean PR. We track memory-layer performance patterns in our agent stack: SwarmAI. Discussion: T-MEM |
The ragged-embedding fallback is a "should not happen" guard; if it fires, an embedder returned variable-length vectors. Log a warning (with sample lengths) instead of silently degrading to the scalar path, so the encoding bug surfaces. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Summary
EncodingFlow.intra_batch_dedup(memoryremember_manypath) compared every pair of batch embeddings with a pure-Python cosine helper that recomputed both vector norms from scratch on each call — O(n²·d) with large constants. The default embedder is 3072-dimensional, so a 200-item batch took ~4.3s of pure-Python float math.Change
Normalize the embedding matrix once and compute the full pairwise cosine-similarity matrix in a single BLAS
X @ Xᵀcall, then run the same greedy "first occurrence wins" selection over it. Drop decisions are identical to the original algorithm, which is retained as_dedup_scalar(used as the behavioral reference in tests and as a fallback for the unexpected ragged-embedding case). numpy is already a transitive core dependency (chromadb,lancedb) and is imported directly in several core modules.Semantics preserved exactly: items without embeddings never participate; pre-dropped items are skipped (neither re-counted nor suppressing others); a dropped item does not suppress later items.
Benchmark
200 items × 3072-dim (local):
~95×.
Tests
Added
test_encoding_flow_dedup.py, including an equivalence test that compares the vectorized result against_dedup_scalaracross 25 randomized clustered-embedding trials. Existingremember_manydedup integration tests pass;ruff+mypyclean.This PR was authored with Claude Code. Per
CONTRIBUTING.md, AI-generated contributions require thellm-generatedlabel — I don't have triage permission to set it, so could a maintainer please add it? 🤖 Generated with Claude CodeSummary by CodeRabbit
Performance Improvements
Tests