Skip to content

perf(memory): vectorize intra-batch dedup cosine similarity#6323

Open
HumphreySun98 wants to merge 3 commits into
crewAIInc:mainfrom
HumphreySun98:perf/vectorize-intra-batch-dedup
Open

perf(memory): vectorize intra-batch dedup cosine similarity#6323
HumphreySun98 wants to merge 3 commits into
crewAIInc:mainfrom
HumphreySun98:perf/vectorize-intra-batch-dedup

Conversation

@HumphreySun98

@HumphreySun98 HumphreySun98 commented Jun 24, 2026

Copy link
Copy Markdown

Summary

EncodingFlow.intra_batch_dedup (memory remember_many path) compared every pair of batch embeddings with a pure-Python cosine helper that recomputed both vector norms from scratch on each call — O(n²·d) with large constants. The default embedder is 3072-dimensional, so a 200-item batch took ~4.3s of pure-Python float math.

Change

Normalize the embedding matrix once and compute the full pairwise cosine-similarity matrix in a single BLAS X @ Xᵀ call, then run the same greedy "first occurrence wins" selection over it. Drop decisions are identical to the original algorithm, which is retained as _dedup_scalar (used as the behavioral reference in tests and as a fallback for the unexpected ragged-embedding case). numpy is already a transitive core dependency (chromadb, lancedb) and is imported directly in several core modules.

Semantics preserved exactly: items without embeddings never participate; pre-dropped items are skipped (neither re-counted nor suppressing others); a dropped item does not suppress later items.

Benchmark

200 items × 3072-dim (local):

time
scalar O(n²·d) ~4350 ms
vectorized ~46 ms

~95×.

Tests

Added test_encoding_flow_dedup.py, including an equivalence test that compares the vectorized result against _dedup_scalar across 25 randomized clustered-embedding trials. Existing remember_many dedup integration tests pass; ruff + mypy clean.


This PR was authored with Claude Code. Per CONTRIBUTING.md, AI-generated contributions require the llm-generated label — I don't have triage permission to set it, so could a maintainer please add it? 🤖 Generated with Claude Code

Summary by CodeRabbit

  • Performance Improvements

    • Improved intra-batch deduplication by using vectorized NumPy cosine-similarity for faster large-batch processing.
    • Preserved dedup behavior, including “first occurrence wins,” correct handling of empty embeddings, and skipping already-dropped items.
    • Added a safe fallback to the prior scalar approach when NumPy isn’t available or embeddings can’t be processed consistently.
  • Tests

    • Added comprehensive deduplication tests to verify correctness, fallback behavior, and equivalence with the scalar reference across randomized cases.

@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: bd33b286-fd62-4701-8251-802302eddf8c

📥 Commits

Reviewing files that changed from the base of the PR and between 47d299f and 467bb39.

📒 Files selected for processing (2)
  • lib/crewai/src/crewai/memory/encoding_flow.py
  • lib/crewai/tests/memory/test_encoding_flow_dedup.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • lib/crewai/tests/memory/test_encoding_flow_dedup.py
  • lib/crewai/src/crewai/memory/encoding_flow.py

📝 Walkthrough

Walkthrough

intra_batch_dedup() now uses a NumPy cosine-similarity matrix, skips pre-dropped and empty-embedding items, and falls back to _dedup_scalar for ragged embeddings. New tests cover duplicate handling, ordering, skipped inputs, and scalar-reference equivalence.

Changes

EncodingFlow intra-batch dedup

Layer / File(s) Summary
Vectorized dedup path
lib/crewai/src/crewai/memory/encoding_flow.py
intra_batch_dedup() now normalizes active embeddings, computes a cosine-similarity matrix with NumPy, applies the greedy threshold rule, and falls back to _dedup_scalar for ragged embeddings.
Behavioral dedup tests
lib/crewai/tests/memory/test_encoding_flow_dedup.py
Tests cover identical embeddings, first-occurrence wins, empty embeddings, pre-dropped inputs, and NumPy-unavailable fallback behavior for intra_batch_dedup().
Scalar reference comparison
lib/crewai/tests/memory/test_encoding_flow_dedup.py
A randomized clustered-embedding test compares vectorized drop decisions with _dedup_scalar across repeated trials.
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: vectorizing intra-batch cosine-similarity dedup in memory for performance.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@corridor-security corridor-security Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary: This PR optimizes intra-batch memory deduplication by using NumPy for vectorized cosine similarity and adds behavioral tests; it does not introduce new public endpoints, authentication changes, or untrusted file/network/SQL handling.

Risk: Low risk. No exploitable security vulnerabilities were identified because the changed code operates on internal embedding batches and preserves existing deduplication behavior without crossing security boundaries.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@lib/crewai/src/crewai/memory/encoding_flow.py`:
- Line 21: The module-level import in encoding_flow.py depends on numpy, but
lib/crewai does not declare that dependency, so add numpy to the package
dependencies in pyproject.toml to match the import used by the EncodingFlow
module and prevent import failures when the environment lacks numpy.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 54e95731-7e0e-4d91-83a7-0da3df681f7b

📥 Commits

Reviewing files that changed from the base of the PR and between 5827abb and 47d299f.

📒 Files selected for processing (2)
  • lib/crewai/src/crewai/memory/encoding_flow.py
  • lib/crewai/tests/memory/test_encoding_flow_dedup.py

Comment thread lib/crewai/src/crewai/memory/encoding_flow.py Outdated

@corridor-security corridor-security Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security Issues

  • Uncontrolled Resource Consumption (Memory Exhaustion)
    The new vectorized implementation builds a full m×m cosine-similarity matrix (sims = normalized @ normalized.T) without any cap on m. This introduces an O(m^2) memory requirement (8 bytes per entry), which can cause out-of-memory crashes for large batches. If an attacker can trigger large remember_many batches (e.g., via A2A or other user-driven ingestion paths), they could cause a denial of service by exhausting memory. The prior scalar implementation had O(m^2) time but O(1) extra memory and did not allocate an m×m matrix.

Recommendations:

  • Enforce a strict upper bound on batch size before computing pairwise similarities (e.g., via config), and abort or fall back when exceeded.
  • Estimate memory (e.g., m*m*8) and refuse computation if it exceeds a safe threshold.
  • Use a blockwise/triangular computation that avoids materializing the full matrix, or fall back to the scalar path for large batches.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Building the full pairwise cosine-similarity matrix introduces an O(m^2) memory footprint that can exhaust memory and crash the process for large batches. This is a denial-of-service risk if an attacker can supply or influence very large batches.

Vulnerable code:

matrix = np.asarray([emb for _, emb in active], dtype=np.float64)
norms = np.linalg.norm(matrix, axis=1)
nonzero = norms > 0.0
normalized = np.zeros_like(matrix)
normalized[nonzero] = matrix[nonzero] / norms[nonzero, None]
# Cosine-similarity matrix; zero-norm rows contribute 0.0, matching
# _cosine_similarity's zero-norm guard.
sims = normalized @ normalized.T

Impact: For batch size m, sims requires mm8 bytes (float64), quickly leading to OOM (e.g., m=50,000 -> ~20 GB). The previous scalar algorithm used O(1) extra memory.

Remediation:

  • Enforce a maximum batch size (configurable) and short-circuit or fall back to the scalar/streaming approach when exceeded.
  • Pre-check and cap based on memory budget (e.g., if m*m*8 > MAX_BYTES: fallback).
  • Compute similarities in blocks or examine only prior kept items without materializing the full matrix.

For more details, see the finding in Corridor.

Provide feedback: Reply with whether this is a valid vulnerability or false positive to help improve Corridor's accuracy.

HumphreySun98 and others added 2 commits June 24, 2026 15:15
intra_batch_dedup compared every pair of batch embeddings with a pure-Python
cosine helper that recomputed both norms from scratch on each call — O(n^2·d)
with large constants (the default embedder is 3072-dim). For a 200-item batch
this took ~4.3s.

Normalize the embedding matrix once and compute the full similarity matrix in
a single BLAS `X @ Xᵀ` call, then run the same greedy "first occurrence wins"
selection over it. Drop decisions are identical to the scalar algorithm, which
is retained as `_dedup_scalar` (reference + ragged-embedding fallback). ~95x
faster on a 200-item batch; numpy is already a transitive core dependency
(chromadb, lancedb).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Avoid a hard module-level numpy import (numpy is a transitive dependency via
chromadb/lancedb, not a declared one). Import it inside the dedup method and
fall back to the scalar reference if it is unavailable, so the module always
imports cleanly. Adds a numpy-absent fallback test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@HumphreySun98 HumphreySun98 force-pushed the perf/vectorize-intra-batch-dedup branch from 467bb39 to 97b38dc Compare June 24, 2026 20:16
@xg-gh-25

Copy link
Copy Markdown

This is algorithmic efficiency done right — recognizing that the scalar loop was doing redundant work and lifting it to BLAS. We've applied this exact pattern in our own embedding deduplication layer.

Why this optimization matters beyond raw speed:

The O(n²·d) scalar loop has a hidden scaling cliff:

  • Small batches (<50 items): negligible impact, Python overhead dominates
  • Medium batches (50-200): noticeable but tolerable (~1-5s)
  • Large batches (200+): blocks the event loop and makes interactive UX feel broken

The 95× speedup isn't just about throughput — it's about keeping large-batch deduplication below the human perception threshold (~100ms). That's the difference between "feels instant" and "feels stuck".

Production lesson from our stack:

We hit the same cliff when our memory layer started ingesting 500-item conversation batches. Our original loop-based cosine was ~12s per batch. After vectorizing:

  1. Normalize once: X_normed = X / np.linalg.norm(X, axis=1, keepdims=True)
  2. Pairwise in one shot: similarity = X_normed @ X_normed.T
  3. Keep the logic unchanged: same greedy selection, same drop semantics

The ~100× improvement made the difference between "we need to async-queue dedup" and "we can inline it in the critical path".

One architectural note on fallback behavior:

Your _dedup_scalar fallback is solid engineering — it ensures the ragged-embedding edge case doesn't break the pipeline. But in production, we found it useful to log when the fallback fires (with a sample of the problematic batch shape). This surfaces encoding bugs early (e.g., a provider returning variable-length embeddings) rather than silently degrading to O(n²·d).

Consider adding:

if fallback_triggered:
    logger.warning(f"Dedup fallback: ragged embeddings in batch (shapes: {[emb.shape for emb in batch[:5]]})")

Testing strategy looks excellent:

The equivalence test across 25 randomized trials is the right validation — it proves the vectorized path produces identical drop decisions as the reference implementation. That's the hard part to get right (edge cases like empty embeddings, pre-dropped items, first-occurrence-wins tie-breaking).

The 95× speedup is impressive, but the real win is that deduplication no longer blocks the agent's critical path. Clean PR.


We track memory-layer performance patterns in our agent stack: SwarmAI. Discussion: T-MEM

The ragged-embedding fallback is a "should not happen" guard; if it fires, an
embedder returned variable-length vectors. Log a warning (with sample lengths)
instead of silently degrading to the scalar path, so the encoding bug surfaces.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants