perf(memory): vectorize intra-batch dedup cosine similarity by HumphreySun98 · Pull Request #6323 · crewAIInc/crewAI

HumphreySun98 · 2026-06-24T19:18:04Z

Summary

EncodingFlow.intra_batch_dedup (memory remember_many path) compared every pair of batch embeddings with a pure-Python cosine helper that recomputed both vector norms from scratch on each call — O(n²·d) with large constants. The default embedder is 3072-dimensional, so a 200-item batch took ~4.3s of pure-Python float math.

Change

Normalize the embedding matrix once and compute the full pairwise cosine-similarity matrix in a single BLAS X @ Xᵀ call, then run the same greedy "first occurrence wins" selection over it. Drop decisions are identical to the original algorithm, which is retained as _dedup_scalar (used as the behavioral reference in tests and as a fallback for the unexpected ragged-embedding case). numpy is already a transitive core dependency (chromadb, lancedb) and is imported directly in several core modules.

Semantics preserved exactly: items without embeddings never participate; pre-dropped items are skipped (neither re-counted nor suppressing others); a dropped item does not suppress later items.

Benchmark

200 items × 3072-dim (local):

	time
scalar O(n²·d)	~4350 ms
vectorized	~46 ms

~95×.

Tests

Added test_encoding_flow_dedup.py, including an equivalence test that compares the vectorized result against _dedup_scalar across 25 randomized clustered-embedding trials. Existing remember_many dedup integration tests pass; ruff + mypy clean.

This PR was authored with Claude Code. Per CONTRIBUTING.md, AI-generated contributions require the llm-generated label — I don't have triage permission to set it, so could a maintainer please add it? 🤖 Generated with Claude Code

Summary by CodeRabbit

Performance Improvements
- Improved intra-batch deduplication by using vectorized NumPy cosine-similarity for faster large-batch processing.
- Preserved dedup behavior, including “first occurrence wins,” correct handling of empty embeddings, and skipping already-dropped items.
- Added a safe fallback to the prior scalar approach when NumPy isn’t available or embeddings can’t be processed consistently.
Tests
- Added comprehensive deduplication tests to verify correctness, fallback behavior, and equivalence with the scalar reference across randomized cases.

coderabbitai · 2026-06-24T19:18:27Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: bd33b286-fd62-4701-8251-802302eddf8c

📥 Commits

Reviewing files that changed from the base of the PR and between 47d299f and 467bb39.

📒 Files selected for processing (2)

lib/crewai/src/crewai/memory/encoding_flow.py
lib/crewai/tests/memory/test_encoding_flow_dedup.py

🚧 Files skipped from review as they are similar to previous changes (2)

lib/crewai/tests/memory/test_encoding_flow_dedup.py
lib/crewai/src/crewai/memory/encoding_flow.py

📝 Walkthrough

Walkthrough

intra_batch_dedup() now uses a NumPy cosine-similarity matrix, skips pre-dropped and empty-embedding items, and falls back to _dedup_scalar for ragged embeddings. New tests cover duplicate handling, ordering, skipped inputs, and scalar-reference equivalence.

Changes

EncodingFlow intra-batch dedup

Layer / File(s)	Summary
Vectorized dedup path `lib/crewai/src/crewai/memory/encoding_flow.py`	`intra_batch_dedup()` now normalizes active embeddings, computes a cosine-similarity matrix with NumPy, applies the greedy threshold rule, and falls back to `_dedup_scalar` for ragged embeddings.
Behavioral dedup tests `lib/crewai/tests/memory/test_encoding_flow_dedup.py`	Tests cover identical embeddings, first-occurrence wins, empty embeddings, pre-dropped inputs, and NumPy-unavailable fallback behavior for `intra_batch_dedup()`.
Scalar reference comparison `lib/crewai/tests/memory/test_encoding_flow_dedup.py`	A randomized clustered-embedding test compares vectorized drop decisions with `_dedup_scalar` across repeated trials.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main change: vectorizing intra-batch cosine-similarity dedup in memory for performance.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

corridor-security

Summary: This PR optimizes intra-batch memory deduplication by using NumPy for vectorized cosine similarity and adds behavioral tests; it does not introduce new public endpoints, authentication changes, or untrusted file/network/SQL handling.

Risk: Low risk. No exploitable security vulnerabilities were identified because the changed code operates on internal embedding batches and preserves existing deduplication behavior without crossing security boundaries.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@lib/crewai/src/crewai/memory/encoding_flow.py`:
- Line 21: The module-level import in encoding_flow.py depends on numpy, but
lib/crewai does not declare that dependency, so add numpy to the package
dependencies in pyproject.toml to match the import used by the EncodingFlow
module and prevent import failures when the environment lacks numpy.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 54e95731-7e0e-4d91-83a7-0da3df681f7b

📥 Commits

Reviewing files that changed from the base of the PR and between 5827abb and 47d299f.

📒 Files selected for processing (2)

lib/crewai/src/crewai/memory/encoding_flow.py
lib/crewai/tests/memory/test_encoding_flow_dedup.py

corridor-security

Security Issues

Uncontrolled Resource Consumption (Memory Exhaustion)
The new vectorized implementation builds a full m×m cosine-similarity matrix (sims = normalized @ normalized.T) without any cap on m. This introduces an O(m^2) memory requirement (8 bytes per entry), which can cause out-of-memory crashes for large batches. If an attacker can trigger large remember_many batches (e.g., via A2A or other user-driven ingestion paths), they could cause a denial of service by exhausting memory. The prior scalar implementation had O(m^2) time but O(1) extra memory and did not allocate an m×m matrix.

Recommendations:

Enforce a strict upper bound on batch size before computing pairwise similarities (e.g., via config), and abort or fall back when exceeded.
Estimate memory (e.g., m*m*8) and refuse computation if it exceeds a safe threshold.
Use a blockwise/triangular computation that avoids materializing the full matrix, or fall back to the scalar path for large batches.

corridor-security · 2026-06-24T19:39:25Z

Building the full pairwise cosine-similarity matrix introduces an O(m^2) memory footprint that can exhaust memory and crash the process for large batches. This is a denial-of-service risk if an attacker can supply or influence very large batches.

Vulnerable code:

matrix = np.asarray([emb for _, emb in active], dtype=np.float64) norms = np.linalg.norm(matrix, axis=1) nonzero = norms > 0.0 normalized = np.zeros_like(matrix) normalized[nonzero] = matrix[nonzero] / norms[nonzero, None] # Cosine-similarity matrix; zero-norm rows contribute 0.0, matching # _cosine_similarity's zero-norm guard. sims = normalized @ normalized.T

Impact: For batch size m, sims requires mm8 bytes (float64), quickly leading to OOM (e.g., m=50,000 -> ~20 GB). The previous scalar algorithm used O(1) extra memory.

Remediation:

Enforce a maximum batch size (configurable) and short-circuit or fall back to the scalar/streaming approach when exceeded.

Pre-check and cap based on memory budget (e.g., if m*m*8 > MAX_BYTES: fallback).

Compute similarities in blocks or examine only prior kept items without materializing the full matrix.

For more details, see the finding in Corridor.

Provide feedback: Reply with whether this is a valid vulnerability or false positive to help improve Corridor's accuracy.

intra_batch_dedup compared every pair of batch embeddings with a pure-Python cosine helper that recomputed both norms from scratch on each call — O(n^2·d) with large constants (the default embedder is 3072-dim). For a 200-item batch this took ~4.3s. Normalize the embedding matrix once and compute the full similarity matrix in a single BLAS `X @ Xᵀ` call, then run the same greedy "first occurrence wins" selection over it. Drop decisions are identical to the scalar algorithm, which is retained as `_dedup_scalar` (reference + ragged-embedding fallback). ~95x faster on a 200-item batch; numpy is already a transitive core dependency (chromadb, lancedb). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Avoid a hard module-level numpy import (numpy is a transitive dependency via chromadb/lancedb, not a declared one). Import it inside the dedup method and fall back to the scalar reference if it is unavailable, so the module always imports cleanly. Adds a numpy-absent fallback test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

xg-gh-25 · 2026-06-25T02:44:18Z

This is algorithmic efficiency done right — recognizing that the scalar loop was doing redundant work and lifting it to BLAS. We've applied this exact pattern in our own embedding deduplication layer.

Why this optimization matters beyond raw speed:

The O(n²·d) scalar loop has a hidden scaling cliff:

Small batches (<50 items): negligible impact, Python overhead dominates
Medium batches (50-200): noticeable but tolerable (~1-5s)
Large batches (200+): blocks the event loop and makes interactive UX feel broken

The 95× speedup isn't just about throughput — it's about keeping large-batch deduplication below the human perception threshold (~100ms). That's the difference between "feels instant" and "feels stuck".

Production lesson from our stack:

We hit the same cliff when our memory layer started ingesting 500-item conversation batches. Our original loop-based cosine was ~12s per batch. After vectorizing:

Normalize once: X_normed = X / np.linalg.norm(X, axis=1, keepdims=True)
Pairwise in one shot: similarity = X_normed @ X_normed.T
Keep the logic unchanged: same greedy selection, same drop semantics

The ~100× improvement made the difference between "we need to async-queue dedup" and "we can inline it in the critical path".

One architectural note on fallback behavior:

Your _dedup_scalar fallback is solid engineering — it ensures the ragged-embedding edge case doesn't break the pipeline. But in production, we found it useful to log when the fallback fires (with a sample of the problematic batch shape). This surfaces encoding bugs early (e.g., a provider returning variable-length embeddings) rather than silently degrading to O(n²·d).

Consider adding:

if fallback_triggered:
    logger.warning(f"Dedup fallback: ragged embeddings in batch (shapes: {[emb.shape for emb in batch[:5]]})")

Testing strategy looks excellent:

The equivalence test across 25 randomized trials is the right validation — it proves the vectorized path produces identical drop decisions as the reference implementation. That's the hard part to get right (edge cases like empty embeddings, pre-dropped items, first-occurrence-wins tie-breaking).

The 95× speedup is impressive, but the real win is that deduplication no longer blocks the agent's critical path. Clean PR.

We track memory-layer performance patterns in our agent stack: SwarmAI. Discussion: T-MEM

The ragged-embedding fallback is a "should not happen" guard; if it fires, an embedder returned variable-length vectors. Log a warning (with sample lengths) instead of silently degrading to the scalar path, so the encoding bug surfaces. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

corridor-security Bot reviewed Jun 24, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 24, 2026

View reviewed changes

Comment thread lib/crewai/src/crewai/memory/encoding_flow.py Outdated

corridor-security Bot reviewed Jun 24, 2026

View reviewed changes

HumphreySun98 and others added 2 commits June 24, 2026 15:15

HumphreySun98 force-pushed the perf/vectorize-intra-batch-dedup branch from 467bb39 to 97b38dc Compare June 24, 2026 20:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(memory): vectorize intra-batch dedup cosine similarity#6323

perf(memory): vectorize intra-batch dedup cosine similarity#6323
HumphreySun98 wants to merge 3 commits into
crewAIInc:mainfrom
HumphreySun98:perf/vectorize-intra-batch-dedup

HumphreySun98 commented Jun 24, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 24, 2026 •

edited

Loading

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

corridor-security Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

corridor-security Bot left a comment

Uh oh!

corridor-security Bot Jun 24, 2026

Uh oh!

xg-gh-25 commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

HumphreySun98 commented Jun 24, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Change

Benchmark

Tests

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

corridor-security Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

corridor-security Bot left a comment

Choose a reason for hiding this comment

Uh oh!

corridor-security Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

xg-gh-25 commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HumphreySun98 commented Jun 24, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 24, 2026 •

edited

Loading