Improve session transcript search by VamsiKrishna0101 · Pull Request #2330 · different-ai/openwork

VamsiKrishna0101 · 2026-06-21T14:40:44Z

Summary

Improves transcript search to match remembered terms instead of only exact phrases.
Adds ranked token matching for exact, prefix, fuzzy, and nearby matches.
Adds a visible sidebar entry point so users can discover session search without knowing the shortcut.
Adds unit coverage for loose matching, typo tolerance, user-message preference, and non-ASCII query terms.

Why

Session search previously required the query to appear as an exact lowercase substring in a message.
Users usually remember past work by concepts or partial phrases, not exact transcript wording.
Search was also difficult to discover because it was primarily exposed through Ctrl+Shift+F and the command palette.
This makes searches like auth redirect find transcript text such as authentication failed after OAuth redirect, and makes the feature easier to access from the sidebar.

Issue

N/A

Scope

Updates the local transcript search matcher in apps/app/src/react-app/domains/session/search/session-search.ts.
Tokenizes queries into meaningful terms and ignores short/common stop words.
Matches tokens by exact word, prefix, substring, and bounded edit distance for small typos.
Supports non-ASCII query terms with Unicode-aware tokenization.
Requires all important query tokens to match before returning a transcript result.
Scores matches by token strength and proximity.
Prefers user-authored messages when multiple transcript messages match.
Builds snippets around the matched token cluster.
Adds a visible Search sessions action in the sidebar footer.
Reuses the existing SessionSearchDialog.
Adds apps/app/tests/session-search.test.ts.

Out of scope

No new dependencies.
No backend indexing or storage changes.
No semantic embeddings or model-backed search.
No result UI redesign.
No changes to title search behavior.

Testing

Ran

pnpm --filter @openwork/app exec bun test tests/session-search.test.ts
pnpm --filter @openwork/app typecheck

Result

pass:
- tests/session-search.test.ts: 4 passed
- tsc -p tsconfig.json --noEmit: passed
if fail, exact files/errors:
- N/A

CI status

pass:
- Not run locally beyond the targeted test and app typecheck.
code-related failures:
- N/A
external/env/auth blockers:
- N/A

Manual verification

Started the OpenWork dev app locally.
Opened session search with Ctrl+Shift+F.
Verified the search dialog opens and returns matching session results.
Verified transcript matching behavior with targeted unit coverage for loose query terms such as auth redirect.
Verified the sidebar footer shows a visible Search sessions entry point.
Clicked Search sessions and confirmed it opens the existing search dialog.

Evidence

Screenshot from local desktop search verification:

Risk

Low-to-medium.
Transcript matching behavior changes from exact substring matching to token-based matching.
Search is stricter in one way: all meaningful query tokens must match for transcript results.
Short/common words are ignored, so very small queries may scan less aggressively.
Fuzzy matching is intentionally bounded to avoid broad or expensive matches.
The sidebar change only adds a new entry point to an existing dialog.

Rollback

Revert this PR to restore the previous exact lowercase substring transcript search behavior and remove the sidebar search entry point.

vercel · 2026-06-21T14:40:51Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
openwork-landing	Ready	Preview, Comment, Open in v0	Jun 24, 2026 4:19am

vercel · 2026-06-21T14:40:52Z

@VamsiKrishna0101 is attempting to deploy a commit to the Different AI Team on Vercel.

A member of the Team first needs to authorize it.

cubic-dev-ai

1 issue found across 2 files

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

VamsiKrishna0101 · 2026-06-22T00:39:28Z

@benjaminshafii addressed the Cubic feedback by making transcript search tokenization Unicode-aware and adding regression coverage for a non-ASCII query term.

Also added a visible sidebar entry point for session search so the feature is easier to discover without knowing Ctrl+Shift+F.

Validation:

pnpm --filter @openwork/app exec bun test tests/session-search.test.ts — 4 passed
pnpm --filter @openwork/app typecheck — passed
CI is green

Ready for review when you have a chance.

evanklem · 2026-06-23T03:48:37Z

Pulled this branch and checked it out locally at 82f7846 with 3 commits on base dev.

The 4 tests pass and typecheck is clean, and I confirmed cubic's point is handled: the tokenizer is now Unicode-aware. WORD_PATTERN uses \p{L} and \p{N}.

The core idea is a real improvement. Matching query tokens independently means auth redirect now finds ...redirect...authentication..., which the old indexOf(query) could not.

A few issues before this is mergeable.

1. The snippet highlights the entire span when query terms are far apart

buildTokenSnippet in session-search.ts appears to set match to:

text.slice(first.start, last.end)

That means it returns the whole span from the first matched token to the last matched token, with no cap.

For example:

const text = "deploy " + "lorem ipsum ".repeat(80) + " vercel";

// search "deploy vercel"
// snippet.match becomes the whole span from "deploy" through "vercel"

The row is truncate and renders snippet.match as the highlight in session-search-dialog.tsx:81,85, so the UI shows one long highlight with no useful context and the second term can be clipped.

A quick test against two sessions, one with words far apart and one with words adjacent, shows this directly:

Capping match to the matched token and putting context in before and after would fix it.

2. Fuzzy matching returns sessions it probably should not

scoreWordForToken in session-search.ts:166 allows edit distance 1 for tokens shorter than 7 characters, and 2 for longer tokens, with no minimum score.

That means single-substitution neighbors can match unexpectedly. For example, 2332 can match a session whose only candidate is 2331, and code can match one containing only node.

The real word still wins when present, so the cost is precision, especially for the exact things users search for in this app: IDs and short identifiers.

It also runs against the description's "bounded fuzzy matching." Worth questioning whether this layer earns its place. I disabled just the Levenshtein fallback and 3 of the 4 tests still pass, so the felt improvement appears to be the tokenizer, not fuzzy matching.

I would drop it, or gate it with a score floor and skip numeric or short tokens.

3. The sidebar entry shows the wrong shortcut on macOS

The sidebar button hardcodes (Ctrl+Shift+F) in both aria-label and title in app-sidebar.tsx:770-771.

The binding itself is platform-aware:

isMac ? event.metaKey : event.ctrlKey

So on a Mac, the actual shortcut is Cmd+Shift+F.

The app already has isMacPlatform() for this in utils/index.ts:181, and it is used in composer.tsx:1723. The rest of the wiring looks correct.

Smaller notes

buildSnippet in session-search.ts appears unused. Ripgrep only finds its own definition. Remove it and its export.
The snippet ellipsis was switched to ASCII ... in session-search.ts:96,97,211,213, while the rest of the app uses the Unicode ellipsis in session-surface.tsx:1010,1249. This is just consistency.
wordRanges() re-runs the regex on each keystroke. The cache holds the lowercased text but not the tokenized ranges. Caching those ranges would help keep long-transcript search responsive.

PR metadata

The PR title and commits do not follow the repo convention:

type(scope): summary

Something like this would match the history:

feat(app): order-independent session search

Pablosinyores

The ranked token matching (exact/prefix/substring/edit-distance) with stop-word filtering reads well, and the tests cover the interesting cases.

One concern in buildTokenSnippet: the match slice is text.slice(first.start, last.end), bounded only by the first and last matched token positions. With the AND matching across a long transcript entry, two tokens can land far apart (token A near the start, token B thousands of chars later), so the rendered match becomes the entire span between them. matchTokenizedQuery penalizes large spans in scoring (proximityBonus = max(0, 120 - span)), but the snippet itself is not capped — a low-scoring distant match still renders a huge highlight. Worth clamping the snippet to a window around the highest-scoring range (or capping last.end - first.start).

Minor: matchTokenizedQuery requires every token to match (if (!range) return null). That is stricter than the "match remembered terms" goal — a single typo beyond editDistanceWithin drops the whole entry. If partial matching is intended, scoring on matched-token count rather than all-or-nothing would track the stated behavior more closely.

VamsiKrishna0101 · 2026-06-24T04:23:25Z

Pulled this branch and checked it out locally at 82f7846 with 3 commits on base dev.

The 4 tests pass and typecheck is clean, and I confirmed cubic's point is handled: the tokenizer is now Unicode-aware. WORD_PATTERN uses \p{L} and \p{N}.

The core idea is a real improvement. Matching query tokens independently means auth redirect now finds ...redirect...authentication..., which the old indexOf(query) could not.

A few issues before this is mergeable.

1. The snippet highlights the entire span when query terms are far apart

buildTokenSnippet in session-search.ts appears to set match to:
text.slice(first.start, last.end)
That means it returns the whole span from the first matched token to the last matched token, with no cap.

For example:
const text = "deploy " + "lorem ipsum ".repeat(80) + " vercel";

// search "deploy vercel"
// snippet.match becomes the whole span from "deploy" through "vercel"
The row is truncate and renders snippet.match as the highlight in session-search-dialog.tsx:81,85, so the UI shows one long highlight with no useful context and the second term can be clipped.

A quick test against two sessions, one with words far apart and one with words adjacent, shows this directly:
Capping `match` to the matched token and putting context in `before` and `after` would fix it.
2. Fuzzy matching returns sessions it probably should not

scoreWordForToken in session-search.ts:166 allows edit distance 1 for tokens shorter than 7 characters, and 2 for longer tokens, with no minimum score.

That means single-substitution neighbors can match unexpectedly. For example, 2332 can match a session whose only candidate is 2331, and code can match one containing only node.

The real word still wins when present, so the cost is precision, especially for the exact things users search for in this app: IDs and short identifiers.

It also runs against the description's "bounded fuzzy matching." Worth questioning whether this layer earns its place. I disabled just the Levenshtein fallback and 3 of the 4 tests still pass, so the felt improvement appears to be the tokenizer, not fuzzy matching.

I would drop it, or gate it with a score floor and skip numeric or short tokens.

3. The sidebar entry shows the wrong shortcut on macOS

The sidebar button hardcodes (Ctrl+Shift+F) in both aria-label and title in app-sidebar.tsx:770-771.

The binding itself is platform-aware:
isMac ? event.metaKey : event.ctrlKey
So on a Mac, the actual shortcut is Cmd+Shift+F.

The app already has isMacPlatform() for this in utils/index.ts:181, and it is used in composer.tsx:1723. The rest of the wiring looks correct.

Smaller notes

buildSnippet in session-search.ts appears unused. Ripgrep only finds its own definition. Remove it and its export.

The snippet ellipsis was switched to ASCII ... in session-search.ts:96,97,211,213, while the rest of the app uses the Unicode ellipsis in session-surface.tsx:1010,1249. This is just consistency.

wordRanges() re-runs the regex on each keystroke. The cache holds the lowercased text but not the tokenized ranges. Caching those ranges would help keep long-transcript search responsive.

PR metadata

The PR title and commits do not follow the repo convention:
type(scope): summary
Something like this would match the history:
feat(app): order-independent session search

Thanks, this was fair feedback. I updated the PR to tighten the search behavior:

changed transcript snippets so widely separated query terms no longer produce one huge highlighted span
removed the broad Levenshtein fallback to avoid false positives for IDs and short identifiers
fixed the sidebar shortcut label to use the platform-specific modifier on macOS
removed the unused buildSnippet
cached token ranges with transcript text so repeated searches do not retokenize every message
updated tests to cover far-apart query terms, Unicode terms, user-message preference, and precision around short identifiers

The multi-token/Unicode matching behavior is still preserved.

VamsiKrishna0101 · 2026-06-24T04:23:45Z

The ranked token matching (exact/prefix/substring/edit-distance) with stop-word filtering reads well, and the tests cover the interesting cases.

One concern in buildTokenSnippet: the match slice is text.slice(first.start, last.end), bounded only by the first and last matched token positions. With the AND matching across a long transcript entry, two tokens can land far apart (token A near the start, token B thousands of chars later), so the rendered match becomes the entire span between them. matchTokenizedQuery penalizes large spans in scoring (proximityBonus = max(0, 120 - span)), but the snippet itself is not capped — a low-scoring distant match still renders a huge highlight. Worth clamping the snippet to a window around the highest-scoring range (or capping last.end - first.start).

Minor: matchTokenizedQuery requires every token to match (if (!range) return null). That is stricter than the "match remembered terms" goal — a single typo beyond editDistanceWithin drops the whole entry. If partial matching is intended, scoring on matched-token count rather than all-or-nothing would track the stated behavior more closely.

Yep, agreed on the snippet issue. I updated buildTokenSnippet so it no longer highlights the full span between far-apart matched terms. It now builds the snippet around the highest-scoring matched range, keeping the context compact while still showing why the result matched.

On partial matching: I kept transcript matching as all-token matching intentionally for this PR. Session titles already provide looser/fuzzy discovery, while transcript matches can get noisy quickly if long messages match only one remembered term. I’d rather keep this first change precise and revisit partial-token transcript ranking separately if we want broader recall.

Improve session transcript search

25d16ea

vercel Bot deployed to Preview – openwork-landing June 21, 2026 14:41 View deployment

cubic-dev-ai Bot reviewed Jun 21, 2026

View reviewed changes

Comment thread apps/app/src/react-app/domains/session/search/session-search.ts Outdated

Support Unicode terms in session search

476376b

vercel Bot deployed to Preview – openwork-landing June 21, 2026 14:48 View deployment

vercel Bot deployed to Preview – openwork-landing June 22, 2026 01:18 View deployment

Pablosinyores reviewed Jun 23, 2026

View reviewed changes

Add visible session search entry point

59899c5

VamsiKrishna0101 force-pushed the improve-session-search branch from 82f7846 to 59899c5 Compare June 24, 2026 04:18

vercel Bot deployed to Preview – openwork-landing June 24, 2026 04:19 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve session transcript search#2330

Improve session transcript search#2330
VamsiKrishna0101 wants to merge 3 commits into
different-ai:devfrom
VamsiKrishna0101:improve-session-search

VamsiKrishna0101 commented Jun 21, 2026 •

edited

Loading

Uh oh!

vercel Bot commented Jun 21, 2026 •

edited

Loading

Uh oh!

vercel Bot commented Jun 21, 2026

Uh oh!

cubic-dev-ai Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

VamsiKrishna0101 commented Jun 22, 2026 •

edited

Loading

Uh oh!

evanklem commented Jun 23, 2026

Uh oh!

Pablosinyores left a comment

Uh oh!

VamsiKrishna0101 commented Jun 24, 2026

1. The snippet highlights the entire span when query terms are far apart

2. Fuzzy matching returns sessions it probably should not

3. The sidebar entry shows the wrong shortcut on macOS

Smaller notes

PR metadata

Uh oh!

VamsiKrishna0101 commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

VamsiKrishna0101 commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Issue

Scope

Out of scope

Testing

Ran

Result

CI status

Manual verification

Evidence

Risk

Rollback

Uh oh!

vercel Bot commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vercel Bot commented Jun 21, 2026

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

VamsiKrishna0101 commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

evanklem commented Jun 23, 2026

1. The snippet highlights the entire span when query terms are far apart

2. Fuzzy matching returns sessions it probably should not

3. The sidebar entry shows the wrong shortcut on macOS

Smaller notes

PR metadata

Uh oh!

Pablosinyores left a comment

Choose a reason for hiding this comment

Uh oh!

VamsiKrishna0101 commented Jun 24, 2026

1. The snippet highlights the entire span when query terms are far apart

2. Fuzzy matching returns sessions it probably should not

3. The sidebar entry shows the wrong shortcut on macOS

Smaller notes

PR metadata

Uh oh!

VamsiKrishna0101 commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

VamsiKrishna0101 commented Jun 21, 2026 •

edited

Loading

vercel Bot commented Jun 21, 2026 •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading

VamsiKrishna0101 commented Jun 22, 2026 •

edited

Loading