Skip to content

Improve session transcript search#2330

Open
VamsiKrishna0101 wants to merge 3 commits into
different-ai:devfrom
VamsiKrishna0101:improve-session-search
Open

Improve session transcript search#2330
VamsiKrishna0101 wants to merge 3 commits into
different-ai:devfrom
VamsiKrishna0101:improve-session-search

Conversation

@VamsiKrishna0101

@VamsiKrishna0101 VamsiKrishna0101 commented Jun 21, 2026

Copy link
Copy Markdown

Summary

  • Improves transcript search to match remembered terms instead of only exact phrases.
  • Adds ranked token matching for exact, prefix, fuzzy, and nearby matches.
  • Adds a visible sidebar entry point so users can discover session search without knowing the shortcut.
  • Adds unit coverage for loose matching, typo tolerance, user-message preference, and non-ASCII query terms.

Why

  • Session search previously required the query to appear as an exact lowercase substring in a message.
  • Users usually remember past work by concepts or partial phrases, not exact transcript wording.
  • Search was also difficult to discover because it was primarily exposed through Ctrl+Shift+F and the command palette.
  • This makes searches like auth redirect find transcript text such as authentication failed after OAuth redirect, and makes the feature easier to access from the sidebar.

Issue

  • N/A

Scope

  • Updates the local transcript search matcher in apps/app/src/react-app/domains/session/search/session-search.ts.
  • Tokenizes queries into meaningful terms and ignores short/common stop words.
  • Matches tokens by exact word, prefix, substring, and bounded edit distance for small typos.
  • Supports non-ASCII query terms with Unicode-aware tokenization.
  • Requires all important query tokens to match before returning a transcript result.
  • Scores matches by token strength and proximity.
  • Prefers user-authored messages when multiple transcript messages match.
  • Builds snippets around the matched token cluster.
  • Adds a visible Search sessions action in the sidebar footer.
  • Reuses the existing SessionSearchDialog.
  • Adds apps/app/tests/session-search.test.ts.

Out of scope

  • No new dependencies.
  • No backend indexing or storage changes.
  • No semantic embeddings or model-backed search.
  • No result UI redesign.
  • No changes to title search behavior.

Testing

Ran

  • pnpm --filter @openwork/app exec bun test tests/session-search.test.ts
  • pnpm --filter @openwork/app typecheck

Result

  • pass:
    • tests/session-search.test.ts: 4 passed
    • tsc -p tsconfig.json --noEmit: passed
  • if fail, exact files/errors:
    • N/A

CI status

  • pass:
    • Not run locally beyond the targeted test and app typecheck.
  • code-related failures:
    • N/A
  • external/env/auth blockers:
    • N/A

Manual verification

  1. Started the OpenWork dev app locally.
  2. Opened session search with Ctrl+Shift+F.
  3. Verified the search dialog opens and returns matching session results.
  4. Verified transcript matching behavior with targeted unit coverage for loose query terms such as auth redirect.
  5. Verified the sidebar footer shows a visible Search sessions entry point.
  6. Clicked Search sessions and confirmed it opens the existing search dialog.

Evidence

  • Screenshot from local desktop search verification:
OpenWork - Dev 22-06-2026 06_44_05 OpenWork - Dev 22-06-2026 06_43_48

Risk

  • Low-to-medium.
  • Transcript matching behavior changes from exact substring matching to token-based matching.
  • Search is stricter in one way: all meaningful query tokens must match for transcript results.
  • Short/common words are ignored, so very small queries may scan less aggressively.
  • Fuzzy matching is intentionally bounded to avoid broad or expensive matches.
  • The sidebar change only adds a new entry point to an existing dialog.

Rollback

  • Revert this PR to restore the previous exact lowercase substring transcript search behavior and remove the sidebar search entry point.

@vercel

vercel Bot commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
openwork-landing Ready Ready Preview, Comment, Open in v0 Jun 24, 2026 4:19am

@vercel

vercel Bot commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

@VamsiKrishna0101 is attempting to deploy a commit to the Different AI Team on Vercel.

A member of the Team first needs to authorize it.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 2 files

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread apps/app/src/react-app/domains/session/search/session-search.ts Outdated
@VamsiKrishna0101

VamsiKrishna0101 commented Jun 22, 2026

Copy link
Copy Markdown
Author

@benjaminshafii addressed the Cubic feedback by making transcript search tokenization Unicode-aware and adding regression coverage for a non-ASCII query term.

Also added a visible sidebar entry point for session search so the feature is easier to discover without knowing Ctrl+Shift+F.

Validation:

  • pnpm --filter @openwork/app exec bun test tests/session-search.test.ts — 4 passed
  • pnpm --filter @openwork/app typecheck — passed
  • CI is green

Ready for review when you have a chance.

@evanklem

Copy link
Copy Markdown

Pulled this branch and checked it out locally at 82f7846 with 3 commits on base dev.

The 4 tests pass and typecheck is clean, and I confirmed cubic's point is handled: the tokenizer is now Unicode-aware. WORD_PATTERN uses \p{L} and \p{N}.

The core idea is a real improvement. Matching query tokens independently means auth redirect now finds ...redirect...authentication..., which the old indexOf(query) could not.

A few issues before this is mergeable.

1. The snippet highlights the entire span when query terms are far apart

buildTokenSnippet in session-search.ts appears to set match to:

text.slice(first.start, last.end)

That means it returns the whole span from the first matched token to the last matched token, with no cap.

For example:

const text = "deploy " + "lorem ipsum ".repeat(80) + " vercel";

// search "deploy vercel"
// snippet.match becomes the whole span from "deploy" through "vercel"

The row is truncate and renders snippet.match as the highlight in session-search-dialog.tsx:81,85, so the UI shows one long highlight with no useful context and the second term can be clipped.

A quick test against two sessions, one with words far apart and one with words adjacent, shows this directly:

image

Capping match to the matched token and putting context in before and after would fix it.

2. Fuzzy matching returns sessions it probably should not

scoreWordForToken in session-search.ts:166 allows edit distance 1 for tokens shorter than 7 characters, and 2 for longer tokens, with no minimum score.

That means single-substitution neighbors can match unexpectedly. For example, 2332 can match a session whose only candidate is 2331, and code can match one containing only node.

The real word still wins when present, so the cost is precision, especially for the exact things users search for in this app: IDs and short identifiers.

It also runs against the description's "bounded fuzzy matching." Worth questioning whether this layer earns its place. I disabled just the Levenshtein fallback and 3 of the 4 tests still pass, so the felt improvement appears to be the tokenizer, not fuzzy matching.

I would drop it, or gate it with a score floor and skip numeric or short tokens.

3. The sidebar entry shows the wrong shortcut on macOS

The sidebar button hardcodes (Ctrl+Shift+F) in both aria-label and title in app-sidebar.tsx:770-771.

The binding itself is platform-aware:

isMac ? event.metaKey : event.ctrlKey

So on a Mac, the actual shortcut is Cmd+Shift+F.

The app already has isMacPlatform() for this in utils/index.ts:181, and it is used in composer.tsx:1723. The rest of the wiring looks correct.

Smaller notes

  • buildSnippet in session-search.ts appears unused. Ripgrep only finds its own definition. Remove it and its export.
  • The snippet ellipsis was switched to ASCII ... in session-search.ts:96,97,211,213, while the rest of the app uses the Unicode ellipsis in session-surface.tsx:1010,1249. This is just consistency.
  • wordRanges() re-runs the regex on each keystroke. The cache holds the lowercased text but not the tokenized ranges. Caching those ranges would help keep long-transcript search responsive.

PR metadata

The PR title and commits do not follow the repo convention:

type(scope): summary

Something like this would match the history:

feat(app): order-independent session search

@Pablosinyores Pablosinyores left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ranked token matching (exact/prefix/substring/edit-distance) with stop-word filtering reads well, and the tests cover the interesting cases.

One concern in buildTokenSnippet: the match slice is text.slice(first.start, last.end), bounded only by the first and last matched token positions. With the AND matching across a long transcript entry, two tokens can land far apart (token A near the start, token B thousands of chars later), so the rendered match becomes the entire span between them. matchTokenizedQuery penalizes large spans in scoring (proximityBonus = max(0, 120 - span)), but the snippet itself is not capped — a low-scoring distant match still renders a huge highlight. Worth clamping the snippet to a window around the highest-scoring range (or capping last.end - first.start).

Minor: matchTokenizedQuery requires every token to match (if (!range) return null). That is stricter than the "match remembered terms" goal — a single typo beyond editDistanceWithin drops the whole entry. If partial matching is intended, scoring on matched-token count rather than all-or-nothing would track the stated behavior more closely.

@VamsiKrishna0101

Copy link
Copy Markdown
Author

Pulled this branch and checked it out locally at 82f7846 with 3 commits on base dev.

The 4 tests pass and typecheck is clean, and I confirmed cubic's point is handled: the tokenizer is now Unicode-aware. WORD_PATTERN uses \p{L} and \p{N}.

The core idea is a real improvement. Matching query tokens independently means auth redirect now finds ...redirect...authentication..., which the old indexOf(query) could not.

A few issues before this is mergeable.

1. The snippet highlights the entire span when query terms are far apart

buildTokenSnippet in session-search.ts appears to set match to:

text.slice(first.start, last.end)

That means it returns the whole span from the first matched token to the last matched token, with no cap.

For example:

const text = "deploy " + "lorem ipsum ".repeat(80) + " vercel";

// search "deploy vercel"
// snippet.match becomes the whole span from "deploy" through "vercel"

The row is truncate and renders snippet.match as the highlight in session-search-dialog.tsx:81,85, so the UI shows one long highlight with no useful context and the second term can be clipped.

A quick test against two sessions, one with words far apart and one with words adjacent, shows this directly:

image Capping `match` to the matched token and putting context in `before` and `after` would fix it.

2. Fuzzy matching returns sessions it probably should not

scoreWordForToken in session-search.ts:166 allows edit distance 1 for tokens shorter than 7 characters, and 2 for longer tokens, with no minimum score.

That means single-substitution neighbors can match unexpectedly. For example, 2332 can match a session whose only candidate is 2331, and code can match one containing only node.

The real word still wins when present, so the cost is precision, especially for the exact things users search for in this app: IDs and short identifiers.

It also runs against the description's "bounded fuzzy matching." Worth questioning whether this layer earns its place. I disabled just the Levenshtein fallback and 3 of the 4 tests still pass, so the felt improvement appears to be the tokenizer, not fuzzy matching.

I would drop it, or gate it with a score floor and skip numeric or short tokens.

3. The sidebar entry shows the wrong shortcut on macOS

The sidebar button hardcodes (Ctrl+Shift+F) in both aria-label and title in app-sidebar.tsx:770-771.

The binding itself is platform-aware:

isMac ? event.metaKey : event.ctrlKey

So on a Mac, the actual shortcut is Cmd+Shift+F.

The app already has isMacPlatform() for this in utils/index.ts:181, and it is used in composer.tsx:1723. The rest of the wiring looks correct.

Smaller notes

  • buildSnippet in session-search.ts appears unused. Ripgrep only finds its own definition. Remove it and its export.
  • The snippet ellipsis was switched to ASCII ... in session-search.ts:96,97,211,213, while the rest of the app uses the Unicode ellipsis in session-surface.tsx:1010,1249. This is just consistency.
  • wordRanges() re-runs the regex on each keystroke. The cache holds the lowercased text but not the tokenized ranges. Caching those ranges would help keep long-transcript search responsive.

PR metadata

The PR title and commits do not follow the repo convention:

type(scope): summary

Something like this would match the history:

feat(app): order-independent session search

Thanks, this was fair feedback. I updated the PR to tighten the search behavior:

  • changed transcript snippets so widely separated query terms no longer produce one huge highlighted span
  • removed the broad Levenshtein fallback to avoid false positives for IDs and short identifiers
  • fixed the sidebar shortcut label to use the platform-specific modifier on macOS
  • removed the unused buildSnippet
  • cached token ranges with transcript text so repeated searches do not retokenize every message
  • updated tests to cover far-apart query terms, Unicode terms, user-message preference, and precision around short identifiers

The multi-token/Unicode matching behavior is still preserved.

@VamsiKrishna0101

Copy link
Copy Markdown
Author

The ranked token matching (exact/prefix/substring/edit-distance) with stop-word filtering reads well, and the tests cover the interesting cases.

One concern in buildTokenSnippet: the match slice is text.slice(first.start, last.end), bounded only by the first and last matched token positions. With the AND matching across a long transcript entry, two tokens can land far apart (token A near the start, token B thousands of chars later), so the rendered match becomes the entire span between them. matchTokenizedQuery penalizes large spans in scoring (proximityBonus = max(0, 120 - span)), but the snippet itself is not capped — a low-scoring distant match still renders a huge highlight. Worth clamping the snippet to a window around the highest-scoring range (or capping last.end - first.start).

Minor: matchTokenizedQuery requires every token to match (if (!range) return null). That is stricter than the "match remembered terms" goal — a single typo beyond editDistanceWithin drops the whole entry. If partial matching is intended, scoring on matched-token count rather than all-or-nothing would track the stated behavior more closely.

Yep, agreed on the snippet issue. I updated buildTokenSnippet so it no longer highlights the full span between far-apart matched terms. It now builds the snippet around the highest-scoring matched range, keeping the context compact while still showing why the result matched.

On partial matching: I kept transcript matching as all-token matching intentionally for this PR. Session titles already provide looser/fuzzy discovery, while transcript matches can get noisy quickly if long messages match only one remembered term. I’d rather keep this first change precise and revisit partial-token transcript ranking separately if we want broader recall.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants