PIGS-809: Adopt regex query contract for search_text_in_pdf#82
Conversation
The platform's extract-text-bounding-boxes method now takes a `queries`
array (each query is literal text or a regex) instead of a flat `texts`
array, and returns results under `textBoxes` as per-query results with
`query`/`matches` rather than flat text boxes.
- platformHandler.extractTextBoundingBoxes now builds `queries` from the
texts plus optional isRegex/regexFlags and sends `{ queries }`.
- search_text_in_pdf exposes optional isRegex + regexFlags (ignore-case,
multiline, dot-all), rejects flags without isRegex, and computes match
counts from the new query-results shape.
- Tests updated for the new request/response contract.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replace the flat texts[] + shared isRegex/regexFlags inputs with a
queries[] of {text, isRegex?, regexFlags?} objects, matching the backend
contract one-to-one. This lets a single call mix literal and regex
queries and assign distinct flags per query, rather than applying one
regex mode uniformly to every term.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CI node-check fails at `npm audit --audit-level=moderate` due to a high-severity advisory (GHSA-gv7w-rqvm-qjhr / GHSA-g7r4-m6w7-qqqr) covering esbuild 0.17.0-0.28.0. Bump esbuild to ^0.28.1 and tsx to ^4.22.4 so its nested esbuild dedupes to the patched top-level copy. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The previous in-place install left the lockfile missing some @emnapi/* optional deps, breaking `npm ci` on Linux CI. Regenerate cleanly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Updates the Node/TypeScript search_text_in_pdf tool and PlatformHandler.extractTextBoundingBoxes wrapper to align with the platform’s new extract-text-bounding-boxes contract that supports per-query regex searches (PIGS-809).
Changes:
- Replaces
texts: string[]withqueries: { text, isRegex?, regexFlags? }[]on both the tool input schema and platform request payload. - Updates result parsing to use the new per-query
textBoxes[].matches[]response shape and recomputes summary counts. - Adjusts and expands Vitest coverage for the new request/response contract and regex behavior; bumps a couple of dev dependencies.
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| node-version/src/tools/extractions.ts | Updates search_text_in_pdf tool schema + logic to send per-query literal/regex searches and parse new per-query results. |
| node-version/src/handlers/platformHandler.ts | Updates extractTextBoundingBoxes to send { queries } payload and optionally include regex fields. |
| node-version/tests/extractions.test.ts | Updates search_text_in_pdf tests for new input/output shape and adds regex-related scenarios. |
| node-version/tests/platformHandler.test.ts | Updates handler tests to assert new { queries } request payload, including regex flags. |
| node-version/package.json | Bumps esbuild and tsx dev dependency versions. |
| node-version/package-lock.json | Lockfile refresh reflecting dependency updates/hoisting. |
Files not reviewed (1)
- node-version/package-lock.json: Generated file
| async extractTextBoundingBoxes( | ||
| fileBytes: Buffer, | ||
| queries: { text: string; isRegex?: boolean | undefined; regexFlags?: string[] | undefined }[], | ||
| ): Promise<Buffer> { |
There was a problem hiding this comment.
Declining this one: the | undefined annotations are intentional. This codebase compiles with exactOptionalPropertyTypes: true, and the caller passes a zod-inferred value typed as { isRegex?: boolean | undefined; regexFlags?: string[] | undefined }. Removing | undefined from the target makes that argument non-assignable and fails tsc (TS2345). Keeping the explicit | undefined is what allows the optional properties to line up.
| "node_modules/glob": { | ||
| "version": "7.2.3", | ||
| "resolved": "https://registry.npmjs.org/glob/-/glob-7.2.3.tgz", | ||
| "integrity": "sha512-nFR0zLpU2YCaRxwoCJvL6UvCH2JFyFVIvwTLsIf21AuHlMskA1hhTdk+LlYJtOlYt9v6dvszD2BGRqBL+iQK9Q==", | ||
| "deprecated": "Old versions of glob are not supported, and contain widely publicized security vulnerabilities, which have been fixed in the current version. Please update. Support for old versions may be purchased (at exorbitant rates) by contacting i@izs.me", | ||
| "license": "ISC", |
There was a problem hiding this comment.
Declining within the scope of this PR: glob@7.2.3 is not introduced or hoisted by the dependency changes here. It is a long-standing transitive dependency of exceljs (via archiver-utils / rimraf / zip-stream) and is already present on main prior to this branch. The esbuild/tsx bump in this PR does not touch that chain. Replacing it would mean forcing an override on a deep transitive of a production dependency, which is unrelated to the PIGS-809 regex change and better handled separately (e.g. via Dependabot on main).
Address Copilot review: validation previously allowed `{ regexFlags: [] }`
to pass without isRegex. Reject any provided regexFlags regardless of
length, and cover the empty-array case with a test.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Summary
Updates the
search_text_in_pdftool to match the platform's newextract-text-bounding-boxescontract (PIGS-809), which adds regular-expression search support.The backend now:
queriesarray (each query is literal text or a regex) instead of a flattextsarray, andtextBoxesas per-query results (query+matches[]withmatchedText/boxes[]/groups[]) instead of flat text boxes.The old tool sent
{ texts: [...] }and parsedtextBoxes[].text, so it would break against the updated platform on both the request and response sides.Changes
platformHandler.extractTextBoundingBoxesnow accepts an array of query objects ({ text, isRegex?, regexFlags? }) and sends{ queries }, omittingisRegex/regexFlagson the wire unless set.search_text_in_pdfinput is now aqueries[]of{ text, isRegex?, regexFlags? }objects (rather thantexts[]+ shared flags), giving full backend parity: a single call can mix literal and regex queries and assign distinct flags per query. AllowedregexFlags:ignore-case,multiline,dot-all. Flags withoutisRegexare rejected with aUserFacingError.totalMatches,uniqueTextsFound) are computed from the new per-query/matchesshape.Only
node-version/is touched (the root Python code is legacy perCLAUDE.md).Testing
task n:check— format, types, lint all passtask n:test— 164 passed, 1 skipped🤖 Generated with Claude Code