Skip to content

khawjaahmad/dupscout

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DupScout — browser embeddings & reranking

Semantic duplicate-bug detection built on two Qwen3 0.6B models loaded via Transformers.js v4. The models and the corpus load into the browser tab and run on WebGPU, so queries are processed entirely on the user's device:

  1. Recallonnx-community/Qwen3-Embedding-0.6B-ONNX (feature-extraction, last_token pooling, normalized). Cosine similarity → top candidates.
  2. Precisiononnx-community/Qwen3-Reranker-0.6B-ONNX (CausalLM). For each candidate it builds a yes/no judge prompt, runs one forward pass, and scores P(yes) = softmax([yes_logit, no_logit])[yes]. A threshold on that score decides duplicate vs new.

Everything runs in a Web Worker so the UI stays responsive during load and inference.

Why this approach

  • Static deployment. Ships as static files, so hosting it is as simple as serving a folder.
  • Local processing. The corpus and each query are handled in the browser tab, which suits cases where reports should not leave the device.
  • Transparent pipeline. Recall and rerank are shown side by side, so each stage's contribution to the result is visible.

Run locally

npm install
npm run dev            # http://localhost:5173
npm run build          # production build → dist/
npm run preview        # serve the production build locally

Install note: the repo ships an .npmrc with ignore-scripts=true. @huggingface/transformers pulls in onnxruntime-node, whose postinstall downloads a native binary that a browser build doesn't need — skipping scripts avoids that. The browser runtime is onnxruntime-web, which is bundled at build time regardless.

Configuration

Everything tunable lives in plain constants:

What Where Default
Duplicate threshold src/App.jsx (useState(0.5)) — also adjustable live via the UI slider 0.5
Candidates sent to the reranker src/constants.jsCANDIDATE_COUNT 8
Embedding dtype src/hooks/useDuplicateScanner.jsembedDtype fp16 (q4f16 on mobile)
Reranker dtype src/hooks/useDuplicateScanner.jsrerankDtype q4
Model IDs src/lib/worker.jsEMBED_ID / RERANK_ID Qwen3 0.6B ONNX (embedding + reranker)
Embed batch size / rerank token cap src/lib/worker.jsEMBED_BATCH, RERANK_MAX_TOKENS 8 / 2048
Retrieval/judge prompts src/lib/worker.jsEMBED_TASK, RERANK_INSTRUCTION bug-report specific

Using your own corpus: replace the BUGS array in src/data/bugs.js. Each entry needs { id, title, text } — that's the only contract. SAMPLE_QUERIES (same file) feeds the sample chips in the UI.

Deploy to Vercel (static)

  1. Push to a Git repo, import in Vercel — framework auto-detects as Vite.
  2. If the build fails on the onnxruntime-node postinstall, set Settings → General → Install Command to npm install --ignore-scripts.
  3. Deploy. Output is a static site; model weights stream from the Hugging Face CDN on first load and are cached by the browser afterwards.

The build output is a plain static bundle that any static host can serve as-is.

The one real tradeoff: first-load size

Weights download once (then cached). Defaults (set in src/hooks/useDuplicateScanner.js):

  • embedding: fp16 (q4f16 in mobile lite mode)
  • reranker: q4 — measured ~33× faster than q8 on WebGPU (optimized MatMulNBits shaders) with near-identical scores; fp16/fp32 don't exist for this model (they need external data files unsupported by ORT Web)

Combined first load is on the order of ~1 GB. For a screen-recorded walkthrough this is a one-time cost on your machine and a non-issue. For a public link, every new visitor downloads it — fine for a portfolio piece, but say so on the page or expect bounces.

Browser support

WebGPU is required (Chrome/Edge desktop, Android Chrome with WebGPU, recent Safari Tech Preview). Browsers without WebGPU see an unsupported notice. Mobile devices get a lite mode (embedding-only recall) to stay within memory limits.

Walkthrough (≈40s)

The corpus is a frozen snapshot of 180 real facebook/react issues containing genuine duplicate clusters filed by different users.

  1. Load models — progress bar fills, "ready · 180 reports" appears.
  2. Sample 1 (screen reader doesn't announce emoji-picker buttons) → scan. Recall surfaces the accessibility neighbourhood; the reranker lifts the true duplicate pair (#36421/#36422) to the top with CANDIDATE badges (P(yes) ≈ 1.0).
  3. Sample 2 (SSR "text content does not match" — never says "hydration") → scan. The embedding bridges the wording gap to the hydration-mismatch cluster; the reranker confirms #36241.
  4. Sample 4 (near-miss: date-input locale bug, no true duplicate exists) → scan. Recall still returns 8 candidates, but every reranker score stays near 0.0 → "NO STRONG DUPLICATE". That contrast is the story: embeddings find the neighbourhood, the reranker makes the call.

Sample 3 (eslint-plugin-react-hooks 7.0.1 resolution failure) finds its true duplicate (#35045) at #1, and also illustrates the model's known limit: same-package cousins score above threshold too — which is why flagged rows are labeled CANDIDATE rather than DUP.

File structure

index.html                         page shell, loads fonts + src/main.jsx
vite.config.js                     Vite + React, excludes transformers from pre-bundling
.npmrc                             ignore-scripts=true (skips onnxruntime-node postinstall)
src/main.jsx                       React entry point
src/App.jsx                        thin shell — wires the hook to the views
src/constants.js                   PHASE states + candidate count
src/hooks/useDuplicateScanner.js   worker lifecycle, load/scan actions, all app state
src/components/Header.jsx          brand + status pill
src/components/BootScreen.jsx      WebGPU check + load button
src/components/LoadingView.jsx     download progress
src/components/QueryPanel.jsx      bug-report input + sample chips
src/components/Pipeline.jsx        three-stage progress strip
src/components/Results.jsx         verdict banner + recall/precision columns
src/lib/worker.js                  loads both models, runs embed → cosine → rerank off-thread
src/lib/environment.js             WebGPU + mobile detection
src/data/bugs.js                   frozen snapshot of 180 real facebook/react issues
src/styles.css                     Spark design system (dark) — violet accent, Plus Jakarta Sans

Verified

The two-stage pipeline has been run end-to-end against both real models (embedding fp16 + reranker q4, WebGPU) on the 180-issue corpus:

  • Samples 1–3 each surface their true duplicate at #1 of PRECISION with P(yes) 0.999–1.000; whole scan completes in ~2 s (embedding ~80 ms + rerank ~1.7 s for 8 candidates).
  • Sample 4 (near-miss, no true duplicate) stays "NO STRONG DUPLICATE" — every reranker score ≤ 0.016.
  • Known limitation: when all candidates share a package/component (sample 3), the 0.6B reranker scores related-but-different defects above threshold as well. The top-1 match has been correct in every test run; per-row flags are candidates, not verdicts.

About

Semantic duplicate bug-report detection running fully in the browser — Qwen3 embeddings (recall) + Qwen3 reranker (precision) on WebGPU via Transformers.js

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors