An offline-first Progressive Web App that answers questions exclusively from the World Programme for the Census of Agriculture 2030 (WCA 2030) guidelines. Every answer is verbatim extracted text from the guidelines, accompanied by a precise section title and page number.
The WCA 2030 Explorer is a retrieval tool — not a generative AI. It enforces the following hard constraints:
- Answers are extracted text only. The retrieved chunk is the answer; no paraphrasing or generation occurs.
- No external API calls at runtime. After the first load, the app works with zero internet access. All model inference runs in-browser via WebAssembly.
- Guardrail is mandatory. When no chunk exceeds the confidence threshold and keyword fallback also fails, the app returns: "This question could not be answered from the WCA 2030 guidelines. Sections searched: [list]."
- Every answer cites the source chunk's section title and page number.
- No generative model at runtime. The embedding model (
all-MiniLM-L6-v2) is used only to encode queries; it never generates text.
The app is intended for FAO staff and national census bureaux who need authoritative, citable answers from the WCA 2030 methodology document without network access in the field.
- Node.js ≥ 18 (for native
fetch,structuredClone, WASM support) - npm ≥ 9
- The WCA 2030 source PDF at
./source/Census-2030_EN-DTP-9.pdf
Windows users: run all commands in Git Bash or PowerShell. Do not use
cmd.exe— thenpx tsxcalls require a POSIX-compatible shell or PowerShell for proper path handling.
# Extract text from the PDF, chunk it, and embed every chunk
npm run build-indexThis runs scripts/chunk.ts (PDF extraction + chunking) and then scripts/embed.ts
(downloads Xenova/all-MiniLM-L6-v2 on first run and embeds ~876 chunks).
Outputs — all written to the canonical source locations under public/:
src/data/chunks-raw.json— intermediate raw chunks (no embeddings; gitignored)public/data/chunks.json— chunks with 384-dimensional embeddings (~10 MB)public/models/— ONNX model weights + WASM runtime files
The model download (~23 MB) requires internet access on the first run only. Subsequent runs use the local
.cache/directory and are fully offline.
No manual copy step is needed. public/data/ is the canonical source;
npm run build (Step 2) copies everything from public/ into docs/ automatically.
npm run buildCompiles TypeScript, bundles the app, and generates:
docs/— the production bundledocs/sw.js— the Workbox service worker with a 16-entry pre-cache manifest (~71 MB total)
npm run preview
# → open http://localhost:4173On the first visit the service worker installs and pre-caches all 16 assets (JS bundle, CSS, chunks.json, ONNX model, WASM runtime files, icons). After that the app is fully offline-capable.
npm run dev
# → http://localhost:5173 (no service worker; hot-module reload active)npm test
# 20 tests across 3 files — chunking, retrieval, guardrailThe confidence threshold controls how selective the semantic search is before falling back to keyword search or returning a not-found response.
Default value: 0.42
Live tuning via DevTools (no rebuild needed):
// Lower the threshold to surface more borderline results
localStorage.setItem('wca_threshold', '0.38')
// Raise it to be more selective
localStorage.setItem('wca_threshold', '0.60')
// Restore default
localStorage.removeItem('wca_threshold')Then reload the page — the new threshold takes effect on the next query.
- Semantic search — if the top cosine-similarity score (with 1.15× boost for high-priority chunks) meets the threshold, those results are returned.
- Lexical fallback — if semantic fails, MiniSearch BM25 search runs. Only results scoring ≥ 8 BM25 points are accepted (prevents false positives from incidental word overlap on off-topic queries).
- Not-found response — if both fail, the guardrail card is shown with a deduplicated list of sections searched.
Note: raising the threshold for an on-topic query (e.g.
'0.99'for "agricultural holding") will cause the semantic pass to fail but the lexical fallback will still answer it, since "agricultural" and "holding" score ≫ 8 in BM25. To force the not-found card, the query must also fail the keyword search — for example "What is the capital of France?" fails both at any threshold.
When a new edition of the WCA guidelines is released:
- Replace
./source/Census-2030_EN-DTP-9.pdfwith the new PDF. Update every filename reference inscripts/chunk.tsif it differs. - Verify the high-priority page ranges in
scripts/chunk.tsstill match the new document's chapter layout, and update theSTATIC_HIGH_RANGESarray if needed. - Re-run the full pipeline:
This writes the updated
npm run build-index
public/data/chunks.jsondirectly — no copy step needed. - Bump the
versionfield insrc/data/model-meta.json:{ "model": "Xenova/all-MiniLM-L6-v2", "dim": 384, "version": "wca2030-v2" } - Rebuild and redeploy:
The Workbox revision hashes will change, triggering a service-worker update on existing installs. Users will see the "Guidelines index updated. Reload to apply." banner.
npm run build
On the very first visit the service worker pre-caches ~71 MB of assets:
| Asset | Size |
|---|---|
chunks.json (content index) |
~10 MB |
model_quantized.onnx (ONNX weights) |
~22 MB |
| WASM runtime (4 files) | ~36 MB |
| JS bundle, CSS, HTML, icons | ~3 MB |
Subsequent loads use the cache entirely — no network traffic.
Deploy the docs/ directory to any static host:
# Netlify
netlify deploy --prod --dir docs
# Any static host supporting HTTPSUsers visit the URL, the service worker installs on first load, and the app can then be installed to the home screen and used fully offline.
This repo is configured to deploy from the committed docs/ folder on the
main branch. The Vite build uses base: '/WCA_2030_Explorer/' so all asset
URLs resolve correctly under the GitHub Pages subdirectory.
One-time setup (do this once in the GitHub web UI):
- Go to Settings → Pages in the repository.
- Under Source, choose Deploy from a branch.
- Set the branch to
mainand the folder to/docs. - Click Save. GitHub Pages will publish from
docs/on every push tomain.
Expected URL: https://bakodramane.github.io/WCA_2030_Explorer/
Re-deploying after a guidelines update:
# 1. Regenerate the content index (run once after replacing the source PDF)
npm run build-index
# → writes public/data/chunks.json and public/models/ directly; no copy needed
# 2. Rebuild with the correct base path
npm run build
# → Vite copies public/ into docs/ automatically
# 3. Commit and push — GitHub Pages updates automatically
git add public/data/chunks.json docs/
git commit -m "chore: rebuild docs for updated guidelines"
git pushNote:
docs/is intentionally committed to this repo (not gitignored) and is the build output that GitHub Pages serves.public/data/andpublic/models/are the canonical sources — also committed so thatnpm run buildis always self-contained and requires no manual file-copying.
For environments with no internet access at all:
# Serve locally (Node.js must be installed on the target machine)
npx serve docs
# → http://localhost:3000Or zip docs/ and serve it on any local web server (Nginx, Apache, Python's
http.server). A proper HTTPS origin is required for service-worker installation;
for internal networks a self-signed certificate is sufficient.
All search processing is local to the device:
- The PDF text and its embeddings never leave the machine that runs
npm run build-index. - At query time, the user's question is encoded in-browser by the ONNX model running in WebAssembly — the query is never sent to any server.
- No analytics, telemetry, cookies, or tracking of any kind are present.
- The only outbound network request ever made is the one-time model download from
Hugging Face during
npm run embed(controlled byenv.allowRemoteModels). In production,env.allowRemoteModels = falseis set insrc/engine/retrieval.ts, making the runtime fully air-gapped.
Query log download: The Share query log button in the footer downloads the user's query history (timestamps, query text, result tier, and relevance scores) as a local CSV file to the user's own device. No data is transmitted to any external server — the log is only ever downloaded locally and never sent anywhere automatically or otherwise.
During Phase 6 manual testing, "What is the capital of France?" triggered the lexical
fallback and returned results scored on the words the, is, what — common English
function words that appear in virtually every chunk. Two fixes were applied in
src/engine/retrieval.ts:
- Stop-word filter —
processTermis configured to drop ~60 common English function words from both indexing and search. This reduced the off-topic BM25 score from 79 → 6.9. - Minimum BM25 score (
MIN_LEXICAL_SCORE = 8) — even after stop-word filtering, "capital" and "france" appeared in the WCA 2030 references section, scoring 6.9. A minimum score of 8 was chosen because off-topic queries scored ≤ 6.9 while genuine domain-term matches (e.g. "agricultural holding") score ≫ 8.
The original vite.config.ts glob was ['**/*.{js,css,html,json,wasm}']. The ONNX
model file (model_quantized.onnx, 22 MB) has the .onnx extension and was therefore
excluded from the Workbox pre-cache manifest, meaning the model would be fetched from
the network on every cold start instead of being served from the service-worker cache.
Adding '**/*.onnx' to the glob array ensures the model is pre-cached on install and
the app works fully offline from the first reload.