WCA 2030 Explorer

An offline-first Progressive Web App that answers questions exclusively from the World Programme for the Census of Agriculture 2030 (WCA 2030) guidelines. Every answer is verbatim extracted text from the guidelines, accompanied by a precise section title and page number.

1. Purpose & constraints

The WCA 2030 Explorer is a retrieval tool — not a generative AI. It enforces the following hard constraints:

Answers are extracted text only. The retrieved chunk is the answer; no paraphrasing or generation occurs.
No external API calls at runtime. After the first load, the app works with zero internet access. All model inference runs in-browser via WebAssembly.
Guardrail is mandatory. When no chunk exceeds the confidence threshold and keyword fallback also fails, the app returns: "This question could not be answered from the WCA 2030 guidelines. Sections searched: [list]."
Every answer cites the source chunk's section title and page number.
No generative model at runtime. The embedding model (all-MiniLM-L6-v2) is used only to encode queries; it never generates text.

The app is intended for FAO staff and national census bureaux who need authoritative, citable answers from the WCA 2030 methodology document without network access in the field.

2. Build instructions

Prerequisites

Node.js ≥ 18 (for native fetch, structuredClone, WASM support)
npm ≥ 9
The WCA 2030 source PDF at ./source/Census-2030_EN-DTP-9.pdf

Windows users: run all commands in Git Bash or PowerShell. Do not use cmd.exe — the npx tsx calls require a POSIX-compatible shell or PowerShell for proper path handling.

Step 1 — Generate the content index (run once; takes 5–20 min)

# Extract text from the PDF, chunk it, and embed every chunk
npm run build-index

This runs scripts/chunk.ts (PDF extraction + chunking) and then scripts/embed.ts (downloads Xenova/all-MiniLM-L6-v2 on first run and embeds ~876 chunks). Outputs — all written to the canonical source locations under public/:

src/data/chunks-raw.json — intermediate raw chunks (no embeddings; gitignored)
public/data/chunks.json — chunks with 384-dimensional embeddings (~10 MB)
public/models/ — ONNX model weights + WASM runtime files

The model download (~23 MB) requires internet access on the first run only. Subsequent runs use the local .cache/ directory and are fully offline.

No manual copy step is needed. public/data/ is the canonical source; npm run build (Step 2) copies everything from public/ into docs/ automatically.

Step 2 — Build the PWA

npm run build

Compiles TypeScript, bundles the app, and generates:

docs/ — the production bundle
docs/sw.js — the Workbox service worker with a 16-entry pre-cache manifest (~71 MB total)

Step 3 — Preview locally

npm run preview
# → open http://localhost:4173

On the first visit the service worker installs and pre-caches all 16 assets (JS bundle, CSS, chunks.json, ONNX model, WASM runtime files, icons). After that the app is fully offline-capable.

Development server

npm run dev
# → http://localhost:5173  (no service worker; hot-module reload active)

Running tests

npm test
# 20 tests across 3 files — chunking, retrieval, guardrail

3. Threshold tuning

The confidence threshold controls how selective the semantic search is before falling back to keyword search or returning a not-found response.

Default value: 0.42

Live tuning via DevTools (no rebuild needed):

// Lower the threshold to surface more borderline results
localStorage.setItem('wca_threshold', '0.38')

// Raise it to be more selective
localStorage.setItem('wca_threshold', '0.60')

// Restore default
localStorage.removeItem('wca_threshold')

Then reload the page — the new threshold takes effect on the next query.

How the cascade works

Semantic search — if the top cosine-similarity score (with 1.15× boost for high-priority chunks) meets the threshold, those results are returned.
Lexical fallback — if semantic fails, MiniSearch BM25 search runs. Only results scoring ≥ 8 BM25 points are accepted (prevents false positives from incidental word overlap on off-topic queries).
Not-found response — if both fail, the guardrail card is shown with a deduplicated list of sections searched.

Note: raising the threshold for an on-topic query (e.g. '0.99' for "agricultural holding") will cause the semantic pass to fail but the lexical fallback will still answer it, since "agricultural" and "holding" score ≫ 8 in BM25. To force the not-found card, the query must also fail the keyword search — for example "What is the capital of France?" fails both at any threshold.

4. Updating guidelines

When a new edition of the WCA guidelines is released:

Replace ./source/Census-2030_EN-DTP-9.pdf with the new PDF. Update every filename reference in scripts/chunk.ts if it differs.
Verify the high-priority page ranges in scripts/chunk.ts still match the new document's chapter layout, and update the STATIC_HIGH_RANGES array if needed.
Re-run the full pipeline:
```
npm run build-index
```
This writes the updated public/data/chunks.json directly — no copy step needed.

Bump the version field in src/data/model-meta.json:

{ "model": "Xenova/all-MiniLM-L6-v2", "dim": 384, "version": "wca2030-v2" }

Rebuild and redeploy:
```
npm run build
```
The Workbox revision hashes will change, triggering a service-worker update on existing installs. Users will see the "Guidelines index updated. Reload to apply." banner.

5. Distribution options

First-load download size

On the very first visit the service worker pre-caches ~71 MB of assets:

Asset	Size
`chunks.json` (content index)	~10 MB
`model_quantized.onnx` (ONNX weights)	~22 MB
WASM runtime (4 files)	~36 MB
JS bundle, CSS, HTML, icons	~3 MB

Subsequent loads use the cache entirely — no network traffic.

Hosted PWA

Deploy the docs/ directory to any static host:

# Netlify
netlify deploy --prod --dir docs

# Any static host supporting HTTPS

Users visit the URL, the service worker installs on first load, and the app can then be installed to the home screen and used fully offline.

GitHub Pages (Option A — commit `docs/` directly)

This repo is configured to deploy from the committed docs/ folder on the main branch. The Vite build uses base: '/WCA_2030_Explorer/' so all asset URLs resolve correctly under the GitHub Pages subdirectory.

One-time setup (do this once in the GitHub web UI):

Go to Settings → Pages in the repository.
Under Source, choose Deploy from a branch.
Set the branch to main and the folder to /docs.
Click Save. GitHub Pages will publish from docs/ on every push to main.

Expected URL: https://bakodramane.github.io/WCA_2030_Explorer/

Re-deploying after a guidelines update:

# 1. Regenerate the content index (run once after replacing the source PDF)
npm run build-index
# → writes public/data/chunks.json and public/models/ directly; no copy needed

# 2. Rebuild with the correct base path
npm run build
# → Vite copies public/ into docs/ automatically

# 3. Commit and push — GitHub Pages updates automatically
git add public/data/chunks.json docs/
git commit -m "chore: rebuild docs for updated guidelines"
git push

Note: docs/ is intentionally committed to this repo (not gitignored) and is the build output that GitHub Pages serves. public/data/ and public/models/ are the canonical sources — also committed so that npm run build is always self-contained and requires no manual file-copying.

Air-gapped / offline-only use

For environments with no internet access at all:

# Serve locally (Node.js must be installed on the target machine)
npx serve docs
# → http://localhost:3000

Or zip docs/ and serve it on any local web server (Nginx, Apache, Python's http.server). A proper HTTPS origin is required for service-worker installation; for internal networks a self-signed certificate is sufficient.

6. Privacy note

All search processing is local to the device:

The PDF text and its embeddings never leave the machine that runs npm run build-index.
At query time, the user's question is encoded in-browser by the ONNX model running in WebAssembly — the query is never sent to any server.
No analytics, telemetry, cookies, or tracking of any kind are present.
The only outbound network request ever made is the one-time model download from Hugging Face during npm run embed (controlled by env.allowRemoteModels). In production, env.allowRemoteModels = false is set in src/engine/retrieval.ts, making the runtime fully air-gapped.

Query log download: The Share query log button in the footer downloads the user's query history (timestamps, query text, result tier, and relevance scores) as a local CSV file to the user's own device. No data is transmitted to any external server — the log is only ever downloaded locally and never sent anywhere automatically or otherwise.

Implementation notes

MiniSearch stop-word filter and `MIN_LEXICAL_SCORE = 8`

During Phase 6 manual testing, "What is the capital of France?" triggered the lexical fallback and returned results scored on the words the, is, what — common English function words that appear in virtually every chunk. Two fixes were applied in src/engine/retrieval.ts:

Stop-word filter — processTerm is configured to drop ~60 common English function words from both indexing and search. This reduced the off-topic BM25 score from 79 → 6.9.
Minimum BM25 score (MIN_LEXICAL_SCORE = 8) — even after stop-word filtering, "capital" and "france" appeared in the WCA 2030 references section, scoring 6.9. A minimum score of 8 was chosen because off-topic queries scored ≤ 6.9 while genuine domain-term matches (e.g. "agricultural holding") score ≫ 8.

`.onnx` added to Workbox `globPatterns`

The original vite.config.ts glob was ['**/*.{js,css,html,json,wasm}']. The ONNX model file (model_quantized.onnx, 22 MB) has the .onnx extension and was therefore excluded from the Workbox pre-cache manifest, meaning the model would be fetched from the network on every cold start instead of being served from the service-worker cache. Adding '**/*.onnx' to the glob array ensures the model is pre-cached on install and the app works fully offline from the first reload.

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
.claude		.claude
.netlify		.netlify
data		data
docs		docs
public		public
scripts		scripts
source		source
src		src
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
index.html		index.html
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json
vite.config.ts		vite.config.ts
vitest.config.ts		vitest.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WCA 2030 Explorer

1. Purpose & constraints

2. Build instructions

Prerequisites

Step 1 — Generate the content index (run once; takes 5–20 min)

Step 2 — Build the PWA

Step 3 — Preview locally

Development server

Running tests

3. Threshold tuning

How the cascade works

4. Updating guidelines

5. Distribution options

First-load download size

Hosted PWA

GitHub Pages (Option A — commit `docs/` directly)

Air-gapped / offline-only use

6. Privacy note

Implementation notes

MiniSearch stop-word filter and `MIN_LEXICAL_SCORE = 8`

`.onnx` added to Workbox `globPatterns`

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

WCA 2030 Explorer

1. Purpose & constraints

2. Build instructions

Prerequisites

Step 1 — Generate the content index (run once; takes 5–20 min)

Step 2 — Build the PWA

Step 3 — Preview locally

Development server

Running tests

3. Threshold tuning

How the cascade works

4. Updating guidelines

5. Distribution options

First-load download size

Hosted PWA

GitHub Pages (Option A — commit docs/ directly)

Air-gapped / offline-only use

6. Privacy note

Implementation notes

MiniSearch stop-word filter and MIN_LEXICAL_SCORE = 8

.onnx added to Workbox globPatterns

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

GitHub Pages (Option A — commit `docs/` directly)

MiniSearch stop-word filter and `MIN_LEXICAL_SCORE = 8`

`.onnx` added to Workbox `globPatterns`

Packages