Skip to content

feat(model): hybrid text+fusion content classifier with structural sensor#15

Merged
Babyhamsta merged 10 commits into
mainfrom
feat/training-data-boost
Jun 17, 2026
Merged

feat(model): hybrid text+fusion content classifier with structural sensor#15
Babyhamsta merged 10 commits into
mainfrom
feat/training-data-boost

Conversation

@Babyhamsta

Copy link
Copy Markdown
Owner

What

Replaces the text-only Tier-3 content classifier with a hybrid: the lexical logistic model + a Stage-2 gradient-boosted tree over page structure. Text alone can't tell a page that IS a blocked category from one ABOUT it (a Wikipedia "Proxy server" article and a working web proxy are identical to a bag-of-words). The fusion model reads the structure that separates them.

How it works

  • Shared extractor (structural-features.js, 62 signals): URL/host lexical, DOM tag histogram + depth, script ratios, payment/credential fields, and resource fingerprints (adult-ad / gambling-affiliate / crypto hosts, CGI-proxy markers, gambling license seals). The same function runs live on-device and offline at scrape time, so train/infer vectors are identical by construction.
  • Fusion GBDT (fusion.json): a tree over [5 text scores + ~60 structural scalars], exported from sklearn to a portable array walked by extension/lib/fusion.js.
  • Hybrid decide(): fusion is the primary call (it learned is-vs-about); the text model is a high-recall backstop for true positives the tree misses; a structural article-guard (prose-rescue) suppresses the backstop on genuine articles so text's vocabulary false-positives don't leak. SERP exemption applied upstream.

Trustworthy by construction

  • export_fusion.py asserts the exported trees reproduce sklearn predict_proba exactly (0.0 diff over the test set).
  • test_fusion_parity.mjs asserts the JS interpreter matches the Python reference — closing the chain sklearn ≡ Python ≡ JS.
  • Degrades to text-only if the tree fails to load; pulled via OTA with a SHA-256 check; baseline bundled so a fresh install is never unprotected.

Validation

Held-out fusion recall beats text-only by ~+0.11 at matched ~1.5% clean-FP. On a live adversarial suite the hybrid correctly: blocks real proxies / VPN vendors / casinos / game portals / adult sites, and keeps Wikipedia articles, news, sex-ed, and educational tools clean — including held-out sites never trained on.

Notes

  • Training pipeline (scraper, hard-negative mining, train/eval/export) + curated seed lists included for reproducibility. Scraped corpus is not redistributed.
  • README "content model" section rewritten to document the hybrid.
  • Tests: npm test (selftest/popup/detect), node classifier/tests/test_parity.mjs, node classifier/tests/test_fusion_parity.mjs, pytest classifier/tests/ — all green; eslint . clean.

Boost Layer-3 training data and tighten train/score parity:

- Gate (filtering.py + scan.js): keep/scan when body has >=80 non-space
  chars OR title+meta >=6 tokens, mirrored byte-for-byte, so thin interior
  pages (a canvas game under a loud title) are scored instead of dropped.
- Interior pages: sample up to 4 random same-eTLD+1 links per usable
  homepage (render.py links, frontier.sample_interior_links), label-inherited.
- Dedup over the full doc (title+meta+text) so distinct thin pages survive.
- Force-label seed lists (frontier.force_label) and add 14 current proxies
  harvested from unblokkked.web.app (scrape_unblokkked.py).
- Raise per-class target to 10k; set block_threshold to 0.8.
- Drop bot/consent-wall interstitials (Google 'unusual traffic' et al.).
- Retrain and sync the on-device model on the expanded corpus.
… exemption

Stage 1 of the is-vs-about rework (docs/model-rework-plan.md). The lexical model
scores on topic vocabulary, so it cannot separate a page that IS X from one that
is ABOUT X. Add the signal vocabulary can't fake — what the page IS and DOES —
and two hard post-rules on top of the score.

- extension/content/structural-features.js: one shared pure-DOM extractor
  (~20 scalars: link_density, paragraph_count, internal_link_ratio,
  has_url_like_input, url_embeds_url, has_dominant_canvas, video/iframe surfaces,
  script_host_entropy, ...). All feature math lives in JS so the training corpus
  and the device compute identical vectors by construction.
- render.py injects that same source into Playwright evaluate and captures the
  dict at scrape time; adds a realistic Chromebook UA/viewport + webdriver mask
  to avoid capturing headless-only bot walls. extract.py stores the full typed
  dict (back-compatible with old 3-key records).
- prose-rescue (detect/prose-rescue.js + decision.py): force clean when a
  block on proxy/adult/gambling is clearly an article (low link-density, real
  paragraphs, no dominant canvas) AND the category's functional element is
  absent. croxyproxy (has url input) is NOT rescued — regression-locked.
- search-engine exemption (detect/search-engine.js + decision.py): skip the
  content model on the ~8 search engines' SERP/home paths only; exact-host +
  path scoped so translate./cache. and Layer 4 are unaffected.
- model.js decide() takes structural + applies prose-rescue; sw.js passes it and
  applies the SERP exemption before scoring; evaluate.py mirrors both rules so
  offline metrics match the device.
- scan.js forwards structural; manifest loads the extractor before scan.js.
- Tests: test/detect.mjs + classifier/tests/test_decision.py cover the measured
  hard cases (Wikipedia article/list, cherrion, croxy, sex-ed, gambling news).
One dead chromium aborted the entire overnight sweep: the periodic recycle
called browser.new_context unguarded, and asyncio.gather had no
return_exceptions, so a single TargetClosedError propagated out and cancelled
every other category. Root cause was memory growth — the recycle only reopened
the context, never the browser, so a long 16-worker run leaked until a chromium
OOMed.

- Recycle the whole browser (close + relaunch) every RECYCLE_EVERY pages to
  bound memory, via _new_session/_close_session helpers.
- Fast-recycle after FAIL_RECYCLE consecutive failed renders so a browser that
  already died is replaced within ~12 domains instead of burning the queue.
- gather(return_exceptions=True): an unexpected worker death is logged, siblings
  finish, and the run continues to the next category.
…rmful)

Held-out evaluation on the enriched corpus showed the prose-rescue hard rule
rescued genuine blocked pages, not just articles: gambling recall 0.74->0.15 and
proxy 0.59->0.25 for only a 0.005 clean-FP gain. The premise (low link-density +
paragraphs + no functional element => article) does not generalize past the
Wikipedia case — gambling marketing pages and proxy landing pages match it too.

Unwire it from decide() and evaluate._predict. Keep the structural capture and
the (neutral, safe) search-engine exemption. prose-rescue.js is retained as a
candidate SOFT feature for the Stage-2 fusion model, documented + unit-tested.
Replace the text-only Tier-3 classifier with a hybrid of the lexical
logistic model and a Stage-2 gradient-boosted tree over page structure,
so a page that IS a blocked category is separated from one that is ABOUT
it (Wikipedia "Proxy server" vs a working proxy — identical to a
bag-of-words).

Model:
- structural-features.js expanded to 62 signals (URL/host lexical, DOM
  tag histogram + depth, script ratios, payment/credential fields,
  resource fingerprints: adult-ad/gambling-affiliate/crypto hosts,
  CGI-proxy markers, gambling license seals). One shared extractor, so
  train/infer vectors are identical by construction.
- fusion.json: GBDT over [5 text scores + ~60 structural scalars],
  exported from sklearn to a portable tree array.
- decide() is now the hybrid: fusion is primary (learned is-vs-about),
  text is a high-recall backstop, gated by the structural article-guard
  (prose-rescue) so text's vocabulary false-positives don't leak.

Trustworthy by construction: export_fusion.py asserts the exported trees
reproduce sklearn predict_proba exactly; test_fusion_parity.mjs asserts
the JS interpreter matches the Python reference — chain sklearn == Python
== JS. Degrades to text-only if the tree fails to load; pulled via OTA
with a SHA-256 check.

Training pipeline (scraper, hard-negative mining, train/eval/export) and
curated seed lists included for reproducibility. Corpus not redistributed.
@Babyhamsta Babyhamsta merged commit 78713d9 into main Jun 17, 2026
3 checks passed
@Babyhamsta Babyhamsta deleted the feat/training-data-boost branch June 17, 2026 06:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant