feat(model): hybrid text+fusion content classifier with structural sensor by Babyhamsta · Pull Request #15 · Babyhamsta/Fenceline

Babyhamsta · 2026-06-17T05:27:36Z

What

Replaces the text-only Tier-3 content classifier with a hybrid: the lexical logistic model + a Stage-2 gradient-boosted tree over page structure. Text alone can't tell a page that IS a blocked category from one ABOUT it (a Wikipedia "Proxy server" article and a working web proxy are identical to a bag-of-words). The fusion model reads the structure that separates them.

How it works

Shared extractor (structural-features.js, 62 signals): URL/host lexical, DOM tag histogram + depth, script ratios, payment/credential fields, and resource fingerprints (adult-ad / gambling-affiliate / crypto hosts, CGI-proxy markers, gambling license seals). The same function runs live on-device and offline at scrape time, so train/infer vectors are identical by construction.
Fusion GBDT (fusion.json): a tree over [5 text scores + ~60 structural scalars], exported from sklearn to a portable array walked by extension/lib/fusion.js.
Hybrid decide(): fusion is the primary call (it learned is-vs-about); the text model is a high-recall backstop for true positives the tree misses; a structural article-guard (prose-rescue) suppresses the backstop on genuine articles so text's vocabulary false-positives don't leak. SERP exemption applied upstream.

Trustworthy by construction

export_fusion.py asserts the exported trees reproduce sklearn predict_proba exactly (0.0 diff over the test set).
test_fusion_parity.mjs asserts the JS interpreter matches the Python reference — closing the chain sklearn ≡ Python ≡ JS.
Degrades to text-only if the tree fails to load; pulled via OTA with a SHA-256 check; baseline bundled so a fresh install is never unprotected.

Validation

Held-out fusion recall beats text-only by ~+0.11 at matched ~1.5% clean-FP. On a live adversarial suite the hybrid correctly: blocks real proxies / VPN vendors / casinos / game portals / adult sites, and keeps Wikipedia articles, news, sex-ed, and educational tools clean — including held-out sites never trained on.

Notes

Training pipeline (scraper, hard-negative mining, train/eval/export) + curated seed lists included for reproducibility. Scraped corpus is not redistributed.
README "content model" section rewritten to document the hybrid.
Tests: npm test (selftest/popup/detect), node classifier/tests/test_parity.mjs, node classifier/tests/test_fusion_parity.mjs, pytest classifier/tests/ — all green; eslint . clean.

Boost Layer-3 training data and tighten train/score parity: - Gate (filtering.py + scan.js): keep/scan when body has >=80 non-space chars OR title+meta >=6 tokens, mirrored byte-for-byte, so thin interior pages (a canvas game under a loud title) are scored instead of dropped. - Interior pages: sample up to 4 random same-eTLD+1 links per usable homepage (render.py links, frontier.sample_interior_links), label-inherited. - Dedup over the full doc (title+meta+text) so distinct thin pages survive. - Force-label seed lists (frontier.force_label) and add 14 current proxies harvested from unblokkked.web.app (scrape_unblokkked.py). - Raise per-class target to 10k; set block_threshold to 0.8. - Drop bot/consent-wall interstitials (Google 'unusual traffic' et al.). - Retrain and sync the on-device model on the expanded corpus.

… exemption Stage 1 of the is-vs-about rework (docs/model-rework-plan.md). The lexical model scores on topic vocabulary, so it cannot separate a page that IS X from one that is ABOUT X. Add the signal vocabulary can't fake — what the page IS and DOES — and two hard post-rules on top of the score. - extension/content/structural-features.js: one shared pure-DOM extractor (~20 scalars: link_density, paragraph_count, internal_link_ratio, has_url_like_input, url_embeds_url, has_dominant_canvas, video/iframe surfaces, script_host_entropy, ...). All feature math lives in JS so the training corpus and the device compute identical vectors by construction. - render.py injects that same source into Playwright evaluate and captures the dict at scrape time; adds a realistic Chromebook UA/viewport + webdriver mask to avoid capturing headless-only bot walls. extract.py stores the full typed dict (back-compatible with old 3-key records). - prose-rescue (detect/prose-rescue.js + decision.py): force clean when a block on proxy/adult/gambling is clearly an article (low link-density, real paragraphs, no dominant canvas) AND the category's functional element is absent. croxyproxy (has url input) is NOT rescued — regression-locked. - search-engine exemption (detect/search-engine.js + decision.py): skip the content model on the ~8 search engines' SERP/home paths only; exact-host + path scoped so translate./cache. and Layer 4 are unaffected. - model.js decide() takes structural + applies prose-rescue; sw.js passes it and applies the SERP exemption before scoring; evaluate.py mirrors both rules so offline metrics match the device. - scan.js forwards structural; manifest loads the extractor before scan.js. - Tests: test/detect.mjs + classifier/tests/test_decision.py cover the measured hard cases (Wikipedia article/list, cherrion, croxy, sex-ed, gambling news).

…sh corpus

One dead chromium aborted the entire overnight sweep: the periodic recycle called browser.new_context unguarded, and asyncio.gather had no return_exceptions, so a single TargetClosedError propagated out and cancelled every other category. Root cause was memory growth — the recycle only reopened the context, never the browser, so a long 16-worker run leaked until a chromium OOMed. - Recycle the whole browser (close + relaunch) every RECYCLE_EVERY pages to bound memory, via _new_session/_close_session helpers. - Fast-recycle after FAIL_RECYCLE consecutive failed renders so a browser that already died is replaced within ~12 domains instead of burning the queue. - gather(return_exceptions=True): an unexpected worker death is logged, siblings finish, and the run continues to the next category.

…rmful) Held-out evaluation on the enriched corpus showed the prose-rescue hard rule rescued genuine blocked pages, not just articles: gambling recall 0.74->0.15 and proxy 0.59->0.25 for only a 0.005 clean-FP gain. The premise (low link-density + paragraphs + no functional element => article) does not generalize past the Wikipedia case — gambling marketing pages and proxy landing pages match it too. Unwire it from decide() and evaluate._predict. Keep the structural capture and the (neutral, safe) search-engine exemption. prose-rescue.js is retained as a candidate SOFT feature for the Stage-2 fusion model, documented + unit-tested.

Replace the text-only Tier-3 classifier with a hybrid of the lexical logistic model and a Stage-2 gradient-boosted tree over page structure, so a page that IS a blocked category is separated from one that is ABOUT it (Wikipedia "Proxy server" vs a working proxy — identical to a bag-of-words). Model: - structural-features.js expanded to 62 signals (URL/host lexical, DOM tag histogram + depth, script ratios, payment/credential fields, resource fingerprints: adult-ad/gambling-affiliate/crypto hosts, CGI-proxy markers, gambling license seals). One shared extractor, so train/infer vectors are identical by construction. - fusion.json: GBDT over [5 text scores + ~60 structural scalars], exported from sklearn to a portable tree array. - decide() is now the hybrid: fusion is primary (learned is-vs-about), text is a high-recall backstop, gated by the structural article-guard (prose-rescue) so text's vocabulary false-positives don't leak. Trustworthy by construction: export_fusion.py asserts the exported trees reproduce sklearn predict_proba exactly; test_fusion_parity.mjs asserts the JS interpreter matches the Python reference — chain sklearn == Python == JS. Degrades to text-only if the tree fails to load; pulled via OTA with a SHA-256 check. Training pipeline (scraper, hard-negative mining, train/eval/export) and curated seed lists included for reproducibility. Corpus not redistributed.

Babyhamsta added 10 commits June 13, 2026 02:20

feat(classifier): BUILD_RAW env override to build from a fresh corpus

5200944

feat(classifier): DECONTAM_RAW env override for decontaminating a fre…

542f611

…sh corpus

style(model): prettier-format hybrid model files

0139adf

style: prettier-format scan.js and detect.mjs

02472cf

style(classifier): ruff check + format new python files

76096dc

Babyhamsta merged commit 78713d9 into main Jun 17, 2026
3 checks passed

Babyhamsta deleted the feat/training-data-boost branch June 17, 2026 06:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(model): hybrid text+fusion content classifier with structural sensor#15

feat(model): hybrid text+fusion content classifier with structural sensor#15
Babyhamsta merged 10 commits into
mainfrom
feat/training-data-boost

Babyhamsta commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Babyhamsta commented Jun 17, 2026

What

How it works

Trustworthy by construction

Validation

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant