feat(model): hybrid text+fusion content classifier with structural sensor#15
Merged
Conversation
Boost Layer-3 training data and tighten train/score parity: - Gate (filtering.py + scan.js): keep/scan when body has >=80 non-space chars OR title+meta >=6 tokens, mirrored byte-for-byte, so thin interior pages (a canvas game under a loud title) are scored instead of dropped. - Interior pages: sample up to 4 random same-eTLD+1 links per usable homepage (render.py links, frontier.sample_interior_links), label-inherited. - Dedup over the full doc (title+meta+text) so distinct thin pages survive. - Force-label seed lists (frontier.force_label) and add 14 current proxies harvested from unblokkked.web.app (scrape_unblokkked.py). - Raise per-class target to 10k; set block_threshold to 0.8. - Drop bot/consent-wall interstitials (Google 'unusual traffic' et al.). - Retrain and sync the on-device model on the expanded corpus.
… exemption Stage 1 of the is-vs-about rework (docs/model-rework-plan.md). The lexical model scores on topic vocabulary, so it cannot separate a page that IS X from one that is ABOUT X. Add the signal vocabulary can't fake — what the page IS and DOES — and two hard post-rules on top of the score. - extension/content/structural-features.js: one shared pure-DOM extractor (~20 scalars: link_density, paragraph_count, internal_link_ratio, has_url_like_input, url_embeds_url, has_dominant_canvas, video/iframe surfaces, script_host_entropy, ...). All feature math lives in JS so the training corpus and the device compute identical vectors by construction. - render.py injects that same source into Playwright evaluate and captures the dict at scrape time; adds a realistic Chromebook UA/viewport + webdriver mask to avoid capturing headless-only bot walls. extract.py stores the full typed dict (back-compatible with old 3-key records). - prose-rescue (detect/prose-rescue.js + decision.py): force clean when a block on proxy/adult/gambling is clearly an article (low link-density, real paragraphs, no dominant canvas) AND the category's functional element is absent. croxyproxy (has url input) is NOT rescued — regression-locked. - search-engine exemption (detect/search-engine.js + decision.py): skip the content model on the ~8 search engines' SERP/home paths only; exact-host + path scoped so translate./cache. and Layer 4 are unaffected. - model.js decide() takes structural + applies prose-rescue; sw.js passes it and applies the SERP exemption before scoring; evaluate.py mirrors both rules so offline metrics match the device. - scan.js forwards structural; manifest loads the extractor before scan.js. - Tests: test/detect.mjs + classifier/tests/test_decision.py cover the measured hard cases (Wikipedia article/list, cherrion, croxy, sex-ed, gambling news).
One dead chromium aborted the entire overnight sweep: the periodic recycle called browser.new_context unguarded, and asyncio.gather had no return_exceptions, so a single TargetClosedError propagated out and cancelled every other category. Root cause was memory growth — the recycle only reopened the context, never the browser, so a long 16-worker run leaked until a chromium OOMed. - Recycle the whole browser (close + relaunch) every RECYCLE_EVERY pages to bound memory, via _new_session/_close_session helpers. - Fast-recycle after FAIL_RECYCLE consecutive failed renders so a browser that already died is replaced within ~12 domains instead of burning the queue. - gather(return_exceptions=True): an unexpected worker death is logged, siblings finish, and the run continues to the next category.
…rmful) Held-out evaluation on the enriched corpus showed the prose-rescue hard rule rescued genuine blocked pages, not just articles: gambling recall 0.74->0.15 and proxy 0.59->0.25 for only a 0.005 clean-FP gain. The premise (low link-density + paragraphs + no functional element => article) does not generalize past the Wikipedia case — gambling marketing pages and proxy landing pages match it too. Unwire it from decide() and evaluate._predict. Keep the structural capture and the (neutral, safe) search-engine exemption. prose-rescue.js is retained as a candidate SOFT feature for the Stage-2 fusion model, documented + unit-tested.
Replace the text-only Tier-3 classifier with a hybrid of the lexical logistic model and a Stage-2 gradient-boosted tree over page structure, so a page that IS a blocked category is separated from one that is ABOUT it (Wikipedia "Proxy server" vs a working proxy — identical to a bag-of-words). Model: - structural-features.js expanded to 62 signals (URL/host lexical, DOM tag histogram + depth, script ratios, payment/credential fields, resource fingerprints: adult-ad/gambling-affiliate/crypto hosts, CGI-proxy markers, gambling license seals). One shared extractor, so train/infer vectors are identical by construction. - fusion.json: GBDT over [5 text scores + ~60 structural scalars], exported from sklearn to a portable tree array. - decide() is now the hybrid: fusion is primary (learned is-vs-about), text is a high-recall backstop, gated by the structural article-guard (prose-rescue) so text's vocabulary false-positives don't leak. Trustworthy by construction: export_fusion.py asserts the exported trees reproduce sklearn predict_proba exactly; test_fusion_parity.mjs asserts the JS interpreter matches the Python reference — chain sklearn == Python == JS. Degrades to text-only if the tree fails to load; pulled via OTA with a SHA-256 check. Training pipeline (scraper, hard-negative mining, train/eval/export) and curated seed lists included for reproducibility. Corpus not redistributed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Replaces the text-only Tier-3 content classifier with a hybrid: the lexical logistic model + a Stage-2 gradient-boosted tree over page structure. Text alone can't tell a page that IS a blocked category from one ABOUT it (a Wikipedia "Proxy server" article and a working web proxy are identical to a bag-of-words). The fusion model reads the structure that separates them.
How it works
structural-features.js, 62 signals): URL/host lexical, DOM tag histogram + depth, script ratios, payment/credential fields, and resource fingerprints (adult-ad / gambling-affiliate / crypto hosts, CGI-proxy markers, gambling license seals). The same function runs live on-device and offline at scrape time, so train/infer vectors are identical by construction.fusion.json): a tree over[5 text scores + ~60 structural scalars], exported from sklearn to a portable array walked byextension/lib/fusion.js.decide(): fusion is the primary call (it learned is-vs-about); the text model is a high-recall backstop for true positives the tree misses; a structural article-guard (prose-rescue) suppresses the backstop on genuine articles so text's vocabulary false-positives don't leak. SERP exemption applied upstream.Trustworthy by construction
export_fusion.pyasserts the exported trees reproduce sklearnpredict_probaexactly (0.0 diff over the test set).test_fusion_parity.mjsasserts the JS interpreter matches the Python reference — closing the chain sklearn ≡ Python ≡ JS.Validation
Held-out fusion recall beats text-only by ~+0.11 at matched ~1.5% clean-FP. On a live adversarial suite the hybrid correctly: blocks real proxies / VPN vendors / casinos / game portals / adult sites, and keeps Wikipedia articles, news, sex-ed, and educational tools clean — including held-out sites never trained on.
Notes
npm test(selftest/popup/detect),node classifier/tests/test_parity.mjs,node classifier/tests/test_fusion_parity.mjs,pytest classifier/tests/— all green;eslint .clean.