Skip to content

GSoC 2026 Module B — Week 2: Stage 1 regex filter + Stage 1.5 sanitize#928

Open
manshusainishab wants to merge 12 commits into
OWASP:mainfrom
manshusainishab:module_b_w2
Open

GSoC 2026 Module B — Week 2: Stage 1 regex filter + Stage 1.5 sanitize#928
manshusainishab wants to merge 12 commits into
OWASP:mainfrom
manshusainishab:module_b_w2

Conversation

@manshusainishab

Copy link
Copy Markdown
Contributor

Summary

  • Adds Stage 1 (path-based regex/glob noise filter) and Stage 1.5 (defensive text sanitization vendored from TRACT) of the Module B pipeline.
  • Addresses two of the six CodeRabbit comments on the Week 1 PR (Mv cre sync 2 root #4 and Mypy #5 — chunker <pre> tracking and hard-split cursor arithmetic).

Builds on Week 1 (module_b_w1, in review). Once Week 1 lands, this PR will rebase against main automatically.

Part of the GSoC 2026 OpenCRE Scraper & Indexer (Project OIE) Module B (Noise/Relevance Filter).

What this PR adds

Stage 1.5 — text sanitization (commit 77ad228)

File Purpose
application/utils/noise_filter/sanitize.py sanitize_text(text) + strip_html(text). Defensive cleanup: PDF ligatures (ff → ff), zero-width chars, HTML entities/tags, broken hyphenation, NFC normalization. Idempotent on clean Module A output.
application/tests/noise_filter/sanitize_test.py 26 unittest cases — pipeline rules in isolation, idempotency across input classes, edge cases, public helper.

Vendored from rocklambros/TRACT (CC0 1.0 Universal), with one deliberate deviation: we do NOT collapse interior whitespace. TRACT's pipeline ends with re.sub(r"\s+", " ", text) which flattens newlines too — that's right for embedding-similarity use cases but destroys structure Module A's contract explicitly preserves (rule 6: whitespace inside code fences). Module B's LLM benefits from paragraph breaks and code-fence layout, so we keep them intact. Also dropped TRACT's max_length/return_full machinery and sanitize_control(dict) helper — both belong in their own domain, not B's.

Stage 1 — path-based regex filter (commit 8f5169b)

File Purpose
application/utils/noise_filter/regex_filter.py RegexFilter class. Public surface: is_noise_path(path), is_noise_record(rec), filter_records(records) lazy generator.
application/utils/noise_filter/noise_patterns.yaml Data-driven blocklist: extensions, filenames, path globs, allow-overrides. Contributors edit this file (not code) to extend coverage.
application/tests/noise_filter/regex_filter_test.py 15 unittest cases including the plan's acceptance criteria: ≥90% rejection on a table of known junk paths, 0% false positives on a table of known security doc paths.

Conservative-by-design under the recall-first labeling rule (agreed with the maintainer on 2026-06-01, see the Week 1 PR thread). Stage 1 only blocks paths where we're highly confident no security content lives there. We deliberately do NOT block **/blog/**, Website/content/**, or **/sponsors.md even though Week 1's labeling found them ~100% organizational — recall-first says let Stage 2 LLM judge content rather than blocking at the path level. We DO block **/Supporting Resources/meetings/** and **/Supporting Resources/enterprise metrics/** because Week 1 found those 100% organizational across all 25 SAMM samples.

CodeRabbit Week 1 deferrals (commit 806b0e5)

Addresses two CodeRabbit comments on the Week 1 PR that were deferred to Week 2 to land alongside other harvester work:

  • Mv cre sync 2 root #4: chunker now tracks <pre> block depth in addition to ``` fences, so a heading-like line inside <pre>...</pre> no longer triggers a false split.
  • Mypy #5: _split_chunk_by_size cursor arithmetic now records per-entry separator width (0 for hard-split fragments, 2 for normal \n\n boundaries) instead of unconditionally adding +2 between every consecutive sub-chunk.

Empirical check on the existing Week 1 dataset: 0/100 chunks contain <pre> tags and 0/100 are near the 4000-char hard-split ceiling. So the existing labeled_data.json and candidate_commits.json are unchanged — both old and new chunker produce bit-identical output on our sample. The fixes are forward-looking, protecting future harvests from corner-case files we didn't happen to sample.

Test plan

  • make test312 passing, 1 skip, 0 failures, 0 errors (was 271 after Week 1; +41 from Week 2: 15 regex_filter + 26 sanitize).
  • black --check . — clean across the whole repo (164 files unchanged, 0 to reformat). Pinned 24.4.2, matching superlinter.
  • Plan acceptance criteria for Stage 1: ≥90% rejection on table of known junk, 0% false positives on table of known security docs.
  • Smoke tests on the chunker fixes confirm <pre> tracking prevents false splits and hard-split sub-chunks are contiguous (no phantom +2).

Notes for reviewers

  • Out of this PR (intentional):
    • Stage 2 LLM classifier wrapping PromptHandler (LiteLLM-backed) — Week 3 deliverable.
    • End-to-end pipeline wiring (Stage 1 → Stage 1.5 → Stage 2) — Week 5 deliverable.
    • KnowledgeQueueItem SQLAlchemy model + Alembic migration — Week 5 deliverable.
  • Branch base: branched off module_b_w1 so that ChangeRecord from Week 1 is available. PR targets main directly; GitHub will rebase once Week 1 lands.
  • Labeling rule (recall-first): under maintainer agreement on 2026-06-01, KNOWLEDGE = any chunk with a security signal; NOISE = only organizational/admin content. Week 2's Stage 1 patterns are aligned with this rule — they err on the side of letting through.

Note :- Claude is used for some parts of the PR and file changes.

manshusainishab and others added 9 commits June 2, 2026 00:47
…contract

Establishes the data contract Module B consumes from Module A. ChangeRecord
is a Pydantic v2 model matching A's actual emission shape: nested source
(discriminated union on type for github/rss), span (chunk position +
heading_path + char/line offsets), and locator (addressing scheme). Internal
models ClassifyResult and QueuePayload prep for later stages.

hashing.py provides normalize_text + compute_content_hash since Module A
does not emit content_hash; B computes its own (SHA-256 of normalized text)
for use as the knowledge_queue dedup key.

22 unittest cases cover the round-trip, the discriminated union, hash
determinism, normalization rules, code-fence preservation, and idempotency.
Full make test: 271 passing, no regressions.

Part of GSoC 2026 OpenCRE Scraper & Indexer (Project OIE) Module B.
…ifact

module_a_mock.jsonl: Module A's canonical 20-record mock shared 2026-05-29,
saved as JSONL (one record per line per the contract). Becomes a permanent
integration-test fixture for B's parser and a reference shape for the
Module A contributor.

module_a_contract.schema.json: JSON Schema generated from B's Pydantic
ChangeRecord model via model_json_schema(). 246 lines covering all four
nested types (ChangeRecord, GithubSource, RssSource, Span, Locator).
Source of truth for cross-module CI validation.

Part of GSoC 2026 OpenCRE Scraper & Indexer (Project OIE) Module B.
build_labeled_dataset.py: PyGithub-based harvester that acts as Module A's
stand-in for producing benchmark data. Fetches recent commits from 4 OWASP
repos (WSTG, ASVS, CheatSheetSeries, SAMM), applies the contract's
normalization rules, splits into chunks at markdown heading boundaries
with a fence-aware stack-based walker that tracks heading_path + char/line
offsets, and emits records in Module A's actual nested shape. Pluggable
via GITHUB_TOKEN env var. Reproducible: python scripts/build_labeled_dataset.py
regenerates the candidate set.

label_dataset.py: resumable interactive TUI for manual classification.
Atomic-writes labeled_data.json after every keystroke; lookup by chunk_id
for resume. Embeds the recall-first definition (agreed with maintainer
2026-06-01) so labelers see the rule front-of-mind: KNOWLEDGE for any
chunk with security signal, NOISE only for pure organizational content.

candidate_commits.json: 100 records, 25 per repo, all Pydantic-valid
against ChangeRecord. 90/100 have non-empty heading_path; 10 multi-chunk
artifacts captured.

labeled_data.json: 100/100 labeled by hand under the recall-first rule.
Distribution 55 KNOWLEDGE / 40 NOISE / 5 UNCERTAIN. Per-repo skew is
visible: CheatSheetSeries 92% K, SAMM 0% K (the SAMM commits sampled
landed entirely on Website/Sponsorship/meetings paths -- empirical input
for Week 2's noise_patterns.yaml).

Part of GSoC 2026 OpenCRE Scraper & Indexer (Project OIE) Module B.
Super-Linter (Black 24.4.2) flagged 4 files in the previous push.
Applied `black` (same pinned version) to bring them in line with the
repo's formatting standard. Cosmetic changes only: blank lines around
section-separator comments, one multi-line dict join. No behavior or
test changes -- `make test` remains 271 passing, 1 skip.
- Sort __all__ lists in hashing.py and schemas.py to satisfy
  Ruff RUF022.
- Declare JSON Schema dialect ($schema = draft 2020-12, which is
  what Pydantic v2 model_json_schema() emits) on the contract artifact.
- Wrap load_labeled() in scripts/label_dataset.py with try/except so a
  corrupted labeled_data.json prints an actionable hint instead of a
  raw JSONDecodeError stack trace.

Deferred to Week 2 (will be addressed when we touch the harvester):
- chunker should also track <pre> open/close, not just ``` fences
- _split_chunk_by_size cursor arithmetic assumes \\n\\n separator even
  on hard-split sub-chunks

Tests: 271 passing, 1 skip (unchanged). Black: clean.
Defensive text cleanup (PDF ligatures, zero-width chars, HTML, hyphenation).
Vendored from rocklambros/TRACT under CC0; drops their whitespace-collapse
step so structure (newlines, paragraphs) is preserved for Module B's LLM.

26 unit tests, all passing.
Path-based filter with extension/filename/glob deny rules and allow_overrides.
Patterns are deliberately conservative under the recall-first labeling rule.

15 unit tests including >=90% rejection / 0% false-positive acceptance criteria.
… math

Addresses CodeRabbit comments OWASP#4 and OWASP#5 on the Week 1 PR.
@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 9a0e6607-dca7-44b1-9afd-4f8a4a443a9a

📥 Commits

Reviewing files that changed from the base of the PR and between 806b0e5 and 762cbce.

📒 Files selected for processing (4)
  • application/tests/noise_filter/fixtures/module_a_mock.jsonl
  • application/tests/noise_filter/sanitize_test.py
  • application/utils/noise_filter/sanitize.py
  • application/utils/noise_filter/schemas.py
🚧 Files skipped from review as they are similar to previous changes (4)
  • application/utils/noise_filter/sanitize.py
  • application/tests/noise_filter/sanitize_test.py
  • application/tests/noise_filter/fixtures/module_a_mock.jsonl
  • application/utils/noise_filter/schemas.py

Summary by CodeRabbit

Release Notes

  • New Features
    • Added conservative path-based noise filtering with deny/allow precedence and configurable pattern rules.
    • Added text sanitization (HTML/ligature/line-break cleanup) and deterministic content hashing for deduplication.
  • Tests
    • Introduced end-to-end unit tests covering filtering behavior, sanitization edge cases, schema validation, and hashing normalization/determinism.
  • Documentation
    • Added a JSON schema for the Module A data contract used in the pipeline.

Walkthrough

This PR introduces Module B, a noise/relevance filtering system for OpenCRE's scraper pipeline. It defines data contracts (ChangeRecord, Source union), implements text normalization and sanitization, provides path-based filtering via YAML rules, delivers comprehensive test coverage with fixtures, and includes tooling to harvest GitHub documentation and interactively label training datasets.

Changes

Module B: Noise Filtering Pipeline

Layer / File(s) Summary
Data contracts and module setup
application/utils/noise_filter/__init__.py, application/utils/noise_filter/schemas.py
Module-level docstring and Pydantic v2 models define the ChangeRecord contract (chunk_id, artifact_id, text, span, source, locator), discriminated Source union (GithubSource/RssSource keyed by type), Span/Locator metadata, and internal ClassifyResult/QueuePayload models with forward-compatible extra-field handling.
Text normalization and sanitization
application/utils/noise_filter/hashing.py, application/utils/noise_filter/sanitize.py
normalize_text implements Unicode NFC, newline normalization, and selective whitespace collapsing (preserving fenced-code internals) for deterministic SHA-256 content hashing; sanitize_text pipeline chains null-byte removal, Unicode normalization, zero-width stripping, HTML entity decoding/tag stripping, PDF ligature replacement, hyphenation rejoining, and structure-preserving trim.
JSON Schema contract documentation
docs/gsoc_2026_module_b/module_a_contract.schema.json
JSON Schema (draft 2020-12) formally specifies the ChangeRecord contract with discriminated source variants, required fields (schema_version, chunk_id, artifact_id, pipeline_run_id, text, span, source, locator), and structured definitions for all sub-objects including regex patterns for GitHub repo identifiers.
Path-based noise filtering rules and implementation
application/utils/noise_filter/noise_patterns.yaml, application/utils/noise_filter/regex_filter.py
YAML defines tiered deny rules (file extensions, filenames, path globs including test/CI/build directories) and allow-overrides for docs/**; RegexFilter loads patterns at construction, applies deterministic precedence (allow first, then deny by extension/filename/path), returns audit-friendly reason strings, and provides both record-level and lazy generator filtering.
Path filtering behavior tests
application/tests/noise_filter/regex_filter_test.py
Validates RegexFilter construction (default/custom YAML loading, empty-file tolerance), precedence rules (extension/filename/glob precedence and ** multi-depth matching), realistic path acceptance (≥90% junk rejection, zero false positives on known docs), and ChangeRecord-level filtering including lazy generator semantics.
Test fixtures and schema/sanitization validation
application/tests/noise_filter/fixtures/module_a_mock.jsonl, application/tests/noise_filter/schemas_test.py, application/tests/noise_filter/sanitize_test.py
20-record JSONL fixture with OWASP ASVS metadata; schema tests validate ChangeRecord parsing for GitHub/RSS sources, required-field enforcement, forward-compat via extra-field ignoring, discriminated-union construction, and content hash determinism/normalization/code-fence-preservation; sanitization tests cover pipeline stages, idempotency, and edge cases (empty input, blank-only, HTML collapse).
GitHub documentation harvesting script
scripts/build_labeled_dataset.py
Harvests markdown docs from configured OWASP GitHub repos by commit/file, normalizes text (NFC, newlines, fence-aware whitespace), chunks by heading with character/line offsets, deduplicates by chunk_id, emits Module A-shaped records with repo/commit source and span metadata to atomic JSON output with stats.
Interactive dataset labeling tool
scripts/label_dataset.py
CLI/TUI for labeling candidate records by chunk_id as KNOWLEDGE/NOISE/UNCERTAIN/SKIP with optional rationale, displaying formatted source metadata (GitHub/RSS fields, commit/feed URLs), record preview, and running label counts, persisting to atomic JSON with save/quit/skip/re-display keybindings.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 38.55% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly identifies the main deliverables: Stage 1 regex filter and Stage 1.5 sanitize component for Module B Week 2, directly summarizing the primary changes.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, covering the purpose, implementation details, test coverage, and design decisions for both stages being added.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (2)
application/utils/noise_filter/regex_filter.py (1)

40-60: 💤 Low value

Class name uses "Regex" but implementation uses fnmatch glob patterns.

The class is named RegexFilter but the implementation relies on fnmatch for glob-style matching (not regular expressions). The behavior is correctly documented in both the module docstring and noise_patterns.yaml, so this is purely a naming inconsistency. Consider renaming to PatternFilter or GlobFilter in a future refactor if it causes confusion.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@application/utils/noise_filter/regex_filter.py` around lines 40 - 60, The
class name RegexFilter is misleading because the implementation uses glob-style
matching (fnmatch); rename the class to a clearer name (e.g., PatternFilter or
GlobFilter) and update all references/imports and tests accordingly: change the
class declaration (RegexFilter -> PatternFilter), update instantiation sites and
any type annotations that reference RegexFilter (including functions, tests, and
other modules that import it), and ensure exported names (if any) and the
attribute patterns_path, deny_extensions, deny_filenames, deny_paths, and
allow_overrides remain unchanged so behavior and YAML wiring continue to work.
scripts/build_labeled_dataset.py (1)

520-528: 💤 Low value

Consider logging exception type for better debuggability.

The bare Exception catch is pragmatic for this non-critical rate limit display, but logging the exception class name would help diagnose unexpected failures without changing the control flow.

🔧 Proposed improvement
     except Exception as e:
-        print(f"(could not read rate limit: {e})")
+        print(f"(could not read rate limit: {type(e).__name__}: {e})")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/build_labeled_dataset.py` around lines 520 - 528, The except block
that catches Exception when reading the GitHub rate limit (in the try/except
around gh.get_rate_limit(), rl, core, and the print of rate limits) should
include the exception class name in the log message for better debuggability;
update the except clause to format the message with both the exception type
(e.g., using e.__class__.__name__ or type(e).__name__) and the exception message
so the printed line shows the error class and details without changing control
flow.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@application/tests/noise_filter/sanitize_test.py`:
- Line 35: The test contains raw U+200B characters in string literals (e.g., the
variable text in sanitize_test.py) which trigger Ruff PLE2515; replace each raw
zero-width-space character with an explicit Unicode escape (use \u200B) in those
string literals (including the other occurrence around line 105) so the runtime
value is unchanged but the source contains no literal U+200B characters, then
run make lint to verify the file passes.

In `@application/utils/noise_filter/sanitize.py`:
- Line 56: Replace the invisible literal zero‑width characters in the
_ZERO_WIDTH_RE regex with explicit Unicode escape sequences so the pattern is
readable and maintainable; update the _ZERO_WIDTH_RE definition (the re.compile
call for _ZERO_WIDTH_RE in sanitize.py) to use a raw string containing the
specific escapes (e.g. \u200B, \u200C, \u200D, \uFEFF or other needed zero‑width
codepoints) inside the character class, preserving the re.Pattern[str]
annotation and behavior.

In `@application/utils/noise_filter/schemas.py`:
- Around line 84-97: Locator currently requires a path for every kind which
breaks non-path schemes; change Locator into a discriminated union using the
existing discriminator field "kind" (keep the base class Locator as a BaseModel
with model_config and common fields like kind and id but make path
optional/absent there), then add concrete subclasses such as RepoPathLocator
(kind="repo_path") that require path: str and id: str, and other scheme-specific
classes like FeedPostLocator with their own fields; update any type
annotations/usages referring to Locator (e.g., RegexFilter.is_noise_record) to
accept the union type and regenerate
docs/gsoc_2026_module_b/module_a_contract.schema.json so the public schema
reflects the discriminated models.
- Around line 72-78: The Span model currently allows inconsistent values (e.g.,
index == total or end < start); add validation to the model (use Pydantic
`@root_validator` or field `@validator` in the Span class) to enforce: index < total
(0 <= index and index must be strictly less than total), if either
start_char_idx or end_char_idx is set then both must be set and end_char_idx >=
start_char_idx, and similarly for start_line/end_line (both present or both None
and end_line >= start_line). Implement these checks in a single root validator
(e.g., validate_span) that raises ValueError with a clear message when
invariants fail and reference the fields index, total, start_char_idx,
end_char_idx, start_line, end_line.
- Around line 111-117: The model-level setting ChangeRecord.model_config
currently sets str_strip_whitespace=True which trims all str fields (including
text) during validation; remove or disable str_strip_whitespace from the
model_config and instead apply trimming only to specific fields if needed (e.g.,
add a field-specific validator or use a constrained field for any fields that
must be trimmed), ensuring ChangeRecord.text remains unmodified during
validation so offsets/spans stay consistent; update model_config and add a
targeted trim implementation referencing ChangeRecord, model_config, and the
text Field.

---

Nitpick comments:
In `@application/utils/noise_filter/regex_filter.py`:
- Around line 40-60: The class name RegexFilter is misleading because the
implementation uses glob-style matching (fnmatch); rename the class to a clearer
name (e.g., PatternFilter or GlobFilter) and update all references/imports and
tests accordingly: change the class declaration (RegexFilter -> PatternFilter),
update instantiation sites and any type annotations that reference RegexFilter
(including functions, tests, and other modules that import it), and ensure
exported names (if any) and the attribute patterns_path, deny_extensions,
deny_filenames, deny_paths, and allow_overrides remain unchanged so behavior and
YAML wiring continue to work.

In `@scripts/build_labeled_dataset.py`:
- Around line 520-528: The except block that catches Exception when reading the
GitHub rate limit (in the try/except around gh.get_rate_limit(), rl, core, and
the print of rate limits) should include the exception class name in the log
message for better debuggability; update the except clause to format the message
with both the exception type (e.g., using e.__class__.__name__ or
type(e).__name__) and the exception message so the printed line shows the error
class and details without changing control flow.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 9c395721-0b73-4b11-940a-981425b83662

📥 Commits

Reviewing files that changed from the base of the PR and between bc3a8a3 and 806b0e5.

📒 Files selected for processing (16)
  • application/tests/noise_filter/__init__.py
  • application/tests/noise_filter/fixtures/candidate_commits.json
  • application/tests/noise_filter/fixtures/labeled_data.json
  • application/tests/noise_filter/fixtures/module_a_mock.jsonl
  • application/tests/noise_filter/regex_filter_test.py
  • application/tests/noise_filter/sanitize_test.py
  • application/tests/noise_filter/schemas_test.py
  • application/utils/noise_filter/__init__.py
  • application/utils/noise_filter/hashing.py
  • application/utils/noise_filter/noise_patterns.yaml
  • application/utils/noise_filter/regex_filter.py
  • application/utils/noise_filter/sanitize.py
  • application/utils/noise_filter/schemas.py
  • docs/gsoc_2026_module_b/module_a_contract.schema.json
  • scripts/build_labeled_dataset.py
  • scripts/label_dataset.py

Comment thread application/tests/noise_filter/sanitize_test.py Outdated
Comment thread application/utils/noise_filter/sanitize.py Outdated
Comment thread application/utils/noise_filter/schemas.py
Comment thread application/utils/noise_filter/schemas.py
Comment thread application/utils/noise_filter/schemas.py Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant