GSoC 2026 Module B — Week 2: Stage 1 regex filter + Stage 1.5 sanitize by manshusainishab · Pull Request #928 · OWASP/OpenCRE

manshusainishab · 2026-06-12T01:16:15Z

Summary

Adds Stage 1 (path-based regex/glob noise filter) and Stage 1.5 (defensive text sanitization vendored from TRACT) of the Module B pipeline.
Addresses two of the six CodeRabbit comments on the Week 1 PR (Mv cre sync 2 root #4 and Mypy #5 — chunker <pre> tracking and hard-split cursor arithmetic).

Builds on Week 1 (module_b_w1, in review). Once Week 1 lands, this PR will rebase against main automatically.

Part of the GSoC 2026 OpenCRE Scraper & Indexer (Project OIE) Module B (Noise/Relevance Filter).

What this PR adds

Stage 1.5 — text sanitization (commit `77ad228`)

File	Purpose
`application/utils/noise_filter/sanitize.py`	`sanitize_text(text)` + `strip_html(text)`. Defensive cleanup: PDF ligatures (`ﬀ → ff`), zero-width chars, HTML entities/tags, broken hyphenation, NFC normalization. Idempotent on clean Module A output.
`application/tests/noise_filter/sanitize_test.py`	26 unittest cases — pipeline rules in isolation, idempotency across input classes, edge cases, public helper.

Vendored from rocklambros/TRACT (CC0 1.0 Universal), with one deliberate deviation: we do NOT collapse interior whitespace. TRACT's pipeline ends with re.sub(r"\s+", " ", text) which flattens newlines too — that's right for embedding-similarity use cases but destroys structure Module A's contract explicitly preserves (rule 6: whitespace inside code fences). Module B's LLM benefits from paragraph breaks and code-fence layout, so we keep them intact. Also dropped TRACT's max_length/return_full machinery and sanitize_control(dict) helper — both belong in their own domain, not B's.

Stage 1 — path-based regex filter (commit `8f5169b`)

File	Purpose
`application/utils/noise_filter/regex_filter.py`	`RegexFilter` class. Public surface: `is_noise_path(path)`, `is_noise_record(rec)`, `filter_records(records)` lazy generator.
`application/utils/noise_filter/noise_patterns.yaml`	Data-driven blocklist: extensions, filenames, path globs, allow-overrides. Contributors edit this file (not code) to extend coverage.
`application/tests/noise_filter/regex_filter_test.py`	15 unittest cases including the plan's acceptance criteria: ≥90% rejection on a table of known junk paths, 0% false positives on a table of known security doc paths.

Conservative-by-design under the recall-first labeling rule (agreed with the maintainer on 2026-06-01, see the Week 1 PR thread). Stage 1 only blocks paths where we're highly confident no security content lives there. We deliberately do NOT block **/blog/**, Website/content/**, or **/sponsors.md even though Week 1's labeling found them ~100% organizational — recall-first says let Stage 2 LLM judge content rather than blocking at the path level. We DO block **/Supporting Resources/meetings/** and **/Supporting Resources/enterprise metrics/** because Week 1 found those 100% organizational across all 25 SAMM samples.

CodeRabbit Week 1 deferrals (commit `806b0e5`)

Addresses two CodeRabbit comments on the Week 1 PR that were deferred to Week 2 to land alongside other harvester work:

Mv cre sync 2 root #4: chunker now tracks <pre> block depth in addition to ``` fences, so a heading-like line inside <pre>...</pre> no longer triggers a false split.
Mypy #5: _split_chunk_by_size cursor arithmetic now records per-entry separator width (0 for hard-split fragments, 2 for normal \n\n boundaries) instead of unconditionally adding +2 between every consecutive sub-chunk.

Empirical check on the existing Week 1 dataset: 0/100 chunks contain <pre> tags and 0/100 are near the 4000-char hard-split ceiling. So the existing labeled_data.json and candidate_commits.json are unchanged — both old and new chunker produce bit-identical output on our sample. The fixes are forward-looking, protecting future harvests from corner-case files we didn't happen to sample.

Test plan

make test — 312 passing, 1 skip, 0 failures, 0 errors (was 271 after Week 1; +41 from Week 2: 15 regex_filter + 26 sanitize).
black --check . — clean across the whole repo (164 files unchanged, 0 to reformat). Pinned 24.4.2, matching superlinter.
Plan acceptance criteria for Stage 1: ≥90% rejection on table of known junk, 0% false positives on table of known security docs.
Smoke tests on the chunker fixes confirm <pre> tracking prevents false splits and hard-split sub-chunks are contiguous (no phantom +2).

Notes for reviewers

Out of this PR (intentional):
- Stage 2 LLM classifier wrapping PromptHandler (LiteLLM-backed) — Week 3 deliverable.
- End-to-end pipeline wiring (Stage 1 → Stage 1.5 → Stage 2) — Week 5 deliverable.
- KnowledgeQueueItem SQLAlchemy model + Alembic migration — Week 5 deliverable.
Branch base: branched off module_b_w1 so that ChangeRecord from Week 1 is available. PR targets main directly; GitHub will rebase once Week 1 lands.
Labeling rule (recall-first): under maintainer agreement on 2026-06-01, KNOWLEDGE = any chunk with a security signal; NOISE = only organizational/admin content. Week 2's Stage 1 patterns are aligned with this rule — they err on the side of letting through.

Note :- Claude is used for some parts of the PR and file changes.

…contract Establishes the data contract Module B consumes from Module A. ChangeRecord is a Pydantic v2 model matching A's actual emission shape: nested source (discriminated union on type for github/rss), span (chunk position + heading_path + char/line offsets), and locator (addressing scheme). Internal models ClassifyResult and QueuePayload prep for later stages. hashing.py provides normalize_text + compute_content_hash since Module A does not emit content_hash; B computes its own (SHA-256 of normalized text) for use as the knowledge_queue dedup key. 22 unittest cases cover the round-trip, the discriminated union, hash determinism, normalization rules, code-fence preservation, and idempotency. Full make test: 271 passing, no regressions. Part of GSoC 2026 OpenCRE Scraper & Indexer (Project OIE) Module B.

…ifact module_a_mock.jsonl: Module A's canonical 20-record mock shared 2026-05-29, saved as JSONL (one record per line per the contract). Becomes a permanent integration-test fixture for B's parser and a reference shape for the Module A contributor. module_a_contract.schema.json: JSON Schema generated from B's Pydantic ChangeRecord model via model_json_schema(). 246 lines covering all four nested types (ChangeRecord, GithubSource, RssSource, Span, Locator). Source of truth for cross-module CI validation. Part of GSoC 2026 OpenCRE Scraper & Indexer (Project OIE) Module B.

build_labeled_dataset.py: PyGithub-based harvester that acts as Module A's stand-in for producing benchmark data. Fetches recent commits from 4 OWASP repos (WSTG, ASVS, CheatSheetSeries, SAMM), applies the contract's normalization rules, splits into chunks at markdown heading boundaries with a fence-aware stack-based walker that tracks heading_path + char/line offsets, and emits records in Module A's actual nested shape. Pluggable via GITHUB_TOKEN env var. Reproducible: python scripts/build_labeled_dataset.py regenerates the candidate set. label_dataset.py: resumable interactive TUI for manual classification. Atomic-writes labeled_data.json after every keystroke; lookup by chunk_id for resume. Embeds the recall-first definition (agreed with maintainer 2026-06-01) so labelers see the rule front-of-mind: KNOWLEDGE for any chunk with security signal, NOISE only for pure organizational content. candidate_commits.json: 100 records, 25 per repo, all Pydantic-valid against ChangeRecord. 90/100 have non-empty heading_path; 10 multi-chunk artifacts captured. labeled_data.json: 100/100 labeled by hand under the recall-first rule. Distribution 55 KNOWLEDGE / 40 NOISE / 5 UNCERTAIN. Per-repo skew is visible: CheatSheetSeries 92% K, SAMM 0% K (the SAMM commits sampled landed entirely on Website/Sponsorship/meetings paths -- empirical input for Week 2's noise_patterns.yaml). Part of GSoC 2026 OpenCRE Scraper & Indexer (Project OIE) Module B.

Super-Linter (Black 24.4.2) flagged 4 files in the previous push. Applied `black` (same pinned version) to bring them in line with the repo's formatting standard. Cosmetic changes only: blank lines around section-separator comments, one multi-line dict join. No behavior or test changes -- `make test` remains 271 passing, 1 skip.

- Sort __all__ lists in hashing.py and schemas.py to satisfy Ruff RUF022. - Declare JSON Schema dialect ($schema = draft 2020-12, which is what Pydantic v2 model_json_schema() emits) on the contract artifact. - Wrap load_labeled() in scripts/label_dataset.py with try/except so a corrupted labeled_data.json prints an actionable hint instead of a raw JSONDecodeError stack trace. Deferred to Week 2 (will be addressed when we touch the harvester): - chunker should also track <pre> open/close, not just ``` fences - _split_chunk_by_size cursor arithmetic assumes \\n\\n separator even on hard-split sub-chunks Tests: 271 passing, 1 skip (unchanged). Black: clean.

Defensive text cleanup (PDF ligatures, zero-width chars, HTML, hyphenation). Vendored from rocklambros/TRACT under CC0; drops their whitespace-collapse step so structure (newlines, paragraphs) is preserved for Module B's LLM. 26 unit tests, all passing.

Path-based filter with extension/filename/glob deny rules and allow_overrides. Patterns are deliberately conservative under the recall-first labeling rule. 15 unit tests including >=90% rejection / 0% false-positive acceptance criteria.

… math Addresses CodeRabbit comments OWASP#4 and OWASP#5 on the Week 1 PR.

coderabbitai · 2026-06-12T01:16:27Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 9a0e6607-dca7-44b1-9afd-4f8a4a443a9a

📥 Commits

Reviewing files that changed from the base of the PR and between 806b0e5 and 762cbce.

📒 Files selected for processing (4)

application/tests/noise_filter/fixtures/module_a_mock.jsonl
application/tests/noise_filter/sanitize_test.py
application/utils/noise_filter/sanitize.py
application/utils/noise_filter/schemas.py

🚧 Files skipped from review as they are similar to previous changes (4)

application/utils/noise_filter/sanitize.py
application/tests/noise_filter/sanitize_test.py
application/tests/noise_filter/fixtures/module_a_mock.jsonl
application/utils/noise_filter/schemas.py

Summary by CodeRabbit

Release Notes

New Features
- Added conservative path-based noise filtering with deny/allow precedence and configurable pattern rules.
- Added text sanitization (HTML/ligature/line-break cleanup) and deterministic content hashing for deduplication.
Tests
- Introduced end-to-end unit tests covering filtering behavior, sanitization edge cases, schema validation, and hashing normalization/determinism.
Documentation
- Added a JSON schema for the Module A data contract used in the pipeline.

Walkthrough

This PR introduces Module B, a noise/relevance filtering system for OpenCRE's scraper pipeline. It defines data contracts (ChangeRecord, Source union), implements text normalization and sanitization, provides path-based filtering via YAML rules, delivers comprehensive test coverage with fixtures, and includes tooling to harvest GitHub documentation and interactively label training datasets.

Changes

Module B: Noise Filtering Pipeline

Layer / File(s)	Summary
Data contracts and module setup `application/utils/noise_filter/__init__.py`, `application/utils/noise_filter/schemas.py`	Module-level docstring and Pydantic v2 models define the ChangeRecord contract (chunk_id, artifact_id, text, span, source, locator), discriminated Source union (GithubSource/RssSource keyed by type), Span/Locator metadata, and internal ClassifyResult/QueuePayload models with forward-compatible extra-field handling.
Text normalization and sanitization `application/utils/noise_filter/hashing.py`, `application/utils/noise_filter/sanitize.py`	`normalize_text` implements Unicode NFC, newline normalization, and selective whitespace collapsing (preserving fenced-code internals) for deterministic SHA-256 content hashing; `sanitize_text` pipeline chains null-byte removal, Unicode normalization, zero-width stripping, HTML entity decoding/tag stripping, PDF ligature replacement, hyphenation rejoining, and structure-preserving trim.
JSON Schema contract documentation `docs/gsoc_2026_module_b/module_a_contract.schema.json`	JSON Schema (draft 2020-12) formally specifies the ChangeRecord contract with discriminated source variants, required fields (schema_version, chunk_id, artifact_id, pipeline_run_id, text, span, source, locator), and structured definitions for all sub-objects including regex patterns for GitHub repo identifiers.
Path-based noise filtering rules and implementation `application/utils/noise_filter/noise_patterns.yaml`, `application/utils/noise_filter/regex_filter.py`	YAML defines tiered deny rules (file extensions, filenames, path globs including test/CI/build directories) and allow-overrides for docs/**; RegexFilter loads patterns at construction, applies deterministic precedence (allow first, then deny by extension/filename/path), returns audit-friendly reason strings, and provides both record-level and lazy generator filtering.
Path filtering behavior tests `application/tests/noise_filter/regex_filter_test.py`	Validates RegexFilter construction (default/custom YAML loading, empty-file tolerance), precedence rules (extension/filename/glob precedence and `**` multi-depth matching), realistic path acceptance (≥90% junk rejection, zero false positives on known docs), and ChangeRecord-level filtering including lazy generator semantics.
Test fixtures and schema/sanitization validation `application/tests/noise_filter/fixtures/module_a_mock.jsonl`, `application/tests/noise_filter/schemas_test.py`, `application/tests/noise_filter/sanitize_test.py`	20-record JSONL fixture with OWASP ASVS metadata; schema tests validate ChangeRecord parsing for GitHub/RSS sources, required-field enforcement, forward-compat via extra-field ignoring, discriminated-union construction, and content hash determinism/normalization/code-fence-preservation; sanitization tests cover pipeline stages, idempotency, and edge cases (empty input, blank-only, HTML collapse).
GitHub documentation harvesting script `scripts/build_labeled_dataset.py`	Harvests markdown docs from configured OWASP GitHub repos by commit/file, normalizes text (NFC, newlines, fence-aware whitespace), chunks by heading with character/line offsets, deduplicates by chunk_id, emits Module A-shaped records with repo/commit source and span metadata to atomic JSON output with stats.
Interactive dataset labeling tool `scripts/label_dataset.py`	CLI/TUI for labeling candidate records by chunk_id as KNOWLEDGE/NOISE/UNCERTAIN/SKIP with optional rationale, displaying formatted source metadata (GitHub/RSS fields, commit/feed URLs), record preview, and running label counts, persisting to atomic JSON with save/quit/skip/re-display keybindings.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 38.55% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title clearly identifies the main deliverables: Stage 1 regex filter and Stage 1.5 sanitize component for Module B Week 2, directly summarizing the primary changes.
Description check	✅ Passed	The description is comprehensive and directly related to the changeset, covering the purpose, implementation details, test coverage, and design decisions for both stages being added.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (2)

application/utils/noise_filter/regex_filter.py (1)
40-60: 💤 Low value

Class name uses "Regex" but implementation uses fnmatch glob patterns.

The class is named RegexFilter but the implementation relies on fnmatch for glob-style matching (not regular expressions). The behavior is correctly documented in both the module docstring and noise_patterns.yaml, so this is purely a naming inconsistency. Consider renaming to PatternFilter or GlobFilter in a future refactor if it causes confusion.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@application/utils/noise_filter/regex_filter.py` around lines 40 - 60, The
class name RegexFilter is misleading because the implementation uses glob-style
matching (fnmatch); rename the class to a clearer name (e.g., PatternFilter or
GlobFilter) and update all references/imports and tests accordingly: change the
class declaration (RegexFilter -> PatternFilter), update instantiation sites and
any type annotations that reference RegexFilter (including functions, tests, and
other modules that import it), and ensure exported names (if any) and the
attribute patterns_path, deny_extensions, deny_filenames, deny_paths, and
allow_overrides remain unchanged so behavior and YAML wiring continue to work.
scripts/build_labeled_dataset.py (1)
520-528: 💤 Low value

Consider logging exception type for better debuggability.

The bare Exception catch is pragmatic for this non-critical rate limit display, but logging the exception class name would help diagnose unexpected failures without changing the control flow.
🔧 Proposed improvement
     except Exception as e:
-        print(f"(could not read rate limit: {e})")
+        print(f"(could not read rate limit: {type(e).__name__}: {e})")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/build_labeled_dataset.py` around lines 520 - 528, The except block
that catches Exception when reading the GitHub rate limit (in the try/except
around gh.get_rate_limit(), rl, core, and the print of rate limits) should
include the exception class name in the log message for better debuggability;
update the except clause to format the message with both the exception type
(e.g., using e.__class__.__name__ or type(e).__name__) and the exception message
so the printed line shows the error class and details without changing control
flow.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@application/tests/noise_filter/sanitize_test.py`:
- Line 35: The test contains raw U+200B characters in string literals (e.g., the
variable text in sanitize_test.py) which trigger Ruff PLE2515; replace each raw
zero-width-space character with an explicit Unicode escape (use \u200B) in those
string literals (including the other occurrence around line 105) so the runtime
value is unchanged but the source contains no literal U+200B characters, then
run make lint to verify the file passes.

In `@application/utils/noise_filter/sanitize.py`:
- Line 56: Replace the invisible literal zero‑width characters in the
_ZERO_WIDTH_RE regex with explicit Unicode escape sequences so the pattern is
readable and maintainable; update the _ZERO_WIDTH_RE definition (the re.compile
call for _ZERO_WIDTH_RE in sanitize.py) to use a raw string containing the
specific escapes (e.g. \u200B, \u200C, \u200D, \uFEFF or other needed zero‑width
codepoints) inside the character class, preserving the re.Pattern[str]
annotation and behavior.

In `@application/utils/noise_filter/schemas.py`:
- Around line 84-97: Locator currently requires a path for every kind which
breaks non-path schemes; change Locator into a discriminated union using the
existing discriminator field "kind" (keep the base class Locator as a BaseModel
with model_config and common fields like kind and id but make path
optional/absent there), then add concrete subclasses such as RepoPathLocator
(kind="repo_path") that require path: str and id: str, and other scheme-specific
classes like FeedPostLocator with their own fields; update any type
annotations/usages referring to Locator (e.g., RegexFilter.is_noise_record) to
accept the union type and regenerate
docs/gsoc_2026_module_b/module_a_contract.schema.json so the public schema
reflects the discriminated models.
- Around line 72-78: The Span model currently allows inconsistent values (e.g.,
index == total or end < start); add validation to the model (use Pydantic
`@root_validator` or field `@validator` in the Span class) to enforce: index < total
(0 <= index and index must be strictly less than total), if either
start_char_idx or end_char_idx is set then both must be set and end_char_idx >=
start_char_idx, and similarly for start_line/end_line (both present or both None
and end_line >= start_line). Implement these checks in a single root validator
(e.g., validate_span) that raises ValueError with a clear message when
invariants fail and reference the fields index, total, start_char_idx,
end_char_idx, start_line, end_line.
- Around line 111-117: The model-level setting ChangeRecord.model_config
currently sets str_strip_whitespace=True which trims all str fields (including
text) during validation; remove or disable str_strip_whitespace from the
model_config and instead apply trimming only to specific fields if needed (e.g.,
add a field-specific validator or use a constrained field for any fields that
must be trimmed), ensuring ChangeRecord.text remains unmodified during
validation so offsets/spans stay consistent; update model_config and add a
targeted trim implementation referencing ChangeRecord, model_config, and the
text Field.

---

Nitpick comments:
In `@application/utils/noise_filter/regex_filter.py`:
- Around line 40-60: The class name RegexFilter is misleading because the
implementation uses glob-style matching (fnmatch); rename the class to a clearer
name (e.g., PatternFilter or GlobFilter) and update all references/imports and
tests accordingly: change the class declaration (RegexFilter -> PatternFilter),
update instantiation sites and any type annotations that reference RegexFilter
(including functions, tests, and other modules that import it), and ensure
exported names (if any) and the attribute patterns_path, deny_extensions,
deny_filenames, deny_paths, and allow_overrides remain unchanged so behavior and
YAML wiring continue to work.

In `@scripts/build_labeled_dataset.py`:
- Around line 520-528: The except block that catches Exception when reading the
GitHub rate limit (in the try/except around gh.get_rate_limit(), rl, core, and
the print of rate limits) should include the exception class name in the log
message for better debuggability; update the except clause to format the message
with both the exception type (e.g., using e.__class__.__name__ or
type(e).__name__) and the exception message so the printed line shows the error
class and details without changing control flow.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 9c395721-0b73-4b11-940a-981425b83662

📥 Commits

Reviewing files that changed from the base of the PR and between bc3a8a3 and 806b0e5.

📒 Files selected for processing (16)

application/tests/noise_filter/__init__.py
application/tests/noise_filter/fixtures/candidate_commits.json
application/tests/noise_filter/fixtures/labeled_data.json
application/tests/noise_filter/fixtures/module_a_mock.jsonl
application/tests/noise_filter/regex_filter_test.py
application/tests/noise_filter/sanitize_test.py
application/tests/noise_filter/schemas_test.py
application/utils/noise_filter/__init__.py
application/utils/noise_filter/hashing.py
application/utils/noise_filter/noise_patterns.yaml
application/utils/noise_filter/regex_filter.py
application/utils/noise_filter/sanitize.py
application/utils/noise_filter/schemas.py
docs/gsoc_2026_module_b/module_a_contract.schema.json
scripts/build_labeled_dataset.py
scripts/label_dataset.py

manshusainishab and others added 9 commits June 2, 2026 00:47

Merge branch 'main' into module_b_w1

8cf4dd1

fix(module-b): chunker tracks <pre> blocks; correct hard-split cursor…

806b0e5

… math Addresses CodeRabbit comments OWASP#4 and OWASP#5 on the Week 1 PR.

Merge branch 'main' into module_b_w2

85a4380

coderabbitai Bot reviewed Jun 12, 2026

View reviewed changes

manshusainishab and others added 2 commits June 15, 2026 15:48

chore(module-b): address CodeRabbit Week 2 review comments

762cbce

Merge branch 'main' into module_b_w2

d5398fa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GSoC 2026 Module B — Week 2: Stage 1 regex filter + Stage 1.5 sanitize#928

GSoC 2026 Module B — Week 2: Stage 1 regex filter + Stage 1.5 sanitize#928
manshusainishab wants to merge 12 commits into
OWASP:mainfrom
manshusainishab:module_b_w2

manshusainishab commented Jun 12, 2026

Uh oh!

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

manshusainishab commented Jun 12, 2026

Summary

What this PR adds

Stage 1.5 — text sanitization (commit 77ad228)

Stage 1 — path-based regex filter (commit 8f5169b)

CodeRabbit Week 1 deferrals (commit 806b0e5)

Test plan

Notes for reviewers

Uh oh!

coderabbitai Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Stage 1.5 — text sanitization (commit `77ad228`)

Stage 1 — path-based regex filter (commit `8f5169b`)

CodeRabbit Week 1 deferrals (commit `806b0e5`)

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading