From bffb5e450499b9308b785f26257b24134094fe5a Mon Sep 17 00:00:00 2001 From: Arnesh Banerjee Date: Sat, 6 Jun 2026 20:30:46 +0530 Subject: [PATCH 1/2] docs: cover GitHub, Stack Overflow, identifier extraction, link-following, and CI The README still described the pipeline as Reddit + HN only and didn't mention the deterministic email/handle extractor, the website crawler, the commit author email path, or the CI workflow. Acquisition section now lists all four sources and the shallow website follower. Feature extraction documents both the LLM pass and the deterministic regex pass (12 social platforms + email deobfuscation). Output properties calls out the new direct-identifiers block. Usage examples cover --github, --so, the four-platform combined run, and the claude-code provider. Limitations honestly notes the 90-day events cap, the SPA scraping gap, and the noreply email filter. New CI section points at .github/workflows/ci.yml. --- README.md | 68 +++++++++++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 63 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index 6b111ad..d6c6059 100644 --- a/README.md +++ b/README.md @@ -37,22 +37,50 @@ containing: 1. Acquisition - Reddit artifacts from [Arctic Shift API](https://arctic-shift.photon-reddit.com) - Hacker News artifacts from [HN Algolia Search API](https://hn.algolia.com/api) + - GitHub profile fields + public events (commits, issues, PRs, review + comments) via the [GitHub REST API](https://docs.github.com/en/rest); + commit author name and email from `PushEvent` payloads are folded in + inline. Optional `GITHUB_TOKEN` raises the rate limit. + - Stack Overflow answers, questions, comments, and profile fields via + the [Stack Exchange API v2.3](https://api.stackexchange.com) + - Shallow link-follower for any external website declared in a GitHub + or Stack Overflow profile: fetches the root page, then up to 5 + same-origin sub-paths prioritized by identity-shaped routes + (`/about`, `/cv`, `/resume`, `/contact`, `/bio`, `/me`, + `/portfolio`, …). Preserves `mailto:` and `http(s)://` href values + before HTML stripping so contact emails behind a link survive. 2. Canonicalization - Heterogeneous source records mapped into a unified item schema - Temporal and textual normalization for bounded-context inference 3. Feature extraction and attribution - - Detection of location, affiliation, temporal routine, self-disclosed - demographics, cross-platform handles, external URLs, and stylometric cues - - Attribution binding from claim to quote-level evidence and permalink + - LLM pass: detection of location, affiliation, temporal routine, + self-disclosed demographics, cross-platform handles, external URLs, + and stylometric cues, with attribution binding from each claim to + quote-level evidence and permalink + - Deterministic regex pass that runs in parallel and bypasses the + model: extracts emails (with `[at]` / `[dot]` obfuscation handling) + and cross-platform social handles for LinkedIn, Twitter/X, GitHub, + YouTube, Instagram, Bluesky, Reddit, Hacker News, Telegram, GitLab, + Stack Overflow, and Mastodon from URL patterns in the corpus. + False-positive paths like `twitter.com/home` are filtered and the + audited account itself is excluded. 4. Risk synthesis - Confidence-calibrated findings: low, medium, high - Explicit exact-user section and public proof URL set + - Direct-identifier block (emails + discovered handles) rendered + before the LLM findings, so concrete leaks always appear regardless + of how the model chose to summarize them - Finding-level remediation recommendations ## Output properties -- Human-readable report with ranked findings and rationale -- JSON serialization for longitudinal tracking and downstream analytics +- Human-readable report with ranked findings and rationale, grouped by + confidence (high → medium → low) +- Dedicated `direct identifiers extracted` block surfacing emails and + cross-platform handles found by the deterministic regex pass +- JSON serialization for longitudinal tracking and downstream analytics; + `AuditResult.directIdentifiers` exposes the raw email + social handle + hits alongside the model findings - Optional strict validation: fail if no external proof URL exists beyond audited platform profile endpoints @@ -130,6 +158,19 @@ npm run audit -- my_reddit_handle --hn my_hn_handle # Hacker News only npm run audit -- --hn my_hn_handle +# GitHub only (also follows the linked website + sub-pages) +npm run audit -- --github my_gh_handle + +# Stack Overflow only (accepts numeric user_id or profile URL) +npm run audit -- --so 1234567 + +# All four platforms at once — cross-platform handle correlation is the +# strongest signal the analyzer can flag +npm run audit -- my_reddit_handle --hn my_hn_handle --github my_gh_handle --so 1234567 + +# Audit through the Claude Code CLI (no API key needed) +npm run audit -- my_reddit_handle --provider claude-code + # JSON output npm run audit -- my_reddit_handle --json -o report.json @@ -178,9 +219,26 @@ npm run audit -- my_reddit_handle --provider openai --model gpt-4o-mini npm run build ``` +## Continuous integration + +A GitHub Actions workflow at `.github/workflows/ci.yml` runs `npm run +build`, `npm test`, `npm run format:check`, and `npm run lint` on every +push and pull request, against the latest LTS Node release. + ## Limitations - Findings are probabilistic and should not be interpreted as identity proof - Recall is upper-bounded by source completeness and truncation constraints - Stylometric separability is population- and domain-dependent - Confidence calibration depends on evidence density and artifact quality +- GitHub's public events feed is capped at roughly 300 events from the + last 90 days, so commit author emails that only appear in older + history won't be picked up unless you supply `GITHUB_TOKEN` and walk + repos directly (not yet implemented) +- The website link-follower is single-hop with same-origin sub-page + expansion; JavaScript-rendered SPAs (Next.js client-rendered, Notion + exports, etc.) return mostly empty bodies because there is no headless + browser in the pipeline +- `@users.noreply.github.com` addresses are filtered out of the direct + identifier extractor since they are the privacy-preserving default + rather than a leak From ecc6d4c3d57c24313d98ae0a8e7f8230a630d7fc Mon Sep 17 00:00:00 2001 From: Arnesh Banerjee Date: Sat, 6 Jun 2026 20:32:17 +0530 Subject: [PATCH 2/2] docs: correct CI section (matrix + typecheck step) --- README.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index d6c6059..2e1db53 100644 --- a/README.md +++ b/README.md @@ -222,8 +222,9 @@ npm run build ## Continuous integration A GitHub Actions workflow at `.github/workflows/ci.yml` runs `npm run -build`, `npm test`, `npm run format:check`, and `npm run lint` on every -push and pull request, against the latest LTS Node release. +lint`, `npm run format:check`, `tsc --noEmit`, `npm test`, and `npm run +build` on every push and pull request against `main`, across a Node 20 / +22 / 24 matrix. ## Limitations