Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 64 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,22 +37,50 @@ containing:
1. Acquisition
- Reddit artifacts from [Arctic Shift API](https://arctic-shift.photon-reddit.com)
- Hacker News artifacts from [HN Algolia Search API](https://hn.algolia.com/api)
- GitHub profile fields + public events (commits, issues, PRs, review
comments) via the [GitHub REST API](https://docs.github.com/en/rest);
commit author name and email from `PushEvent` payloads are folded in
inline. Optional `GITHUB_TOKEN` raises the rate limit.
- Stack Overflow answers, questions, comments, and profile fields via
the [Stack Exchange API v2.3](https://api.stackexchange.com)
- Shallow link-follower for any external website declared in a GitHub
or Stack Overflow profile: fetches the root page, then up to 5
same-origin sub-paths prioritized by identity-shaped routes
(`/about`, `/cv`, `/resume`, `/contact`, `/bio`, `/me`,
`/portfolio`, …). Preserves `mailto:` and `http(s)://` href values
before HTML stripping so contact emails behind a link survive.
2. Canonicalization
- Heterogeneous source records mapped into a unified item schema
- Temporal and textual normalization for bounded-context inference
3. Feature extraction and attribution
- Detection of location, affiliation, temporal routine, self-disclosed
demographics, cross-platform handles, external URLs, and stylometric cues
- Attribution binding from claim to quote-level evidence and permalink
- LLM pass: detection of location, affiliation, temporal routine,
self-disclosed demographics, cross-platform handles, external URLs,
and stylometric cues, with attribution binding from each claim to
quote-level evidence and permalink
- Deterministic regex pass that runs in parallel and bypasses the
model: extracts emails (with `[at]` / `[dot]` obfuscation handling)
and cross-platform social handles for LinkedIn, Twitter/X, GitHub,
YouTube, Instagram, Bluesky, Reddit, Hacker News, Telegram, GitLab,
Stack Overflow, and Mastodon from URL patterns in the corpus.
False-positive paths like `twitter.com/home` are filtered and the
audited account itself is excluded.
4. Risk synthesis
- Confidence-calibrated findings: low, medium, high
- Explicit exact-user section and public proof URL set
- Direct-identifier block (emails + discovered handles) rendered
before the LLM findings, so concrete leaks always appear regardless
of how the model chose to summarize them
- Finding-level remediation recommendations

## Output properties

- Human-readable report with ranked findings and rationale
- JSON serialization for longitudinal tracking and downstream analytics
- Human-readable report with ranked findings and rationale, grouped by
confidence (high → medium → low)
- Dedicated `direct identifiers extracted` block surfacing emails and
cross-platform handles found by the deterministic regex pass
- JSON serialization for longitudinal tracking and downstream analytics;
`AuditResult.directIdentifiers` exposes the raw email + social handle
hits alongside the model findings
- Optional strict validation: fail if no external proof URL exists beyond
audited platform profile endpoints

Expand Down Expand Up @@ -130,6 +158,19 @@ npm run audit -- my_reddit_handle --hn my_hn_handle
# Hacker News only
npm run audit -- --hn my_hn_handle

# GitHub only (also follows the linked website + sub-pages)
npm run audit -- --github my_gh_handle

# Stack Overflow only (accepts numeric user_id or profile URL)
npm run audit -- --so 1234567

# All four platforms at once — cross-platform handle correlation is the
# strongest signal the analyzer can flag
npm run audit -- my_reddit_handle --hn my_hn_handle --github my_gh_handle --so 1234567

# Audit through the Claude Code CLI (no API key needed)
npm run audit -- my_reddit_handle --provider claude-code

# JSON output
npm run audit -- my_reddit_handle --json -o report.json

Expand Down Expand Up @@ -178,9 +219,27 @@ npm run audit -- my_reddit_handle --provider openai --model gpt-4o-mini
npm run build
```

## Continuous integration

A GitHub Actions workflow at `.github/workflows/ci.yml` runs `npm run
lint`, `npm run format:check`, `tsc --noEmit`, `npm test`, and `npm run
build` on every push and pull request against `main`, across a Node 20 /
22 / 24 matrix.

## Limitations

- Findings are probabilistic and should not be interpreted as identity proof
- Recall is upper-bounded by source completeness and truncation constraints
- Stylometric separability is population- and domain-dependent
- Confidence calibration depends on evidence density and artifact quality
- GitHub's public events feed is capped at roughly 300 events from the
last 90 days, so commit author emails that only appear in older
history won't be picked up unless you supply `GITHUB_TOKEN` and walk
repos directly (not yet implemented)
- The website link-follower is single-hop with same-origin sub-page
expansion; JavaScript-rendered SPAs (Next.js client-rendered, Notion
exports, etc.) return mostly empty bodies because there is no headless
browser in the pipeline
- `@users.noreply.github.com` addresses are filtered out of the direct
identifier extractor since they are the privacy-preserving default
rather than a leak
Loading