docs: cover all sources and the deterministic extractor in the README#21
Merged
Merged
Conversation
added 2 commits
June 6, 2026 20:30
…wing, and CI The README still described the pipeline as Reddit + HN only and didn't mention the deterministic email/handle extractor, the website crawler, the commit author email path, or the CI workflow. Acquisition section now lists all four sources and the shallow website follower. Feature extraction documents both the LLM pass and the deterministic regex pass (12 social platforms + email deobfuscation). Output properties calls out the new direct-identifiers block. Usage examples cover --github, --so, the four-platform combined run, and the claude-code provider. Limitations honestly notes the 90-day events cap, the SPA scraping gap, and the noreply email filter. New CI section points at .github/workflows/ci.yml.
Contributor
Author
|
@ni5arga Noticed github didn't cover a few things I added in yesterdays PR. Specifically missing documentation on github, fetching emails, and the shallow website crawler I implemented. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The README still described the pipeline as Reddit and Hacker News only, even though main now ships the GitHub and Stack Overflow sources, the shallow website crawler, the deterministic email/handle extractor, the commit author email path, the claude-code provider, and the CI workflow. This update walks through each of those.
Acquisition step now lists all four sources plus the shallow website follower, with notes on the GITHUB_TOKEN rate limit and the mailto preservation that lets contact emails behind link tags survive HTML stripping. Feature extraction documents both the LLM pass and the parallel regex pass (LinkedIn, Twitter/X, GitHub, YouTube, Instagram, Bluesky, Reddit, Hacker News, Telegram, GitLab, Stack Overflow, Mastodon) with the obfuscation handling for emails. Output properties calls out the new direct-identifiers block. Usage examples now cover --github, --so, the four-platform combined run, and the claude-code provider. A short CI section points at the workflow added in df0fa26.
Limitations honestly mentions the things that bit during testing: the 90-day events cap on GitHub, JS-rendered SPA pages returning empty bodies, and the noreply email filter.
Docs only, no code changes.