Skip to content

docs: cover all sources and the deterministic extractor in the README#21

Merged
ni5arga merged 2 commits into
ni5arga:mainfrom
ArneshBanerjee:docs/cover-all-sources
Jun 6, 2026
Merged

docs: cover all sources and the deterministic extractor in the README#21
ni5arga merged 2 commits into
ni5arga:mainfrom
ArneshBanerjee:docs/cover-all-sources

Conversation

@ArneshBanerjee

Copy link
Copy Markdown
Contributor

The README still described the pipeline as Reddit and Hacker News only, even though main now ships the GitHub and Stack Overflow sources, the shallow website crawler, the deterministic email/handle extractor, the commit author email path, the claude-code provider, and the CI workflow. This update walks through each of those.

Acquisition step now lists all four sources plus the shallow website follower, with notes on the GITHUB_TOKEN rate limit and the mailto preservation that lets contact emails behind link tags survive HTML stripping. Feature extraction documents both the LLM pass and the parallel regex pass (LinkedIn, Twitter/X, GitHub, YouTube, Instagram, Bluesky, Reddit, Hacker News, Telegram, GitLab, Stack Overflow, Mastodon) with the obfuscation handling for emails. Output properties calls out the new direct-identifiers block. Usage examples now cover --github, --so, the four-platform combined run, and the claude-code provider. A short CI section points at the workflow added in df0fa26.

Limitations honestly mentions the things that bit during testing: the 90-day events cap on GitHub, JS-rendered SPA pages returning empty bodies, and the noreply email filter.

Docs only, no code changes.

Arnesh Banerjee added 2 commits June 6, 2026 20:30
…wing, and CI

The README still described the pipeline as Reddit + HN only and didn't
mention the deterministic email/handle extractor, the website crawler,
the commit author email path, or the CI workflow.

Acquisition section now lists all four sources and the shallow website
follower. Feature extraction documents both the LLM pass and the
deterministic regex pass (12 social platforms + email deobfuscation).
Output properties calls out the new direct-identifiers block. Usage
examples cover --github, --so, the four-platform combined run, and the
claude-code provider. Limitations honestly notes the 90-day events cap,
the SPA scraping gap, and the noreply email filter. New CI section
points at .github/workflows/ci.yml.
@ArneshBanerjee

Copy link
Copy Markdown
Contributor Author

@ni5arga Noticed github didn't cover a few things I added in yesterdays PR. Specifically missing documentation on github, fetching emails, and the shallow website crawler I implemented.
Added those here. Only Readme change, no code updated.

@ni5arga ni5arga left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@ni5arga ni5arga merged commit c94900b into ni5arga:main Jun 6, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants