A local-first ingestion pipeline for collecting Zhihu and WeChat content, normalizing it into Markdown, and storing it inside a local notes library.
This repository intentionally does less than it used to. The package focuses on ingestion only. Search-oriented Q&A is now handled by a lightweight repo-local skill/plugin that uses rg plus direct note reads instead of a dedicated CLI orchestration layer.
- A runnable Python package and CLI:
sni - Source adapters for feed-driven, manual-seed, and browser-session Zhihu/WeChat ingestion
- Browser-session login persistence via Playwright
- WeChat history discovery through
discover wechat, usingseed article -> profile_ext/getmsg -> paginated article list - WeChat streaming ingestion, so notes appear while the crawl is still running
- WeChat verification-aware resume, so
sni ingest wechatcan recompute remaining URLs from state and retry after a verification wall - HTML-to-Markdown normalization into a canonical note model
- Markdown library writing with frontmatter, raw HTML archival, asset download, and sync state tracking
- A repo-local
notes-rg-qaplugin/skill underplugins/notes-rg-qa/
- Built-in Q&A commands inside
sni - A dedicated model-specific CLI layer for Claude, Codex, or similar agents
- A required editor-specific retrieval backend
- Fully unattended login or captcha solving
- Full-history WeChat extraction for accounts that do not expose a usable
profile_ext/getmsgpath from a reachable seed article
python3 -m venv .venv
source .venv/bin/activate
pip install -e .[browser]
python3 -m playwright install chromiumsrc/source_notes_ingestor/: ingestion runtime packageplugins/notes-rg-qa/: repo-local search skill/plugin for agentsdocs/: architecture and technical notessamples/: target config examplestests/: ingestion-focused tests
fetch_source(target, auth_ctx, since) -> raw_items[]normalize(raw_item) -> canonical_notewrite_note(canonical_note, library_path) -> note_path
sni auth zhihu
sni auth wechat --login-url 'https://mp.weixin.qq.com/s/PgR1HF-b9r7V37iwNNCgrw'By default, this saves storage state under ./state/browser/<source>.json.
- Set
NOTES_LIBRARY_PATHto your target notes library. - Copy one of the sample target files and replace the source fields.
- Run
sni auth <source>once to complete login or human verification in a real browser. - Run
sni ingest <source> --target .... - Use the
notes-rg-qaskill/plugin for retrieval and synthesis.
Example:
export NOTES_LIBRARY_PATH=/absolute/path/to/your/library
cp samples/zhihu_target.example.json /tmp/zhihu_target.json
sni auth zhihu
sni ingest zhihu --target /tmp/zhihu_target.jsonWeChat history workflow:
export NOTES_LIBRARY_PATH=/absolute/path/to/your/library
sni auth wechat --login-url 'https://mp.weixin.qq.com/s/PgR1HF-b9r7V37iwNNCgrw'
sni discover wechat --target targets/wechat_damowang.json
sni ingest wechat --target targets/wechat_damowang_discovered.jsonProgress monitoring:
./scripts/watch_wechat_progress.sh "$NOTES_LIBRARY_PATH"Sources/Zhihu/<author>/answers/*.mdSources/Zhihu/<author>/thoughts/*.mdSources/Zhihu/<author>/articles/*.mdSources/WeChat/<account>/*.mdSources/_assets/...Sources/_state/...
The repository includes a small, model-agnostic plugin scaffold for note-library Q&A:
- plugin manifest:
plugins/notes-rg-qa/.codex-plugin/plugin.json - skill:
plugins/notes-rg-qa/skills/notes-rg-qa/SKILL.md - marketplace entry:
.agents/plugins/marketplace.json
The skill explicitly tells agents to:
- plan a question in retrieval-friendly terms
- use
rgfor multiple short searches - open raw note files directly
- answer from opened evidence only
source .venv/bin/activate
PYTHONPATH=src python -m unittest discover -s tests