source-notes-ingestor

A local-first ingestion pipeline for collecting Zhihu and WeChat content, normalizing it into Markdown, and storing it inside a local notes library.

This repository intentionally does less than it used to. The package focuses on ingestion only. Search-oriented Q&A is now handled by a lightweight repo-local skill/plugin that uses rg plus direct note reads instead of a dedicated CLI orchestration layer.

Docs

What is implemented now

A runnable Python package and CLI: sni
Source adapters for feed-driven, manual-seed, and browser-session Zhihu/WeChat ingestion
Browser-session login persistence via Playwright
WeChat history discovery through discover wechat, using seed article -> profile_ext/getmsg -> paginated article list
WeChat streaming ingestion, so notes appear while the crawl is still running
WeChat verification-aware resume, so sni ingest wechat can recompute remaining URLs from state and retry after a verification wall
HTML-to-Markdown normalization into a canonical note model
Markdown library writing with frontmatter, raw HTML archival, asset download, and sync state tracking
A repo-local notes-rg-qa plugin/skill under plugins/notes-rg-qa/

What is intentionally not implemented

Built-in Q&A commands inside sni
A dedicated model-specific CLI layer for Claude, Codex, or similar agents
A required editor-specific retrieval backend
Fully unattended login or captcha solving
Full-history WeChat extraction for accounts that do not expose a usable profile_ext/getmsg path from a reachable seed article

Install

python3 -m venv .venv
source .venv/bin/activate
pip install -e .[browser]
python3 -m playwright install chromium

Project layout

src/source_notes_ingestor/: ingestion runtime package
plugins/notes-rg-qa/: repo-local search skill/plugin for agents
docs/: architecture and technical notes
samples/: target config examples
tests/: ingestion-focused tests

Stable ingestion interfaces

fetch_source(target, auth_ctx, since) -> raw_items[]
normalize(raw_item) -> canonical_note
write_note(canonical_note, library_path) -> note_path

Login session bootstrap

sni auth zhihu
sni auth wechat --login-url 'https://mp.weixin.qq.com/s/PgR1HF-b9r7V37iwNNCgrw'

By default, this saves storage state under ./state/browser/<source>.json.

Quick start

Set NOTES_LIBRARY_PATH to your target notes library.
Copy one of the sample target files and replace the source fields.
Run sni auth <source> once to complete login or human verification in a real browser.
Run sni ingest <source> --target ....
Use the notes-rg-qa skill/plugin for retrieval and synthesis.

Example:

export NOTES_LIBRARY_PATH=/absolute/path/to/your/library
cp samples/zhihu_target.example.json /tmp/zhihu_target.json
sni auth zhihu
sni ingest zhihu --target /tmp/zhihu_target.json

WeChat history workflow:

export NOTES_LIBRARY_PATH=/absolute/path/to/your/library
sni auth wechat --login-url 'https://mp.weixin.qq.com/s/PgR1HF-b9r7V37iwNNCgrw'
sni discover wechat --target targets/wechat_damowang.json
sni ingest wechat --target targets/wechat_damowang_discovered.json

Progress monitoring:

./scripts/watch_wechat_progress.sh "$NOTES_LIBRARY_PATH"

Library layout

Sources/Zhihu/<author>/answers/*.md
Sources/Zhihu/<author>/thoughts/*.md
Sources/Zhihu/<author>/articles/*.md
Sources/WeChat/<account>/*.md
Sources/_assets/...
Sources/_state/...

notes-rg-qa plugin

The repository includes a small, model-agnostic plugin scaffold for note-library Q&A:

plugin manifest: plugins/notes-rg-qa/.codex-plugin/plugin.json
skill: plugins/notes-rg-qa/skills/notes-rg-qa/SKILL.md
marketplace entry: .agents/plugins/marketplace.json

The skill explicitly tells agents to:

plan a question in retrieval-friendly terms
use rg for multiple short searches
open raw note files directly
answer from opened evidence only

Running tests

source .venv/bin/activate
PYTHONPATH=src python -m unittest discover -s tests

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.agents/plugins		.agents/plugins
docs		docs
plugins/notes-rg-qa		plugins/notes-rg-qa
samples		samples
scopes		scopes
scripts		scripts
src		src
state		state
targets		targets
tests		tests
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

source-notes-ingestor

Docs

What is implemented now

What is intentionally not implemented

Install

Project layout

Stable ingestion interfaces

Login session bootstrap

Quick start

Library layout

notes-rg-qa plugin

Running tests

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

source-notes-ingestor

Docs

What is implemented now

What is intentionally not implemented

Install

Project layout

Stable ingestion interfaces

Login session bootstrap

Quick start

Library layout

notes-rg-qa plugin

Running tests

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages