Tags Obsidian notes automatically with a local Ollama LLM, reuses your existing tag vocabulary instead of creating near-duplicates, and includes a cleanup tool to consolidate the tag pool over time.
- What it does
- Quick start
- Daily run
- Manual runs
- Cleanup tool
- Configuration
- Model performance
- Design notes
- Repo layout
- Tags new and modified notes automatically. Each note gets a fresh set of frontmatter tags drawn from its content.
- Respects your existing tag vocabulary. When a note mentions something you've tagged before, the script picks the tag you already use instead of creating a near-duplicate.
- Disambiguates with context. When two existing tags look alike —
julievsjulieandrews— the script reads the note and decides which one fits. - Cleans up after itself. A separate
cleanup_tags.pytool surfaces duplicates, typos, and junk tags for review. Nothing writes until you approve. - Designed to run on its own. Point a nightly cron at your vault and forget it.
You need:
- Ollama running locally with
gemma3:12bandnomic-embed-textpulled. - Python 3.11+ on Linux, or 3.13 specifically on macOS (3.14+ has a Local Network privacy quirk — see Design notes).
git clone https://github.com/undergroundpost/obsidian-auto-tagger.git
cd obsidian-auto-tagger
python3 -m venv .venv
.venv/bin/pip install -r requirements.txt
cp config.yaml.example config.yaml
# edit config.yaml — at minimum, set INPUT_FOLDER to your vault pathSanity-check on five files without writing anything:
.venv/bin/python generate_tags.py --dry-run --limit 5Drop --dry-run once you're happy with the output.
Pick the path that matches your setup.
For a vault that lives on the same machine the cron runs on:
0 1 * * * /path/to/obsidian-auto-tagger/.venv/bin/python /path/to/obsidian-auto-tagger/generate_tags.pyThe script writes its own dated log to logs/generate_tags_YYYY-MM-DD.log.
If your vault lives in Obsidian Sync and you want this script to run on an always-on server (instead of relying on a laptop being awake at 1am), use Obsidian's official headless client to pull and push around each run.
One-time setup on the server:
npm install -g obsidian-headless
ob sync-setup # interactive — pair to your Obsidian Sync vaultThen schedule the bundled wrapper, which does ob sync → tag → ob sync:
0 1 * * * /path/to/obsidian-auto-tagger/run-daily.shThe wrapper logs to logs/run-daily_YYYY-MM-DD.log; the Python script logs to logs/generate_tags_YYYY-MM-DD.log.
.venv/bin/python generate_tags.py # default: only files modified since their last processed: stamp
.venv/bin/python generate_tags.py --dry-run --limit 5 # preview without writing
.venv/bin/python generate_tags.py --force --limit 10 # re-tag, ignoring the processed timestamp
.venv/bin/python generate_tags.py --untagged-only # backfill files with no tagsOther flags: --debug, --input <folder>, --exclude <folder> (repeatable), --model <name>, --provider {ollama,openai}, --api-key <key>.
Vault tag pools drift over time — typos, near-duplicates, one-off junk. cleanup_tags.py consolidates yours through a review-and-apply workflow:
.venv/bin/python cleanup_tags.py scan # writes cleanup_proposals.json
# open cleanup_proposals.json and edit any actions you want to change
.venv/bin/python cleanup_tags.py apply # rewrites frontmatter + inline #tag referencesProposals are grouped into:
| Section | Default | What it catches |
|---|---|---|
format_consolidations |
apply | hyphen / underscore / accent variants |
semantic_consolidations_auto |
apply | high-similarity clusters (cosine ≥ 0.90) |
semantic_consolidations_review |
review | borderline clusters (0.80–0.90) — approve each manually |
suspected_junk |
delete | concatenation blobs, garbage strings |
one_offs |
review | tags used once; merged if a close neighbor exists |
By default the tool also runs an LLM judge over each cluster to verify cohesion and pick the canonical form (e.g. claud + claudai → claude). Pass --no-llm to skip the judge and use heuristics only.
| Key | Default | Notes |
|---|---|---|
INPUT_FOLDER |
~/Documents/Notes |
Vault root |
EXCLUDE_FOLDERS |
[] |
Skipped during scan and tagging |
LLM_PROVIDER |
ollama |
ollama or openai |
OLLAMA_MODEL |
gemma3:12b |
Tag-extraction model (also used by cleanup LLM passes) |
OLLAMA_SERVER_ADDRESS |
http://localhost:11434 |
Ollama endpoint |
OLLAMA_CONTEXT_WINDOW |
32000 |
num_ctx for tag extraction |
EMBEDDING_MODEL |
nomic-embed-text |
Embedding model for the matcher |
EMBEDDING_BATCH_SIZE |
100 |
Tags per /api/embed call |
SIMILARITY_THRESHOLD_HIGH |
0.95 |
Cosine sim ≥ this → auto-consolidate |
SIMILARITY_THRESHOLD_LOW |
0.70 |
Cosine sim < this → new tag (no judge call) |
LLM_JUDGE_ENABLED |
true |
Gray-zone candidates use the LLM judge with note context |
MAX_SPECIFIC_TAG_WORDS |
3 |
specific_tags with more words are dropped (anti-SKU) |
MAX_NORMALIZED_TAG_LEN |
30 |
Tags whose normalized form exceeds this length are dropped |
REQUIRE_SPECIFIC_TAG_IN_BODY |
true |
Drop specific_tags not present in the note body (anti-leakage) |
EMPTY_NOTE_BODY_MIN_CHARS |
1 |
Skip notes whose body is shorter than this |
OPT_OUT_FRONTMATTER_KEY |
"auto_tag" |
Written as <key>: false on hard LLM failure; remove to retry |
OPENAI_API_KEY |
"" |
Required if provider is openai |
OPENAI_MODEL |
gpt-3.5-turbo |
|
OPENAI_MAX_TOKENS |
4000 |
Tag quality is sensitive to model size. From measured runs on a mixed personal vault:
gemma3:12b(recommended). Reliably follows the granularity rule — emits brand-level tags likeHoka,Darn Toughinstead of SKU strings. Doesn't hallucinate acronym expansions. ~6–10s per file on a 3060.qwen2.5:7b(works, with caveats). Functions, but ignores the granularity rule on list-heavy notes (emits"Darn Tough T4050 Heavyweight Tactical Full Cushion"verbatim). Hallucinated"Reformer Autoencoder Generator"as an expansion ofRAGin one test. ~1–4s per file.- Models below ~7B: not tested but the pattern suggests they'll be worse at granularity rule-following.
The post-processing filters (MAX_SPECIFIC_TAG_WORDS, MAX_NORMALIZED_TAG_LEN) catch the worst-case SKU bloat regardless of model, but they're a safety net, not a substitute for picking a capable model. If you change models, re-evaluate MAX_SPECIFIC_TAG_WORDS — different models concatenate differently.
-
Two-stage matcher with LLM judge — fresh per note. Pure cosine-similarity matching conflates lexical proximity with referent identity — it cannot tell "your coworker Julie" from "Julie Andrews", because both embed similarly. The matcher uses cosine as a coarse filter: above
SIMILARITY_THRESHOLD_HIGHauto-consolidates (typos, trivial morphology); belowSIMILARITY_THRESHOLD_LOWis auto-rejected as a new tag; the gray zone in between is sent to the LLM judge with note context, which has the world knowledge needed to disambiguate. Judge verdicts are not cached — a verdict made in one note's context is not globally true (a later fan note that uses "Julie" to mean Julie Andrews deserves the opposite verdict from a coworker note). Every judge call is appended totag_decisions.log(JSONL) along with the file path, body excerpt, sim, verdict, and reason, providing a full audit trail without freezing decisions. -
Grounding check defeats in-prompt example leakage. LLMs given specific in-context examples can regurgitate those examples on weakly-grounded notes — e.g. emitting
Hokaon a software-UX note because clothing brands appeared in the prompt's worked examples. The substring-against-body check (REQUIRE_SPECIFIC_TAG_IN_BODY) is the deterministic backstop: anyspecific_tagthat doesn't actually appear in the note is dropped before reaching the matcher.general_tagsare exempt because they're conceptual. -
Schema-enforced structured output is non-negotiable. Single-array JSON (
{"tags": [...]}) lets the model paraphrase or ignore output rules. The dual-array schema (specific_tags+general_tags) forces dual-level tagging via grammar-constrained decoding — the model cannot return without filling both fields. -
The matcher is the source of consolidation truth, not the prompt. The model has no knowledge of which existing tags to reuse; it generates freely. Cosine similarity over
nomic-embed-textembeddings handles the consolidation deterministically. This avoids prompt-bloat and keeps tag reuse consistent across runs. -
Junk detection uses vault tags plus
/usr/share/dict/words. Vault-vocabulary-only decomposition produces too many false positives because the established vocab is small relative to natural English. Combining the two gives clean results — vault tags handle niche technical terms, the dict handles normal English. -
Cleanup judge: classify-then-decide, not free-form reason-then-decide. The cluster-cohesion LLM call returns a structured
relationshipfield with an enum value (typo,morphological,synonym,related_distinct,named_vs_common) that deterministically maps tosame_entity. This was rebuilt from a free-form "are these the same?" prompt that consistently failed in one direction: the model used "they're related" as evidence of sameness. Forcing a relationship label first, with explicit anti-patterns (broad-vs-narrow, device-vs-OS, adjective-vs-noun, sibling-concepts), eliminated that failure mode. The model writes acounterexampleonly when the relationship isrelated_distinctornamed_vs_common— gating the counterexample to ambiguous cases prevents it from over-firing on trivial sing/pl pairs. -
Cluster extension uses lexical similarity, not embedding similarity. Misspelled non-words are essentially random in
nomic-embed-textspace —sim("claud", "claude") = 0.54, well below any useful cluster threshold. So the cleanup tool's external-canonical lookup uses Levenshtein ratio (≥ 0.80) to find vault tags lex-close to cluster members, then drops candidates whose embedding sim to the cluster centroid is below 0.50 (filters out coincidental lookalikes likecloudforclaud+claudai). The lookup is typo-gated: it only runs when at least one cluster member fails to appear in/usr/share/dict/wordsand isn't an established vault tag, because for clusters of real-word pairs (e.g.prophetic+prophecy) the external candidates introduce noise that shifts the cohesion verdict. -
macOS Local Network privacy is per-binary. Each Homebrew Python minor version is treated as a separate binary by the macOS permission system. New venvs that need LAN access (e.g. talking to Ollama on a different host) must be created with
python@3.13if that's the binary that has the Local Network grant. A new minor version will silently fail withEHOSTUNREACHand no prompt. This quirk only applies on macOS; on Linux any Python 3.11+ works. -
The
processed:frontmatter timestamp is the only source of truth for incremental work. No external state file. Touching the note in any way (including a tag rewrite from the cleanup tool) intentionally does not updateprocessed:— tag-level edits aren't semantic changes.
generate_tags.py # daily tagger (entry point for cron)
generate_tags.md # prompt template (read at runtime by generate_tags.py)
cleanup_tags.py # vault-wide tag cleanup with scan/apply workflow
config.yaml # all configuration
requirements.txt # Python deps
run-daily.sh # cron wrapper: ob sync → python → ob sync
.venv/ # repo-local virtualenv (created during setup)
tag_embeddings.json # embedding cache, regenerated when missing or model changes
tag_decisions.log # append-only JSONL audit log of matcher judge calls
cleanup_decisions.json # cleanup tool's LLM-verdict cache (cluster cohesion + junk-judge)
cleanup_proposals.json # output of `cleanup_tags.py scan`; reviewed before apply
logs/ # per-day dated logs
harness/ # replay harness for iterating on the cleanup judge prompt
