Skip to content

undergroundpost/obsidian-auto-tagger

Repository files navigation

obsidian-auto-tagger

Tags Obsidian notes automatically with a local Ollama LLM, reuses your existing tag vocabulary instead of creating near-duplicates, and includes a cleanup tool to consolidate the tag pool over time.

Contents

What it does

  • Tags new and modified notes automatically. Each note gets a fresh set of frontmatter tags drawn from its content.
  • Respects your existing tag vocabulary. When a note mentions something you've tagged before, the script picks the tag you already use instead of creating a near-duplicate.
  • Disambiguates with context. When two existing tags look alike — julie vs julieandrews — the script reads the note and decides which one fits.
  • Cleans up after itself. A separate cleanup_tags.py tool surfaces duplicates, typos, and junk tags for review. Nothing writes until you approve.
  • Designed to run on its own. Point a nightly cron at your vault and forget it.

Quick start

You need:

  • Ollama running locally with gemma3:12b and nomic-embed-text pulled.
  • Python 3.11+ on Linux, or 3.13 specifically on macOS (3.14+ has a Local Network privacy quirk — see Design notes).
git clone https://github.com/undergroundpost/obsidian-auto-tagger.git
cd obsidian-auto-tagger
python3 -m venv .venv
.venv/bin/pip install -r requirements.txt
cp config.yaml.example config.yaml
# edit config.yaml — at minimum, set INPUT_FOLDER to your vault path

Sanity-check on five files without writing anything:

.venv/bin/python generate_tags.py --dry-run --limit 5

Drop --dry-run once you're happy with the output.

Daily run

Pick the path that matches your setup.

Local vault

For a vault that lives on the same machine the cron runs on:

0 1 * * * /path/to/obsidian-auto-tagger/.venv/bin/python /path/to/obsidian-auto-tagger/generate_tags.py

The script writes its own dated log to logs/generate_tags_YYYY-MM-DD.log.

Headless server with Obsidian Sync

If your vault lives in Obsidian Sync and you want this script to run on an always-on server (instead of relying on a laptop being awake at 1am), use Obsidian's official headless client to pull and push around each run.

One-time setup on the server:

npm install -g obsidian-headless
ob sync-setup    # interactive — pair to your Obsidian Sync vault

Then schedule the bundled wrapper, which does ob sync → tag → ob sync:

0 1 * * * /path/to/obsidian-auto-tagger/run-daily.sh

The wrapper logs to logs/run-daily_YYYY-MM-DD.log; the Python script logs to logs/generate_tags_YYYY-MM-DD.log.

Manual runs

.venv/bin/python generate_tags.py                       # default: only files modified since their last processed: stamp
.venv/bin/python generate_tags.py --dry-run --limit 5   # preview without writing
.venv/bin/python generate_tags.py --force --limit 10    # re-tag, ignoring the processed timestamp
.venv/bin/python generate_tags.py --untagged-only       # backfill files with no tags

Other flags: --debug, --input <folder>, --exclude <folder> (repeatable), --model <name>, --provider {ollama,openai}, --api-key <key>.

Cleanup tool

Vault tag pools drift over time — typos, near-duplicates, one-off junk. cleanup_tags.py consolidates yours through a review-and-apply workflow:

.venv/bin/python cleanup_tags.py scan      # writes cleanup_proposals.json
# open cleanup_proposals.json and edit any actions you want to change
.venv/bin/python cleanup_tags.py apply     # rewrites frontmatter + inline #tag references

Proposals are grouped into:

Section Default What it catches
format_consolidations apply hyphen / underscore / accent variants
semantic_consolidations_auto apply high-similarity clusters (cosine ≥ 0.90)
semantic_consolidations_review review borderline clusters (0.80–0.90) — approve each manually
suspected_junk delete concatenation blobs, garbage strings
one_offs review tags used once; merged if a close neighbor exists

By default the tool also runs an LLM judge over each cluster to verify cohesion and pick the canonical form (e.g. claud + claudai → claude). Pass --no-llm to skip the judge and use heuristics only.

Configuration

Key Default Notes
INPUT_FOLDER ~/Documents/Notes Vault root
EXCLUDE_FOLDERS [] Skipped during scan and tagging
LLM_PROVIDER ollama ollama or openai
OLLAMA_MODEL gemma3:12b Tag-extraction model (also used by cleanup LLM passes)
OLLAMA_SERVER_ADDRESS http://localhost:11434 Ollama endpoint
OLLAMA_CONTEXT_WINDOW 32000 num_ctx for tag extraction
EMBEDDING_MODEL nomic-embed-text Embedding model for the matcher
EMBEDDING_BATCH_SIZE 100 Tags per /api/embed call
SIMILARITY_THRESHOLD_HIGH 0.95 Cosine sim ≥ this → auto-consolidate
SIMILARITY_THRESHOLD_LOW 0.70 Cosine sim < this → new tag (no judge call)
LLM_JUDGE_ENABLED true Gray-zone candidates use the LLM judge with note context
MAX_SPECIFIC_TAG_WORDS 3 specific_tags with more words are dropped (anti-SKU)
MAX_NORMALIZED_TAG_LEN 30 Tags whose normalized form exceeds this length are dropped
REQUIRE_SPECIFIC_TAG_IN_BODY true Drop specific_tags not present in the note body (anti-leakage)
EMPTY_NOTE_BODY_MIN_CHARS 1 Skip notes whose body is shorter than this
OPT_OUT_FRONTMATTER_KEY "auto_tag" Written as <key>: false on hard LLM failure; remove to retry
OPENAI_API_KEY "" Required if provider is openai
OPENAI_MODEL gpt-3.5-turbo
OPENAI_MAX_TOKENS 4000

Model performance

Tag quality is sensitive to model size. From measured runs on a mixed personal vault:

  • gemma3:12b (recommended). Reliably follows the granularity rule — emits brand-level tags like Hoka, Darn Tough instead of SKU strings. Doesn't hallucinate acronym expansions. ~6–10s per file on a 3060.
  • qwen2.5:7b (works, with caveats). Functions, but ignores the granularity rule on list-heavy notes (emits "Darn Tough T4050 Heavyweight Tactical Full Cushion" verbatim). Hallucinated "Reformer Autoencoder Generator" as an expansion of RAG in one test. ~1–4s per file.
  • Models below ~7B: not tested but the pattern suggests they'll be worse at granularity rule-following.

The post-processing filters (MAX_SPECIFIC_TAG_WORDS, MAX_NORMALIZED_TAG_LEN) catch the worst-case SKU bloat regardless of model, but they're a safety net, not a substitute for picking a capable model. If you change models, re-evaluate MAX_SPECIFIC_TAG_WORDS — different models concatenate differently.

Design notes

  • Two-stage matcher with LLM judge — fresh per note. Pure cosine-similarity matching conflates lexical proximity with referent identity — it cannot tell "your coworker Julie" from "Julie Andrews", because both embed similarly. The matcher uses cosine as a coarse filter: above SIMILARITY_THRESHOLD_HIGH auto-consolidates (typos, trivial morphology); below SIMILARITY_THRESHOLD_LOW is auto-rejected as a new tag; the gray zone in between is sent to the LLM judge with note context, which has the world knowledge needed to disambiguate. Judge verdicts are not cached — a verdict made in one note's context is not globally true (a later fan note that uses "Julie" to mean Julie Andrews deserves the opposite verdict from a coworker note). Every judge call is appended to tag_decisions.log (JSONL) along with the file path, body excerpt, sim, verdict, and reason, providing a full audit trail without freezing decisions.

  • Grounding check defeats in-prompt example leakage. LLMs given specific in-context examples can regurgitate those examples on weakly-grounded notes — e.g. emitting Hoka on a software-UX note because clothing brands appeared in the prompt's worked examples. The substring-against-body check (REQUIRE_SPECIFIC_TAG_IN_BODY) is the deterministic backstop: any specific_tag that doesn't actually appear in the note is dropped before reaching the matcher. general_tags are exempt because they're conceptual.

  • Schema-enforced structured output is non-negotiable. Single-array JSON ({"tags": [...]}) lets the model paraphrase or ignore output rules. The dual-array schema (specific_tags + general_tags) forces dual-level tagging via grammar-constrained decoding — the model cannot return without filling both fields.

  • The matcher is the source of consolidation truth, not the prompt. The model has no knowledge of which existing tags to reuse; it generates freely. Cosine similarity over nomic-embed-text embeddings handles the consolidation deterministically. This avoids prompt-bloat and keeps tag reuse consistent across runs.

  • Junk detection uses vault tags plus /usr/share/dict/words. Vault-vocabulary-only decomposition produces too many false positives because the established vocab is small relative to natural English. Combining the two gives clean results — vault tags handle niche technical terms, the dict handles normal English.

  • Cleanup judge: classify-then-decide, not free-form reason-then-decide. The cluster-cohesion LLM call returns a structured relationship field with an enum value (typo, morphological, synonym, related_distinct, named_vs_common) that deterministically maps to same_entity. This was rebuilt from a free-form "are these the same?" prompt that consistently failed in one direction: the model used "they're related" as evidence of sameness. Forcing a relationship label first, with explicit anti-patterns (broad-vs-narrow, device-vs-OS, adjective-vs-noun, sibling-concepts), eliminated that failure mode. The model writes a counterexample only when the relationship is related_distinct or named_vs_common — gating the counterexample to ambiguous cases prevents it from over-firing on trivial sing/pl pairs.

  • Cluster extension uses lexical similarity, not embedding similarity. Misspelled non-words are essentially random in nomic-embed-text space — sim("claud", "claude") = 0.54, well below any useful cluster threshold. So the cleanup tool's external-canonical lookup uses Levenshtein ratio (≥ 0.80) to find vault tags lex-close to cluster members, then drops candidates whose embedding sim to the cluster centroid is below 0.50 (filters out coincidental lookalikes like cloud for claud+claudai). The lookup is typo-gated: it only runs when at least one cluster member fails to appear in /usr/share/dict/words and isn't an established vault tag, because for clusters of real-word pairs (e.g. prophetic+prophecy) the external candidates introduce noise that shifts the cohesion verdict.

  • macOS Local Network privacy is per-binary. Each Homebrew Python minor version is treated as a separate binary by the macOS permission system. New venvs that need LAN access (e.g. talking to Ollama on a different host) must be created with python@3.13 if that's the binary that has the Local Network grant. A new minor version will silently fail with EHOSTUNREACH and no prompt. This quirk only applies on macOS; on Linux any Python 3.11+ works.

  • The processed: frontmatter timestamp is the only source of truth for incremental work. No external state file. Touching the note in any way (including a tag rewrite from the cleanup tool) intentionally does not update processed: — tag-level edits aren't semantic changes.

Repo layout

generate_tags.py        # daily tagger (entry point for cron)
generate_tags.md        # prompt template (read at runtime by generate_tags.py)
cleanup_tags.py         # vault-wide tag cleanup with scan/apply workflow
config.yaml             # all configuration
requirements.txt        # Python deps
run-daily.sh            # cron wrapper: ob sync → python → ob sync
.venv/                  # repo-local virtualenv (created during setup)
tag_embeddings.json     # embedding cache, regenerated when missing or model changes
tag_decisions.log       # append-only JSONL audit log of matcher judge calls
cleanup_decisions.json  # cleanup tool's LLM-verdict cache (cluster cohesion + junk-judge)
cleanup_proposals.json  # output of `cleanup_tags.py scan`; reviewed before apply
logs/                   # per-day dated logs
harness/                # replay harness for iterating on the cleanup judge prompt

About

Use AI to automatically generate and add relevant tags to your Obsidian notes.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors