Skip to content

feat(zotero): add Zotero connector#61

Open
thiswillbeyourgithub wants to merge 9 commits into
open-webui:mainfrom
thiswillbeyourgithub:add-zotero
Open

feat(zotero): add Zotero connector#61
thiswillbeyourgithub wants to merge 9 commits into
open-webui:mainfrom
thiswillbeyourgithub:add-zotero

Conversation

@thiswillbeyourgithub

@thiswillbeyourgithub thiswillbeyourgithub commented Jun 22, 2026

Copy link
Copy Markdown

Summary

Adds a Zotero connector to oikb. It syncs the text of the PDF attachments of items in a Zotero collection (and its subcollections) into an Open WebUI Knowledge Base, mapping the Zotero collection hierarchy onto KB directories. It is read-only with respect to Zotero: it never modifies, deletes, or adds anything to the library.

This is a Claude Code port of an earlier standalone tool I wrote, https://github.com/thiswillbeyourgithub/openwebui-knowledge-zotero-sync, reworked to fit oikb's connector interface.

Zotero connector

  • New zotero:<hierarchy> source scheme. %% separates collection names
    (e.g. zotero:Research%%Machine Learning); a bare zotero: syncs every top-level collection.
  • Text extraction prefers Zotero's indexed fulltext API, falling back to PyMuPDF on the downloaded PDF.
  • Collection hierarchy maps to KB directories; an item that lives in several subcollections is handled.
  • Change detection via ZOTERO_CHECKSUM: version (cheap, hashes the Zotero item version, no download, default) or content (hashes the extracted text).
  • ZOTERO_EXCLUDE skips subcollections, relative to the synced root.
  • Auth/options via env: ZOTERO_LIBRARY_ID, ZOTERO_API_KEY, ZOTERO_LIBRARY_TYPE, ZOTERO_CHECKSUM, ZOTERO_EXCLUDE.
  • Optional dependency group: pip install oikb[zotero] (pyzotero + pymupdf).
  • README and docs/guide.md updated.

Sync change: unavailable source files are warnings, not fatal errors

A file the source advertises in its manifest but cannot provide content for (e.g. a Zotero attachment whose bytes aren't in storage: a web link, a linked file, or a WebDAV-only item) is a source-side data gap, not an oikb failure. Previously any such case landed in result.errors and exited the whole sync with code 1.

This PR adds a SourceFileUnavailable exception to the connector base. run_sync catches it specifically (no retry) and routes the file to a new result.warnings
list instead of result.errors, so a sync whose only problems are unavailable source files still succeeds (exit 0). Any other read_file() exception still fails the run exactly as before. The CLI and daemon surface warnings separately from errors; only result.errors affects the exit code / success status.

Tests

tests/test_sync_warnings.py covers all three paths: the Zotero missing-file mapping to SourceFileUnavailable, warning routing in run_sync, and that a generic read failure is still an error.

Notes

  • Opening as a draft for early feedback on the connector shape and on whether the warnings-vs-errors sync change is acceptable as written.
  • Implemented with Claude Code.
  • I initially wanted to add support for using only the yaml instead of env variables to simplify setup but it resuired some fundamental changes I felt could go beyond the scope of a single PR. Let me know if you'd like me to streamline this.

thiswillbeyourgithub and others added 9 commits June 22, 2026 16:51
Extracts text from PDF attachments in a Zotero collection (and its
subcollections) and exposes them as .txt files for sync. Collection
hierarchy maps to KB directories. Read-only with respect to Zotero.

Checksum mode is configurable via ZOTERO_CHECKSUM: 'version' (cheap,
default) or 'content' (hashes extracted text).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Personal, untracked scripts and configs (e.g. the Zotero sync cheatsheet)
live in perso/ and should not be tracked.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The diff's mkdir list had no guaranteed order, so a nested directory could
be processed before its parent. Since the parent's id is looked up from
directory_map (populated as dirs are created), an out-of-order child would
get parent_id=None and be created at the wrong level. Sorting lexicographically
puts "a" before "a/b", guaranteeing parent-first creation, and also makes the
dry-run "Dirs to create" output stable and readable.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
When a Zotero attachment's bytes aren't in zotero.org storage (file sync
off, storage quota exceeded, WebDAV-only, or a web link), the /file
endpoint returns 404. In 'version' checksum mode this already surfaced
per-file at upload time and the rest of the sync completed. But in
'content' mode the download happens during build_manifest, so a single
404 propagated out of run_sync and aborted the entire run with nothing
uploaded.

Make _checksum degrade to the version checksum when text can't be
retrieved, so the manifest is always built and the failure is reported
through the normal per-file upload error path (sync everything we can,
list what failed at the end). Also wrap the file-download failure in a
clear message explaining the bytes aren't in Zotero storage instead of
dumping a raw HTTP 404.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A file the source advertises in its manifest but cannot actually provide
content for (e.g. a Zotero attachment whose bytes aren't in storage: a web
link, linked file, or WebDAV-only item) is a source-side data gap, not an
oikb failure. Previously any such case landed in result.errors and made the
whole sync exit 1, so a library with a few file-less attachments could never
report success (a recurring systemd/daemon run would always look "failed").

Introduce a dedicated SourceFileUnavailable exception in the connectors base.
The Zotero connector raises it (instead of a bare RuntimeError) for the
no-downloadable-file case. run_sync catches it specifically (no retry) and
routes the file to a new result.warnings list instead of result.errors; the
CLI and daemon surface warnings but only result.errors affects the exit code
and success/partial status. Any other read_file() exception still fails the
run exactly as before, so this narrows to that one error class only.

Tests cover all three paths: the Zotero missing-file mapping, warning routing
in run_sync, and that a generic read failure is still an error.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@thiswillbeyourgithub thiswillbeyourgithub marked this pull request as ready for review June 22, 2026 15:39
thiswillbeyourgithub added a commit to thiswillbeyourgithub/openwebui-knowledge-zotero-sync that referenced this pull request Jun 22, 2026
The Zotero connector port has been opened as a draft PR against the official
open-webui/oikb project (open-webui/oikb#61). Update the
deprecation notice to point at it; this repo will be sunset once it lands.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant