diff --git a/.gitignore b/.gitignore index 9c816af..c969f54 100644 --- a/.gitignore +++ b/.gitignore @@ -5,6 +5,7 @@ __pycache__/ dist/ build/ .venv/ +.env .eggs/ *.egg .pytest_cache/ diff --git a/CHANGELOG.md b/CHANGELOG.md index 2c2ca60..06fa7c9 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -4,6 +4,18 @@ All notable changes to this project will be documented in this file. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/). +## [0.3.8] - 2026-06-03 + +### Added + +- **Gitea**: wildcard owner/org source via `gitea:owner/*` to sync every repository under an owner into one Knowledge Base. Files are prefixed by repository name to avoid path collisions. + +## [0.3.7] - 2026-06-02 + +### Added + +- **Gitea**: repository source connector via `gitea:owner/repo`, configurable with `GITEA_URL` and optional `GITEA_TOKEN`. Supports branch and subdirectory scoping, uses Gitea Git tree blob SHAs for incremental sync checksums, and syncs files through the raw content API. + ## [0.3.6] - 2026-05-28 ### Added diff --git a/README.md b/README.md index cb74bda..c7fbfca 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ # 📚 oikb -Keep your [Open WebUI](https://github.com/open-webui/open-webui) Knowledge Bases in sync. Point it at a local directory, a GitHub repo, a Confluence space, an S3 bucket, or any of 44 supported sources. Only new and modified files are uploaded via incremental SHA-256 diffing. +Keep your [Open WebUI](https://github.com/open-webui/open-webui) Knowledge Bases in sync. Point it at a local directory, a GitHub repo, a Confluence space, an S3 bucket, or any of 45 supported sources. Only new and modified files are uploaded via incremental SHA-256 diffing. > [!IMPORTANT] > Requires **Open WebUI 0.9.6+** @@ -134,11 +134,11 @@ services: timeout: 5s ``` -## 44 Connectors +## 45 Connectors | Category | Sources | |---|---| -| **Code Repos** | GitHub, GitLab, Bitbucket | +| **Code Repos** | GitHub, Gitea, GitLab, Bitbucket | | **Cloud Storage** | S3, GCS, Azure Blob, Dropbox, R2, Google Drive, SharePoint, Egnyte, Oracle Cloud | | **Wikis & KBs** | Confluence, Notion, BookStack, Discourse, GitBook, Guru, Outline, Slab, Document360, DokuWiki, Google Sites | | **Ticketing** | Jira, Linear, Zendesk, Freshdesk, Asana, ClickUp, Airtable, ServiceNow, ProductBoard | @@ -150,6 +150,8 @@ services: ```bash oikb sync github:owner/repo --kb-id your-kb-id +GITEA_URL=https://gitea.example.com oikb sync gitea:owner/repo --kb-id your-kb-id +GITEA_URL=https://gitea.example.com oikb sync 'gitea:owner/*' --kb-id your-kb-id oikb sync confluence:ENG --kb-id your-kb-id oikb sync s3://bucket/prefix --kb-id your-kb-id oikb sync servicenow:incident --kb-id your-kb-id diff --git a/docs/guide.md b/docs/guide.md index 7323dc0..4b2b659 100644 --- a/docs/guide.md +++ b/docs/guide.md @@ -18,6 +18,7 @@ A complete guide to syncing content into Open WebUI Knowledge Bases. - [Sources](#sources) - [Local Directories](#local-directories) - [GitHub](#github) + - [Gitea](#gitea) - [GitLab / Bitbucket](#gitlab--bitbucket) - [Confluence](#confluence) - [Cloud Storage (S3 / GCS / Azure)](#cloud-storage-s3--gcs--azure) @@ -232,6 +233,41 @@ sources: path: docs/ ``` +### Gitea + +```bash +export GITEA_URL=https://gitea.example.com +export GITEA_TOKEN=your-token # required for private repos +oikb sync gitea:owner/repo --kb-id your-kb-id +``` + +Gitea requires `GITEA_URL` because instances are self-hosted. Set `GITEA_TOKEN` for private repositories or higher API limits. Like GitHub, Gitea syncs the default branch by default and supports branch and subdirectory selection: + +```bash +oikb sync gitea:owner/repo --branch main --path docs/ +``` + +To sync every repository owned by a Gitea user or organization, use `*` as the repository name. Files are stored under a repository-name prefix to prevent collisions: + +```bash +oikb sync gitea:owner/* --kb-id your-kb-id +``` + +Or in `.oikb.yaml`: + +```yaml +sources: + - name: gitea-docs + source: gitea:myorg/docs + kb-id: abc123 + branch: main + path: docs/ + + - name: all-gitea-repos + source: gitea:myorg/* + kb-id: abc123 +``` + ### GitLab / Bitbucket ```bash @@ -316,11 +352,11 @@ sources: ### All Connectors -44 connectors available. See the full list: +45 connectors available. See the full list: | Category | Sources | |---|---| -| **Git** | GitHub, GitLab, Bitbucket | +| **Git** | GitHub, Gitea, GitLab, Bitbucket | | **Cloud Storage** | S3, GCS, Azure Blob, Dropbox, R2, Google Drive, SharePoint, Egnyte, Oracle Cloud | | **Wikis & KBs** | Confluence, Notion, BookStack, Discourse, GitBook, Guru, Outline, Slab, Document360, DokuWiki, Google Sites | | **Ticketing** | Jira, Linear, Zendesk, Freshdesk, Asana, ClickUp, Airtable, ServiceNow, ProductBoard | diff --git a/pyproject.toml b/pyproject.toml index 7db85d2..2aa3a7d 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -1,6 +1,6 @@ [project] name = "oikb" -version = "0.3.6" +version = "0.3.8" description = "Sync anything to Open WebUI Knowledge Bases" readme = "README.md" authors = [ diff --git a/src/oikb/__init__.py b/src/oikb/__init__.py index 28c108d..f1f5742 100644 --- a/src/oikb/__init__.py +++ b/src/oikb/__init__.py @@ -1,3 +1,3 @@ """oikb — Open WebUI Knowledge Base CLI.""" -__version__ = "0.3.5" +__version__ = "0.3.8" diff --git a/src/oikb/cli.py b/src/oikb/cli.py index b7bffbc..9991b50 100644 --- a/src/oikb/cli.py +++ b/src/oikb/cli.py @@ -50,6 +50,11 @@ def _resolve_connector(source: str, branch: str | None = None, path: str | None parsed = parse_github_source(source) return GitHubConnector(owner=parsed["owner"], repo=parsed["repo"], branch=branch, path=path or parsed.get("path")) + if source.startswith("gitea:"): + from oikb.connectors.gitea import GiteaConnector, parse_gitea_source + parsed = parse_gitea_source(source) + return GiteaConnector(owner=parsed["owner"], repo=parsed["repo"], branch=branch, path=path or parsed.get("path")) + if source.startswith("gitlab:"): from oikb.connectors.gitlab import GitLabConnector, parse_gitlab_source parsed = parse_gitlab_source(source) @@ -349,7 +354,7 @@ def _build_cli_filter(max_file_size: str | None): @cli.command() @click.argument("source", required=False) @common_options -@click.option("--branch", default=None, help="Branch for GitHub sources.") +@click.option("--branch", default=None, help="Branch for Git repository sources.") @click.option("--path", "source_path", default=None, help="Subdirectory within the source.") @click.option("--dry-run", is_flag=True, help="Preview changes without uploading.") @click.option("-v", "--verbose", is_flag=True, help="Show detailed progress.") @@ -506,7 +511,7 @@ def sync( @cli.command() @click.argument("source") @common_options -@click.option("--branch", default=None, help="Branch for GitHub sources.") +@click.option("--branch", default=None, help="Branch for Git repository sources.") @click.option("--path", "source_path", default=None, help="Subdirectory within the source.") @click.option("-v", "--verbose", is_flag=True, help="Show detailed output.") @click.pass_context @@ -713,7 +718,7 @@ def status(url: str | None, token: str | None, kb: str | None): try: info = client.get_kb(kb) - files = info.get("files", []) + files = info.get("files") or [] except Exception as e: click.echo(click.style(f"Failed: {e}", fg="red"), err=True) sys.exit(1) diff --git a/src/oikb/client.py b/src/oikb/client.py index f40752e..febe757 100644 --- a/src/oikb/client.py +++ b/src/oikb/client.py @@ -135,4 +135,4 @@ def list_kb_files(self, kb_id: str) -> list[dict[str, Any]]: resp = self._http.get(f"/knowledge/{kb_id}") resp.raise_for_status() data = resp.json() - return data.get("files", []) + return data.get("files") or [] diff --git a/src/oikb/connectors/gitea.py b/src/oikb/connectors/gitea.py new file mode 100644 index 0000000..8fb3f68 --- /dev/null +++ b/src/oikb/connectors/gitea.py @@ -0,0 +1,217 @@ +"""Gitea connector — sync Gitea repos to a Knowledge Base via the API. + +Uses the Gitea Git Trees API for checksums (blob SHAs) — no local clone needed. +Set GITEA_URL to your instance URL, and GITEA_TOKEN for private repositories. +""" + +from __future__ import annotations + +import os + +import httpx + +from oikb.connectors import BaseConnector, ManifestEntry + + +class GiteaConnector(BaseConnector): + """Sync files from one Gitea repository, or all repos for an owner. + + Args: + owner: Repository owner or organization. + repo: Repository name, or "*" for all repos owned by owner. + branch: Branch to sync from (default: repo default branch). + path: Subdirectory to scope to (e.g. "docs/"). + token: Gitea personal access token (or GITEA_TOKEN env var). + base_url: Gitea instance URL (or GITEA_URL env var). + """ + + def __init__( + self, + owner: str, + repo: str, + branch: str | None = None, + path: str | None = None, + token: str | None = None, + base_url: str | None = None, + ): + self.owner = owner + self.repo = repo + self.branch = branch + self.path = path.strip("/") if path else None + self._token = token or os.environ.get("GITEA_TOKEN") + self._base_url = (base_url or os.environ.get("GITEA_URL") or "").rstrip("/") + self._default_branches: dict[str, str] = {} + + if not self._base_url: + raise ValueError("GITEA_URL is required for gitea: sources (e.g. https://gitea.example.com)") + + headers: dict[str, str] = {"Accept": "application/json"} + if self._token: + headers["Authorization"] = f"token {self._token}" + + self._http = httpx.Client( + base_url=f"{self._base_url}/api/v1", + headers=headers, + timeout=60.0, + ) + + def build_manifest(self) -> list[ManifestEntry]: + """Fetch the repo tree and build a manifest. + + Gitea paginates the recursive tree endpoint. Blob SHAs are used as + checksums because they are content-addressable hashes. + """ + if self._all_repos: + entries: list[ManifestEntry] = [] + for repo in self._list_repos(): + entries.extend(self._build_repo_manifest(repo, prefix_repo=True)) + entries.sort(key=lambda e: e.display_path) + return entries + + return self._build_repo_manifest(self.repo) + + def _build_repo_manifest(self, repo: str, prefix_repo: bool = False) -> list[ManifestEntry]: + """Fetch one repo tree and build manifest entries.""" + ref = self.branch or self._get_default_branch(repo) + entries: list[ManifestEntry] = [] + seen_items = 0 + page = 1 + + while True: + resp = self._http.get( + f"/repos/{self.owner}/{repo}/git/trees/{ref}", + params={"recursive": "true", "per_page": 100, "page": page}, + ) + resp.raise_for_status() + tree = resp.json() + items = tree.get("tree", []) + + if not items: + break + + seen_items += len(items) + + for item in items: + if item.get("type") != "blob": + continue + + file_path = item["path"] + + # Filter by path prefix if specified. + if self.path: + if not file_path.startswith(self.path + "/"): + continue + # Strip the prefix so paths are relative to the scoped dir. + file_path = file_path[len(self.path) + 1 :] + + parts = file_path.rsplit("/", 1) + if len(parts) == 2: + dir_path, filename = parts + else: + dir_path, filename = "", parts[0] + + if prefix_repo: + dir_path = f"{repo}/{dir_path}" if dir_path else repo + + entries.append( + ManifestEntry( + filename=filename, + path=dir_path, + checksum=item["sha"], # Git blob SHA — content-addressable. + size=item.get("size", 0), + ) + ) + + total_count = tree.get("total_count") + if total_count is not None: + if seen_items >= total_count: + break + elif len(items) < 100: + break + + page += 1 + + entries.sort(key=lambda e: e.display_path) + return entries + + def read_file(self, path: str, filename: str) -> bytes: + """Download a file's raw content via the Gitea raw file endpoint.""" + file_path = f"{path}/{filename}" if path else filename + repo = self.repo + + if self._all_repos: + repo, _, file_path = file_path.partition("/") + if not repo or not file_path: + raise ValueError(f"Invalid wildcard Gitea path: {path}/{filename}") + + if self.path: + file_path = f"{self.path}/{file_path}" + + ref = self.branch or self._get_default_branch(repo) + + resp = self._http.get( + f"/repos/{self.owner}/{repo}/raw/{file_path}", + params={"ref": ref}, + ) + resp.raise_for_status() + return resp.content + + @property + def _all_repos(self) -> bool: + return self.repo == "*" + + def _get_default_branch(self, repo: str) -> str: + """Fetch and cache the repo's default branch name.""" + if repo not in self._default_branches: + resp = self._http.get(f"/repos/{self.owner}/{repo}") + resp.raise_for_status() + self._default_branches[repo] = resp.json()["default_branch"] + return self._default_branches[repo] + + def _list_repos(self) -> list[str]: + """List repositories for the configured owner or organization.""" + repos: list[str] = [] + page = 1 + + while True: + resp = self._http.get(f"/orgs/{self.owner}/repos", params={"page": page, "limit": 50}) + if resp.status_code == 404: + resp = self._http.get(f"/users/{self.owner}/repos", params={"page": page, "limit": 50}) + resp.raise_for_status() + items = resp.json() + + if not items: + break + + repos.extend(repo["name"] for repo in items) + + if len(items) < 50: + break + page += 1 + + repos.sort() + return repos + + def close(self) -> None: + self._http.close() + + +def parse_gitea_source(source: str) -> dict[str, str | None]: + """Parse a gitea:owner/repo[/path] source string. + + Examples: + gitea:myorg/docs + gitea:myorg/docs/api + gitea:myorg/* + """ + source = source.removeprefix("gitea:") + + parts = source.split("/", 2) + if len(parts) < 2: + raise ValueError(f"Invalid Gitea source: {source}. Expected: gitea:owner/repo") + + owner = parts[0] + repo = parts[1] + path = parts[2] if len(parts) > 2 else None + + return {"owner": owner, "repo": repo, "path": path} \ No newline at end of file diff --git a/tests/test_cli_status.py b/tests/test_cli_status.py new file mode 100644 index 0000000..afde391 --- /dev/null +++ b/tests/test_cli_status.py @@ -0,0 +1,31 @@ +from __future__ import annotations + +from click.testing import CliRunner + +from oikb import cli as cli_module +from oikb.cli import cli + + +class DummyClient: + def get_kb(self, kb_id: str) -> dict: + return { + "id": kb_id, + "name": "Docs", + "description": "", + "files": None, + } + + def close(self) -> None: + pass + + +def test_status_handles_null_files(monkeypatch) -> None: + monkeypatch.setattr(cli_module, "resolve_url", lambda value=None: "https://openwebui.example.com") + monkeypatch.setattr(cli_module, "resolve_token", lambda value=None: "token") + monkeypatch.setattr("oikb.client.OikbClient", lambda *args, **kwargs: DummyClient()) + + result = CliRunner().invoke(cli, ["status", "--kb-id", "kb123"]) + + assert result.exit_code == 0 + assert "Knowledge Base: Docs" in result.output + assert "Files: 0" in result.output diff --git a/tests/test_gitea_connector.py b/tests/test_gitea_connector.py new file mode 100644 index 0000000..e140ecb --- /dev/null +++ b/tests/test_gitea_connector.py @@ -0,0 +1,153 @@ +from __future__ import annotations + +import pytest +import respx +from httpx import Response + +from oikb.cli import _resolve_connector +from oikb.connectors.gitea import GiteaConnector, parse_gitea_source + + +def test_parse_gitea_source_with_path() -> None: + assert parse_gitea_source("gitea:owner/repo/docs/api") == { + "owner": "owner", + "repo": "repo", + "path": "docs/api", + } + + +def test_parse_gitea_source_with_wildcard() -> None: + assert parse_gitea_source("gitea:owner/*") == { + "owner": "owner", + "repo": "*", + "path": None, + } + + +def test_parse_gitea_source_rejects_missing_repo() -> None: + with pytest.raises(ValueError, match="Expected: gitea:owner/repo"): + parse_gitea_source("gitea:owner") + + +def test_gitea_connector_requires_base_url(monkeypatch: pytest.MonkeyPatch) -> None: + monkeypatch.delenv("GITEA_URL", raising=False) + + with pytest.raises(ValueError, match="GITEA_URL is required"): + GiteaConnector(owner="owner", repo="repo") + + +def test_resolve_connector_dispatches_gitea(monkeypatch: pytest.MonkeyPatch) -> None: + monkeypatch.setenv("GITEA_URL", "https://gitea.example.com") + + connector = _resolve_connector("gitea:owner/repo/docs", branch="main") + + assert isinstance(connector, GiteaConnector) + assert connector.owner == "owner" + assert connector.repo == "repo" + assert connector.branch == "main" + assert connector.path == "docs" + connector.close() + + +@respx.mock +def test_build_manifest_scopes_path_and_paginates() -> None: + base = "https://gitea.example.com/api/v1" + respx.get(f"{base}/repos/owner/repo/git/trees/main").mock( + side_effect=[ + Response( + 200, + json={ + "total_count": 3, + "tree": [ + {"path": "README.md", "type": "blob", "sha": "root", "size": 10}, + {"path": "docs/a.md", "type": "blob", "sha": "sha-a", "size": 20}, + ], + }, + ), + Response( + 200, + json={ + "total_count": 3, + "tree": [ + {"path": "docs/sub/b.md", "type": "blob", "sha": "sha-b", "size": 30}, + ], + }, + ), + ] + ) + + connector = GiteaConnector( + owner="owner", + repo="repo", + branch="main", + path="docs", + base_url="https://gitea.example.com", + ) + + manifest = connector.build_manifest() + + assert [entry.display_path for entry in manifest] == ["a.md", "sub/b.md"] + assert [entry.checksum for entry in manifest] == ["sha-a", "sha-b"] + assert [entry.size for entry in manifest] == [20, 30] + + +@respx.mock +def test_build_manifest_with_wildcard_prefixes_repo_names() -> None: + base = "https://gitea.example.com/api/v1" + respx.get(f"{base}/orgs/owner/repos").mock( + return_value=Response(200, json=[{"name": "repo-a"}, {"name": "repo-b"}]) + ) + respx.get(f"{base}/repos/owner/repo-a").mock(return_value=Response(200, json={"default_branch": "main"})) + respx.get(f"{base}/repos/owner/repo-b").mock(return_value=Response(200, json={"default_branch": "trunk"})) + respx.get(f"{base}/repos/owner/repo-a/git/trees/main").mock( + return_value=Response( + 200, + json={"total_count": 1, "tree": [{"path": "README.md", "type": "blob", "sha": "a", "size": 1}]}, + ) + ) + respx.get(f"{base}/repos/owner/repo-b/git/trees/trunk").mock( + return_value=Response( + 200, + json={"total_count": 1, "tree": [{"path": "docs/index.md", "type": "blob", "sha": "b", "size": 2}]}, + ) + ) + + connector = GiteaConnector(owner="owner", repo="*", base_url="https://gitea.example.com") + + manifest = connector.build_manifest() + + assert [entry.display_path for entry in manifest] == ["repo-a/README.md", "repo-b/docs/index.md"] + assert [entry.checksum for entry in manifest] == ["a", "b"] + + +@respx.mock +def test_read_file_downloads_raw_content_with_path_scope() -> None: + base = "https://gitea.example.com/api/v1" + route = respx.get(f"{base}/repos/owner/repo/raw/docs/sub/file.md").mock( + return_value=Response(200, content=b"hello") + ) + + connector = GiteaConnector( + owner="owner", + repo="repo", + branch="main", + path="docs", + base_url="https://gitea.example.com", + ) + + assert connector.read_file("sub", "file.md") == b"hello" + assert route.calls.last.request.url.params["ref"] == "main" + + +@respx.mock +def test_read_file_with_wildcard_routes_to_repo_from_path() -> None: + base = "https://gitea.example.com/api/v1" + respx.get(f"{base}/repos/owner/repo-a").mock(return_value=Response(200, json={"default_branch": "main"})) + route = respx.get(f"{base}/repos/owner/repo-a/raw/docs/file.md").mock( + return_value=Response(200, content=b"hello") + ) + + connector = GiteaConnector(owner="owner", repo="*", base_url="https://gitea.example.com") + + assert connector.read_file("repo-a/docs", "file.md") == b"hello" + assert route.calls.last.request.url.params["ref"] == "main" \ No newline at end of file