Skip to content

v2: full product release — LanceDB search engine + multi-user Team mode#1

Draft
valginer0 wants to merge 192 commits into
mainfrom
dev/v2
Draft

v2: full product release — LanceDB search engine + multi-user Team mode#1
valginer0 wants to merge 192 commits into
mainfrom
dev/v2

Conversation

@valginer0

Copy link
Copy Markdown
Owner

Summary

Full v2 product release replacing the previous version on main: backend LanceDB parent-child search engine, multi-user Team mode with complete access control, and supporting infrastructure. 177 commits, 109 files changed (+15,461 / −667).

Major areas

  • Search: backend-side LanceDB integration with parent-child staging (FTS picks document, vector ranks within), semantic rescue pool auto-sizing, drift detection + self-healing sync with retry guards. Recall validated at real scale: hit@10 0.77 raw / ~0.83 adjusted.
  • Access control (Team mode): per-user document ownership and visibility (shared/private), role permissions, collection grants, visibility-filtered search and read surfaces (listing, tree, statistics, metadata discovery, quarantine, locks, indexing runs), overwrite/takeover protection with replacement-safe rollback.
  • Multi-user infra: users, API keys, SCIM, SAML, licensing, retention policies, quarantine, activity log.

Review history

  • Three independent third-party review cycles (2026-06-11); all P1/P2 findings fixed and verified: c2483ea (/index overwrite guard + replacement-safe rollback), 830e7be + 38a2f20 (visibility filtering on all read surfaces), 6707cca (connection-leak sweep).
  • Full suite: 1867 passed, 32 skipped, 0 failures. CI green through 38a2f20.

Documented scope choices (not blockers)

  • T6 operational-writes permission mapping (later)
  • /index allowed-roots design (later)
  • MCP not exposed from the multi-user backend
  • /documents/locks/check answers for any known path (lock coordination, not enumeration)

Remaining before merge

  • /code-review ultra pass on this PR
  • Two-account manual smoke test in the desktop app

valginer0 added 30 commits June 9, 2026 22:29
Private documents owned by another user are now excluded from search on
BOTH engines (LanceDB parent-child and all Postgres paths) via a reserved
excluded_document_ids filter pushed down from the /search endpoint.
Admins and local/no-auth mode are unfiltered; identity resolution fails
closed on DB errors instead of leaking.

Gap pre-dated the LanceDB integration: visibility_where_clause() existed
but was never applied to any search path.
…ion grants

- /index and /upload-and-index assign owner_id to the uploading user
  (auth mode, best-effort) so private visibility actually takes effect
- Documents tab context menu: Make Private / Make Shared, with an
  explanatory dialog when a doc has no owner yet
- New role_collection_grants (migration 020) + collection_grants.py:
  role with no grants = unrestricted; '*' wildcard; admins exempt;
  empty allowlist fails closed
- Admin endpoints GET /roles/collections, PUT/DELETE
  /roles/{role}/collections/{namespace}
- Search enforces grants via reserved allowed_namespaces filter in all
  four filter builders (LanceDB + 3 Postgres paths); client-supplied
  values are overwritten, never widening access

24 new tests; full safe suite 1808 passed / 0 failed.
Builds the Docker image from the commit, starts the compose stack on
clean volumes with API_REQUIRE_AUTH + API_AUTH_FORCE_ALL, runs
migrations, and proves: 401 without key, authenticated index, search
served (LanceDB), persistence across restart, and self-heal from
induced LanceDB drift. This is the §6 deployment-artifact gate run on
every push — the Linux-server install path customers actually use.

Also adds the 'Validating Team mode' staging checklist to the public
Access Control Guide.
…ility writers

Review findings (3rd-party, 2026-06-10):
- /context now applies the same visibility exclusions and collection
  grants as /search (shared _apply_access_filters helper so the two
  endpoints cannot drift); retriever get_context accepts filters
- /documents/export is admin-only: it exports full text regardless of
  visibility, and silently filtering a backup would be worse
- visibility writers re-raise on DB error instead of returning 0
  (which the API mapped to a misleading 404)
- LanceDB clause builder: escape extensions and per-parent doc ids;
  comment the JSON-metadata LIKE fallback false-positive; whitespace
- restore endpoint docstring documents the destructive rollback edge
- guide enforcement-scope section updated (/context enforced,
  /export admin-only)

Suite: 1812 passed / 0 failed.
Restore overwrites existing documents by document_id with
caller-supplied content and embeddings — including other users'
private documents — so it is an admin operation like export.
Also adds a Security section to the unreleased changelog.
…ermission-gate deletes

Full-route audit (224 routes) against a single enforcement table
(docs/internal/API_ENFORCEMENT_AUDIT_20260611.md in the private repo).
Two gaps fixed:

- PUT /documents/{id}/visibility was open to any authenticated key:
  a user could flip another user's private document to shared, then
  read it via search — bypassing visibility enforcement entirely.
  Now requires documents.visibility + ownership of the document
  (documents.visibility.all for others'/unowned docs); local mode and
  bootstrap unaffected.
- DELETE /documents/{id} and /documents/bulk-delete now enforce the
  existing documents.delete permission (built-in user/researcher roles
  cannot delete; admin/sre can).

Verified non-gaps: /documents/{id} and tree search return names and
metadata only (T5 carve-out); SCIM has its own bearer-token auth.

Suite: 1818 passed / 0 failed.
…uard

Self-review findings (same gap class the external review caught):
- /index and /upload-and-index now require documents.write (was: any
  authenticated key — even the read-only support role could upload)
- upload takeover closed: document_id derives from the display name and
  the upload path deletes+replaces the existing document, with
  auto-ownership then reassigning it to the uploader. Any key could
  overwrite and take ownership of another user's (incl. private)
  document by uploading a same-named file. Replacing an owned document
  now requires the owner or admin; identical-hash re-uploads still skip
  benignly; unowned/local-mode behavior unchanged.

Suite: 1825 passed / 0 failed.
…ew P1s)

- /index now enforces the same overwrite guard as /upload-and-index via a
  may_replace callback checked just before an existing document is replaced;
  denial maps to 403 (ReplacementNotAuthorizedError). Identical-hash skips
  never consult the guard, so no-op re-indexing still works for everyone.
- Replacing a document no longer loses the old version on failure: both
  /index and /upload-and-index back up the existing chunks, defer the delete
  until the replacement is ready to insert, and restore the backup if the
  insert or LanceDB upsert fails. Partial LanceDB replacements are dropped
  best-effort so drift repair resyncs from PostgreSQL.
- Readiness checks that fail with an exception now record a sentinel drift
  signature, so a repair sync that failed for that broken state is not
  relaunched on every readiness check (P2).
- Tests: unit coverage for guard wiring + rollback on both routes, plus a
  real-DB integration test proving old content survives a failed replacement.
Closes the last open review finding: read endpoints no longer expose other
users' private documents (names, paths, metadata, counts).

- New document_visibility helpers: visibility_clause_for_key_record (SQL
  filter for the caller), is_admin_key_record, and
  filter_entries_by_hidden_source (path-keyed auxiliary records).
- Repository queries accept an optional visibility clause: list_documents,
  get_document_by_id, get_statistics, get_indexed_extensions,
  get_metadata_keys, get_metadata_values.
- Document tree (children/stats/search) filters the Postgres source via the
  SQL clause and the LanceDB source via the hidden-document-id list (same
  exclusion mechanism /search already uses).
- Routes wired: /documents, /documents/{id} (hidden docs 404), tree x3,
  /statistics, /extensions, /metadata/keys, /metadata/values.
- Secondary surfaces: lock listings and indexing-run history omit entries
  mapping to hidden documents; encrypted-PDF report is scoped to the
  caller's own uploads unless admin (clear= likewise).
- ACCESS_CONTROL_GUIDE.md: removed the 'listing not yet enforced' caveat,
  documented the new behavior.
- 25 new tests (helpers, route wiring, tree filtering, secondary surfaces,
  plus a DB-backed integration test proving the SQL paths filter).
asyncio.get_event_loop().run_until_complete() implicitly creates a loop on
Linux but raises 'no current event loop in thread MainThread' on the Windows
runner's proactor policy. asyncio.run() works on both.
…ntine visibility gaps

Local ultrareview pass findings (P1 + P2s):

- P1 SQL injection: search_similar/preview_delete/bulk_delete interpolated
  arbitrary user-supplied filter KEYS into the WHERE clause via f-string
  (`f"{key} = %s"`). A crafted key (e.g. "1=1 OR ...") injected SQL and
  could OR past the AND-joined visibility/exclusion clauses, leaking other
  users' private documents. Now restricted to an ALLOWED_FILTER_COLUMNS
  allowlist; unknown keys raise ValueError -> HTTP 400 (fail-closed).

- GET /stats leaked global document/chunk counts (system_api) while the twin
  /statistics was visibility-scoped. Threaded visibility through
  DocumentIndexer.get_statistics and the route.

- POST /documents/bulk-delete preview enumerated hidden documents and the
  delete could remove them. Both preview_delete and bulk_delete now take a
  visibility clause; the route scopes Postgres by the clause and the LanceDB
  dual-delete by the caller's hidden-id exclusion list, so the two stores
  delete the same visible set (no drift). Admins unrestricted.

- POST /quarantine/{uri}/restore and /quarantine/purge were only
  require_api_key while the listing is visibility-filtered, making restore an
  existence oracle + unauthorized mutation for hidden docs. Now require_admin.

Tests: 15 new (injection rejection for all 3 builders, allowed-column pass,
visibility scoping, route wiring for /stats and bulk-delete, admin-gate
assertion for quarantine). Full suite 1883 passed, 32 skipped.
…faced; persist FAILED; drop spike artifacts

Lower-severity ultrareview findings:

- /search resolved the backend (_should_use_lancedb, which runs the readiness
  gate and can raise) up to 3x per request, including after results were
  computed — a concurrent mutation flipping readiness mid-request turned a
  successful search into a 503. Now resolved once, before search.

- The empty-index count (3 aggregate scans over document_chunks, or a LanceDB
  stats call) ran on every search just to craft an 'index is empty' message.
  Now computed only when the search returned no results.

- A hybrid request (use_hybrid/hybrid_mode) served by the default LanceDB
  engine was validated then silently ignored. Now surfaced via a
  diagnostics.engine_override note (non-breaking; pass source=postgres for
  hybrid).

- check_readiness cached dirty=False after a drift FAILED, so later cached
  calls returned generic NOT_READY and the 'manual repair required' state was
  shown only once. Now caches the status string so FAILED persists.

- Removed rejected/superseded spike artifacts from the release: the
  pg_textsearch spike (compare script, probe SQL, docker-compose, PG17
  Dockerfile) and the superseded search_lancedb_compare.py harness. Kept
  experiments/semantic_pool_sizing/ — it is cited by README as the rationale
  for the auto-sizing defaults.

Tests: +5 /search route behavior, +1 FAILED-persistence; full suite 1889 passed.
The note implied the LanceDB engine isn't hybrid ("pass source=postgres to
use hybrid search"). It is — it does lexical+vector retrieval. Reworded to say
the legacy PostgreSQL hybrid_mode parameter doesn't apply to the LanceDB engine
and that source=postgres runs the legacy hybrid modes.
…ross backends)

Advise mapping high-precision filter keys to type/namespace/category (exact on
both backends) rather than arbitrary metadata.* (exact on PostgreSQL, substring
fallback on LanceDB). Clarifies it affects match precision only, not visibility.
scripts/smoke_team_setup.sh + docker-compose.smoke-team.yml automate the manual
prep in ACCESS_CONTROL_GUIDE (§ Validating Team mode): builds the app image from
this branch, brings up an isolated stack (distinct project/container names, ports
8001/5433 so it won't collide with a dev stack), installs a disposable HS256 Team
license, creates an admin user + a regular (researcher/user) account, and prints
both API keys. Local validation tooling only — not CI/production (auth forced for
loopback; throwaway signing secret, never the production RS256 key).
The system_api /stats route read stats['database']['database_size'], a key the
repository never returns (it returns database_size_bytes), so the endpoint
500'd on every call (the desktop app uses the separate /statistics endpoint, so
it went unnoticed). Found while running the Team-mode smoke setup end-to-end.

Now reads database_size_bytes and renders a human-readable string. Also
corrected the stats route test, which had mocked the nonexistent
'database_size' key and therefore masked the bug — it now feeds
database_size_bytes like the real repository and asserts the formatted output.
Backend (tested):
- #1 Skip the startup pg_dump backup when document_chunks is empty, so an
  empty/fresh install never creates a backup that later trips auto-recovery's
  'backup exists + empty table' false 'DATA LOSS DETECTED' popup.
- #4 list_documents (+ /documents DocumentInfo) now return per-document
  visibility and owner_id.

Desktop UI (static-checked; needs visual confirmation):
- #5 (privacy-adjacent) On account/key change, clear+reload the Documents
  views and clear Search results so a switched identity never sees the prior
  user's cached docs. New DocumentsTab.on_account_changed / SearchTab.clear_results
  wired into _on_backend_settings_changed.
- #4 Lock icon + tooltip on private documents in the List view.
- #2 List context menu resolves the row's document from column 0 regardless of
  which column is right-clicked (was column-0 only).
- #3 Make Private/Shared added to the Tree view file context menu.
- #6 Always-visible current-user indicator in the status bar (from /me),
  refreshed on account switch.
- #7 App version shown in the status bar (from the version module), robust to
  the VERSION-file read that the title bar depends on.

Full suite: 1891 passed, 32 skipped.
The desktop health worker polls /health every 3s (20/min) and /health was
subject to the default 60/min rate limit, so sustained polling (plus other
traffic) exhausted the per-minute window and returned 429s, which the client
reads as 'server unreachable' (flapping). Health/liveness probes are infra
polling and must not consume a caller's rate-limit budget — now exempt
(/health, /healthz, /ready, /readiness, /live, /liveness, and /api/v1 variants).

Pre-existing latent issue (health worker, 60/min default, and the missing
exemption all predate recent work); surfaced by the first auth-enforced remote
smoke run. +1 test.
Root cause of the rate-limit flapping that blocked the UI: while the Org tab is
stuck on its loading placeholder, MainWindow.on_health_status_updated calls
org_tab.on_backend_healthy() every health tick (~3s). That method fires
probe_and_refresh() (a multi-request org probe: capabilities/identity/license/
users/roles/permissions) and then _begin_transient_retry_window() re-arms its
own _awaiting_backend_healthy_reprobe flag — so a probe that keeps failing
(e.g. rate-limited) re-fires every 3s (~120 req/min), exhausting the 60/min
limit immediately and continuously, with no user interaction. The flood also
kept the probe failing, so the tab never left the placeholder.

Fix: cap re-probes to once per 30s while stuck on the placeholder. Combined with
the /health rate-limit exemption (1e298e4), this removes both the flapping and
the flood. +3 regression tests.

Surfaced by the two-account smoke test; triggered when the initial org probe
fails (the recent connect-time UI requests made hitting the limit more likely).
On Rancher Desktop + WSL, bind-mounting the license file from the Linux home
into the container is unreliable, so the file-based license silently failed to
load and the smoke backend fell back to Community edition (which then 403s the
Team-gated /me, leaving the desktop current-user indicator blank). Install the
license into server_settings (DB) after migrations — mount-independent, same as
the in-app install flow — so the smoke stack reliably comes up as Team.
save_backend_settings set self.api_client._api_key (the private field) directly,
bypassing the api_key property setter that writes X-API-Key onto the shared
requests.Session. So switching the key in Settings updated the field but the
session kept sending the PREVIOUS user's key — account switching silently didn't
apply (uploads/visibility changes went out as the old user; e.g. a doc uploaded
'as admin' was actually owned by the prior Alice session). Use the property
setter so the session header updates. +1 regression test (uses a real client;
the existing MagicMock-based test couldn't catch it).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant