From 4ee7709e8cd47d8a16971216c37efe876ec0d8bf Mon Sep 17 00:00:00 2001 From: David Khachaturov Date: Wed, 24 Jun 2026 15:16:11 -0700 Subject: [PATCH] docs: add managed cloud roadmap Sequenced roadmap for an optional managed cloud companion modelled on DuckDB/MotherDuck, grounded in the existing generation-addressed commit format and the local invariants. Docs only; no code or dependency changes. Linked from the README enterprise note and the architecture overview. --- README.md | 3 +- docs/architecture.md | 2 + docs/cloud-roadmap.md | 335 ++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 339 insertions(+), 1 deletion(-) create mode 100644 docs/cloud-roadmap.md diff --git a/README.md b/README.md index cf2ed3c..4a0b07d 100644 --- a/README.md +++ b/README.md @@ -56,7 +56,8 @@ incrementally, so a commit stays **sub-millisecond even at 1M vectors**. > 🏢 **Enterprise** The LodeDB core is Apache-2.0 and free to use. Enterprise licensing is > available for commercial support, managed and at-scale serving, and on-prem / BYOC -> deployment. Contact [sales@egoistmachines.com](mailto:sales@egoistmachines.com). +> deployment. Contact [sales@egoistmachines.com](mailto:sales@egoistmachines.com). The +> direction for a managed cloud companion is mapped in [docs/cloud-roadmap.md](docs/cloud-roadmap.md). ## Install diff --git a/docs/architecture.md b/docs/architecture.md index 0a54afa..c426254 100644 --- a/docs/architecture.md +++ b/docs/architecture.md @@ -1,5 +1,7 @@ # Architecture +_For the optional managed cloud companion, see [cloud-roadmap.md](cloud-roadmap.md)._ + ## System overview ```mermaid diff --git a/docs/cloud-roadmap.md b/docs/cloud-roadmap.md new file mode 100644 index 0000000..59fdfd0 --- /dev/null +++ b/docs/cloud-roadmap.md @@ -0,0 +1,335 @@ +# Managed cloud roadmap + +This document maps a path from LodeDB, an embedded local-first vector database, to an +optional managed cloud companion, following the model DuckDB established with MotherDuck: +the local engine stays useful and primary on its own, and the cloud is an opt-in companion +that adds durable sync, managed serving, sharing, and scale. It is a direction, not a +commitment. The near-term, low-risk slice is called out under "Scope of investment", and +the managed-service milestones are explicitly demand-gated. + +## North star + +Lode Cloud is a companion, not a replacement. + +1. Local LodeDB stays offline-first, in-process, and no-auth by default. `import lodedb` + never loads auth or network code. +2. Cloud mode is opt-in through a separate package extra, a separate client class, and a + separate CLI command group. +3. The first managed capabilities are durable sync and backup, then managed read-only + serving of committed indexes. +4. Hybrid local/cloud query routing comes last, and it is retrieval routing plus result + fusion, not relational operator splitting. + +### The MotherDuck model and the honest translation + +MotherDuck is not "DuckDB on a server". It keeps the embedded engine authoritative on the +client and adds, in the cloud, a service layer (identity, authz, sharing), per-user +isolated compute, object-storage-backed durable state with copy-on-write differential +storage (which is what enables zero-copy clone, branch, and time travel), and a hybrid +planner that can split one SQL query across local and cloud. A client opts in with one line +(`ATTACH 'md:'`) authenticated by a token. + +The translation to LodeDB is mostly direct, with one deliberate exception: + +- A local client attaching a cloud database becomes a LodeDB handle syncing a committed + generation to and from a managed remote. +- Differential, copy-on-write storage maps onto a capability LodeDB already has. The + on-disk format is generation-addressed: committed state lives in immutable, + content-checksummed `g.*` artifacts under `.gen/`, pinned by a single atomic + root manifest, `.commit.json`. Cloud storage becomes "ship the checksummed artifacts + to object storage and publish the manifest", not a new storage engine. +- Per-user isolated compute maps onto a per-tenant engine instance serving that tenant's + path. +- The service layer maps onto a new authenticated service, the one component LodeDB + deliberately does not have today. +- Hybrid SQL execution does not map. LodeDB has no relational optimizer and no query plan to + split; a single exact nearest-neighbour scan over one index is not partitionable the way a + SQL join is, so copying bridge operators would be overbuilding. The honest analogue is + routing a whole query to local or cloud, or running it on both and fusing the results. + +## Invariants the cloud must preserve + +LodeDB's identity rests on properties enforced in code today. The cloud companion preserves +all four on the client and surfaces them as deliberate, opt-in choices on the server. + +1. **Loopback or private bind only.** `is_private_bind_host` (`engine/core.py`) rejects any + public bind, and the local HTTP server (`local/server.py`) refuses a non-private host. + The embedded engine is never exposed publicly. The cloud serving tier is a separate, + authenticated service that fronts its own local engine; the bind guard is never relaxed. +2. **Payload boundary.** Redacted artifacts (`.json`, `.jsd`, `.tvim`, `.tvd`) never carry + text. Raw text lives only in the opt-in `.tvtext` base plus `.txd` journal, and lexical + terms in the opt-in `.tvlex` base plus `.lxd` journal (`local/db.py`, + `engine/_commit_manifest.py`). Anything that leaves the machine respects this: text and + lexical sync are explicit per-store opt-ins, off by default, and control-plane telemetry + stays metrics-only. +3. **Single-writer, crash-atomic commits.** One writer per path via an OS advisory lock + (`engine/_filelock.py`); every commit is sealed by the atomic `.commit.json` swap, + which is the only thing that commits a generation; recovery rolls back to the last + committed generation (`engine/core.py`). The cloud extends this to one writer per remote + path via a lease, it does not replace it. +4. **Lean core.** Runtime dependencies are numpy, typer, sentence-transformers, and pyyaml, + and an import-boundary test (`tests/test_import_boundary.py`) guards against growth. All + cloud code lives behind a new optional extra and a new top-level package, importing + nothing into the embedded path. + +## Core idea: reuse the on-disk format as the cloud substrate + +The single most important decision is to treat the existing generation-addressed commit +format as the differential-storage layer rather than inventing a second one. It already +provides what a cloud durable store needs: + +- **Immutability.** Bases are epoch-addressed and never overwritten in place, so every + committed generation is a stable snapshot. +- **Content addressing.** Artifacts and the manifest body are sha256-checksummed + (`body_sha256`), so blobs deduplicate by checksum and integrity is verifiable on restore. +- **Atomic commit.** Publishing one new manifest pointer commits a whole generation; readers + see old-then-new, never torn. +- **O(changed) updates.** Writes append delta journals onto a fixed base epoch, so a cloud + push uploads only new segments or a new base, inheriting the engine's existing + incremental-commit property. + +Clone, branch, and time travel then fall out as catalog operations over immutable +generations rather than new storage machinery. + +One coupling to manage: the on-disk manifest schema (`COMMIT_MANIFEST_SCHEMA_VERSION`) and +the cloud wire protocol must be versioned independently, so a future on-disk bump does not +break deployed clients, and a cloud that retains generations long-term must keep reading +older format versions as the engine evolves. + +## Milestones + +### Milestone 0: Boundaries and this document + +Land this roadmap as `docs/cloud-roadmap.md`, linked from the README enterprise note and the +architecture overview. No cloud runtime dependencies and no code. Fix the cloud user stories +(back up a local index, query a managed copy without holding a local writer, share read-only +snapshots, run GPU batch serving, BYOC later) and the privacy policy (text and lexical +artifacts are payload-bearing and require explicit policy). This is the content of the first +pull request. + +### Milestone 1: Artifact-store abstraction and manifest export/import + +Introduce a storage abstraction for committed artifacts with the local filesystem as the +default implementation, modelled on the existing root pointer and generation artifacts (no +new format). Interface shape: + +```python +class ArtifactStore: + def read_bytes(self, name: str) -> bytes: ... + def write_bytes_if_absent(self, name: str, data: bytes, sha256: str) -> None: ... + def compare_and_swap_pointer(self, key: str, old_generation: int | None, new_body: dict) -> None: ... +``` + +Add a read-only generation-inventory helper that, given a path and index key, lists every +artifact the live `.commit.json` references with its checksum, size, and kind, +distinguishing a new delta segment on an existing base epoch from a new base epoch (so push +stays O(changed)). Add a manifest export/import that packages a committed generation without +reading uncommitted WAL tails, so only committed state ships. + +- Modules: `engine/artifact_store.py` (interface), `engine/local_artifact_store.py` + (default), and a read-only inventory helper. The only engine-side change is read-only. +- Decision: object-storage backends (S3, GCS, Azure) live in the optional cloud extra or a + cloud-service package, never in the open-core runtime dependencies. +- Risk: object stores have no `os.replace`. The root-pointer commit needs a + strongly-consistent conditional write or a metadata-service compare-and-swap, never + eventually-consistent list operations. + +### Milestone 2: Backup and restore (self-hosted or BYOC object store) + +The first real cloud feature, and the de-risking one. Upload the current committed +generation to object storage, and restore a generation into a local directory, verifying +manifest checksums before accepting it. Restore reuses the engine recovery path, so a torn +restore self-heals to the last good generation. + +- Text policy: the default uploads only redacted artifacts; `--include-text` adds + `.tvtext`/`.txd`; `--include-lexical` adds `.tvlex`/`.lxd` (lexical terms are + payload-derived and sensitive). +- CLI: `lodedb cloud push`, `lodedb cloud pull`, `lodedb cloud status` (compare local and + remote generation numbers and counts). +- This proves cloud durability without changing writer semantics. Combined with Milestone 1 + it is shippable as self-hosted or BYOC sync before any managed service exists. + +### Milestone 3: Control plane + +The service layer LodeDB lacks: a separate authenticated service, outside the engine path, +providing identity, orgs, projects, and tokens; a catalog of databases, indexes, snapshots, +owners, and share grants; generation-pointer metadata with the strongly-consistent +compare-and-swap from Milestone 1; a single-writer lease per index; and a metrics-only audit +log. + +- Client: a separate `lodedb.cloud` package and a `LodeDBCloud` client, plus a `lodedb cloud` + CLI group (login, sync, pull, status, share, serve). Cloud auth stays out of `LodeDB(...)`. + A thin ergonomic facade may lazily import the cloud client only when a remote is requested, + so the import boundary holds while integrations still get a single handle. +- Metadata DB: a strongly-transactional store (Postgres class) for the managed service. The + interface stays abstract so a local single-node implementation serves development, tests, + and self-host. +- Risk: this is the first networked auth surface and needs a security review. Tokens and + control-plane metadata must never contain documents, queries, vectors, embeddings, or + credentials. + +### Milestone 4: Managed read-only serving + +A cloud worker materialises a committed generation onto an ephemeral local SSD cache, opens +it with `read_only=True` (which takes no writer lock and loads the exact committed manifest), +and serves search, search_many, get, and stats over authenticated HTTPS. Many concurrent +readers per tenant are safe and lock-free. + +- This is a separate authenticated service, never the loopback dev server with auth bolted + on. +- Raw-text serving is a separate permission: vector search and metadata can be allowed while + get and inline text are denied. + +### Milestone 5: Managed GPU batch serving + +The cloud is where GPUs live, so batched search_many on a CUDA worker is the cloud's +throughput advantage. The engine already has a GPU-resident exact scan +(`engine/gpu_turbovec.py`, lazy and optional), with the CPU scan as the source of truth and +fallback. Optimise for batch and concurrency, not single-query latency; preserve exactness +and CPU fallback; keep a per-index resident GPU cache with eviction and a per-tenant quota so +one large index cannot starve a worker. + +### Milestone 6: Cloud writes with a single-writer lease + +A managed writer per index. The worker acquires a control-plane lease (not a filesystem +lock), opens a writable cached copy, applies add and remove, commits through the existing +generation path, uploads changed artifacts, then advances the cloud root pointer by +compare-and-swap. One active writer per index; the local-writer-plus-push mode (Milestone 2) +and the cloud-writer mode are mutually exclusive per path via the lease. + +- WAL caveat: WAL mode does not expose uncheckpointed writes to lock-free readers, so remote + writes force a generation commit (checkpoint) per request until a cloud reader-invalidation + protocol exists. +- Server-side embedding uses the same presets; arbitrary user embedding code in workers is + out of scope initially. + +### Milestone 7: Snapshot sharing, attach, clone, branch, time travel + +Immutable snapshot IDs for committed generations, and read-only share grants to a snapshot or +to a latest-published pointer. `attach` pulls a shared snapshot into a local read-only path +queried with `open_readonly`; later, remote read-only attach through the cloud client. + +- Clone is a new catalog pointer to an existing manifest; branch forks the pointer and + accepts independent future generations; time travel selects an older generation. All are + catalog operations because generations are immutable and independently loadable. +- Sharing must not expose raw text or lexical terms unless the owner published those + artifacts and the grantee has permission. +- Highest correctness risk: blob garbage collection must become reference-counted across + branches before any remote deletion. The local keep-last-N-epochs rule + (`DEFAULT_EPOCHS_RETAINED`) does not hold once a generation can be referenced by multiple + branches; gate remote GC behind catalog-level reference counting. + +### Milestone 8: Bidirectional sync and conflict model + +Local push and pull tracking base, local head, and remote head. Divergent histories are not +auto-merged: offer explicit `--force-push` and `--force-pull`, with logical add and remove +merge from WAL history as later work. The sync unit is a committed generation, not arbitrary +files. Conservative and explicit by default. + +### Milestone 9: Unified handle and hybrid retrieval routing + +A client that holds local and remote indexes and routes a query: local only, remote only, or +fan-out to both and fuse. Implemented as a separate cloud client with an optional thin facade +over the existing `LodeDB`, so the adapters (LangChain, LlamaIndex, mem0) and the MCP server +can target a cloud-backed handle, with the cloud client lazily imported. + +- The fan-out merge is exact: because scoring is exact cosine and results are + `(score, id, metadata)`, merging two top-k lists is just a re-sort by score, with no + approximation, unlike merging approximate-nearest-neighbour shards. Cross-source fusion + reuses the existing reciprocal-rank-fusion logic. +- Correctness requirement: fused or synced results are only comparable if both ends embed + with the identical model, version, and normalisation. A synced store carries an implicit + embedding-model contract; pin and record it, and refuse cross-model fusion. +- This is routing, not operator splitting: a whole query goes to one side, or to both and + merges. No bridge operators. + +### Milestone 10: Hosted MCP and agent-memory service + +A hosted, authenticated MCP endpoint backed by cloud indexes, mirroring the local stdio tools +(`lodedb_add`, `lodedb_search`, `lodedb_get` only with text permission, `lodedb_remove`, +`lodedb_stats`). The local stdio MCP is unchanged and stays local-first. Text exposure is +deny-by-default for shared indexes. + +### Milestone 11: BYOC and on-prem + +Helm or Terraform for the control plane, workers, object storage, and metadata DB; +customer-managed bucket and KMS; an air-gapped mode with no managed-service callbacks. Keep +BYOC operationally close to the managed service to avoid a forked product. This matches the +enterprise note already in the README. + +## Sequencing + +- Milestone 0, then 1, then 2 form the low-risk core: plumbing over an existing format, + shippable as self-hosted or BYOC sync with no control plane. +- Milestone 3 is the inflection point where real new surface area and security exposure + begin. +- Milestones 4, 5, and 6 build managed serving and then managed writes on top of the control + plane. +- Milestones 7 and 8 add sharing and sync semantics over immutable generations. +- Milestone 9 adds the unified handle and hybrid routing; Milestone 10 the hosted MCP; + Milestone 11 the BYOC packaging. + +## Cross-cutting decisions + +- Reuse the generation-addressed format as the sync substrate; do not invent a second storage + engine. +- Extend, never relax, the four invariants. Cloud code is a separate package and a separate + process. +- Replace local writer locks with cloud leases, not network filesystems (the file lock is + documented as unreliable on NFS and SMB). +- Object storage holds immutable artifacts; a strongly-consistent metadata store holds the + root pointer and performs the commit compare-and-swap. +- The public SDK contract (search, search_many, add, get, the `(score, id, metadata)` shape, + and open_readonly) is the cloud's API contract, so existing surfaces inherit cloud backing. +- Build remote serving before remote writes, and snapshot sharing before multi-writer + collaboration. + +## Risks + +- Atomic root-pointer commit on object storage needs a real compare-and-swap or metadata + transaction, not `os.replace` and not eventually-consistent listing. +- WAL mode is single-writer and does not expose uncheckpointed writes to lock-free readers; + cloud writes must checkpoint. +- Uploading `.tvtext` or `.tvlex` changes the privacy posture; both are opt-in and off by + default. +- Multi-writer merge is not supported by the engine; the lease keeps one writer per index. +- Content addressing deduplicates by sha256; in a multi-tenant store this must be namespaced + and authorised per tenant, so a known checksum cannot probe for or fetch another tenant's + blob. +- Hybrid fusion is only correct across an identical embedding model and version; pin and + record the model with the store. +- Cross-branch blob GC needs reference counting before any deletion. +- Cold starts may be dominated by model loading and index hydration; cache hot generations on + SSD and keep resident workers. + +## Out of scope + +- A SQL or relational query optimizer, or MotherDuck-style bridge operators. Cross-source + result fusion is the analogue. +- Multi-writer or CRDT merge. +- Changes to the TurboVec kernel, the CUDA or MPS scan, or the on-disk format to suit the + cloud. +- Relaxing the loopback bind or the metrics-only telemetry of the embedded engine. +- Adding auth or network code to `LodeDB(...)` or to the local stdio MCP. +- New mandatory runtime dependencies. +- A browser or WASM client. + +## Scope of investment + +The architecture is grounded in what the engine already does, but it describes a multi-quarter +program that turns a library into a multi-tenant service with auth, billing, isolation, and +on-call. Treat it in two parts: + +- **Near-term, low-risk, and on-brand:** Milestones 0, 1, and 2. These ship self-hosted or + BYOC durable sync and backup, reuse the existing format, add no mandatory dependencies, and + match the existing enterprise note without committing to running a managed service. +- **Demand-gated:** Milestone 3 onward (the control plane and everything above it). Build + these against real customer demand, since each adds operational and security surface that + contradicts the no-server positioning if shipped speculatively. + +## First pull request and next step + +This pull request adds this document and links it from the README enterprise note and the +architecture overview. It adds no code and no dependencies. The first implementation pull +request is Milestone 1: the artifact-store interface and manifest export/import over the +existing commit manifest, with the object-storage backend kept in the optional cloud extra.