[BREAKING] refactor(data-plane): route all data access through lumid-data-app#41
Merged
Conversation
…data-app Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
97a2d8a to
48a9757
Compare
…n corners Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
…v dep Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
…lign doctor with FlowMesh Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
393d1f0 to
87de8d4
Compare
Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Routes every Lumilake data access through lumid-data-app's HTTP API. Direct
psycopgto compute Postgres and directminioto S3 are deleted: everyDataRetrievalOpmode (sql / s3 / agent), the data-profile preflight, the archive layer (job records + runtime artifacts), and server-side input/output resolution all go through one client.LUMID_DATA_URLis required;LUMID_DATA_TOKENfalls back toLUMILAKE_RUNTIME_TOKEN.Breaking surface:
DATABASE_URL,S3_URL,S3_ENDPOINT,S3_ACCESS_KEY,S3_CONNECTION_VALUE,S3_CERT_FILE,S3_WORKER_URL,LUMILAKE_DATA_PROFILE_CONNECT_TIMEOUT_S,LUMILAKE_DATA_PROFILE_STATEMENT_TIMEOUT_S.DBLocationoutputs return 422 at submit (inputs still work via the catalog endpoint).S3Location.connection_stringremoved.utils/delta.py,utils/s3.pydeleted.psycopg,psycopg-pool,minio.Changes
Migration core
utils/lumid_data_client.py(new): HTTP client coveringprofile,retrieve_sample,list_blobs,alist_blob_keys,acatalog_column_exists, blob get/put. URL-encodes keys, validates SQL identifiers, disables redirects, forwardsX-Request-ID, raises on truncated listings.utils/job_storage.py:PersistentJobStorageinlined ontolumid_data_client;BlobNotFound → ArchiveNotFound.runtime/data_profile_utils.py: preflight pre-collects the union of folders the batch's s3 retrievals reference and lists each withlimit=10000+ truncation guard. Decoupled fromS3_DATA_PREFIXsotemplate:is always treated as an absolute blob key (matches the worker's interpretation).utils/data_profile_offload.py::_estimate_plan_variants:POST /profileinstead of psycopgEXPLAIN.runtime/runtime_graph.py:_default_sql_samplerroutes throughlumid_retrieve_sample;_attach_s3_cfg → _attach_lumid_cfg.routes/jobs.py:_validate_db_location_livevia catalog endpoint;_resolve_s3_input_values+_dump_output_locationsvialumid_data_client; DBLocation outputs blocked at submit; sync_job_storagecalls wrapped withasyncio.to_thread.main.py,packages/sdk/.../envs.py,packages/deploy/.../doctor.py: env contract updated; positive-float validation on bothLUMID_DATA_TIMEOUT_SECONDSandLUMILAKE_HTTP_TIMEOUT_SECONDS; token-fallback warning.Migration-adjacent fixes
OutputOpnow accepts bothLLMOpandDataRetrievalOpsources + an optionalpath:selector. The aggregator walks dotted paths (items.table.symbol) and JSON-decodes intermediate string fields so it handles the DataFrame-serialized shape the SQL/agent workers emit. Malformed paths raise; default is mode-derived (sql + agent →items.table, s3 →items.content)._batch_requires_gpupeeks the batch and requestsgpu_group_size=0for CPU-only batches. The check delegates toFlowmeshRuntimeManager._runtime_op_requires_gpuso the scheduler peek and FlowMesh dispatch can't drift onbackendvstask_typerules.BaseJobManagertwo-phase API:reserve_batch/commit_reservation/abort_reservation. Scheduler wraps the reserve → workers → commit window intry/finallyso worker-acquisition timeout (or any exception) returns the batch to the queue instead of dropping it. User-RR rotation deferred to commit so abort truly leaves queue state unchanged.utils/io_locations.py::normalize_s3_literalrejects.., empty inner segments,?,#, NUL.Docs + examples
docs/{ARCHITECTURE,ENV,CLI,WORKFLOWS,E2E_DEMO}.md,README.md,packages/sdk/README.md,compose.yml,.env.exampleupdated to reflect lumid-data-app routing; deleted-env-var migration notes indocs/ENV.md.examples/templates/yaml/data-retrieval.yaml(new): three-mode smoke; uses${NEWS_KEY}placeholder rendered withenvsubstbefore submission.Design
envs.LUMID_DATA_URL/envs.LUMID_DATA_TOKENonto every data_spec at runtime-graph build time, so workflows don't carry these fields and can't drift between submission and dispatch._batch_requires_gpu) before committing the queue mutation, preserving the "selected batches never drop" invariant even with finite poll timeouts.S3_DATA_PREFIX, eliminating the surprise where a workflow's template was outside the listing root.type: listdata_spec carriess3://items, Lumilake stampslumid_cfgand logs a warning naming the FlowMesh-version-pairing requirement (older workers that only reads3_cfgwill fail at retrieval).Test Plan
E2E against a real lumid-data-app:
Test Result
E2E (lk_test_v0 PAT against 192.168.6.181:5101):
req-3SGkzdMG7PFjmkEt5cy6d3req-eB5yrzo54tVyR8c3pmcMYureq-kbet4j9NXniNpKJuxcWmCdAgent retrievals occasionally returned
agent returned no retrieval resultfrom lumid-data-app's LLM backend; retries succeed. Upstream issue, independent of this PR.Follow-up (out of scope)
aiohttp.ClientSessionmodule-level singleton + retry/backoff on idempotent GETs.requests.HTTPError401/413/5xx into typedHTTPExceptionwith actionable hints._list_blobs_for_folderswith bounded concurrency.data_spec.S3_*env +S3Location→Blob*(nothing points to S3 anymore).jobs_index.json/output_index.jsonGET-modify-PUT race (needs ETag/CAS on lumid-data-app).raise_for_statusechoing response body into exception messages.packages/deploy/.../stop.pyCOMPOSE_PROFILES=("postgres","minio","server")and_demo_data.pyminio loader — separate cleanup PR.Pre-submission Checklist
CONTRIBUTING.md.uv run pre-commit run --all-filesand fixed any issues.uv run pytest tests/passes locally.[BREAKING]and described migration steps above.