Skip to content

draft: v0.4.0 snapshots, fork/restore, and observability#34

Draft
tastyeffectco wants to merge 10 commits into
consolefrom
feat/v0.4-snapshots-observability
Draft

draft: v0.4.0 snapshots, fork/restore, and observability#34
tastyeffectco wants to merge 10 commits into
consolefrom
feat/v0.4-snapshots-observability

Conversation

@tastyeffectco

Copy link
Copy Markdown
Owner

Draft — not ready to merge. Opened for review/visibility of the v0.4.0
line. Built on top of the v0.3 console integration work and targets
console (not main). Source branches feat/snapshots-fork and
feat/observability-events are kept as-is.

What's included

Snapshots / fork / restore (Phase 4)

  • App snapshot historyGET /v1/apps/{id}/snapshots (tenant- and app-scoped); migration 0015 adds snapshot.source_app_id so history survives the ephemeral sandbox.
  • Restore an app from a snapshot — POST /v1/apps/{id}/restore (replaces the app's current sandbox; destructive, console confirms).
  • Fork an app from a snapshot — POST /v1/apps/{id}/fork (new app + its own sandbox; source untouched).
  • Console: Snapshots panel on app detail (history, restore-with-confirm, fork).
  • Built only on the public /v1/snapshots subsystem; the internal /sandbox/{id}/... snapshot API stays unexposed.

Observability (Phase 5)

  • Durable app/task event timeline — one append-only app_events table in the existing control-plane SQLite DB (migration 0016). No ClickHouse/OTEL/Loki, no separate logs DB.
  • Centralized best-effort recorder (internal/events, mirrors audit): never breaks a request; monotonic-ULID ids that double as the page cursor.
  • Tenant-scoped read API — GET /v1/apps/{id}/events and GET /v1/tasks/{id}/events (newest-first, default 50 / max 200, ?before cursor → next_before).
  • Instrumented: app.created/updated, config.created/updated/deleted (key only — never the secret), snapshot.captured/capture.failed/restored/forked, sandbox.create.started/failed, sandbox.started/stopped/deleted, task.started/completed/failed/build.failed, preview.health.ok/failed. Payloads carry structured flags/reasons only — no raw build/dev-server/agent output.
  • Console: read-only Activity timeline on app detail (durable across refresh/restart).

Shared-host preview port fixes

  • Preview URLs include the host-facing port (SANDBOXD_PUBLIC_HTTP_PORT) unless it's the scheme default, so a shared-host deploy (e.g. :18080) returns a reachable URL.
  • v0.4 isolated-host installer + runbook (scripts/dev/install-v04-ubuntu.sh, docs/v0.4.0-test-runbook.md): shared-host-safe defaults (uncommon ports, sslip.io, console basic-auth, API on loopback).

Test status

Unit / integration (CI-equivalent, green locally): gofmt, go vet, go build ./..., go test ./..., the OpenAPI contract test (every /v1 route documented), and the console tsc --noEmit + image build. Notable coverage: snapshot store scoping + restore/fork tenant guards; event recorder writes valid JSON; events tenant scoping; pagination/limit + cursor; a planted fake secret in build/preview/agent error text never reaches payload_json; monotonic event-id ordering; preview-URL port logic (http/https × default/custom).

Local shared-host (real Docker) — verified: create sandbox; app-list badge reflects real status; start/stop; delete scoped to the instance's own container (prod's untouched); snapshot capture stamps source_app_id; per-app history returns it; running-source rejected 409; API preview_url includes :18080. All done with portless test sandboxes so prod Traefik never discovered them.

Not verified (deferred):

  • Live restore/fork preview end-to-end — restore/fork spin port-3000 sandboxes whose sandboxd.managed label prod Traefik would discover on a shared host. Deferred to a real isolated v0.4.0 deploy (or an isolated Traefik). Backend orchestration + tenant scoping are unit-tested; the live sandbox spin/preview is not.
  • Console-after-restart timeline was not run as a live e2e (durability is inherent to SQLite + covered by store round-trip tests).
  • Reaper-initiated sandbox.stopped and browser-wake sandbox.started are not evented yet (fire outside a request).

Explicitly deferred / out of scope

  • Live restore/fork preview verification requires an isolated host.
  • No ClickHouse / OTEL / Loki (SQLite table only; export-ready for later).
  • No broker delivery for secrets (config access_policy is metadata only; agent_access/runtime_access/both are reserved in the UI).
  • No Phase 6 runtime/provider abstraction.
  • Event retention not pruned yet (documented future knob SANDBOXD_EVENT_RETENTION_DAYS).

🤖 Generated with Claude Code

tastyeffectco and others added 10 commits June 23, 2026 10:08
v0.4.0 backend on the public /v1/snapshots subsystem only (internal
/sandbox/{id}/... stays unexposed):
- migration 0015 adds snapshot.source_app_id so per-app history survives the
  ephemeral source sandbox; capture stamps it from the source's app_id.
- GET  /v1/apps/{id}/snapshots  — tenant+app-scoped history.
- POST /v1/apps/{id}/restore    — REPLACE the app's current sandbox from a
  snapshot (purge current, then clone). Destructive; console confirms.
- POST /v1/apps/{id}/fork       — new app + its sandbox spun from a snapshot;
  source app untouched.
Tenant scoping enforced on every path (cross-tenant app/snapshot -> 404).
Sandbox spin reuses the proven create path (template_path + .git reset).
Tests cover store scoping, history, and the restore/fork guard paths; the
Docker-dependent spin is verified on a real host, not CI. OpenAPI + contract
test updated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…se 4)

Adds a Snapshots panel to the app detail screen backed by the new
/v1/apps/{id} endpoints:
- history list (name, captured time, size) via GET /v1/apps/{id}/snapshots
- Restore: confirms (replaces the current sandbox, discards un-snapshotted
  work) then POST /v1/apps/{id}/restore and refreshes
- Fork: prompts for a name, POST /v1/apps/{id}/fork into a new app
The capture button now refreshes history and (since v0.4.0 ships these) drops
the 'coming soon' wording. Actions disabled unless the snapshot is ready.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…T verified

Records the release discipline + verification status for the Phase 4 preview
branch: capture/history/tenant-scoping/backend-orchestration are tested, but the
live restore/fork sandbox spin and preview are deliberately deferred to a real
isolated v0.4.0 deploy (they'd otherwise expose port-3000 sandboxes to prod
Traefik on the shared host). Not for merge to console/main; no non-draft PR.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
scripts/dev/install-v04-ubuntu.sh stands up the Phase 4 stack on a fresh Ubuntu
22.04/24.04 server, reusing the repo's docker-compose (traefik + sandboxd +
console profile) — no parallel deploy system. It installs Docker if missing,
fails if 80/443 are taken, detects the public IPv4, uses sslip.io for preview +
console URLs (HTTP on :80), writes .env + a docker-compose.override.yml, gates
the public console with Traefik basic auth (demo creds), keeps the API on
loopback (disables the edge api.yml router), builds images, starts the stack,
and prints URLs + teardown.

docs/v0.4.0-test-runbook.md: requirements, install, the 14-step create→preview→
snapshot→restore→fork checklist, TLS-as-follow-up, teardown, release discipline.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… 80/443)

Default to shared-host mode so the installer is safe next to Coolify/nginx/another
Traefik:
- HTTP_PORT=18080 (uncommon edge port; set HTTP_PORT=80 for dedicated-host mode)
- API_PORT=19090 on loopback only
- only the chosen HTTP_PORT must be free; fail clearly telling the user to set
  HTTP_PORT if taken; do NOT check or require 443 (TLS deferred)
- generated URLs include :<HTTP_PORT> unless it's 80
Runbook documents both modes plus 'behind an existing proxy': keep 18080 and have
the front proxy forward console.<ip>.sslip.io + *.preview.<ip>.sslip.io to
127.0.0.1:18080 (Host preserved); TLS via the front proxy or a real wildcard domain.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
On a shared host with HTTP_PORT=18080, the API returned bare preview URLs
(…sslip.io/) that hit whatever owns :80 (Coolify/front proxy) instead of
sandboxd's Traefik on :18080. previewURL() now appends the host port unless it's
the scheme default (80 for http, 443 for https):

  Server.PublicHTTPPort  <- SANDBOXD_PUBLIC_HTTP_PORT (main.go)
  docker-compose.yml     <- SANDBOXD_PUBLIC_HTTP_PORT: ${HTTP_PORT:-80}

The console iframe + open-in-tab link consume sb.preview.url unchanged, and
restore/fork responses use the same previewURL(), so all preview surfaces get the
corrected port. Unit-tested across http/https x default/custom ports; verified
live that GET /v1/sandboxes/{id} returns ...sslip.io:18080. No installer change
needed — it already writes HTTP_PORT, which compose forwards.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
One append-only app_events table in the existing control-plane SQLite DB (no
ClickHouse/OTEL/Loki/separate DB), a centralized best-effort recorder, a
tenant-scoped paginated read API, and a console Activity timeline.

- migration 0016: app_events(id ULID, owner_token, app/sandbox/task/snapshot ids,
  type, severity, message, payload_json, created_at) + scoped indexes. ULID id
  doubles as the newest-first page cursor.
- internal/events: Recorder.Record (mirrors audit: own Store interface, detached
  ctx, never breaks the request); stable type/severity constants.
- store: InsertAppEvent + ListAppEvents/ListTaskEvents (owner_token-scoped,
  cursor-paginated); owner-agnostic GetApp for the background task path.
- API: GET /v1/apps/{id}/events and /v1/tasks/{id}/events (newest-first,
  default 50 / max 200, ?before cursor, next_before). Cross-tenant -> 404/empty.
- instrumented via the recorder (no scattered SQL): app.created/updated,
  config.created/updated/deleted (key only, never the secret), snapshot
  captured/capture.failed/restored/forked, sandbox create.started/failed/
  started/stopped/deleted, task.started, and on the task terminal point
  task.completed/failed/build.failed + preview.health.ok/failed.
- console: read-only Activity panel on app detail (time/severity/type/message +
  related ids), durable across refresh/restart.
- docs/openapi.yaml + contract test; .env.example notes the future
  SANDBOXD_EVENT_RETENTION_DAYS knob (retention deferred).

Tests: recorder writes valid JSON events; tenant scoping; pagination/limit;
config event carries the key but never the secret; failed task emits
task.failed + task.build.failed on both feeds. gofmt/vet/build/test + contract
test green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ndexes

Follow-up to Phase 5 (3 fixes):
1. No raw output in app_events.payload_json. Task events now carry structured
   flags/reasons only — never BuildErrorMessage/PreviewErrorMessage/ErrorMessage
   text (which can echo secrets the app printed; the full text stays in the
   task's result.json). New payloads:
     task.completed       -> {files_changed, duration_ms, build_ok}
     task.failed          -> {failure_reason, has_error}
     task.build.failed    -> {reason:'build_failed', has_build_error:true}
     preview.health.failed-> {preview_status, has_preview_error}
   Test now plants a fake secret in all three error fields and asserts it never
   appears in any event's payload_json or message.
2. Monotonic ULID event ids (ulid.Monotonic under a mutex), so a same-millisecond
   completion burst sorts in emission order by id (the page cursor). Added a
   monotonic-ordering test.
3. Dropped the unused indexes (owner-only, sandbox-only, type-only) from
   migration 0016 — no endpoint queries them and each is write amplification on
   an append-only table. Kept idx_app_events_app + idx_app_events_task.

gofmt/vet/build/test + OpenAPI contract test green. Console untouched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant