feat: add image_pinned to skip registry rewrite for custom images by zeroasterisk · Pull Request #7 · zeroasterisk/scion

zeroasterisk · 2026-06-14T17:44:43Z

Summary

Adds image_pinned field to ScionConfig (scion-agent.yaml)
When image_pinned: true, the image_registry rewrite is skipped, preserving the exact image reference from the template
No behavior change for existing templates (opt-in only)

Motivation

Templates that specify a fully-qualified custom scion-* image (e.g. ghcr.io/myorg/scion-elixir-image:latest) get their registry prefix rewritten by RewriteImageRegistry() to match the broker's image_registry setting. This makes it impossible to use custom images hosted in external registries without push access to the broker's registry.

The image_pinned flag lets template authors signal that their image reference is intentional and should not be rewritten.

Usage

In a template's scion-agent.yaml:

image: ghcr.io/myorg/scion-custom-image:latest
image_pinned: true

Changes

pkg/api/types.go: Add ImagePinned bool field to ScionConfig
pkg/agent/run.go: Check ImagePinned before applying registry rewrite

Test plan

go build ./... passes
go test ./pkg/agent/... ./pkg/api/... ./pkg/config/... — all image-related tests pass
Manual: create template with image_pinned: true and custom image, verify no rewrite in debug logs

Intended as a PR to googlecloudplatform/scion — created here on fork first due to token permissions. Please open upstream PR or grant fork PR access.

…rm#293) * fix(scion-chat-app): set channel="gchat" on ask_user dialog responses handleDialogSubmit was using the simple SendMessage API which doesn't support structured message fields, so inbound ask_user responses arrived at the hub with no channel set (defaulting to "web"). Switch to SendStructuredMessage with Channel="gchat" to match the pattern already used by cmdMessage. * fix: channel filtering and thread-id routing for chat channel replies Two bugs in the chat channel routing feature: 1. Channel filtering: broker plugins now check msg.Channel and skip messages targeted at a different channel. The hub injects plugin_name into broker credentials so each plugin knows its own channel identity. This prevents cross-channel delivery (e.g., Telegram replies leaking to Google Chat). 2. Thread-id routing: the Telegram plugin now passes msg.ThreadID as message_thread_id to the Telegram Bot API when sending outbound messages. Previously, thread-id was captured on inbound messages but never forwarded on outbound, causing replies to land in the wrong forum topic. Added SendOption variadic parameter to SendMessage, SendMessageWithKeyboard, and SendQueue.Send for backward-compatible thread-id support. * feat(scion-chat-app): add Google Chat thread context support Propagate thread IDs end-to-end so agents can participate in Google Chat threads: - Inbound: auto-set ThreadID on StructuredMessage from the Google Chat event's thread context when no explicit --thread flag is used - Inbound: propagate ThreadID on dialog submit (ask_user responses) - Outbound: pass ThreadID from StructuredMessage to SendMessageRequest so agent replies land in the correct Google Chat thread * fix: route outbound messages to chat-app via ChannelID The FanOutEventBus matched msg.Channel against the bus Name, but the chat-app plugin is registered as "chat-app" while its messages use channel="gchat". Add a ChannelID field to NamedEventBus and PluginInfo so plugins can declare the channel they handle independently of their registered name. The chat-app now reports ChannelID="gchat" via GetInfo(), and the hub reads it at startup to wire routing correctly. * design: per-topic /default agent scoping for Telegram forums Explores how to let /default set a different default agent per forum topic (message_thread_id) rather than per-chat. Conclusion: ~85 lines of changes across store, commands, callbacks, and routing. * feat(scion-telegram): per-topic /default agent scoping for forum groups Add support for setting a different default agent per Telegram forum topic/thread, with the chat-wide default as fallback. - New topic_defaults table keyed on (chat_id, thread_id) - /default in a topic sets/shows the topic-level override - Callback data extended: dflt:<slug>:<threadID> for topic scope - Routing resolves topic default before chat default for both @bot-mention and unaddressed message fallback paths * fix: address PR GoogleCloudPlatform#293 review feedback - Add !no_sqlite build tag to resource_import_handler_test.go to fix CI vet failure (mockRoundTripper undefined when template_bootstrap_test.go is excluded) - Guard debug log in broker.go Publish against nil msg to prevent panic - Add fitCallback to preserve threadID suffix in Telegram callback_data when the 64-byte limit is exceeded, truncating agentSlug instead - Add slog warning to truncateCallback when truncation occurs * fix: address second round of PR GoogleCloudPlatform#293 review feedback - Remove redundant channel filters from chat-app and Telegram Publish() methods — the FanOutEventBus already routes by ChannelID, and comparing against the plugin's registered name would silently drop messages - Log errors from GetTopicDefault instead of silently ignoring them - Return distinct error messages in chat-app when ResolveOrAutoRegister fails with a real error vs a nil mapping * fix: address third round of PR GoogleCloudPlatform#293 review feedback - Add early return for nil msg at top of Publish() to prevent panics in downstream handlers that dereference msg fields - Add thread-safe ChannelName() getter on BrokerServer - Use dynamic ChannelName() in GetInfo() instead of hardcoded "gchat" - Use dynamic ChannelName() in both commands.go call sites * fix: use callback_lookups for long callback data instead of truncation Replace fitCallback() which corrupted agent slugs by truncating them to fit Telegram's 64-byte limit. Long callback payloads are now stored in the callback_lookups table with a short cblu:<id> reference. HandleCallback resolves lookup IDs before routing. Also add defensive check for empty HubUserEmail in chat-app to prevent constructing invalid "user:" sender strings. * fix: address fifth round of PR GoogleCloudPlatform#293 review feedback - Use local interface instead of concrete *BrokerRPCClient type assertion in pluginChannelID() and isObserverBroker() so in-process brokers and mocks are handled correctly. - Add nil guard for msg in fanout channel routing check. --------- Co-authored-by: Scion <agent@scion.dev>

…eCloudPlatform#296) * Fix test suite leaking Hub credentials, corrupting agent state (GoogleCloudPlatform#123) Tests that spawn sciontool (e.g., TestInitCommand_Integration) inherited live Hub env vars from the agent container, causing the subprocess to talk to the real Hub and reset the agent phase to "starting." - Add scrubHubEnv(t) helpers that use t.Setenv to clear Hub env vars (SCION_HUB_ENDPOINT, SCION_HUB_URL, SCION_AUTH_TOKEN, SCION_AGENT_ID, SCION_AGENT_MODE) with automatic restore on test cleanup - Filter Hub env vars from subprocess Cmd.Env in TestInitCommand_Integration as belt-and-suspenders protection - Convert os.Setenv/os.Unsetenv to t.Setenv throughout hub_test.go and client_test.go for crash-safe env var isolation * Add project log entry for issue GoogleCloudPlatform#123 fix * Address PR GoogleCloudPlatform#296 review feedback in init_test.go Replace hardcoded /tmp/sciontool-test path with t.TempDir() to avoid permission conflicts and test races. Replace map allocation in filterHubEnv with slices.Contains on the static hubEnvVars slice.

…po.sh

…cript

…oogleCloudPlatform#299) Three new documentation pages: - External Channels: covers Telegram (bidirectional group chat), Discord (outbound webhooks), and A2A protocol bridge in one page. Summarizes concepts and links to detailed READMEs in extras/. - Hub Setup on GCE: step-by-step walkthrough of deploying a hub using the starter-hub scripts. Covers provisioning, repo setup, TLS, and post-setup next steps. - Multi-Broker Setup: how to connect multiple machines to a single hub for distributed agent execution. Covers architecture, broker registration, selection, and cross-broker considerations. Sidebar updated to include all three pages.

* Add sort and filter capabilities to agent list view (GoogleCloudPlatform#71) CLI: add --phase, --activity, --template filter flags and --sort, --reverse sort flags to 'scion list'. Validates flag values against known phases/activities. Passes phase filter server-side in hub mode for efficiency. Web UI: add phase filter chips (All/Running/Stopped/Suspended/Error), sortable table headers (Name, Status, Updated), and sort dropdown for grid view. Filter and sort state persists to localStorage. Closes GoogleCloudPlatform#71 * Address review feedback: input canonicalization and validation - CLI: canonicalize --phase/--activity/--sort to lowercase in validateListFlags, remove redundant empty check on filterActivity - Web UI: validate localStorage phase filter against known values instead of raw cast - Web UI: validate localStorage sort config field/dir values before applying - Web UI: handle invalid date strings in formatRelativeTime with isNaN guard

…rm#295) * Add prominent disconnected overlay to web terminal When the WebSocket connection drops, a full-terminal overlay now appears with 50% black opacity and large red "DISCONNECTED" text centered on it. The overlay appears immediately on disconnect and disappears when the connection is re-established. The small status indicator in the toolbar remains as a secondary signal. Fixes GoogleCloudPlatform#77 * Move disconnected overlay to be a sibling of xterm container The overlay was a child of .terminal-container, whose DOM is managed by xterm.js. Lit re-rendering the overlay on connect/disconnect state changes conflicts with xterm's DOM management. Fix: introduce .terminal-wrapper as the relative-positioning context, make .terminal-container absolutely positioned inside it, and render the overlay as a sibling — outside xterm's managed subtree. * Use wasConnected flag instead of terminal ref for overlay reactivity Replace the non-reactive `this.terminal` reference in the overlay condition with a new `@state() wasConnected` flag. This fixes two issues: 1. Lit reactivity: `this.terminal` lacked `@state()` so changes to it didn't trigger re-renders. The new `wasConnected` is properly decorated as reactive state. 2. Initial connection: using `this.terminal` would flash the overlay during the brief window between terminal init and WebSocket open. `wasConnected` is only set true after the first successful connect, so the overlay only appears after a genuine disconnection.

…tore port, LISTEN/NOTIFY (GoogleCloudPlatform#304) * P0-1: switch Postgres driver from lib/pq to pgx/v5 stdlib - Add github.com/jackc/pgx/v5/stdlib (registers as "pgx") - driver_postgres.go: blank import pgx stdlib instead of lib/pq - OpenPostgres: open via sql.Open("pgx", dsn) + entsql.OpenDB - Introduce PoolConfig (applied to *sql.DB); thread through OpenSQLite/OpenPostgres and update all callers - go mod tidy drops lib/pq * P0-2: add connection pool config to DatabaseConfig - DatabaseConfig gains MaxOpenConns / MaxIdleConns / ConnMaxLifetime plus ConnMaxLifetimeDuration() helper - DefaultGlobalConfig sets sqlite pool defaults (MaxOpenConns=1, load-bearing for write serialization) - applyDatabasePoolDefaults fills postgres defaults (20/5/30m) and forces sqlite MaxOpenConns=1; called in both load paths - Mirror fields in V1DatabaseConfig + both conversion directions - Wire pool settings into entc.OpenSQLite in initStore * P0-3/P0-4: CRUD-parity test harness + spec-driven fixture generator P0-3: pkg/store/storetest/ — backend-agnostic, table-driven CRUD oracle. A Factory(t) -> store.Store is injected; generic Domain[T] descriptors drive Create/Read/Update/Delete (+optional soft-delete)/List-paginate/List-filter. Ships group + policy domains and runs green against today's CompositeStore (SQLite base + Ent DB). Ready to accept a postgresFactory for P3-2. P0-4: internal/fixturegen/ — Go-defined spec seeding >=1 row per table across all 30 domain tables, with edge cases (NULL optionals, max-length strings, nested/unicode JSON, soft-deleted agent, BLOB). Deterministic. 'go run ./internal/fixturegen' emits testdata/hub-v46-fixture.db, prints a 30-table coverage report, and caches the blob to the scratchpad mount. CI gate fails if any table has zero rows. * feat(ent): add 23 new Ent schemas for full table parity (P1-2 + P1-3) * feat(observability): add Cloud Monitoring scaffolding for LISTEN/NOTIFY metrics (P0-5) * P2: port notification + gcp/github/token domains to Ent entadapter Add Ent-backed implementations of the notification, GCP service account, GitHub App installation, and user access token store sub-interfaces: - notification_store.go: NotificationStore (subscriptions, notifications, templates). Dispatch uses an atomic conditional update as the multi-replica claim primitive, and an optional NotificationPublisher designs in the LISTEN/NOTIFY fan-out for created/dispatched events. - external_store.go: GCPServiceAccountStore + GitHubInstallationStore + UserAccessTokenStore. GitHub create is idempotent (INSERT OR IGNORE semantics), repositories/scopes are JSON, default_scopes is CSV, and tokens support key-hash lookup. Legacy api_keys is intentionally not surfaced. - storetest: add GCPServiceAccount, SubscriptionTemplate, and NotificationSubscription CRUD-parity domains. Does not modify composite.go. * P2: port schedule, maintenance, message domains to Ent entadapter - schedule_store.go: ScheduleStore + ScheduledEventStore sub-interfaces with dialect-aware SELECT FOR UPDATE SKIP LOCKED claim helper for the ListDueSchedules / ListPendingScheduledEvents job-claim paths (plain SELECT on SQLite, SKIP LOCKED on Postgres). - maintenance_store.go: run-state RMW, AbortRunningMaintenanceOps, Go-side seed (uuid.New) replacing SQLite randomblob() UUID seeds. - message_store.go: CRUD, read flags, PurgeOldMessages, design-in PublishUserMessage hook for Postgres LISTEN/NOTIFY. - pkg/ent/client_driver.go: hand-written Client.Driver() accessor for dialect detection + raw locking queries. * feat(entadapter): port user + allowlist/invite domains to Ent (P2) Implements the Ent-backed store adapters for the user and allowlist/invite domains, plus their CRUD-parity oracle descriptors. pkg/store/entadapter/user_store.go (store.UserStore): - CreateUser/GetUser/GetUserByEmail/UpdateUser/UpdateUserLastSeen/ DeleteUser/ListUsers. - Case-insensitive email: emails are normalized to lower case on write (so the plain unique index enforces case-insensitive uniqueness, equivalent to the legacy UNIQUE COLLATE NOCASE) and matched with EmailEqualFold (lower(email)=lower($1)) on read. ent codegen + AutoMigrate cannot emit a real lower(email) functional index across both SQLite (tests) and Postgres, so the invariant is enforced at the port layer. - Offset-based pagination matching the legacy SQLite store. pkg/store/entadapter/allowlist_store.go (store.AllowListStore + store.InviteCodeStore): - Full allow-list + invite-code CRUD. - BulkAddAllowListEntries uses CreateBulk + OnConflictColumns(email). Ignore() for race-safe INSERT-OR-IGNORE; added/skipped counts mirror the legacy per-row semantics (existing + within-batch dups skipped). - IncrementInviteUseCount is a single atomic conditional UPDATE (revoked=false AND not expired AND (max_uses=0 OR use_count<max_uses)), which is race-free on both backends without SELECT...FOR UPDATE. The sql/lock feature is enabled and ForUpdate is available for genuine multi-statement RMW paths. - ListAllowListEntriesWithInvites batch-joins invite codes (invite_id is a plain column, not an Ent edge). Schema: - pkg/ent/schema/user.go: add nillable last_seen field (+ index) needed by UpdateUserLastSeen / lastSeen sort; document the case-insensitive email strategy. - pkg/ent/generate.go: enable --feature sql/upsert,sql/lock (required for OnConflict and ForUpdate). Tests (all passing): - pkg/store/storetest/domains_user.go: UserDomain, AllowListDomain, InviteCodeDomain oracle descriptors (kept in a separate file to avoid contending on domains.go). - entadapter oracle test runs the shared CRUD-parity suite directly against the new adapters; behavior tests cover case-insensitivity, bulk idempotency, conditional increment, stats, and the invite join. NOTE: Generated Ent code under pkg/ent/** is intentionally NOT included. This is a shared worktree where sibling port agents concurrently modify schemas and the same feature flags; the generated code must be regenerated at wave integration via: go generate ./pkg/ent/... Verified locally that regeneration + full build + tests pass. Per P2 scope: composite.go wiring and ensureEntUser shadow removal are deferred to P2-collapse. * P2: port secret/env_var + template/harness_config domains to Ent Add Ent-backed store implementations for the secret/env and template/harness domains, mirroring the legacy SQLite semantics: - entadapter/secret_store.go: SecretStore implementing store.SecretStore + store.EnvVarStore. Polymorphic (scope, scope_id) addressing, COALESCE target->key projection, version bump on update, get-then-update upsert, and transitive ListProgenySecrets via a created_by IN-list over the ancestor set (user scope + allow_progeny only; encrypted value withheld). - entadapter/template_store.go: TemplateStore implementing store.TemplateStore + store.HarnessConfigStore. base_template hierarchy, scope/project_id backwards-compat lookups, content_hash, JSON config/files columns, DeleteByScope. Subscription templates are owned by NotificationStore. - Direct Ent unit tests incl. a progeny-inheritance parity test. - storetest: Template/HarnessConfig/Secret/EnvVar domain descriptors wired into RunStoreSuite for cross-backend CRUD parity. * P2: port project/broker + brokersecret domains to Ent Port the project/broker domain (projects, runtime_brokers, project_contributors, project_sync_state) and the broker-auth domain (broker_secrets, broker_join_tokens) from raw SQL to Ent adapters. - pkg/store/entadapter/project_store.go: implements ProjectStore, RuntimeBrokerStore, ProjectProviderStore and ProjectSyncStateStore. * provider + sync-state upserts use Ent OnConflict().UpdateNewValues() (sql/upsert) keyed on the (project_id, broker_id) unique index. * runtime broker heartbeat/update use an optimistic version-CAS loop on a new internal lock_version token, serializing concurrent writers portably across SQLite (tests) and Postgres without SELECT ... FOR UPDATE. * slug lookups support case-insensitive matching (EqualFold). * project computed fields (AgentCount, ActiveBrokerCount, ProjectType) are derived via Ent queries, matching the legacy SQLite store. - pkg/store/entadapter/brokersecret_store.go: implements BrokerSecretStore (per-broker HMAC secrets + short-lived join tokens, expiry cleanup). - Project Ent schema: add operational fields for full parity (default_runtime_broker_id, shared_dirs, github_*, git_identity). - RuntimeBroker Ent schema: relax vestigial type column to Optional, add internal lock_version concurrency token. - Regenerate Ent with sql/upsert,sql/lock features. - storetest: add Project, RuntimeBroker, BrokerSecret and BrokerJoinToken CRUD-parity domains. - Unit tests for both adapters. Per the integration plan, composite.go wiring and ensureEntProject shadow removal are deferred to P2-collapse. * P2: port agent domain to Ent entadapter (XL) * chore(ent): regenerate Ent code for all 30 entity schemas Regenerated with --feature sql/upsert,sql/lock to support OnConflict upserts and ForUpdate/SKIP LOCKED job claims. * P2-collapse: collapse dual-DB into single Ent store Wire all Ent-backed sub-stores into CompositeStore via embedding, removing the raw-SQL base store and the User/Agent/Project shadow-sync machinery (ensureEntUser/ensureEntAgent/ensureEntProject). CompositeStore now serves every domain from a single Ent client and implements Close/Ping/Migrate directly. Collapse initStore() to open one Ent SQLite DB (no _ent shadow DSN, no MigrateGroveToProjectData, no raw sqlite.New). Register the User, AllowList, and InviteCode domains in the storetest CRUD-parity suite. Update entadapter tests for the single-DB NewCompositeStore(client) signature. go build ./... green; go test ./pkg/store/entadapter/... ./pkg/store/storetest/... green. * P2-delete: remove raw-SQL store implementation Delete the ~6k-LOC raw-SQL store (sqlite.go) and its per-domain sibling files (brokersecret, gcp_service_account, github_installation, maintenance, messages, notification, project_sync_state, schedule, scheduled_event) plus their tests, including the inline schema-migration scaffold. Keep driver.go, which registers the pure-Go SQLite driver used by Ent's SQLite backend. Repoint the two non-test consumers to the Ent-backed store: - cmd/hub_secret_migrate.go now opens an Ent client + CompositeStore. - internal/fixturegen opens via entc and seeds the Ent schema's *sql.DB. go build ./... green; no remaining production references to the raw store. * test: compile-migrate downstream suites to Ent store + fix signing-key PK Replace the removed raw-SQL store in downstream tests with an Ent-backed newTestStore helper (pkg/hub, pkg/secret) and update cmd/server_test.go and internal/fixturegen tests. Port the 8 raw-SQL DB() access sites in hub tests via a new CompositeStore.DB() escape-hatch accessor. Fix a production bug surfaced by the collapse: hub/server.go signingKeySecretID generated a non-UUID secret primary key, which the Ent secret store rejects; it now derives a deterministic UUIDv5. go build ./... green; entadapter and storetest suites green. NOTE: hub/secret/fixturegen suites now COMPILE but many tests still fail because their fixtures seed non-UUID string IDs that the UUID-PK Ent schema rejects; addressed in follow-up commits (tid() helper). * test(hub): map non-UUID fixture IDs to UUIDs via tid() helper Wrap human-readable test identifiers in tid() (deterministic UUIDv5) so the UUID-PK Ent store accepts them while preserving cross-reference consistency and ID-equality assertions. Reduces pkg/hub failures from 611 to 79; remaining failures are behavioral, not ID-format, and are addressed separately. # Conflicts: # pkg/hub/handlers_project_test.go # pkg/hub/httpdispatcher_test.go * fix(store): seed maintenance ops in Migrate; initStore uses Migrate Restore raw-SQL parity: CompositeStore.Migrate now runs AutoMigrate and seeds built-in maintenance operations (the raw store seeded these in its migrations). initStore and hub test helpers call s.Migrate() so production and tests seed consistently. Fixes the maintenance-operation hub tests (404 'Operation not found'). pkg/hub failures 79 -> 71. * test(hub): satisfy Ent NotEmpty validators in fixtures Add slugs/broker names to test fixtures that previously relied on the raw store's lenient (no-validator) inserts: project/agent slugs in the logs test helper, broker slugs in embedded/profile/authz fixtures, and BrokerName on envgather ProjectProvider literals. pkg/hub failures 71 -> 57. * test(secret): map non-UUID fixture IDs to UUIDs via tid() Apply the tid() helper to pkg/secret fixtures (including a dynamically built secret ID) so the UUID-PK Ent store accepts them. pkg/secret now fully green. * test(cmd): map non-UUID fixture IDs to UUIDs via tid(); add broker slug/name Wrap broker/grove/agent IDs passed to registerGlobalProjectAndBroker and the dispatcher tests in tid(), and supply RuntimeBroker.slug / ProjectContributor broker_name to satisfy Ent validators. cmd now green except TestDeleteStopped_RequiresGroveContext, which requires the 'docker' binary (absent in this sandbox) and is unrelated to the store migration. # Conflicts: # cmd/server_dispatcher_test.go * test(hub): wrap remaining latent non-UUID fixture IDs Catch IDs that surfaced behind earlier failures (stale-agent-*, agent-visible-authz, agent-profile-hb, env-owner-1). No more UUID-parse errors in pkg/hub; the remaining ~56 failures are behavioral (URL paths built from old raw IDs, assertion mismatches), addressed next. * fix(entadapter): Get-by-id returns ErrNotFound for non-UUID identifiers Restore raw-SQL store parity: a malformed identifier cannot match any UUID primary key, so get-by-id lookups now report store.ErrNotFound instead of store.ErrInvalidInput. This matches the raw store (a lookup with a bad id simply returned no row) and is what callers depend on — e.g. resolveTemplate passes a template *name* to GetTemplate and relies on ErrNotFound to fall back to slug-based resolution. New parseGetID helper applied across all 17 get-by-id methods. pkg/hub failures 56 -> 40; entadapter/storetest stay green. * test(hub): fix store-less id wraps and project-route URL paths - controlchannel_client_test: revert tid() wraps (store-less path-builder test; IDs must match the expected literal paths). - github/envgather: project-scoped route handlers resolve the project by UUID id, so build paths with tid(rawID) via fmt.Sprintf instead of the old raw-id literal. pkg/hub failures 40 -> 32. * test(hub): unwrap projectIDFromServiceAccountEmail expectation The tid() sweep over-wrapped a non-ID expected value in a pure-function test; restore the literal GCP project id. * fix(ent): GCPServiceAccount.project_id is a string, not a UUID The GCP service account project_id holds the GCP *cloud project* identifier (e.g. 'my-project-123'), a free-form string — not a UUID. The schema declared it field.UUID, so entadapter CreateGCPServiceAccount/Update did parseUUID(sa.ProjectID) and rejected real GCP project ids, breaking SA mint/create with a 400 in production (storetest masked it by passing a UUID). Change the schema field to field.String, regenerate Ent, and store/read project_id as a string in external_store.go. Fixes ~7 hub GCP tests; pkg/hub 31 -> 23. * test(hub): fix GCP SA project-id assertion and project-settings id Unwrap the over-wrapped 'my-project' expectation now that project_id is a string, and wrap the dynamic project-settings project ID with tid(). * test(hub): fix bootstrap sync-to-finalize agent paths and storage keys Build the finalize request path from the agent's tid() UUID and seed mock storage under WorkspaceStoragePath(projectID, agent.ID) — the handler derives the workspace key from the agent's real id, not the old raw name. pkg/hub 23 -> 19. * test(hub): revert tid() over-wraps in store-less events_test events_test exercises the in-memory ChannelEventPublisher directly; its ProjectID/IDs are subject-string components, not stored UUIDs. The tid() sweep wrongly rewrote them so published subjects no longer matched the subscriptions (timeouts). Restore the literal values. pkg/hub 19 -> 12. * test(hub): fix maintenance-run path and notifications agentId queries Use tid() UUIDs in the maintenance run-detail path and the notifications agentId query params; guard list indexing with require.Len so a mismatch fails cleanly instead of panicking (panics truncate the package run). * test(hub): wrap remaining fixture IDs revealed after panic-cascade cleared Panics ([0] on empty lists) had been truncating the package run, hiding many failures and starving the tid() sweep. With those guarded, sweep the newly reached tests: wrap dynamic rune-suffix IDs and the setupProjectWithBroker / seedCreatedAgentForHarnessTest helper IDs, and convert raw query-param project IDs to tid(). No UUID-parse errors remain in pkg/hub. * test(hub): unwrap tid() in scheduler_test (mock store, raw ids) scheduler_test uses an in-memory mockScheduledEventStore, not the Ent store, so its ids need no UUIDs; the erroneous tid() wraps broke raw getEvent lookups and caused a nil-pointer panic that truncated the package run. * fix(ent): Template.harness may be empty (raw-store parity) A template imported from a directory that declares no harness type has an empty harness; the raw-SQL store stored it, but the Ent NotEmpty validator made BootstrapTemplatesFromDir silently skip such templates. Drop NotEmpty and regenerate. Removing the [0]-on-empty panics this caused un-truncates the hub package run (true failure count now visible). * test(hub): wrap dynamic fixture IDs in wake/workspace/signing-key tests Wrap tid() around the wake_test, setupWorkspaceProject, and empty-value signing-key secret IDs now reachable after panic removal. No panics in the hub package run. * test(hub): convert raw-id URL path segments to tid() Build GET/PUT/DELETE paths for agents/projects/brokers/templates/harness-configs and workspace sync routes from tid(rawID) so the by-id handlers resolve the entity (raw ids no longer match the UUID PKs). pkg/hub 93 -> 80. * fix(entadapter)+test(hub): FK error mapping + permissions FK fixtures mapError now distinguishes foreign-key violations (-> ErrInvalidInput, a bad reference) from unique-constraint violations (-> ErrAlreadyExists); previously both surfaced as a misleading 'already exists'/409. Seed the users/agents that group memberships and policy bindings reference (the Ent store enforces user/agent FK edges the raw store lacked), wrap remaining raw fixture/URL ids in tid(), and give the AddAgent fixtures slugs. All pkg/hub permissions tests pass. * fix(hub): seed creator users for agent-created agents; cascade-delete subscriptions on hard agent delete * test(hub): seed broker slug/name in dispatcher and project_cache fixtures (Ent validators) * test(hub): use tid() in principal/agent URL paths; broker slug in template_bootstrap * fix(entadapter): cascade-delete agents on project delete (raw-store parity); test(hub): seed FK users, broker_name, deterministic UUIDs * test(hub): MaxOpenConns=1 for SQLite test store (serialize writes); tid() URLs + FK user seeds in events/stopall * test(hub): unwrap over-wrapped tid() in unit tests (workspace/logfilter/gcp/web); valid-UUID NotFound cases; tid() scheduled-event URLs * fix(ent): allow empty display_name (raw-store NOT NULL parity, email fallback); test(hub): seed FK owner users, UUID policy/broker/agent IDs in authz remediation * feat(migrate): add Migration β tool (Ent-SQLite → Ent-Postgres) Implements 'scion server migrate --from sqlite://... --to postgres://...' per postgres-strategy.md §7.3. - entc.OpenSQLiteReadOnly: opens source with PRAGMA query_only=ON (no WAL write), MaxOpenConns=1 so the source is never mutated. - entc.MigrateData: generic reflection-based, dependency-ordered copy of all 30 Ent entities (FK-ordered core first), idempotent (skips rows whose PK already exists), atomic per entity (txn), chunked CreateBulk, source/dest row-count verification after each entity, plus the Group.child_groups M2M edge. FK columns are plain fields so edges are preserved via setters. - cmd/server migrate: DSN parsing (sqlite://, file:, bare path; postgres URL or keyword form), --keep-source default / --drop-source cutover, progress logging. Verified end-to-end against live CloudSQL Postgres 16 (integration test + real CLI run): full copy, idempotent re-run, FK + M2M + value round-trips, --drop-source removal. * feat(concurrency): dialect-aware multi-replica primitives for Postgres (P3-3..6) Add cluster-coordination primitives so N stateless hub processes can share one Postgres, each degrading to a no-op on single-writer SQLite: - store.AdvisoryLocker + entadapter TryAdvisoryLock (pg_try_advisory_lock on a dedicated conn); Scheduler.RegisterRecurringSingleton gates the heartbeat, stalled, purge, schedule-evaluator and github-health sweeps to one replica/tick. - store.ScheduledEventClaimer + ClaimScheduledEvent atomic claim; fireEvent claims one-shot events before side effects (dedup across replica startup recovery). - CompositeStore.RunSerializable: SERIALIZABLE + retry on 40001/40P01 (single run on SQLite) for future multi-row invariants. - dbmetrics.StartPoolSampler feeds DB connection-pool gauges to the P0-5 scaffold; wired into StartBackgroundServices via SetDBMetrics. Verified existing primitives correct (agent StateVersion CAS, FOR UPDATE sweeps, notification atomic dispatch). Found and documented the schedule SKIP LOCKED early-commit gap (lock released before the status transition), closed by the singleton evaluator. Audit + budget docs in scratchpad. Tests: locking_test.go (advisory no-op, serializable, claim exactly-once incl. 8-way concurrent), pool_sampler_test.go. * feat(hub): widen events to EventPublisher interface + Postgres LISTEN/NOTIFY publisher P3-7: Decouple call sites from the concrete *ChannelEventPublisher. - Add Subscribe(patterns...) (<-chan Event, func()) to the EventPublisher interface; implement it on noopEventPublisher (nil channel) — *ChannelEventPublisher already had it. - Factor the Publish* methods into a shared eventBuilder (sink func) so every backend emits identical subjects/payloads; ChannelEventPublisher embeds it. - web.go (field + SetEventPublisher), messagebroker.go and notifications.go (field + constructor) now take EventPublisher; handlers_messages.go gates SSE on "not the no-op publisher" instead of a concrete type assertion. P3-8: PostgresEventPublisher over pgx LISTEN/NOTIFY (cross-replica delivery). - Per-grove channels plus a global channel (flat exact-match); event type in the JSON envelope. Grove-scoped subjects publish to both the grove channel and the global channel; subscriptions group their patterns by resolved channel so an event is matched only against patterns that opted into the arriving channel (no double delivery). - 8 KB NOTIFY limit handled by reference-and-refetch via scion_event_payloads (TTL-swept so every replica can refetch). - PublishTx enrolls the NOTIFY in a caller transaction (atomic write+publish; rollback => no deliver). Delivery flows exclusively through the listener. - Listener goroutine reconnects with backoff and re-LISTENs (resubscribe); dynamic LISTEN/UNLISTEN applied on a poll (WaitForNotification timeout does not invalidate the pgconn connection). - Emits pkg/observability/dbmetrics signals (published/delivered/dropped, payload size, publish->deliver latency, reconnects, pool stats). - cmd: newEventPublisher selects the backend by database driver (postgres => PostgresEventPublisher, else ChannelEventPublisher) with safe fallback. Tests: routing/registry/payload-offload/metrics/transactional-executor unit tests run without a DB; cross-replica delivery, oversized round-trip, transactional rollback, and reconnect+resubscribe are gated behind SCION_TEST_POSTGRES_DSN. go build ./... green; full pkg/hub suite green. Note: server.go's equivalent type-assertion cleanup is left in the working tree (co-edited with concurrent P0-5/scheduler work) and is functionally optional — HEAD server.go already compiles against the widened interface. * test(store): parameterize store suites over {sqlite, postgres} (P3-2) Add pkg/store/enttest: a backend-selecting Ent client factory for the store test suites. Default is in-memory SQLite; built with -tags integration and SCION_TEST_POSTGRES_URL set, it provisions a per-package ephemeral Postgres database (created/dropped via TestMain) and isolates each test in its own schema (search_path) so tests never observe each other's rows. Falls back to SQLite when the env var is unset. Route all entadapter and storetest helpers through enttest.NewClient so the same CRUD-parity oracle runs unchanged against either backend. Fix two real Postgres bugs surfaced by the new path: - entadapter/dialect.go ancestryContains: emit the bind parameter via Builder.Arg ($n on Postgres) instead of a literal '?' through ExprP, which was not rebound and produced a syntax error; and use jsonb_array_elements_text (the column is jsonb on Postgres, not json). - schedule_store_test ClaimPath: make the concurrent-claim assertion backend-aware. SQLite serializes (MaxOpenConns=1, no SKIP LOCKED) so every caller sees both due rows; Postgres uses FOR UPDATE SKIP LOCKED so concurrent callers may observe a disjoint subset (0..2) and must only never error or exceed 2. Verified: full SQLite suite green; storetest CRUD parity green on CloudSQL Postgres; entadapter green on Postgres (schedule ClaimPath fix confirmed). * fix(hub): start dispatcher/broker for any subscription-capable EventPublisher Wave C integration: newEventPublisher can now return a PostgresEventPublisher (LISTEN/NOTIFY) in addition to ChannelEventPublisher. The dispatcher/broker startup previously hard-asserted *ChannelEventPublisher, which silently skipped starting them under Postgres. Gate on (not noop and not nil) instead, matching the existing pattern in handlers_messages.go. * fix(hub): harden Postgres event publish + verify wiring; lower PG pool default Task 1 — LISTEN/NOTIFY publish path: - Add TestPostgresIntegration_HandlerCreateProjectEmitsNotify: drives the real POST /api/v1/projects handler with a PostgresEventPublisher and asserts a pg_notify lands on scion_ev_global via an independent raw LISTEN — the exact capability the multi-replica live test probed. Verified PASSING against live CloudSQL, proving the handler -> s.events -> pg_notify wiring is correct end to end (the four pre-existing SCION_TEST_POSTGRES_DSN integration tests also pass). The multi-hub 'no NOTIFY' symptom was not reproducible against the current tree. - Bound the autocommit publish (Publish* methods) with publishTimeout (5s). These run synchronously on the caller's (request handler) goroutine and acquire from the event pool; on a connection-starved instance that acquire could block indefinitely, stalling CRUD and silently never emitting NOTIFY. The timeout converts that into a logged error + dropped event (publishing is fire-and-forget). PublishTx (transactional path) is unaffected. Task 2 — connection budget: - Lower the default Postgres MaxOpenConns 20 -> 10 so multiple replicas fit a modest connection budget (see CONNECTION-BUDGET.md). CloudSQL instance scion-postgres-test resized db-f1-micro -> db-g1-small and max_connections set to 100 (out of band). * test(store): add Postgres stress/integration suite (contention, isolation, pool, NOTIFY, migration, schema, multi-process) Add pkg/store/integrationtest/: a Postgres-only suite that exercises behavior the SQLite parity suites cannot reach. Gated by //go:build integration and SCION_TEST_POSTGRES_URL; skips cleanly otherwise. Coverage: - Contention: state_version CAS race (no lost updates, >=N-1 retries, final version==1+N), SKIP LOCKED / conditional-UPDATE event claim (single winner + disjoint drain), unique-key races (project slug, user email, agent slug). - Isolation: SERIALIZABLE conflict + RunSerializable retry recovery, REPEATABLE READ no-phantom snapshot, READ COMMITTED dirty-read prevention. - Pool: exhaustion + queued recovery, saturated pool honoring context deadline, long txn not starving short queries, healing after pg_terminate_backend. - LISTEN/NOTIFY: ordered burst no-drop, 8000B payload limit, listener reconnect/resume, cross-channel isolation. - Migration: 1000+ row counts + bounded-memory listing, idempotent re-migration. - Schema: NULL semantics, unicode/emoji, nested JSON + special chars, large-text non-truncation, TIMESTAMPTZ microsecond precision. - Multi-process: forks the test binary for cross-process advisory-lock exclusivity and cross-process NOTIFY delivery. Configurable concurrency via SCION_TEST_CONCURRENCY (default 10). Extend pkg/store/enttest with Active() and NewSchemaURL() so tests can open custom-pool clients and share a DSN with forked child processes; non-integration stubs keep the package API stable. * fix(db): recycle stale conns + keepalives; skip singleton tick on lock error Stale-connection pool stalls (CloudSQL drops idle conns after ~10m): - Add ConnMaxIdleTime to DatabaseConfig/PoolConfig (default 5m pg, 0 sqlite) and apply SetConnMaxIdleTime on the database/sql pool. - OpenPostgres now parses the DSN with pgx and opens via stdlib.OpenDB with TCP keepalive GUCs (idle 60s / interval 15s / count 4) and a 10s connect timeout, so a silently-dropped peer is detected instead of the first query after idle hanging on a dead socket. - pgx event pool (events_postgres.go): set keepalives + connect timeout on both the pool's ConnConfig and the dedicated listener connection, plus MaxConnIdleTime 5m / MaxConnLifetime 30m. Advisory-lock leader election (scheduler.go): - A lock-acquisition error no longer falls open to running the handler unguarded (which would duplicate singleton work across replicas); the tick is skipped and retried next interval. Added regression tests. Test harness (enttest/integrationtest): - Accept libpq keyword/value DSNs (not just URL form) when deriving the ephemeral db/schema/params; add WithConnParam helper. - Fix migration idempotency test's per-pass row-count expectation. * fix(store): bound advisory-lock conn checkout + unlock with short timeout TryAdvisoryLock checked a connection out of the pool and ran the unlock on the full 55s scheduler-handler context (acquire) and an unbounded context.Background() (release). On a pool that could not promptly serve a healthy connection, db.Conn() blocked for the entire 55s before failing with 'context deadline exceeded' on every tick; with several singleton handlers firing each 60s tick, those long-blocked goroutines and their pending pool connection requests piled up across ticks and kept the pool jammed (checked out client-side, idle server-side). The unbounded unlock was a second leak vector: if the held connection died mid critical-section, ExecContext could hang forever, so conn.Close() never ran and the connection leaked out of the pool permanently. Bind both the acquire (db.Conn + pg_try_advisory_lock) and the release (pg_advisory_unlock) to a 5s timeout so a bad tick fails fast and retries next tick instead of parking a goroutine for ~55s, and so a dead connection can never block release from freeing the conn. Lock semantics are unchanged: cancelling the acquire context tears down only that context, not the checked-out session that holds the lock. * feat(migrate): in-process migration α (legacy raw-SQL hub.db → Ent) Upgrade a legacy raw-SQL Hub database (the ~53-migration, 30-table schema from the removed pkg/store/sqlite store) to the consolidated Ent-backed SQLite schema, in-process on first boot, behind an automatic backup. pkg/ent/entc/migrate_alpha.go: - IsLegacyRawSQLSchema: detect via the schema_migrations sentinel + the legacy-only agents.agent_id column (no-op for an Ent/empty/absent file). - MigrateAlphaSQLite: backup (checkpoint WAL + copy to hub.db.bak.<ts>), AutoMigrate a fresh Ent schema, ATTACH the legacy file, copy every table with INSERT…SELECT (foreign_keys OFF), verify per-table row counts, then atomically swap the migrated file into place. - Data-driven column mapping (created_at→created, updated_at→updated, agents.agent_id→slug, policies→access_policies); bespoke SQL for the group_members/policy_bindings polymorphic splits and surrogate ids; groups.parent_id→group_child_groups edge. - Deterministic UUIDv5 remap for legacy non-UUID primary keys (internal signing-key secrets; plugin runtime-broker ids) with consistent rewrite of every foreign-key reference via a TEMP _id_remap table. - Tolerates missing legacy tables (older schema versions). cmd/server_foreground.go: detect + migrate in initStore's sqlite path, with a --no-auto-migrate operator opt-out (cmd/server.go). Validated end-to-end against four production hub.db files (scion-integration, -integration2, -demo, -gteam): exact row-count parity (up to ~19k rows), every entity reads back through the live Ent store, idempotent re-runs, and broker FK references resolve post-remap. Pre-existing dangling agent created_by/owner_id refs are faithfully preserved (loader runs FK-off). * fix(config): apply real Postgres pool size (leaked SQLite default of 1 starved the pool) The struct-level default for Database.MaxOpenConns/MaxIdleConns is 1 — the value SQLite REQUIRES to serialize writes. applyDatabasePoolDefaults only bumped postgres to a real pool when the value was <= 0, but a postgres deployment configured via env/driver override inherits the embedded default of 1, so the guard never fired and the Ent pool ran with a SINGLE connection. Effect in production (both integration hubs): every singleton scheduler tick checks out the lone pool connection to hold its advisory lock, then blocks waiting for a second connection to do its work — a self-deadlock that resolves only at the 55s handler context deadline. All API requests serialize behind the one connection, so GET /api/v1/* served in ~55s across the board. Note env overrides could not paper over this: envKeyToConfigKey splits on every underscore, so SCION_SERVER_DATABASE_MAX_OPEN_CONNS maps to database.max.open.conns, not database.max_open_conns — silently ignored. Treat the leaked SQLite default (<= 1) as 'unset' for postgres so the pool default (10) applies; explicit sizing of 2+ is still respected. SQLite remains pinned to 1. Adds regression tests for all three cases. * docs: add multi-node broker dispatch and NFS workspace designs - broker-dispatch.md: DB-as-state-machine + LISTEN/NOTIFY pattern for cross-replica broker command routing and agent lifecycle dispatch - nfs-workspace.md: NFS workspace coordination for VM (host bind-mount) and K8s/Cloud Run (per-pod mount) runtime models * fix(store): address PR GoogleCloudPlatform#304 review — context leaks and DSN parsing Thread the server's cancellable context into initStore and initWebServer instead of using context.Background(), so that: - DB migrations and the health-check ping cancel on Ctrl+C during startup (medium-priority review comment). - The Postgres LISTEN/NOTIFY event publisher goroutine shuts down cleanly when the server exits, preventing connection leaks (high-priority review comment). Also fix parseSQLiteSourceDSN to handle the file:// prefix before the file: prefix, so that file:///var/lib/hub.db correctly resolves to /var/lib/hub.db instead of ///var/lib/hub.db. Add test cases for file:// and file:/// DSN forms. * docs: add project log for PR GoogleCloudPlatform#304 review fixes * fix(store): context leak in legacy migration & double file: prefix 1. Thread the server's cancellable context through maybeMigrateLegacySQLite → MigrateAlphaSQLite so that Ctrl+C during first-boot legacy migration aborts it instead of running with an uncancellable context.Background(). 2. Guard against a double "file:" prefix when constructing the SQLite DSN. If the operator's database.url already starts with "file:", we no longer blindly prepend another "file:" prefix. Also correctly appends cache=shared with "&" when the DSN already contains query parameters. * fix(store): rename ProjectTypeHubNative → ProjectTypeHubManaged (rebase fixup) Upstream renamed hub-native to hub-managed while the PR was in flight. Update the two remaining references that the rebase conflict resolution missed. --------- Co-authored-by: Scion <agent@scion.dev>

…t token TestClient_StartTokenRefresh exercised RefreshToken -> WriteTokenFile without isolating the token home, so running the suite inside a live agent container overwrote the real ~/.scion/scion-token with the test stub "refreshed-token". Every subsequent Hub call then 401'd with "compact JWS format must have three parts" / "unrecognized token format". - Add SetTokenHome(t.TempDir()) to the test, matching its siblings. - Guard WriteTokenFile: panic under `go test` unless SetTokenHome was called, so a forgotten isolation can never corrupt live state again. Reads remain unguarded (harmless; return empty when absent).

…ecycle + message routing (GoogleCloudPlatform#305) * Add canonical engineering glossary (GLOSSARY.md) (#102) * Add engineering glossary (GLOSSARY.md) with canonical terms and cleanup tracker Add a root-level GLOSSARY.md capturing canonical Scion terminology in the ubiquitous-language format (preferred term + synonyms to avoid), grouped by domain cluster, plus an Exceptions & Future Cleanup section tracking known naming-convergence work. Link it from agents.md as the canonical engineering glossary. * Revise glossary: broker reframe, Event Bus, Hub-managed, and term refinements Refine entries from review: redefine Message Broker as the pluggable messaging-integration system (add Broker plugin, Built-in broker); add Event Bus for the NATS real-time/event capability; collapse hub-native/Hub Workspace into Hub-managed project/workspace; tighten Template (harness-agnostic, optional default harness-config), Skill (template-only, Agent Skills link), Profile (named runtime-broker settings bundle), Harness/Harness-config; reframe Hub as the control plane in both modes; add Group and Message Group. Expand Exceptions & Future Cleanup to nine tracked items. * Glossary: restructure headings, add cross-refs, modes table, and new terms - Retitle to "Scion Glossary"; drop the "Language" wrapper and promote the thematic categories to top-level sections - Add an Operations section (Attach, Dispatch) and move Profile next to Runtime Broker - Add a Local/Workstation/Hosted comparison table and "See also" cross-refs across the main confusable term clusters - Reframe the intro around the three-way broker collision (incl. Event Bus) and defer to the disambiguation rule; sentence-case "Shared directory" - Add canonical entries for Secret, Notification, and Schedule - Add a "Potential Future Additions" section cataloguing candidate terms * Glossary: remove Exceptions & Future Cleanup tracker The cleanup items are now tracked by dedicated agents that open GitHub issues and implementation PRs, so the staged tracker no longer lives in the glossary. Reword the two intro/disambiguation references that pointed at the removed section to point at GitHub issues instead. --------- Co-authored-by: Preston Holmes <ptone@google.com> * P0-1: switch Postgres driver from lib/pq to pgx/v5 stdlib - Add github.com/jackc/pgx/v5/stdlib (registers as "pgx") - driver_postgres.go: blank import pgx stdlib instead of lib/pq - OpenPostgres: open via sql.Open("pgx", dsn) + entsql.OpenDB - Introduce PoolConfig (applied to *sql.DB); thread through OpenSQLite/OpenPostgres and update all callers - go mod tidy drops lib/pq * P0-2: add connection pool config to DatabaseConfig - DatabaseConfig gains MaxOpenConns / MaxIdleConns / ConnMaxLifetime plus ConnMaxLifetimeDuration() helper - DefaultGlobalConfig sets sqlite pool defaults (MaxOpenConns=1, load-bearing for write serialization) - applyDatabasePoolDefaults fills postgres defaults (20/5/30m) and forces sqlite MaxOpenConns=1; called in both load paths - Mirror fields in V1DatabaseConfig + both conversion directions - Wire pool settings into entc.OpenSQLite in initStore * P0-3/P0-4: CRUD-parity test harness + spec-driven fixture generator P0-3: pkg/store/storetest/ — backend-agnostic, table-driven CRUD oracle. A Factory(t) -> store.Store is injected; generic Domain[T] descriptors drive Create/Read/Update/Delete (+optional soft-delete)/List-paginate/List-filter. Ships group + policy domains and runs green against today's CompositeStore (SQLite base + Ent DB). Ready to accept a postgresFactory for P3-2. P0-4: internal/fixturegen/ — Go-defined spec seeding >=1 row per table across all 30 domain tables, with edge cases (NULL optionals, max-length strings, nested/unicode JSON, soft-deleted agent, BLOB). Deterministic. 'go run ./internal/fixturegen' emits testdata/hub-v46-fixture.db, prints a 30-table coverage report, and caches the blob to the scratchpad mount. CI gate fails if any table has zero rows. * feat(ent): add 23 new Ent schemas for full table parity (P1-2 + P1-3) * P2: port notification + gcp/github/token domains to Ent entadapter Add Ent-backed implementations of the notification, GCP service account, GitHub App installation, and user access token store sub-interfaces: - notification_store.go: NotificationStore (subscriptions, notifications, templates). Dispatch uses an atomic conditional update as the multi-replica claim primitive, and an optional NotificationPublisher designs in the LISTEN/NOTIFY fan-out for created/dispatched events. - external_store.go: GCPServiceAccountStore + GitHubInstallationStore + UserAccessTokenStore. GitHub create is idempotent (INSERT OR IGNORE semantics), repositories/scopes are JSON, default_scopes is CSV, and tokens support key-hash lookup. Legacy api_keys is intentionally not surfaced. - storetest: add GCPServiceAccount, SubscriptionTemplate, and NotificationSubscription CRUD-parity domains. Does not modify composite.go. * P2: port schedule, maintenance, message domains to Ent entadapter - schedule_store.go: ScheduleStore + ScheduledEventStore sub-interfaces with dialect-aware SELECT FOR UPDATE SKIP LOCKED claim helper for the ListDueSchedules / ListPendingScheduledEvents job-claim paths (plain SELECT on SQLite, SKIP LOCKED on Postgres). - maintenance_store.go: run-state RMW, AbortRunningMaintenanceOps, Go-side seed (uuid.New) replacing SQLite randomblob() UUID seeds. - message_store.go: CRUD, read flags, PurgeOldMessages, design-in PublishUserMessage hook for Postgres LISTEN/NOTIFY. - pkg/ent/client_driver.go: hand-written Client.Driver() accessor for dialect detection + raw locking queries. * feat(entadapter): port user + allowlist/invite domains to Ent (P2) Implements the Ent-backed store adapters for the user and allowlist/invite domains, plus their CRUD-parity oracle descriptors. pkg/store/entadapter/user_store.go (store.UserStore): - CreateUser/GetUser/GetUserByEmail/UpdateUser/UpdateUserLastSeen/ DeleteUser/ListUsers. - Case-insensitive email: emails are normalized to lower case on write (so the plain unique index enforces case-insensitive uniqueness, equivalent to the legacy UNIQUE COLLATE NOCASE) and matched with EmailEqualFold (lower(email)=lower($1)) on read. ent codegen + AutoMigrate cannot emit a real lower(email) functional index across both SQLite (tests) and Postgres, so the invariant is enforced at the port layer. - Offset-based pagination matching the legacy SQLite store. pkg/store/entadapter/allowlist_store.go (store.AllowListStore + store.InviteCodeStore): - Full allow-list + invite-code CRUD. - BulkAddAllowListEntries uses CreateBulk + OnConflictColumns(email). Ignore() for race-safe INSERT-OR-IGNORE; added/skipped counts mirror the legacy per-row semantics (existing + within-batch dups skipped). - IncrementInviteUseCount is a single atomic conditional UPDATE (revoked=false AND not expired AND (max_uses=0 OR use_count<max_uses)), which is race-free on both backends without SELECT...FOR UPDATE. The sql/lock feature is enabled and ForUpdate is available for genuine multi-statement RMW paths. - ListAllowListEntriesWithInvites batch-joins invite codes (invite_id is a plain column, not an Ent edge). Schema: - pkg/ent/schema/user.go: add nillable last_seen field (+ index) needed by UpdateUserLastSeen / lastSeen sort; document the case-insensitive email strategy. - pkg/ent/generate.go: enable --feature sql/upsert,sql/lock (required for OnConflict and ForUpdate). Tests (all passing): - pkg/store/storetest/domains_user.go: UserDomain, AllowListDomain, InviteCodeDomain oracle descriptors (kept in a separate file to avoid contending on domains.go). - entadapter oracle test runs the shared CRUD-parity suite directly against the new adapters; behavior tests cover case-insensitivity, bulk idempotency, conditional increment, stats, and the invite join. NOTE: Generated Ent code under pkg/ent/** is intentionally NOT included. This is a shared worktree where sibling port agents concurrently modify schemas and the same feature flags; the generated code must be regenerated at wave integration via: go generate ./pkg/ent/... Verified locally that regeneration + full build + tests pass. Per P2 scope: composite.go wiring and ensureEntUser shadow removal are deferred to P2-collapse. * P2: port secret/env_var + template/harness_config domains to Ent Add Ent-backed store implementations for the secret/env and template/harness domains, mirroring the legacy SQLite semantics: - entadapter/secret_store.go: SecretStore implementing store.SecretStore + store.EnvVarStore. Polymorphic (scope, scope_id) addressing, COALESCE target->key projection, version bump on update, get-then-update upsert, and transitive ListProgenySecrets via a created_by IN-list over the ancestor set (user scope + allow_progeny only; encrypted value withheld). - entadapter/template_store.go: TemplateStore implementing store.TemplateStore + store.HarnessConfigStore. base_template hierarchy, scope/project_id backwards-compat lookups, content_hash, JSON config/files columns, DeleteByScope. Subscription templates are owned by NotificationStore. - Direct Ent unit tests incl. a progeny-inheritance parity test. - storetest: Template/HarnessConfig/Secret/EnvVar domain descriptors wired into RunStoreSuite for cross-backend CRUD parity. * P2: port project/broker + brokersecret domains to Ent Port the project/broker domain (projects, runtime_brokers, project_contributors, project_sync_state) and the broker-auth domain (broker_secrets, broker_join_tokens) from raw SQL to Ent adapters. - pkg/store/entadapter/project_store.go: implements ProjectStore, RuntimeBrokerStore, ProjectProviderStore and ProjectSyncStateStore. * provider + sync-state upserts use Ent OnConflict().UpdateNewValues() (sql/upsert) keyed on the (project_id, broker_id) unique index. * runtime broker heartbeat/update use an optimistic version-CAS loop on a new internal lock_version token, serializing concurrent writers portably across SQLite (tests) and Postgres without SELECT ... FOR UPDATE. * slug lookups support case-insensitive matching (EqualFold). * project computed fields (AgentCount, ActiveBrokerCount, ProjectType) are derived via Ent queries, matching the legacy SQLite store. - pkg/store/entadapter/brokersecret_store.go: implements BrokerSecretStore (per-broker HMAC secrets + short-lived join tokens, expiry cleanup). - Project Ent schema: add operational fields for full parity (default_runtime_broker_id, shared_dirs, github_*, git_identity). - RuntimeBroker Ent schema: relax vestigial type column to Optional, add internal lock_version concurrency token. - Regenerate Ent with sql/upsert,sql/lock features. - storetest: add Project, RuntimeBroker, BrokerSecret and BrokerJoinToken CRUD-parity domains. - Unit tests for both adapters. Per the integration plan, composite.go wiring and ensureEntProject shadow removal are deferred to P2-collapse. * P2: port agent domain to Ent entadapter (XL) * chore(ent): regenerate Ent code for all 30 entity schemas Regenerated with --feature sql/upsert,sql/lock to support OnConflict upserts and ForUpdate/SKIP LOCKED job claims. * P2-collapse: collapse dual-DB into single Ent store Wire all Ent-backed sub-stores into CompositeStore via embedding, removing the raw-SQL base store and the User/Agent/Project shadow-sync machinery (ensureEntUser/ensureEntAgent/ensureEntProject). CompositeStore now serves every domain from a single Ent client and implements Close/Ping/Migrate directly. Collapse initStore() to open one Ent SQLite DB (no _ent shadow DSN, no MigrateGroveToProjectData, no raw sqlite.New). Register the User, AllowList, and InviteCode domains in the storetest CRUD-parity suite. Update entadapter tests for the single-DB NewCompositeStore(client) signature. go build ./... green; go test ./pkg/store/entadapter/... ./pkg/store/storetest/... green. * P2-delete: remove raw-SQL store implementation Delete the ~6k-LOC raw-SQL store (sqlite.go) and its per-domain sibling files (brokersecret, gcp_service_account, github_installation, maintenance, messages, notification, project_sync_state, schedule, scheduled_event) plus their tests, including the inline schema-migration scaffold. Keep driver.go, which registers the pure-Go SQLite driver used by Ent's SQLite backend. Repoint the two non-test consumers to the Ent-backed store: - cmd/hub_secret_migrate.go now opens an Ent client + CompositeStore. - internal/fixturegen opens via entc and seeds the Ent schema's *sql.DB. go build ./... green; no remaining production references to the raw store. * test: compile-migrate downstream suites to Ent store + fix signing-key PK Replace the removed raw-SQL store in downstream tests with an Ent-backed newTestStore helper (pkg/hub, pkg/secret) and update cmd/server_test.go and internal/fixturegen tests. Port the 8 raw-SQL DB() access sites in hub tests via a new CompositeStore.DB() escape-hatch accessor. Fix a production bug surfaced by the collapse: hub/server.go signingKeySecretID generated a non-UUID secret primary key, which the Ent secret store rejects; it now derives a deterministic UUIDv5. go build ./... green; entadapter and storetest suites green. NOTE: hub/secret/fixturegen suites now COMPILE but many tests still fail because their fixtures seed non-UUID string IDs that the UUID-PK Ent schema rejects; addressed in follow-up commits (tid() helper). * test(hub): map non-UUID fixture IDs to UUIDs via tid() helper Wrap human-readable test identifiers in tid() (deterministic UUIDv5) so the UUID-PK Ent store accepts them while preserving cross-reference consistency and ID-equality assertions. Reduces pkg/hub failures from 611 to 79; remaining failures are behavioral, not ID-format, and are addressed separately. * fix(store): seed maintenance ops in Migrate; initStore uses Migrate Restore raw-SQL parity: CompositeStore.Migrate now runs AutoMigrate and seeds built-in maintenance operations (the raw store seeded these in its migrations). initStore and hub test helpers call s.Migrate() so production and tests seed consistently. Fixes the maintenance-operation hub tests (404 'Operation not found'). pkg/hub failures 79 -> 71. * test(hub): satisfy Ent NotEmpty validators in fixtures Add slugs/broker names to test fixtures that previously relied on the raw store's lenient (no-validator) inserts: project/agent slugs in the logs test helper, broker slugs in embedded/profile/authz fixtures, and BrokerName on envgather ProjectProvider literals. pkg/hub failures 71 -> 57. * fix(entadapter): Get-by-id returns ErrNotFound for non-UUID identifiers Restore raw-SQL store parity: a malformed identifier cannot match any UUID primary key, so get-by-id lookups now report store.ErrNotFound instead of store.ErrInvalidInput. This matches the raw store (a lookup with a bad id simply returned no row) and is what callers depend on — e.g. resolveTemplate passes a template *name* to GetTemplate and relies on ErrNotFound to fall back to slug-based resolution. New parseGetID helper applied across all 17 get-by-id methods. pkg/hub failures 56 -> 40; entadapter/storetest stay green. * test(hub): fix store-less id wraps and project-route URL paths - controlchannel_client_test: revert tid() wraps (store-less path-builder test; IDs must match the expected literal paths). - github/envgather: project-scoped route handlers resolve the project by UUID id, so build paths with tid(rawID) via fmt.Sprintf instead of the old raw-id literal. pkg/hub failures 40 -> 32. * test(hub): unwrap projectIDFromServiceAccountEmail expectation The tid() sweep over-wrapped a non-ID expected value in a pure-function test; restore the literal GCP project id. * fix(ent): GCPServiceAccount.project_id is a string, not a UUID The GCP service account project_id holds the GCP *cloud project* identifier (e.g. 'my-project-123'), a free-form string — not a UUID. The schema declared it field.UUID, so entadapter CreateGCPServiceAccount/Update did parseUUID(sa.ProjectID) and rejected real GCP project ids, breaking SA mint/create with a 400 in production (storetest masked it by passing a UUID). Change the schema field to field.String, regenerate Ent, and store/read project_id as a string in external_store.go. Fixes ~7 hub GCP tests; pkg/hub 31 -> 23. * test(hub): fix GCP SA project-id assertion and project-settings id Unwrap the over-wrapped 'my-project' expectation now that project_id is a string, and wrap the dynamic project-settings project ID with tid(). * test(hub): revert tid() over-wraps in store-less events_test events_test exercises the in-memory ChannelEventPublisher directly; its ProjectID/IDs are subject-string components, not stored UUIDs. The tid() sweep wrongly rewrote them so published subjects no longer matched the subscriptions (timeouts). Restore the literal values. pkg/hub 19 -> 12. * test(hub): fix maintenance-run path and notifications agentId queries Use tid() UUIDs in the maintenance run-detail path and the notifications agentId query params; guard list indexing with require.Len so a mismatch fails cleanly instead of panicking (panics truncate the package run). * test(hub): wrap remaining fixture IDs revealed after panic-cascade cleared Panics ([0] on empty lists) had been truncating the package run, hiding many failures and starving the tid() sweep. With those guarded, sweep the newly reached tests: wrap dynamic rune-suffix IDs and the setupProjectWithBroker / seedCreatedAgentForHarnessTest helper IDs, and convert raw query-param project IDs to tid(). No UUID-parse errors remain in pkg/hub. * test(hub): unwrap tid() in scheduler_test (mock store, raw ids) scheduler_test uses an in-memory mockScheduledEventStore, not the Ent store, so its ids need no UUIDs; the erroneous tid() wraps broke raw getEvent lookups and caused a nil-pointer panic that truncated the package run. * fix(ent): Template.harness may be empty (raw-store parity) A template imported from a directory that declares no harness type has an empty harness; the raw-SQL store stored it, but the Ent NotEmpty validator made BootstrapTemplatesFromDir silently skip such templates. Drop NotEmpty and regenerate. Removing the [0]-on-empty panics this caused un-truncates the hub package run (true failure count now visible). * test(hub): wrap dynamic fixture IDs in wake/workspace/signing-key tests Wrap tid() around the wake_test, setupWorkspaceProject, and empty-value signing-key secret IDs now reachable after panic removal. No panics in the hub package run. * test(hub): convert raw-id URL path segments to tid() Build GET/PUT/DELETE paths for agents/projects/brokers/templates/harness-configs and workspace sync routes from tid(rawID) so the by-id handlers resolve the entity (raw ids no longer match the UUID PKs). pkg/hub 93 -> 80. * fix(hub): seed creator users for agent-created agents; cascade-delete subscriptions on hard agent delete * test(hub): seed broker slug/name in dispatcher and project_cache fixtures (Ent validators) * fix(entadapter): cascade-delete agents on project delete (raw-store parity); test(hub): seed FK users, broker_name, deterministic UUIDs * test(hub): MaxOpenConns=1 for SQLite test store (serialize writes); tid() URLs + FK user seeds in events/stopall * test(hub): unwrap over-wrapped tid() in unit tests (workspace/logfilter/gcp/web); valid-UUID NotFound cases; tid() scheduled-event URLs * fix(ent): allow empty display_name (raw-store NOT NULL parity, email fallback); test(hub): seed FK owner users, UUID policy/broker/agent IDs in authz remediation * feat(migrate): add Migration β tool (Ent-SQLite → Ent-Postgres) Implements 'scion server migrate --from sqlite://... --to postgres://...' per postgres-strategy.md §7.3. - entc.OpenSQLiteReadOnly: opens source with PRAGMA query_only=ON (no WAL write), MaxOpenConns=1 so the source is never mutated. - entc.MigrateData: generic reflection-based, dependency-ordered copy of all 30 Ent entities (FK-ordered core first), idempotent (skips rows whose PK already exists), atomic per entity (txn), chunked CreateBulk, source/dest row-count verification after each entity, plus the Group.child_groups M2M edge. FK columns are plain fields so edges are preserved via setters. - cmd/server migrate: DSN parsing (sqlite://, file:, bare path; postgres URL or keyword form), --keep-source default / --drop-source cutover, progress logging. Verified end-to-end against live CloudSQL Postgres 16 (integration test + real CLI run): full copy, idempotent re-run, FK + M2M + value round-trips, --drop-source removal. * feat(concurrency): dialect-aware multi-replica primitives for Postgres (P3-3..6) Add cluster-coordination primitives so N stateless hub processes can share one Postgres, each degrading to a no-op on single-writer SQLite: - store.AdvisoryLocker + entadapter TryAdvisoryLock (pg_try_advisory_lock on a dedicated conn); Scheduler.RegisterRecurringSingleton gates the heartbeat, stalled, purge, schedule-evaluator and github-health sweeps to one replica/tick. - store.ScheduledEventClaimer + ClaimScheduledEvent atomic claim; fireEvent claims one-shot events before side effects (dedup across replica startup recovery). - CompositeStore.RunSerializable: SERIALIZABLE + retry on 40001/40P01 (single run on SQLite) for future multi-row invariants. - dbmetrics.StartPoolSampler feeds DB connection-pool gauges to the P0-5 scaffold; wired into StartBackgroundServices via SetDBMetrics. Verified existing primitives correct (agent StateVersion CAS, FOR UPDATE sweeps, notification atomic dispatch). Found and documented the schedule SKIP LOCKED early-commit gap (lock released before the status transition), closed by the singleton evaluator. Audit + budget docs in scratchpad. Tests: locking_test.go (advisory no-op, serializable, claim exactly-once incl. 8-way concurrent), pool_sampler_test.go. * feat(hub): widen events to EventPublisher interface + Postgres LISTEN/NOTIFY publisher P3-7: Decouple call sites from the concrete *ChannelEventPublisher. - Add Subscribe(patterns...) (<-chan Event, func()) to the EventPublisher interface; implement it on noopEventPublisher (nil channel) — *ChannelEventPublisher already had it. - Factor the Publish* methods into a shared eventBuilder (sink func) so every backend emits identical subjects/payloads; ChannelEventPublisher embeds it. - web.go (field + SetEventPublisher), messagebroker.go and notifications.go (field + constructor) now take EventPublisher; handlers_messages.go gates SSE on "not the no-op publisher" instead of a concrete type assertion. P3-8: PostgresEventPublisher over pgx LISTEN/NOTIFY (cross-replica delivery). - Per-grove channels plus a global channel (flat exact-match); event type in the JSON envelope. Grove-scoped subjects publish to both the grove channel and the global channel; subscriptions group their patterns by resolved channel so an event is matched only against patterns that opted into the arriving channel (no double delivery). - 8 KB NOTIFY limit handled by reference-and-refetch via scion_event_payloads (TTL-swept so every replica can refetch). - PublishTx enrolls the NOTIFY in a caller transaction (atomic write+publish; rollback => no deliver). Delivery flows exclusively through the listener. - Listener goroutine reconnects with backoff and re-LISTENs (resubscribe); dynamic LISTEN/UNLISTEN applied on a poll (WaitForNotification timeout does not invalidate the pgconn connection). - Emits pkg/observability/dbmetrics signals (published/delivered/dropped, payload size, publish->deliver latency, reconnects, pool stats). - cmd: newEventPublisher selects the backend by database driver (postgres => PostgresEventPublisher, else ChannelEventPublisher) with safe fallback. Tests: routing/registry/payload-offload/metrics/transactional-executor unit tests run without a DB; cross-replica delivery, oversized round-trip, transactional rollback, and reconnect+resubscribe are gated behind SCION_TEST_POSTGRES_DSN. go build ./... green; full pkg/hub suite green. Note: server.go's equivalent type-assertion cleanup is left in the working tree (co-edited with concurrent P0-5/scheduler work) and is functionally optional — HEAD server.go already compiles against the widened interface. * test(store): parameterize store suites over {sqlite, postgres} (P3-2) Add pkg/store/enttest: a backend-selecting Ent client factory for the store test suites. Default is in-memory SQLite; built with -tags integration and SCION_TEST_POSTGRES_URL set, it provisions a per-package ephemeral Postgres database (created/dropped via TestMain) and isolates each test in its own schema (search_path) so tests never observe each other's rows. Falls back to SQLite when the env var is unset. Route all entadapter and storetest helpers through enttest.NewClient so the same CRUD-parity oracle runs unchanged against either backend. Fix two real Postgres bugs surfaced by the new path: - entadapter/dialect.go ancestryContains: emit the bind parameter via Builder.Arg ($n on Postgres) instead of a literal '?' through ExprP, which was not rebound and produced a syntax error; and use jsonb_array_elements_text (the column is jsonb on Postgres, not json). - schedule_store_test ClaimPath: make the concurrent-claim assertion backend-aware. SQLite serializes (MaxOpenConns=1, no SKIP LOCKED) so every caller sees both due rows; Postgres uses FOR UPDATE SKIP LOCKED so concurrent callers may observe a disjoint subset (0..2) and must only never error or exceed 2. Verified: full SQLite suite green; storetest CRUD parity green on CloudSQL Postgres; entadapter green on Postgres (schedule ClaimPath fix confirmed). * fix(hub): harden Postgres event publish + verify wiring; lower PG pool default Task 1 — LISTEN/NOTIFY publish path: - Add TestPostgresIntegration_HandlerCreateProjectEmitsNotify: drives the real POST /api/v1/projects handler with a PostgresEventPublisher and asserts a pg_notify lands on scion_ev_global via an independent raw LISTEN — the exact capability the multi-replica live test probed. Verified PASSING against live CloudSQL, proving the handler -> s.events -> pg_notify wiring is correct end to end (the four pre-existing SCION_TEST_POSTGRES_DSN integration tests also pass). The multi-hub 'no NOTIFY' symptom was not reproducible against the current tree. - Bound the autocommit publish (Publish* methods) with publishTimeout (5s). These run synchronously on the caller's (request handler) goroutine and acquire from the event pool; on a connection-starved instance that acquire could block indefinitely, stalling CRUD and silently never emitting NOTIFY. The timeout converts that into a logged error + dropped event (publishing is fire-and-forget). PublishTx (transactional path) is unaffected. Task 2 — connection budget: - Lower the default Postgres MaxOpenConns 20 -> 10 so multiple replicas fit a modest connection budget (see CONNECTION-BUDGET.md). CloudSQL instance scion-postgres-test resized db-f1-micro -> db-g1-small and max_connections set to 100 (out of band). * test(store): add Postgres stress/integration suite (contention, isolation, pool, NOTIFY, migration, schema, multi-process) Add pkg/store/integrationtest/: a Postgres-only suite that exercises behavior the SQLite parity suites cannot reach. Gated by //go:build integration and SCION_TEST_POSTGRES_URL; skips cleanly otherwise. Coverage: - Contention: state_version CAS race (no lost updates, >=N-1 retries, final version==1+N), SKIP LOCKED / conditional-UPDATE event claim (single winner + disjoint drain), unique-key races (project slug, user email, agent slug). - Isolation: SERIALIZABLE conflict + RunSerializable retry recovery, REPEATABLE READ no-phantom snapshot, READ COMMITTED dirty-read prevention. - Pool: exhaustion + queued recovery, saturated pool honoring context deadline, long txn not starving short queries, healing after pg_terminate_backend. - LISTEN/NOTIFY: ordered burst no-drop, 8000B payload limit, listener reconnect/resume, cross-channel isolation. - Migration: 1000+ row counts + bounded-memory listing, idempotent re-migration. - Schema: NULL semantics, unicode/emoji, nested JSON + special chars, large-text non-truncation, TIMESTAMPTZ microsecond precision. - Multi-process: forks the test binary for cross-process advisory-lock exclusivity and cross-process NOTIFY delivery. Configurable concurrency via SCION_TEST_CONCURRENCY (default 10). Extend pkg/store/enttest with Active() and NewSchemaURL() so tests can open custom-pool clients and share a DSN with forked child processes; non-integration stubs keep the package API stable. * fix(db): recycle stale conns + keepalives; skip singleton tick on lock error Stale-connection pool stalls (CloudSQL drops idle conns after ~10m): - Add ConnMaxIdleTime to DatabaseConfig/PoolConfig (default 5m pg, 0 sqlite) and apply SetConnMaxIdleTime on the database/sql pool. - OpenPostgres now parses the DSN with pgx and opens via stdlib.OpenDB with TCP keepalive GUCs (idle 60s / interval 15s / count 4) and a 10s connect timeout, so a silently-dropped peer is detected instead of the first query after idle hanging on a dead socket. - pgx event pool (events_postgres.go): set keepalives + connect timeout on both the pool's ConnConfig and the dedicated listener connection, plus MaxConnIdleTime 5m / MaxConnLifetime 30m. Advisory-lock leader election (scheduler.go): - A lock-acquisition error no longer falls open to running the handler unguarded (which would duplicate singleton work across replicas); the tick is skipped and retried next interval. Added regression tests. Test harness (enttest/integrationtest): - Accept libpq keyword/value DSNs (not just URL form) when deriving the ephemeral db/schema/params; add WithConnParam helper. - Fix migration idempotency test's per-pass row-count expectation. * fix(store): bound advisory-lock conn checkout + unlock with short timeout TryAdvisoryLock checked a connection out of the pool and ran the unlock on the full 55s scheduler-handler context (acquire) and an unbounded context.Background() (release). On a pool that could not promptly serve a healthy connection, db.Conn() blocked for the entire 55s before failing with 'context deadline exceeded' on every tick; with several singleton handlers firing each 60s tick, those long-blocked goroutines and their pending pool connection requests piled up across ticks and kept the pool jammed (checked out client-side, idle server-side). The unbounded unlock was a second leak vector: if the held connection died mid critical-section, ExecContext could hang forever, so conn.Close() never ran and the connection leaked out of the pool permanently. Bind both the acquire (db.Conn + pg_try_advisory_lock) and the release (pg_advisory_unlock) to a 5s timeout so a bad tick fails fast and retries next tick instead of parking a goroutine for ~55s, and so a dead connection can never block release from freeing the conn. Lock semantics are unchanged: cancelling the acquire context tears down only that context, not the checked-out session that holds the lock. * feat(migrate): in-process migration α (legacy raw-SQL hub.db → Ent) Upgrade a legacy raw-SQL Hub database (the ~53-migration, 30-table schema from the removed pkg/store/sqlite store) to the consolidated Ent-backed SQLite schema, in-process on first boot, behind an automatic backup. pkg/ent/entc/migrate_alpha.go: - IsLegacyRawSQLSchema: detect via the schema_migrations sentinel + the legacy-only agents.agent_id column (no-op for an Ent/empty/absent file). - MigrateAlphaSQLite: backup (checkpoint WAL + copy to hub.db.bak.<ts>), AutoMigrate a fresh Ent schema, ATTACH the legacy file, copy every table with INSERT…SELECT (foreign_keys OFF), verify per-table row counts, then atomically swap the migrated file into place. - Data-driven column mapping (created_at→created, updated_at→updated, agents.agent_id→slug, policies→access_policies); bespoke SQL for the group_members/policy_bindings polymorphic splits and surrogate ids; groups.parent_id→group_child_groups edge. - Deterministic UUIDv5 remap for legacy non-UUID primary keys (internal signing-key secrets; plugin runtime-broker ids) with consistent rewrite of every foreign-key reference via a TEMP _id_remap table. - Tolerates missing legacy tables (older schema versions). cmd/server_foreground.go: detect + migrate in initStore's sqlite path, with a --no-auto-migrate operator opt-out (cmd/server.go). Validated end-to-end against four production hub.db files (scion-integration, -integration2, -demo, -gteam): exact row-count parity (up to ~19k rows), every entity reads back through the live Ent store, idempotent re-runs, and broker FK references resolve post-remap. Pre-existing dangling agent created_by/owner_id refs are faithfully preserved (loader runs FK-off). * fix(config): apply real Postgres pool size (leaked SQLite default of 1 starved the pool) The struct-level default for Database.MaxOpenConns/MaxIdleConns is 1 — the value SQLite REQUIRES to serialize writes. applyDatabasePoolDefaults only bumped postgres to a real pool when the value was <= 0, but a postgres deployment configured via env/driver override inherits the embedded default of 1, so the guard never fired and the Ent pool ran with a SINGLE connection. Effect in production (both integration hubs): every singleton scheduler tick checks out the lone pool connection to hold its advisory lock, then blocks waiting for a second connection to do its work — a self-deadlock that resolves only at the 55s handler context deadline. All API requests serialize behind the one connection, so GET /api/v1/* served in ~55s across the board. Note env overrides could not paper over this: envKeyToConfigKey splits on every underscore, so SCION_SERVER_DATABASE_MAX_OPEN_CONNS maps to database.max.open.conns, not database.max_open_conns — silently ignored. Treat the leaked SQLite default (<= 1) as 'unset' for postgres so the pool default (10) applies; explicit sizing of 2+ is still respected. SQLite remains pinned to 1. Adds regression tests for all three cases. * feat(hub): per-process instanceID on Server (B1-1) Add a unique per-process instanceID to Server, generated at construction via uuid.NewString(). Optionally prefixed with POD_NAME env var for log readability, but uniqueness is always guaranteed by the UUID. This ID serves as the affinity key for broker dispatch (design §4.1) and is intentionally distinct from config.ResolveHubID, which is shareable across replicas. * feat(schema): affinity columns on runtime_brokers (B1-2) Add 3 nullable fields to the runtime_brokers ent schema and store model for tracking which hub instance holds the control-channel socket: - connected_hub_id (TEXT, optional/nullable) - connected_session_id (TEXT, optional/nullable) - connected_at (TIMESTAMPTZ, optional/nullable) Dialect-neutral (no Postgres-only annotations) — AutoMigrate works on both SQLite and CloudSQL Postgres per postgres-strategy.md §6.4. Wire the fields through the ent<->store conversion code in both directions (entBrokerToStore, CreateRuntimeBroker, UpdateRuntimeBroker). Regenerated ent code included. * feat(store): Claim/Release runtime-broker affinity CAS methods (B1-3) Mirrors UpdateRuntimeBrokerHeartbeat's lock_version CAS loop. - ClaimRuntimeBrokerConnection: newest-wins, sets affinity + status=online + heartbeat in one write - ReleaseRuntimeBrokerConnection: compare-and-clear, returns cleared=false (no-op) if affinity moved (disconnect-race fix) Tests cover claim/overwrite/clear/no-op + A->B flap (design 9.4). * fix(hub): thread sessionID through connect + fix onDisconnect clobber race (B1-4, B1-5) B1-4: HandleUpgrade returns sessionID; markBrokerOnline(brokerID, sessionID) now calls ClaimRuntimeBrokerConnection(brokerID, instanceID, sessionID), recording affinity + online + heartbeat in one CAS write. B1-5: SetOnDisconnect callback gains sessionID; the handler compare-and-clears via ReleaseRuntimeBrokerConnection and skips the offline stamp when affinity has moved (flap). removeConnection now only removes/fires for the matching session, so an old connection's teardown can't drop a newer live socket. * feat(schema): broker_dispatch intent table + messages dispatch-state (B2-1, B2-2) B2-1: new BrokerDispatch ent entity (table broker_dispatch) — id, broker_id, agent_id(null), agent_slug, project_id(null), op, args(JSON), state, result, claimed_by, attempts, error, created_at/updated_at, deadline_at(null); index (broker_id,state). store.BrokerDispatch model + state constants. B2-2: messages.dispatch_state (default 'pending') + dispatched_at; wired through store.Message + entadapter conversion/create. Dialect-neutral. * feat(hub): PostgresCommandBus LISTEN/NOTIFY signal listener on scion_broker_cmd (B2-4) Introduce a CommandBus interface and PostgresCommandBus implementation that listens on the new global channel scion_broker_cmd for broker dispatch wakeup signals. This is a sibling of PostgresEventPublisher, reusing the same connect/reconnect/keepalive helpers but maintaining its own independent pgx connection and pool (design §5.1). Key components: - PostgresCommandBus: LISTEN loop with backoff-reconnect on its own dedicated connection; filters signals by local broker ownership via an injected ownsLocally func (wired to ControlChannelManager.IsConnected); invokes an injected onSignal reconcile callback (to be wired to the reconcile drain in B2-5). - NotifyBrokerCmd: issues NOTIFY inside the caller's transaction so the signal commits atomically with the durable intent row (mirrors PublishTx). - NoopCommandBus: safe no-op for the SQLite backend (single-process, all brokers are local). - Backend selection in newCommandBus mirrors newEventPublisher: Postgres driver → PostgresCommandBus; otherwise → NoopCommandBus. - Server.SetCommandBus/CommandBus() setter/getter; cleanup in both Shutdown and CleanupResources paths. * feat(store): BrokerDispatch store methods + message dispatch CAS (B2-3) BrokerDispatchStore: Insert/Claim(CAS pending->in_progress)/Complete/Fail/ ListPendingDispatch + MarkMessageDispatched(CAS)/ListPendingMessages (via agent runtime_broker_id). Wired into CompositeStore + store.Store. Tests: concurrent claim single-winner (exactly-once), drain pending-only, message CAS dedupe, complete/fail transitions, pending-messages-by-broker-agent. * feat(hub): reconcile-on-connect drain wired to bus + markBrokerOnline (B2-5) Server.reconcileBroker drains pending broker_dispatch rows (CAS-claim -> exec -> done/fail) and pending messages (CAS MarkMessageDispatched -> deliver) for a broker this node owns. Exactly-once via store CAS; idempotent + concurrent-safe. Wired as durability backstop into markBrokerOnline (async on reconnect) and as the command-bus signal handler (SetOnSignal -> ReconcileBroker). Op executors are seams (executeDispatch/deliverMessage) that Phase 3/4 fill with local tunnel ops. * feat(hub): route() decision in HybridBrokerClient (B3-1) routeLocal (IsConnected, unchanged fast path) | routeForward (affinity owner alive) | routeHTTP (broker endpoint set) | routeUndeliverable. Affinity is a hint only (StoreAffinityLookup over connected_hub_id + last_heartbeat freshness), injectable for testing. Not yet wired into dispatch (B3-2 wires message path). Table-driven tests over all branches incl. local-precedence + nil-affinity. * feat(hub): cross-node message dispatch via route()+intent+signal+owner drain (B3-2, B3-3) Route-gate the message send path: HybridBrokerClient.MessageAgent now uses route(brokerID, endpoint) to decide delivery. routeLocal and routeHTTP follow existing paths unchanged. routeForward/routeUndeliverable return ErrMessageDeferred — the message row (already persisted with dispatch_state=pending) is the durable intent. All call sites (handleAgentMessage, set[], broadcastDirect, messagebroker, notifications, scheduler) catch the sentinel, emit a best-effort NOTIFY wakeup via SignalBrokerCmd, and return 202 Accepted (or log as deferred). Fill the deliverMessage seam in reconcile.go: resolves the agent from the message's AgentID, obtains the dispatcher, and calls DispatchAgentMessage for local tunnel delivery. reconcileBroker already CAS-marks dispatched before calling this. Wire SetAffinityLookup(StoreAffinityLookup(store, 0)) on the HybridBrokerClient in CreateAuthenticatedDispatcher so route() can return routeForward when another node owns the broker. Add SignalBrokerCmd to the CommandBus interface — a best-effort NOTIFY using the bus's own pool, used by the message path where the durable intent is the message row itself and the NOTIFY is only a wakeup hint. * feat(hub): lifecycle dispatch (rolling-timeout wait + cross-node start/stop/restart) (B4-1, B4-2) B4-1: Rolling-timeout wait helper (dispatch_wait.go) - waitForAgentTransition subscribes to agent.<id>.status events and loops with a rolling window (dispatchRollingTimeout=90s) that resets on ANY AgentStatusEvent (phase/activity/detail change). - Terminal phase → return phase, nil. Window expiry → ErrDispatchFailed. Context cancellation → ctx.Err(). - Caller subscribes BEFORE writing intent, passes the channel + unsub. B4-2: Cross-node start/stop/restart dispatch - Route-gated HybridBrokerClient.StartAgent/StopAgent/RestartAgent exactly like MessageAgent: routeLocal → control-channel tunnel (unchanged fast path), routeHTTP → HTTP fallback, routeForward/routeUndeliverable → ErrLifecycleDeferred. - Dispatch args structs (dispatch_args.go): StartDispatchArgs captures task, resolvedEnv, resolvedSecrets, inlineConfig, sharedDirs, sharedWorkspace, projectPath, projectSlug, harnessConfig. RestartDispatchArgs captures resolvedEnv. StopDispatchArgs is empty. All JSON-serializable for broker_dispatch.args column. - Owner-side executeDispatch (reconcile.go): start/stop/restart cases deserialize args, load agent from store, call local DispatchAgentStart/Stop/Restart via the dispatcher. Unknown ops (delete, finalize_env, etc.) still fail cleanly for B4-3/B4-4. Tests: waitForAgentTransition (terminal, error, rolling reset, silence expiry, context cancel, unsub); route-gating of Start/Stop/Restart returns ErrLifecycleDeferred when non-local; executeDispatch lifecycle cases invoke the local dispatcher; args round-trip (serialize→deserialize) is lossless; reconcile end-to-end lifecycle path. * feat(hub): wire originator-side cross-node lifecycle dispatch (B4-2 complete) The originator-side orchestration was missing: ErrLifecycleDeferred was returned by HybridBrokerClient but nothing caught it. Now the full cross-node start/stop/restart flow works transparently to all handler call sites. Originator side (HTTPAgentDispatcher): - DispatchAgentStart/Stop/Restart catch ErrLifecycleDeferred after env/secret resolution and invoke deferredLifecycle: 1. Subscribe("agent.<id>.status") BEFORE writing intent 2. InsertBrokerDispatch{op, agent_id, broker_id, args} 3. Best-effort SignalBrokerCmd (row is durable backstop) 4. waitForAgentTransition with terminal set per op 5. Return nil on success, error on error-phase/timeout - SetCrossNodeDeps(events, commandBus) wired in server.go's getOrCreateDispatcher, so all handler call sites get cross-node for free with synchronous semantics preserved. - Local path (routeLocal) is unchanged at zero added latency — no subscribe, no intent row, no wait. Args decision: owner RE-RESOLVES env/secrets via DispatchAgentStart (all hub instances share the same store + secret backend), so StartDispatchArgs carries only {Task}. RestartDispatchArgs and StopDispatchArgs are empty. This avoids serializing potentially large env/secrets into the DB while remaining correct because all hubs read from the same shared store. waitForAgentTransition refactored to a standalone function (no Server receiver) so the dispatcher can call it directly. Tests: - TestDeferredStart_WritesIntentAndWaits: deferred start writes a broker_dispatch row, waits, returns success on "running" event - TestDeferredStart_ReturnsErrorOnErrorPhase: error phase → error - TestLocalStart_SkipsIntentRow: local path calls tunnel directly, no intent row written - All existing tests pass (no regressions) * fix(hub): make web session replica-portable to fix OAuth state_mismatch OAuth login behind the load balancer intermittently failed with state_mismatch: the CSRF state token (and the entire web session) was stored in a gorilla FilesystemStore on the handling replica's local disk, while the browser only carried a session-ID cookie. When the LB routed /auth/login and /auth/callback to different replicas, the callback replica had no matching session file -> empty state -> state_mismatch. It only "worked" when both hops happened to hit the same backend. The same flaw affected the post-login session: sessionToBearerMiddleware reads the Hub access/refresh JWTs from that disk-local store on every API request, so sessions silently dropped whenever a follow-up request landed on a different replica. Replace the FilesystemStore with an encrypted, signed gorilla CookieStore so the whole session lives in the client's cookie and any replica sharing SESSION_SECRET can read it. Keys are derived deterministically from SESSION_SECRET (32-byte HMAC auth key + 32-byte AES-256 encryption key, domain-separated). No DB, no migration; works with N replicas. The original switch to disk was motivated by a "JWT tokens exceed 4096 bytes" concern. Measured against the current compact HS256 tokens the full session (identity + access + refresh) encodes to ~2.6 KB, well under the browser's ~4 KB per-cookie cap, so the securecookie length limit is left in force (oversize would now error+log, not silently drop). Tests: replace the obsolete NoMaxLengthLimit test with a cross-replica round-trip regression test (cookie minted by replica A decodes on replica B with the same secret; carries OAuth state + post-login tokens) plus a negative test (a different secret cannot decode the cookie). * feat(hub): cross-node delete + create-time data ops dispatch (B4-3, B4-4) Route-gate HybridBrokerClient.DeleteAgent, CheckAgentPrompt, CreateAgentWithGather, and FinalizeEnv through route() so routeForward/routeUndeliverable return ErrLifecycleDeferred (matching start/stop/restart pattern from B4-2). B4-3 (delete dispatch): - deferredDelete on ErrLifecycleDeferred: subscribe broker.dispatch.<id>.done → InsertBrokerDispatch{op:delete} → SignalBrokerCmd → waitForDispatchDone (reads DB row, authoritative). - Owner executeDispatch case "delete": deserializes DeleteDispatchArgs → local DispatchAgentDelete (idempotent, 404 ok). - DeleteDispatchArgs struct + UnmarshalDeleteArgs for args round-trip. B4-4 (create-time data ops): - deferredDataOp/deferredDataOpResult: common originator flow for ops that return results via the dispatch row (design §6.3). Subscribe to broker.dispatch.<id>.done BEFORE writing intent, insert dispatch, signal, waitForDispatchDone, read result from GetBrokerDispatch. - deferredCheckPrompt: returns bool from CheckPromptResult in row. - deferredFinalizeEnv: fire-and-forget via deferredDataOp. - deferredCreateWithGather: returns envRequirements from row result. - Owner executeDispatch cases: check_prompt, finalize_env, create — run local op, marshal result JSON, return it. - PublishDispatchDone on EventPublisher: slim completion event broker.dispatch.<id>.done emitted by reconcile loop on complete/fail. - waitForDispatchDone: event-driven wait with bounded re-read at rolling timeout (missed event recovery, design §6.3). - GetBrokerDispatch added to BrokerDispatchStore interface + entadapter. Local fast path unchanged (routeLocal → zero added latency). * feat(hub): stale-affinity + stuck-dispatch reaper singleton (B5-1) * feat(hub): pending-message sweep + dispatch metrics (B5-2) Add observability for the multi-node broker dispatch pipeline: Sweep: - CountStuckPendingMessages store method (messages pending > threshold) - brokerMessageSweepHandler registered as RecurringSingleton with LockBrokerMessageSweep (0x5C100007), runs every 1m Metrics (pkg/observability/dispatchmetrics): - Counters: dispatch published/claimed/done/failed, message dispatched - Gauge: message stuck (pending beyond 5m threshold) - Histograms: intent-to-done latency, reconcile drain duration - Counter: command bus reconnects Emit sites: - InsertBrokerDispatch → IncPublished (httpdispatcher.go) - ClaimBrokerDispatch → IncClaimed (reconcile.go) - CompleteBrokerDispatch → IncDone + RecordDispatchLatency (reconcile.go) - FailBrokerDispatch → IncFailed (reconcile.go) - MarkMessageDispatched → IncMessageDispatched (reconcile.go) - reconcileBroker → RecordReconcileDrainDuration (reconcile.go) - command bus reconnect → IncCmdBusReconnects (command_bus.go) - sweep handler → ObserveMessageStuck (sweep.go) * fix(hub): derive JWT signing keys from shared SESSION_SECRET to fix cross-replica login loop The cookie-store fix (0515e2a8) made the web session replica-portable, but the Hub JWT *inside* the cookie is still signed with a per-replica key: ensureSigningKey scopes signing keys to (scope=hub, scope_id=hubID) and hubID = sha256(hostname)[:12]. The integration env runs two replicas of one logical hub behind a single LB, sharing one Postgres DB and one SESSION_SECRET but with different hostnames -> different hubIDs -> different HS256 signing keys. So a user JWT minted on replica A failed signature verification on replica B (go-jose: error in cryptographic primitive); refresh failed too (refresh token signed with the same foreign key), so sessionToBearerMiddleware declared the session irrecoverably invalid, DELETED the cookie (MaxAge=-1) and returned session_expired. The cookie deletion turns it into a redirect loop: dashboard flashes, then /login?error=session_expired. Fix: extend the 0515e2a8 approach (replica-portable via the shared secret) from the cookie to the keys inside it. Add ServerConfig.SharedSigningSecret; when set, ensureSigningKey derives the agent and user signing keys deterministically from it (domain-separated by key name) and bypasses per-host secret-backend storage. cmd feeds the same --session-secret / SESSION_SECRET value into both the web cookie store and the hub config via a new resolveSessionSecret() helper. Empty secret keeps the existing per-hub behavior (no regression for single-node/local dev). Tests: cross-replica round trip (different hubID + same secret -> identical keys, token minted on A validates on B; different secret cannot) plus pre-configured-key precedence. Note: rollout rotates the signing keys (now derived from SESSION_SECRET), so existing web/CLI tokens are invalidated once and users re-login. * docs: project log for B5-3 chaos gate — GB5 PASSED (GA gate for broker dispatch) * fix(hub): align fakeHTTPClient.CleanupProject with interface (3 params, not 4) * fix(hub): address PR #305 review feedback - server_migrate.go: use nil-checked deferred close for src DB, and explicitly close src before dropSQLiteFile to prevent Windows sharing violations - server_migrate.go: handle file:// prefix before file: to correctly parse file:///path/to/db URLs - server_foreground.go: evaluate GetControlChannelManager() inside the ownsLocally closure to avoid capturing a stale nil value - server_migrate_test.go: add test case for file:/// URL format - server_test.go: sanitize t.Name() slashes in newTestStore to prevent SQLite path errors in subtests * docs: add project log for PR #305 review feedback fixes * fix(hub): prevent duplicate message delivery, guard dispatch state transitions C1: Call MarkMessageDispatched after successful local dispatch in messagebroker.go and handlers.go (single-recipient, set[], broadcast). Without this, successfully dispatched messages remained dispatch_state=pending and were re-delivered on every broker reconnect via reconcileBroker. C2: Return immediately in messagebroker.go deliverToAgent when CreateMessage fails — without a durable row, a deferred signal has nothing for the owning node to reconcile. C3: Guard CompleteBrokerDispatch and FailBrokerDispatch with state=in_progress CAS predicate so a done dispatch cannot be flipped to failed or vice versa. Update tests to claim before completing/failing to match the new CAS guard. * fix(hub): reconcile broker→eventbus and hub-native→hub-managed renames after rebase Post-rebase fixups to align the feature branch with main's refactoring: - broker package → eventbus package rename (types, imports, methods) - SetRecipient → GroupRecipient, SetMessageResponse → GroupMessageResponse - hubNativeProjectPath → hubManagedProjectPath - ProjectTypeHubNative → ProjectTypeHubManaged - populateAgentConfig gains ctx parameter - Add missing handleResourcesImport and handleMessageChannels handlers - Add ListChannels method to MessageBrokerProxy - Wire newCommandBus in server_foreground.go - Restore main's test fixtures for renamed APIs --------- Co-authored-by: scion-gteam[bot] <271067763+scion-gteam[bot]@users.noreply.github.com> Co-authored-by: Scion <agent@scion.dev>

…A Docker + Model B GKE) (GoogleCloudPlatform#306)

…GoogleCloudPlatform#303) * fix: atomic session-guarded broker disconnect to prevent reconnect race (GoogleCloudPlatform#131) The onDisconnect callback previously used separate ReleaseRuntimeBrokerConnection and UpdateRuntimeBrokerHeartbeat calls. When a broker disconnects and reconnects rapidly, the stale disconnect's offline stamp can clobber the new connection's online status because UpdateRuntimeBrokerHeartbeat has no session guard — it unconditionally overwrites status. Provider statuses are also clobbered and never restored by heartbeats, leaving the broker permanently invisible until hub restart. Add ReleaseAndMarkBrokerOffline which atomically clears affinity AND stamps status=offline in a single CAS write. If a concurrent reconnect has already claimed the broker with a new session, the compare fails and the callback is a no-op. Also add a re-check guard before updating provider statuses. * docs: add project log for broker disconnect race fix unification

…rm#301) * docs(design): reduced resource clone/delete design (resolved review) * refactor: remove dead Locked field from Template and HarnessConfig models Remove the Locked bool field, all 16 enforcement sites across 6 handler files, the force query parameter from delete endpoints, 3 locked-template tests, and add a DB migration to drop the column. No production code ever set Locked=true — this simplifies the handlers for the upcoming clone/delete feature. * feat: add harness-config clone endpoint, authz hardening, and slug uniqueness - Add handleHarnessConfigClone mirroring template clone - Add CheckAccess authz to deleteTemplateV2, handleTemplateClone, deleteHarnessConfig, handleHarnessConfigClone - Add DB migration V55: UNIQUE constraint on (slug, scope, scope_id) - Return 409 Conflict on slug collision during clone - Add clone failure cleanup - Add tests for clone, authz, and slug collision * feat(web): add Clone/Delete row actions and clone-from-global to resource list - Add Clone and Delete action menu to shared resource-list component - Add delete confirmation dialog with deleteFiles checkbox (default on) - Add clone dialog with name input and 409 collision handling - Add clone-from-global picker in project settings view - Unify on resource-changed event (migrate resource-imported) - Gate actions on capabilities (canClone, canDelete properties) * fix: address PR review — cleanup orphaned files on DB create failure, remove redundant clone method - Add stor.DeletePrefix cleanup when CreateTemplate/CreateHarnessConfig fails after files were already copied (prevents orphaned storage files) - Remove redundant confirmCloneFromGlobal method — confirmClone already handles cross-scope clone via the component's scope/scopeId properties * fix: adapt Locked removal and slug constraint to Ent-based schema Remove Locked references from entadapter, remove stale sqlite.go (replaced by Ent ORM upstream), add UNIQUE(slug, scope, scope_id) to Ent schema indexes, and regenerate Ent code. * fix: adapt tests and entadapter for Ent-based store (UUID IDs, no Locked) - Use api.NewUUID() for all test entity IDs (Ent enforces UUID format) - Remove Locked field from entadapter create/update calls - Remove stale sqlite.go (replaced by Ent ORM upstream) - Add UNIQUE(slug, scope, scope_id) to Ent schema indexes

…form#309) * fix(hub): make web session replica-portable to fix OAuth state_mismatch OAuth login behind the load balancer intermittently failed with state_mismatch: the CSRF state token (and the entire web session) was stored in a gorilla FilesystemStore on the handling replica's local disk, while the browser only carried a session-ID cookie. When the LB routed /auth/login and /auth/callback to different replicas, the callback replica had no matching session file -> empty state -> state_mismatch. It only "worked" when both hops happened to hit the same backend. The same flaw affected the post-login session: sessionToBearerMiddleware reads the Hub access/refresh JWTs from that disk-local store on every API request, so sessions silently dropped whenever a follow-up request landed on a different replica. Replace the FilesystemStore with an encrypted, signed gorilla CookieStore so the whole session lives in the client's cookie and any replica sharing SESSION_SECRET can read it. Keys are derived deterministically from SESSION_SECRET (32-byte HMAC auth key + 32-byte AES-256 encryption key, domain-separated). No DB, no migration; works with N replicas. The original switch to disk was motivated by a "JWT tokens exceed 4096 bytes" concern. Measured against the current compact HS256 tokens the full session (identity + access + refresh) encodes to ~2.6 KB, well under the browser's ~4 KB per-cookie cap, so the securecookie length limit is left in force (oversize would now error+log, not silently drop). Tests: replace the obsolete NoMaxLengthLimit test with a cross-replica round-trip regression test (cookie minted by replica A decodes on replica B with the same secret; carries OAuth state + post-login tokens) plus a negative test (a different secret cannot decode the cookie). * fix(hub): derive JWT signing keys from shared SESSION_SECRET to fix cross-replica login loop The cookie-store fix (0515e2a) made the web session replica-portable, but the Hub JWT *inside* the cookie is still signed with a per-replica key: ensureSigningKey scopes signing keys to (scope=hub, scope_id=hubID) and hubID = sha256(hostname)[:12]. The integration env runs two replicas of one logical hub behind a single LB, sharing one Postgres DB and one SESSION_SECRET but with different hostnames -> different hubIDs -> different HS256 signing keys. So a user JWT minted on replica A failed signature verification on replica B (go-jose: error in cryptographic primitive); refresh failed too (refresh token signed with the same foreign key), so sessionToBearerMiddleware declared the session irrecoverably invalid, DELETED the cookie (MaxAge=-1) and returned session_expired. The cookie deletion turns it into a redirect loop: dashboard flashes, then /login?error=session_expired. Fix: extend the 0515e2a approach (replica-portable via the shared secret) from the cookie to the keys inside it. Add ServerConfig.SharedSigningSecret; when set, ensureSigningKey derives the agent and user signing keys deterministically from it (domain-separated by key name) and bypasses per-host secret-backend storage. cmd feeds the same --session-secret / SESSION_SECRET value into both the web cookie store and the hub config via a new resolveSessionSecret() helper. Empty secret keeps the existing per-hub behavior (no regression for single-node/local dev). Tests: cross-replica round trip (different hubID + same secret -> identical keys, token minted on A validates on B; different secret cannot) plus pre-configured-key precedence. Note: rollout rotates the signing keys (now derived from SESSION_SECRET), so existing web/CLI tokens are invalidated once and users re-login. --------- Co-authored-by: Scion <agent@scion.dev>

…events (GoogleCloudPlatform#312) A rapid session.start → session.end sequence from a spurious sciontool could permanently reset an agent's phase even while the agent works normally. This adds two guards: 1. Phase regression guard: rejects transitions that would move an agent backward in its forward-progress lifecycle (e.g. running → starting) in both the status update handler and broker heartbeat handler. 2. Activity-driven phase auto-correction: when an activity that implies the agent is running (working, thinking, executing, etc.) arrives but the phase is pre-running, auto-promotes the phase to running. Fixes GoogleCloudPlatform#124

…GoogleCloudPlatform#313) Also unset SCION_PROJECT_ID when clearing hub context env vars, since IsHubContext() checks all four env vars and a leftover SCION_PROJECT_ID causes FindProjectRoot() to return a synthetic path instead of failing.

…tform#311) * Fix agent list task overflow and unify action buttons Task cell in list view used inline span styling that silently ignored max-width/overflow constraints, allowing long task text to push action buttons off-screen. Switch to display:-webkit-box with line-clamp:2 so text wraps to at most two lines with ellipsis. Card view action buttons now render icon-only (matching list view), with sl-tooltip and aria-label for accessibility. Both views share a single renderActionButtons helper, eliminating the duplicated button logic. Color-coded hover effects added to action buttons in both views: red for stop/delete, amber for suspend, green for resume/start. Closes GoogleCloudPlatform#134 Closes GoogleCloudPlatform#135 * Fix agent list task overflow and unify action buttons Task cell in list view used inline span styling that silently ignored max-width/overflow constraints, allowing long task text to push action buttons off-screen. Switch to display:-webkit-box with line-clamp:2 so text wraps to at most two lines with ellipsis. Card view action buttons now render icon-only (matching list view), with sl-tooltip and aria-label for accessibility. Both views share a single renderActionButtons helper, eliminating the duplicated button logic. Color-coded hover effects use translucent rgba backgrounds that work in both light and dark mode: red for stop/delete, amber for suspend, green for resume/start. Closes GoogleCloudPlatform#134 Closes GoogleCloudPlatform#135 * Add before/after screenshots for PR review Screenshots captured from the real running app (Vite dev server + fetch mock for agent data). Shows before/after for both issues in light mode and dark mode. * Fix hover on disabled buttons and tooltip on disabled terminal Add :not([disabled]) to hover CSS selectors so color-coded hover effects don't apply to disabled action buttons. Wrap the Terminal button in an inline-flex span inside sl-tooltip so the tooltip remains accessible even when the button has pointer-events:none.

* docs(design): auth proxy mode (Google IAP) architecture Add design for an exclusive proxy human-auth mode that derives the user from a verified Google IAP signed header (X-Goog-IAP-JWT-Assertion), reusing the existing domain/allowlist/admin provisioning controls. Also specifies a hub-minted transport-auth layer (dedicated SA, generalizing PR GoogleCloudPlatform#307) so agents can traverse the IAP / Cloud Run-invoker front door, with a generalized array-based token refresh. * refactor(hub): extract provisionUser, dedupe OAuth find-or-create Extract the duplicated find-or-create-user block from four OAuth handlers (handleAuthLogin, handleAuthToken, handleCLIAuthToken, completeOAuthLogin) into a single provisionUser method on Server. The new method encapsulates: 1. Authorization check (isUserAuthorized) with audit logging 2. GetUserByEmail / CreateUser (find-or-create) 3. Profile backfill (DisplayName, AvatarURL when empty) 4. Admin promotion (when admin list changes) 5. Hub membership enrollment (ensureHubMembership) Introduces ExternalUserInfo struct (decoupled from OAuthUserInfo) and ErrAccessDenied sentinel error for caller-side HTTP response mapping. This is Phase 0 of the auth-proxy-mode feature — pure refactor with no behavior change. The proxy middleware (Phase 1) will call the same provisionUser method. NOTE: No suspended-user check is added. The existing OAuth flow does not check user.Status == "suspended" either; adding it here would change behavior. This gap is documented for Phase 1. * docs(project-log): record provisionUser extraction findings * feat(auth): implement proxy auth mode with IAP JWT verification (Phase 1) Add exclusive proxy auth mode for Google IAP signed-header authentication: - pkg/hub/proxyauth.go (NEW): ProxyAuthenticator interface, IAPAuthenticator with ES256 JWT verification via go-jose/v4, JWKS lazy-fetch cache with periodic refresh + on-miss refresh for unknown kids + transient failure tolerance (last-good keys). - pkg/config: auth.mode selector (oauth|proxy|dev), auth.proxy section with provider/iap.audience/overrides in both DevAuthConfig (GlobalConfig) and V1AuthConfig (settings.yaml). Wire conversion in both directions. - pkg/hub/auth.go: Replace IP-only extractProxyUser branch with ProxyAuthenticator path. Add 60s resolution cache (ProxyUserCache) wrapping provisionUser — signature verification runs every request, only the store lookup is cached. Legacy extractProxyUser preserved when no authenticator is configured. - pkg/hub/handlers_auth.go: Add suspended-user gate to provisionUser — rejects Status=="suspended" with ErrUserSuspended. This is an intentional behavior change sanctioned by the design doc, closing the pre-existing OAuth suspended-login gap documented in Phase 0. - pkg/hub/web.go: In proxy mode, handleAuthProviders returns no OAuth providers; handleLogout redirects to IAP's clear_login_cookie endpoint. - cmd/server_foreground.go: Construct IAPAuthenticator when mode==proxy && provider==iap, wire into ServerConfig.ProxyAuth. Security: audience binding is mandatory; only the signed JWT assertion is authoritative (X-Goog-Authenticated-User-* headers ignored); clock skew ±30s; JWKS cache handles key rotation and transient fetch failures. * test(auth): add comprehensive IAPAuthenticator unit tests Tests using self-generated ES256 key pair + httptest JWKS server: - Valid assertion -> correct ProxyUserInfo (subject/email stripped, lowercased) - Bad signature -> error - Wrong audience -> error (mandatory binding) - Wrong issuer -> error - Expired token (past 30s skew) -> error - Missing header -> (nil, nil) fall-through - Unknown kid triggers JWKS refresh and succeeds - Custom issuer override for testing - HD (hosted domain) claim extraction - Email lowercasing - JWKS cache transient failure tolerance (serves last-good keys) * style: fix gofmt formatting in proxyauth_test.go and settings_v1.go * docs(project-log): record auth-proxy-mode Phase 1 implementation * config: add auth.transport config for outbound transport auth Add TransportAuthConfig (hub_config.go) and V1TransportConfig (settings_v1.go) for the transport-layer auth that lets agents traverse IAP / Cloud Run invoker front doors. Config supports mode (none|cloudrun_invoker|iap), oidcAudience, and platformAuthSA fields. Wire into V1↔GlobalConfig conversion and env key mapping. Phase 2 item 6 of auth-proxy-mode. * hub: add TransportTokenMinter interface and implementations Introduce the TransportTokenMinter interface for minting Google OIDC ID tokens that let agents traverse platform guards (IAP / Cloud Run invoker). Three implementations: - gcpTransportMinter: production impl using IAM Credentials API (generateIdToken) to impersonate a dedicated platform-auth SA. Uses already-vendored google.golang.org/api/iamcredentials/v1. - noopTransportMinter: returns error when transport auth is disabled. - FakeTransportMinter: exported test double for other packages. Also adds RefreshTokenEntry type for the generalized tokens[] array and parseJWTExpiry for extracting expiry from ID tokens. All tests pass with no live GCP dependency (httptest fakes). Phase 2 item 6 of auth-proxy-mode. * hub: wire transport token minter into ServerConfig and dispatch Add TransportMode, TransportAudience, TransportMinter fields to ServerConfig and wire them through to the Server struct and HTTPAgentDispatcher. Transport tokens are injected as env vars (SCION_TRANSPORT_TOKEN, SCION_TRANSPORT_AUDIENCE, SCION_TRANSPORT_TOKEN_EXPIRY) into agent dispatch payloads in all three dispatch paths (Create, Start, Restart). server_foreground.go constructs a gcpTransportMinter from auth.transport config, deriving audience from hubEndpoint for cloudrun_invoker mode. When transport mode is "none" or unset, no minter is created and no transport tokens are injected — zero impact on existing deployments. Phase 2 item 6 of auth-proxy-mode. * hub: extend token refresh response with generalized tokens[] array The agent token refresh handler now returns a tokens[] array alongside the existing token/expires_at fields for backward compatibility. Old clients ignore tokens[]; new clients use it to apply both app-layer and transport-layer tokens. When transport auth is configured (transportMinter != nil), the response includes a google_oidc transport token entry with the configured audience. When disabled, only the app scion_access entry appears. Transport token minting errors are logged but don't fail the refresh — the app token is always returned. Phase 2 item 7 of auth-proxy-mode. * sciontool: add pluggable OIDC transport for agent outbound auth Implement the agent-side transport-layer auth with two pluggable token sources: - injectedTokenSource: uses the hub-provided SCION_TRANSPORT_TOKEN env var (cold start), then refreshed via the tokens[] array on subsequent refresh calls. - metadataTokenSource: fetches OIDC from the GCE metadata server (passthrough/on-GCE mode, the PR GoogleCloudPlatform#307 pattern). Selection logic: SCION_TRANSPORT_TOKEN env → injected mode; else if on GCE → metadata mode; else → no OIDC transport. The oidcTransport RoundTripper injects Authorization: Bearer on outbound hub requests. Graceful degradation: if token fetch fails, the request proceeds without the header (the hub can still auth via X-Scion-Agent-Token). Client changes: - Add oidcSource field and configureOIDCTransport() in NewClient() - Update RefreshTokenResponse with tokens[] array (backward compat) - RefreshToken() applies transport tokens via applyRefreshTokens() - Refresh scheduling uses shortest-lived entry (5-min margin for transport tokens vs 2h for scion tokens) 23 new tests covering both sources, transport, configuration, end-to-end dual-header, and refresh token application. Phase 2 item 8 of auth-proxy-mode. * docs(project-log): record auth-proxy-mode Phase 2 implementation * docs: add IAP proxy auth deployment guide (Phase 3) Add comprehensive deployment documentation for the IAP + Cloud Run invoker topology, covering inbound human IAP authentication, outbound agent transport auth (dual-layer OIDC + scion token), security considerations, and an end-to-end GCP setup checklist. All config keys and env vars verified against shipped code. * fix: prevent JWKS cache stampede and add HTTP client timeout - resolveHTTPClient() now returns a client with 10s timeout instead of http.DefaultClient (which has no timeout), preventing hangs on JWKS fetches. Tests that inject their own HTTPClient are unaffected. - JWKS cache refresh now debounces on lastAttempted (set at the start of every attempt, success or failure) instead of lastFetched (success only). This prevents stampedes during persistent JWKS outages where every cache-miss would trigger an unbounded refresh. - Added a refreshing guard to prevent concurrent in-flight refreshes (proactive background refresh + synchronous miss-refresh could race). - Network I/O is now performed outside the write lock to avoid holding the mutex across HTTP requests. - Added TestJWKSCache_StampedePreventionDuringOutage to verify that repeated misses during an outage do not cause repeated fetches within the debounce window. * fix: replace custom splitJWT with strings.Split and cache IAM service - Replace the hand-rolled splitJWT function with strings.Split(token, "."). Behavior is identical for well-formed JWTs; the custom function is deleted. - Cache the IAM credentials service client in gcpTransportMinter using sync.Once so it is created once and reused across MintIDToken calls instead of creating a new HTTP client/service on every invocation. Uses context.Background() for the long-lived client construction; per-call ctx continues to be passed to .Context(ctx).Do(). FakeTransportMinter is unaffected.

…oogleCloudPlatform#302) * fix: resolve workspace file browser to groves/ instead of projects/ The Hub UI file browser was showing the wrong directory contents. The hubManagedProjectPath() function resolved workspace paths to ~/.scion/projects/<slug>/ (project metadata) instead of ~/.scion/groves/<slug>/ (the actual git checkout mounted as /workspace in agents). Reverse the lookup priority: check groves/ first, fall back to projects/, and default to groves/ when neither has content. Fixes GoogleCloudPlatform#130 * docs: add project log for issue GoogleCloudPlatform#130 workspace path fix * fix: guard hubManagedProjectPath against empty slug Prevent hubManagedProjectPath from resolving to the parent directory when called with an empty slug. Add unit test for this case.

…by/owner_id) The Agent Ent schema modeled created_by/owner_id as foreign keys to the users table. When an agent creates a sub-agent, those columns hold the *creating agent's* ID, which has no users-table row, so Postgres rejected the insert with a foreign-key violation. mapError maps that to ErrInvalidInput, surfacing as a detail-free "validation_error: Invalid input (status: 400)" on every agent-initiated `scion start`. User-created agents were unaffected, masking the regression (introduced when GoogleCloudPlatform#304 ported the agent store onto Ent). created_by/owner_id are polymorphic principal references (user OR agent), like ancestry. Drop the User-typed edges and keep them as plain principal UUID fields; resolve the delegation creator by ID and tolerate "no such user". Atlas AutoMigrate drops the two FK constraints on existing DBs at next boot. Tests: the sole sub-agent creation test only passed because it seeded a fake user row sharing the agent's ID — an impossible production state. Remove that workaround so it exercises the real path, and add store/ent regression tests asserting a non-user principal ID is accepted.

…o agent containers (GoogleCloudPlatform#322) * Add sciontool doctor and agent auth reset infrastructure When an agent's hub JWT expires and the refresh loop fails (e.g. hub signing key rotation), the agent becomes a zombie: running locally but invisible to the hub. This adds two features to diagnose and recover: 1. `sciontool doctor` command — runs inside the agent container to check env vars, token validity/expiry, hub connectivity, auth status, and GCP metadata/GitHub token health. Prints actionable remediation. 2. Auth reset mechanism — allows pushing a fresh token into a running agent without restarting. The flow is: - Hub generates a new agent JWT via DispatchAgentResetAuth - Broker's /reset-auth endpoint writes the token file via exec - Broker sends SIGUSR2 to sciontool init (PID 1) - Init re-reads the token, updates the hub client, restarts the token refresh loop, and sends an immediate heartbeat Also adds Client.SetToken() for in-memory token updates. * Add scion reset-auth CLI command and hub API endpoint Adds the user-facing `scion reset-auth <agent>` command that triggers an auth reset on a running agent via the Hub. Also adds: - Hub handler for POST /api/v1/agents/{id}/reset-auth - hubclient AgentService.ResetAuth() method --------- Co-authored-by: Scion Agent (eng-manager) <agent@scion.dev>

Adds a "Reset Auth" button in the agent detail header actions area, visible when the agent is running. Clicking it calls the hub's POST /api/v1/agents/{id}/reset-auth endpoint, which generates a fresh JWT and pushes it into the running container without restart.

GoogleCloudPlatform#323) * Make SIGUSR2 signal best-effort in reset-auth handler The kill -USR2 step can fail (e.g. PID 1 is not sciontool init, or the process doesn't handle the signal). Since the token file write already succeeded and the refresh loop will pick up the new token without the signal, treat signal failure as a warning rather than returning a 500 error. * Add admin bulk reset-auth endpoint POST /api/v1/admin/agents/reset-auth-all lists all running agents and dispatches an auth reset for each, returning a per-agent success/failure summary. Admin role required. * Add Reset Auth All button to admin maintenance page Adds a Quick Actions section with a "Reset Auth — All Running Agents" button that calls POST /api/v1/admin/agents/reset-auth-all and displays a per-agent success/failure summary inline. --------- Co-authored-by: Scion Agent (eng-manager) <agent@scion.dev>

…ration (GoogleCloudPlatform#320)

…loudPlatform#319)

…metrics) (GoogleCloudPlatform#407) Clarify the two distinct metric families in Scion: - Infrastructure metrics (scion.hub.*, scion.db.*, scion.dispatch.*) for platform health, produced by the Hub process - Agent metrics (gen_ai.*, agent.*) for harness/model telemetry, produced inside agent containers via the telemetry pipeline Also defines the Telemetry pipeline term. Co-authored-by: Scion Agent (metrics-architect) <agent@scion.dev>

…tform#410) Co-authored-by: Scion Agent (harness-build-blocker-fix) <agent@scion.dev>

@Version

* skill-bank M5a: add RoutingSkillResolver and scheme detection Introduce RoutingSkillResolver that groups SkillReferences by URI scheme and dispatches each group to a registered scheme-specific resolver. The hub resolver serves as the fallback for skill:// URIs and bare names. Includes detectScheme() which routes gh://, gcp-skill://, and GitHub full URLs to their respective resolvers, with comprehensive tests covering fallback routing, scheme dispatch, mixed batches, unsupported schemes, nil fallback safety, and error propagation. * skill-bank M5a: wire routing resolver at CLI and broker call sites Replace direct HubSkillResolver construction with RoutingSkillResolver wrapping the hub resolver at both CLI (cmd/create.go) and broker (pkg/runtimebroker/handlers.go) call sites. CachingSkillResolver wraps the routing resolver so content-hash caching applies to all source types. Add SkillURIScheme() utility to pkg/api/skill_uri.go for extracting the scheme portion of a skill URI without full parsing. * skill-bank M5c: add SkillRegistry schema, store, and models Add the SkillRegistry Ent schema with fields for name, endpoint, type (hub/gcp), trust_level (trusted/pinned), auth_token, resolve_path, pinned_hashes, and status. Define the SkillRegistryStore interface and its Ent adapter implementation with CRUD, pinned hash management, and list operations. Embed in the composite store. * skill-bank M5c: add skill registry CRUD handlers Add admin-only HTTP handlers for skill registry CRUD operations: create, list, get, update, delete, and pin hash. Register routes at /api/v1/skill-registries. Enforce HTTPS-only endpoints, validate registry names, and never expose auth tokens in API responses. * skill-bank M5c: add federation proxy and trust enforcement Add federateResolve to proxy skill resolution requests to external registries. The resolve endpoint now detects non-scion registry URIs and delegates to the configured external registry instead of local resolution. Supports trusted (pass-through) and pinned (hash verification) trust levels. * skill-bank M5c: add hub client and CLI for skill registries Add SkillRegistryService to the hub client with List, Get, Create, Update, Delete, and Pin operations. Add CLI subcommands under 'scion skills registries' for list, add, show, update, remove, and pin operations with table and JSON output support. * skill-bank M5c: add federation and registry tests Add 16 tests covering federation proxy (trusted/pinned happy paths, hash mismatch, missing pin, unknown/disabled/wrong-type registry, external registry down, auth token forwarding, custom resolve path) and registry CRUD (lifecycle, duplicate name rejection, HTTPS-only enforcement, non-admin rejection, auth token not in responses, pin). * skill-bank M5c: fix federation security issues (H1, H3, L1) H1: Add 10MB body size limit on federation success path to prevent OOM. H3: Disable redirect following on federation HTTP client to prevent credential leakage via Authorization header on cross-origin redirects. L1: Create federation HTTP client once on Server struct instead of per-call, enabling connection pooling and proper test injection. * skill-bank M5d: add gcp-skill:// URI parser and tests Add ParseGCPSkillURI which extracts alias, skill ID, and optional version from gcp-skill://alias/SKILL_ID[@Version] URIs. This is the first building block for the GCP Vertex AI skill resolver. * skill-bank M5d: add GCPSkillResolver with Vertex AI API integration Implements gcp-skill:// resolution via GCP Vertex AI Skills API. The resolver uses ADC for authentication, looks up registry aliases via an injected RegistryLookup function, fetches skill metadata and files from the GCP API, and computes content hashes for verification. * skill-bank M5d: wire GCP resolver at broker and add tests Register the GCPSkillResolver in the broker's skill resolver chain. The registry alias lookup uses the Hub API (which accepts name-based lookups). Add comprehensive tests covering happy path, error cases (unknown alias, disabled registry, wrong type, GCP 404/403, ADC failure, empty files), and alias forwarding. * skill-bank M5d: fix version validation, SSRF defense, and response size limit F1: Validate that if a version is requested via @Version in the URI, the GCP API response version must match — reject with a clear error otherwise. F2: Validate file download URLs before fetching: must use HTTPS (except localhost), must share the same host as the registry endpoint, and must not target link-local (169.254.x.x) or RFC 1918 addresses. F6: Wrap fetchSkillMetadata response body with io.LimitReader (1MB) to prevent OOM from oversized API responses. * skill-bank M5b: add gh:// URI parser and tests * skill-bank M5b: add GitHubSkillResolver with Contents API integration * skill-bank M5b: add full GitHub URL parser * skill-bank M5b: wire GitHubSkillResolver at CLI and broker * skill-bank M5b: add GitHub resolver integration tests * skill-bank M5b: fix input sanitization and response size limit * skill-bank M5: fix PR review findings (nil checks, SSRF IPv6, resolvePath, ADC caching) * skill-bank M5: fix SSRF redirect bypass, ADC context, and PinSkillHash race * skill-bank M5: fix federation URI translation, CLI GCP wiring, and path escaping * skill-bank M5: fix CI — gofmt and missing mock method --------- Co-authored-by: Scion Agent (skill-bank-m5a-dev3) <agent@scion.dev>

* fix: error contracts, integration feedback, outbound errors, and wake audit Stream B — Non-existent agent error contract: - Move agent lookup before message persistence in deliverToAgent() to prevent orphan message rows for deleted agents - Add DELIVERY_FAILED notification type dispatched to agent senders when broker-path delivery targets a non-existent agent - Enhance Hub API 404 responses with agent slug and project context - Mark scheduled events targeting deleted agents with status=failed Stream I — Outbound agent-to-user error feedback: - Persistence failure returns 500 (was silent 200 OK) - Missing recipient returns 400 (removed silent creator fallback) - Broker dispatch failure returns 502 with clear message - Successful sends return message_id, status, recipient, recipient_id Stream K — Wake audit and test coverage: - Add TestHandleAgentMessage_WakeSuspended (primary use case was untested) - Add wake failure scenario tests (start fails, delivery fails) - Add test for messaging suspended agent without --wake - Bump wake timeout from 15s to 30s to match broker retry deadline - Add distinct error for wake-success-delivery-failure - Reject messages to suspended agents without --wake with clear error Stream C — Integration error feedback: - Add ActionAttach permission check for user: senders in handleBrokerInbound - Validate default agents against agent cache before routing in Telegram - Report Hub delivery errors back to originating Telegram chat - Add error cooldown (max 1 per 5 min per chat+thread+error-type) - Include remediation suggestions in error responses * fix: address review findings M1, M2, L1, L2 M1: Fix misleading "Message persisted but delivery failed" error message to "Message delivery failed" — the broker path doesn't persist before dispatch, so the old message was incorrect. M2: Add lazy eviction to errorCooldown map in shouldSuppressError() when map exceeds 1000 entries, preventing unbounded growth in long-running Telegram plugin instances. L1: Fix gofmt alignment on ErrCodeAgentNotFound and ErrCodeDeliveryFailed constants. L2: Inline responseStatus and deliveryStatus variables that were never reassigned — every error path returns early, so the scaffolding added no value. * feat(messaging): broadcast partial-failure reporting and CLI sender feedback Stream H — CLI Sender Feedback Improvements: - Add agent phase pre-check in handleAgentMessage: non-running agents return 409 Conflict with guidance (suspended: use --wake, stopped/error: use scion start, other: wait for running state). - Extend 200 OK response with message_id, status, agent, agent_phase. - Update hubclient SendStructuredMessage to return *MessageResponse. - CLI differentiates "delivered" (200) from "deferred" (202) output. Stream G — Broadcast Partial-Failure Reporting: - Broadcasts return 202 Accepted with targeting info: total agents, targeted (running) count, skipped count with phase breakdown. - broadcastDirect publishes DELIVERY_FAILED notifications for per-agent delivery failures. - Message broker fan-out publishes DELIVERY_FAILED on dispatch failures. - CLI grove-scoped broadcast uses Hub broadcast endpoint and prints acceptance summary with targeted/skipped breakdown. - Update hubclient BroadcastMessage to return *BroadcastResponse. * fix: address review findings M1, M2, M3, M4 M1: Eliminate double ListAgents TOCTOU in direct-broadcast path by passing pre-classified running agents from the handler's single query to broadcastDirect. M2: Add TODO noting --all path needs P3 upgrade when a global broadcast endpoint is added. M3: Restore zero-targeted guard — print "No running agents" when targeted count is 0 instead of misleading acceptance message. M4: Sort skipped breakdown phases alphabetically for deterministic CLI output. * style: fix gofmt formatting in broker_v2.go and agents.go * feat: channel validation, group[] rename, and scheduled event cleanup (GoogleCloudPlatform#213) Stream A — Channel/flag validation: - Validate --channel names against registered channels at send time in CLI (sendMessageViaHub, sendOutboundMessageViaHub) and Hub (handleAgentOutboundMessage). - Return actionable error naming available channels. Stream F — set[] to group[] rename: - Accept both group[ and set[ prefixes in IsGroupRecipient/ParseGroupRecipient for backward compatibility. - FormatGroupRecipients now emits group[...] as the canonical syntax. - CLI help text updated to show group[...] as primary syntax. - Deprecation warning logged when set[...] is used. Stream J — Scheduled event cleanup on agent deletion: - Cancel all pending scheduled events targeting a deleted agent in performAgentDelete, before the agent record is removed. - Match events by parsing payload for agent ID/name/slug. - Mark cancelled events with status "cancelled" and reason "target agent deleted". - Cancel corresponding in-memory scheduler timers. * feat(messaging): no-queuing delivery policy with synchronous broker retry Replace implicit fire-and-forget queuing with synchronous-or-reject semantics. Messages are now retried against the broker for up to 30s with exponential backoff before failing with 502 (non-transient error) or 504 (timeout). Messages are persisted with dispatch_state=dispatched optimistically and marked as failed on delivery failure. - Add dispatchWithBrokerRetry() helper with exponential backoff - Add ErrBrokerTimeout sentinel and broker_timeout error code - Add MarkMessageFailed() to store interface - Update all 7 dispatch call sites to use sync retry - Remove signalDeferredMessage, pending message scan in reconcileBroker - Remove signalDeferred wiring from MessageBrokerProxy, NotificationDispatcher - Remove dead "deferred" branch from CLI message output * fix(messaging): address Phase 4 review findings F1: MarkMessageFailed now persists the failure reason via a new dispatch_failure_reason column, and removes redundant control flow. F2: Update stale ErrMessageDeferred comment to reflect retry semantics. F3: broadcastDirect persists before dispatch, matching other handlers. F4: Document sequential retry O(N×30s) risk in handleGroupMessage. F5: Note shared 30s context in deliverToAgent. F6: Document that post-Phase-4 pending rows indicate a bug. * fix(messaging): address Phase 1 review findings - M1: Use FormatGroupRecipients (not deprecated FormatSetRecipients) in handleGroupMessage - M2: Fail closed when broker proxy is nil during channel validation in outbound handler - M4: Add unit tests for eventTargetsAgent (6 tests) and validateChannel (3 tests) - L1: Fix FormatGroupRecipients docstring (set[...] -> group[...]) - Fix ListChannels using CheckResponse which closes body before decode; use DecodeResponse instead * fix: resolve CI lint and gofmt failures - Fix gofmt trailing newline in reconcile.go - Fix errcheck: check CancelEvent return value in handlers.go - Fix errcheck: discard json.Encode return in handlers.go and test files - Fix staticcheck: use tagged switch in message_channel_test.go * fix(messaging): address PR GoogleCloudPlatform#409 review comments - Add deliveryErr parameter to publishDeliveryFailed for accurate error messages - Distinguish ErrNotFound from transient errors in agent lookup (messagebroker.go) - Distinguish ErrNotFound from transient errors in broker inbound handler (403 vs 500) - Throttle errorCooldown map cleanup to every 100 calls instead of every call --------- Co-authored-by: Scion Agent (message-improvements-p2) <agent@scion.dev>

…loudPlatform#411) Co-authored-by: Scion Agent (dev-followup-pr) <agent@scion.dev>

Removes [esbuild](https://github.com/evanw/esbuild). It's no longer used after updating ancestor dependency [vite](https://github.com/vitejs/vite/tree/HEAD/packages/vite). These dependencies need to be updated together. Removes `esbuild` Updates `vite` from 7.3.2 to 8.0.16 - [Release notes](https://github.com/vitejs/vite/releases) - [Changelog](https://github.com/vitejs/vite/blob/main/packages/vite/CHANGELOG.md) - [Commits](https://github.com/vitejs/vite/commits/v8.0.16/packages/vite) --- updated-dependencies: - dependency-name: esbuild dependency-version: dependency-type: indirect - dependency-name: vite dependency-version: 8.0.16 dependency-type: direct:development ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…GoogleCloudPlatform#415)

…CloudPlatform#412)

…udPlatform#416) Vite 7 deprecated the bundled transformWithEsbuild and now requires esbuild to be installed separately. This fixes the CI build failure: "Failed to load transformWithEsbuild. It is deprecated and it now requires esbuild to be installed separately." Co-authored-by: Scion Agent (ci-vite-esbuild-fix) <agent@scion.dev>

Co-authored-by: Scion Agent (ci-gofmt-fix) <agent@scion.dev>

…(issue GoogleCloudPlatform#256) (GoogleCloudPlatform#419) Co-authored-by: Scion Agent (issue-256-fix) <agent@scion.dev>

…latform#418) - Use BadRequest() helper for base64 validation in all 4 secret handlers - Store decoded plaintext (string(decoded)) instead of base64-encoded req.Value, preventing double-encoding when secrets are injected as environment variables or written to secrets.json - Add MaxBytesReader (128KB) to setSecret, handleProjectSecretByKey, and handleBrokerSecretByKey to match handleAgentSecrets - Encode secret values as base64 in the frontend using TextEncoder before sending to the API Co-authored-by: Scion Agent (secret-400-pr418-fix) <agent@scion.dev>

…rm#420) Restores the build feature code that was accidentally removed by PR GoogleCloudPlatform#412. This includes: - BuildHarnessConfigImageExecutor in maintenance_executors.go - build-harness-config-image seeded operation - Executor wiring in admin_maintenance.go - Build Image button, dialog, and log streaming UI on harness-config detail page Originally shipped in PRs GoogleCloudPlatform#406 and GoogleCloudPlatform#410. Co-authored-by: Scion Agent (harness-local-build) <agent@scion.dev>

Stop auto-closing tasks on the first content message from an agent. Previously, any non-state-change message immediately marked the task as completed and closed all subscriptions (the MVP single-turn limitation documented in the TODO at bridge.go:633). Now content messages are broadcast to streaming and push subscribers with state=working and Final=false, keeping the task alive. Task lifecycle is driven solely by agent state-change messages: - working/thinking/executing → working (non-terminal) - waiting_for_input → input-required (non-terminal) - completed → completed (terminal, closes task) - error/stalled → failed (terminal, closes task) This enables multi-turn conversations where agents ask clarifying questions, send progress updates, or emit interim artifacts before completing. Design doc: .design/a2a-multi-turn-lifecycle.md

20 tests covering the multi-turn task lifecycle: - Content messages don't complete tasks - Content broadcasts with state=working, Final=false - Multiple content messages keep task alive - State-change to completed/failed closes task properly - State-change to input-required keeps task alive - Blocking SendMessage returns working (not completed) - Blocking timeout/error/cancel cleans up activeTask - Full multi-turn lifecycle integration test - Slug-based fallback correlation with content - Metrics not incremented on content messages

Fixes from code review: 1. Terminal state-changes dropped during blocking calls: dispatchToWaiter skipped state-change messages entirely, even terminal ones. The task's DB state was never updated to completed/failed. Fix: update DB state for terminal state-changes even when a waiter is active. 2. Janitor reaping active multi-turn tasks: content messages didn't refresh the task's UpdatedAt timestamp, so long conversations could be reaped as stale. Fix: call UpdateTaskState(working) on content messages to refresh the timestamp. Added/updated tests for both scenarios.

…review Debug/refactor cycle findings: - Refactored dispatchToActiveTask for clarity - Added test coverage for edge cases in state-change handling - All tests pass

Enable multi-turn conversations by routing message/send with a taskID to the same agent, continuing the conversation instead of creating a new task. When SendMessageParams includes a taskID: 1. Look up the existing task and verify ownership 2. Reject if task is in a terminal state (completed/failed/canceled) 3. Resolve the agent from stored task metadata 4. Send the follow-up message to the agent 5. Return the existing task (not a new one) This works with both blocking and non-blocking modes. Combined with the multi-turn lifecycle change (PR 1), this enables the full A2A multi-turn flow: client sends initial message → agent responds or asks for input → client sends follow-up → agent continues. Design doc: .design/a2a-task-followup.md

22 tests covering follow-up message routing: - Valid/unknown/terminal/wrong-project/wrong-agent task ID handling - Task state transitions (input-required → working) - Blocking timeout/error/cancel/success cleanup paths - Non-blocking registration and send-failure cleanup - Concurrent follow-ups on same task - Message content translation - Server-level TaskID passthrough and error handling Bugs fixed during review: - Blocking success path leaked activeTask (added defer unregister) - Non-blocking send failure didn't mark task failed or unregister

Fixes from code review: - Blocking success: refresh task timestamp with UpdateTaskState(working) - Send failure: mark task as failed + unregister activeTask - Timeout/cancel: mark task as failed - Added tests verifying DB state after each path

Found during 12-cycle debug/refactor: - Fixed edge cases in follow-up routing and state management - Added test coverage for discovered paths - 3 consecutive clean cycles after fixes

Update agent cards to advertise streaming and push notification support now that multi-turn conversations are implemented. - Registry card: streaming=true, pushNotifications=true - Per-agent cards: streaming=true, pushNotifications=true - Remove MVP streaming warning from handleStreamMessage - Update README: remove single-turn limitation, update known limitations to reflect current state (no gRPC/REST transport)

4 tests verifying multi-turn capability advertisement: - Registry card advertises streaming=true, pushNotifications=true - Per-agent card matches registry capabilities - Direct unit test of GenerateAgentCard capability values - Drift prevention test ensuring registry and per-agent cards stay in sync

…d use topic helpers (GoogleCloudPlatform#421) Address review feedback from merged A2A PRs GoogleCloudPlatform#314 and GoogleCloudPlatform#315: - Add TouchTask store method to refresh timestamps without changing state - Guard dispatchToActiveTask so content messages read and preserve the current task state instead of unconditionally resetting to working - Replace hardcoded fmt.Sprintf topic patterns in sendFollowUp and SendStreamingMessage with projectcompat.UserTopic/LegacyUserTopic - Fix SendStructuredMessage call sites missing second return value - Update followup_test.go mocks to match current hubclient interfaces Co-authored-by: Scion Agent (a2a-review-followup-dev) <agent@scion.dev>

Co-authored-by: Scion Agent (broker-shutdown-inv) <agent@scion.dev>

Templates that specify a fully-qualified custom image (e.g. ghcr.io/myorg/scion-myimage:latest) currently get their registry prefix rewritten by the broker's image_registry setting. This makes it impossible to use custom scion-* images hosted in external registries without push access to the broker's registry. Add an `image_pinned` field to ScionConfig. When set to true in a template's scion-agent.yaml, the image is used as-is without registry rewriting.

zeroasterisk · 2026-06-15T01:45:19Z

Closing — superseded by format-based detection (#8 / ptone#266). image_pinned approach deprecated.

ptone and others added 30 commits June 2, 2026 05:46

docs: document REGION and ZONE overrides in starter-hub README

e64a1a7

fix: use sudo to check repository path existence in gce-demo-setup-re…

b2eaa59

…po.sh

cmd: fix nil-pointer panic in harness-config when Hub is disabled

5d21b25

scripts/starter-hub: add MACHINE_TYPE override support to provision s…

26caeb9

…cript

Restore contents of .scion as before the recent pull

b1f08a0

Remove scratchpad markdown files as requested

d1a01c7

Organize developer tools into hack and fix build config

afaae9a

feat(runtime): NFS-coordinated workspace sharing across nodes (Model …

02efd44

…A Docker + Model B GKE) (GoogleCloudPlatform#306)

fix(hub): use deterministic UUID for plugin broker IDs to match α mig…

87c0487

…ration (GoogleCloudPlatform#320)

fix(hub): address PR GoogleCloudPlatform#319 review feedback (GoogleC…

a3a7530

…loudPlatform#319)

ptone and others added 28 commits June 12, 2026 04:41

fix: correct runId JSON key mismatch in build polling (GoogleCloudPla…

5da221d

…tform#410) Co-authored-by: Scion Agent (harness-build-blocker-fix) <agent@scion.dev>

fix: use net/mail.ParseAddress for stricter email validation (GoogleC…

1edca56

…loudPlatform#411) Co-authored-by: Scion Agent (dev-followup-pr) <agent@scion.dev>

skill-bank M5: fix SQLite pin compatibility and skill name validation (…

d6e1ba1

…GoogleCloudPlatform#415)

Add filter/sort to project detail agent list (GoogleCloudPlatform#414)

2afa0e4

Fix duplicate no_auth keys and missing field schema attribute (Google…

fb67191

…CloudPlatform#412)

style: gofmt all unformatted Go source files (GoogleCloudPlatform#417)

ff87bb3

Co-authored-by: Scion Agent (ci-gofmt-fix) <agent@scion.dev>

fix: regenerate ent client to remove stale discordpendinglink import …

4b7a775

…(issue GoogleCloudPlatform#256) (GoogleCloudPlatform#419) Co-authored-by: Scion Agent (issue-256-fix) <agent@scion.dev>

fix(a2a-bridge): refactor multi-turn dispatch + add tests from debug …

e077366

…review Debug/refactor cycle findings: - Refactored dispatchToActiveTask for clarity - Added test coverage for edge cases in state-change handling - All tests pass

fix(a2a-bridge): 4 fixes in sendFollowUp from debug review

487a4d8

Found during 12-cycle debug/refactor: - Fixed edge cases in follow-up routing and state management - Added test coverage for discovered paths - 3 consecutive clean cycles after fixes

fix: add ResetAuth to mock after upstream interface change

508bcce

Protect metadata shutdown endpoint (GoogleCloudPlatform#422)

326e497

Co-authored-by: Scion Agent (broker-shutdown-inv) <agent@scion.dev>

zeroasterisk closed this Jun 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add image_pinned to skip registry rewrite for custom images#7

feat: add image_pinned to skip registry rewrite for custom images#7
zeroasterisk wants to merge 155 commits into
mainfrom
feat/image-pinned

zeroasterisk commented Jun 14, 2026

Uh oh!

zeroasterisk commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

zeroasterisk commented Jun 14, 2026

Summary

Motivation

Usage

Changes

Test plan

Uh oh!

zeroasterisk commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants