Skip to content

feat: add image_pinned to skip registry rewrite for custom images#7

Closed
zeroasterisk wants to merge 155 commits into
mainfrom
feat/image-pinned
Closed

feat: add image_pinned to skip registry rewrite for custom images#7
zeroasterisk wants to merge 155 commits into
mainfrom
feat/image-pinned

Conversation

@zeroasterisk

Copy link
Copy Markdown
Owner

Summary

  • Adds image_pinned field to ScionConfig (scion-agent.yaml)
  • When image_pinned: true, the image_registry rewrite is skipped, preserving the exact image reference from the template
  • No behavior change for existing templates (opt-in only)

Motivation

Templates that specify a fully-qualified custom scion-* image (e.g. ghcr.io/myorg/scion-elixir-image:latest) get their registry prefix rewritten by RewriteImageRegistry() to match the broker's image_registry setting. This makes it impossible to use custom images hosted in external registries without push access to the broker's registry.

The image_pinned flag lets template authors signal that their image reference is intentional and should not be rewritten.

Usage

In a template's scion-agent.yaml:

image: ghcr.io/myorg/scion-custom-image:latest
image_pinned: true

Changes

  • pkg/api/types.go: Add ImagePinned bool field to ScionConfig
  • pkg/agent/run.go: Check ImagePinned before applying registry rewrite

Test plan

  • go build ./... passes
  • go test ./pkg/agent/... ./pkg/api/... ./pkg/config/... — all image-related tests pass
  • Manual: create template with image_pinned: true and custom image, verify no rewrite in debug logs

Intended as a PR to googlecloudplatform/scion — created here on fork first due to token permissions. Please open upstream PR or grant fork PR access.

ptone and others added 30 commits June 2, 2026 05:46
…rm#293)

* fix(scion-chat-app): set channel="gchat" on ask_user dialog responses

handleDialogSubmit was using the simple SendMessage API which doesn't
support structured message fields, so inbound ask_user responses arrived
at the hub with no channel set (defaulting to "web"). Switch to
SendStructuredMessage with Channel="gchat" to match the pattern already
used by cmdMessage.

* fix: channel filtering and thread-id routing for chat channel replies

Two bugs in the chat channel routing feature:

1. Channel filtering: broker plugins now check msg.Channel and skip
   messages targeted at a different channel. The hub injects plugin_name
   into broker credentials so each plugin knows its own channel identity.
   This prevents cross-channel delivery (e.g., Telegram replies leaking
   to Google Chat).

2. Thread-id routing: the Telegram plugin now passes msg.ThreadID as
   message_thread_id to the Telegram Bot API when sending outbound
   messages. Previously, thread-id was captured on inbound messages but
   never forwarded on outbound, causing replies to land in the wrong
   forum topic. Added SendOption variadic parameter to SendMessage,
   SendMessageWithKeyboard, and SendQueue.Send for backward-compatible
   thread-id support.

* feat(scion-chat-app): add Google Chat thread context support

Propagate thread IDs end-to-end so agents can participate in
Google Chat threads:

- Inbound: auto-set ThreadID on StructuredMessage from the Google Chat
  event's thread context when no explicit --thread flag is used
- Inbound: propagate ThreadID on dialog submit (ask_user responses)
- Outbound: pass ThreadID from StructuredMessage to SendMessageRequest
  so agent replies land in the correct Google Chat thread

* fix: route outbound messages to chat-app via ChannelID

The FanOutEventBus matched msg.Channel against the bus Name, but the
chat-app plugin is registered as "chat-app" while its messages use
channel="gchat". Add a ChannelID field to NamedEventBus and PluginInfo
so plugins can declare the channel they handle independently of their
registered name. The chat-app now reports ChannelID="gchat" via
GetInfo(), and the hub reads it at startup to wire routing correctly.

* design: per-topic /default agent scoping for Telegram forums

Explores how to let /default set a different default agent per
forum topic (message_thread_id) rather than per-chat. Conclusion:
~85 lines of changes across store, commands, callbacks, and routing.

* feat(scion-telegram): per-topic /default agent scoping for forum groups

Add support for setting a different default agent per Telegram forum
topic/thread, with the chat-wide default as fallback.

- New topic_defaults table keyed on (chat_id, thread_id)
- /default in a topic sets/shows the topic-level override
- Callback data extended: dflt:<slug>:<threadID> for topic scope
- Routing resolves topic default before chat default for both
  @bot-mention and unaddressed message fallback paths

* fix: address PR GoogleCloudPlatform#293 review feedback

- Add !no_sqlite build tag to resource_import_handler_test.go to fix CI
  vet failure (mockRoundTripper undefined when template_bootstrap_test.go
  is excluded)
- Guard debug log in broker.go Publish against nil msg to prevent panic
- Add fitCallback to preserve threadID suffix in Telegram callback_data
  when the 64-byte limit is exceeded, truncating agentSlug instead
- Add slog warning to truncateCallback when truncation occurs

* fix: address second round of PR GoogleCloudPlatform#293 review feedback

- Remove redundant channel filters from chat-app and Telegram Publish()
  methods — the FanOutEventBus already routes by ChannelID, and comparing
  against the plugin's registered name would silently drop messages
- Log errors from GetTopicDefault instead of silently ignoring them
- Return distinct error messages in chat-app when ResolveOrAutoRegister
  fails with a real error vs a nil mapping

* fix: address third round of PR GoogleCloudPlatform#293 review feedback

- Add early return for nil msg at top of Publish() to prevent panics
  in downstream handlers that dereference msg fields
- Add thread-safe ChannelName() getter on BrokerServer
- Use dynamic ChannelName() in GetInfo() instead of hardcoded "gchat"
- Use dynamic ChannelName() in both commands.go call sites

* fix: use callback_lookups for long callback data instead of truncation

Replace fitCallback() which corrupted agent slugs by truncating them
to fit Telegram's 64-byte limit. Long callback payloads are now stored
in the callback_lookups table with a short cblu:<id> reference.
HandleCallback resolves lookup IDs before routing.

Also add defensive check for empty HubUserEmail in chat-app to prevent
constructing invalid "user:" sender strings.

* fix: address fifth round of PR GoogleCloudPlatform#293 review feedback

- Use local interface instead of concrete *BrokerRPCClient type assertion
  in pluginChannelID() and isObserverBroker() so in-process brokers and
  mocks are handled correctly.
- Add nil guard for msg in fanout channel routing check.

---------

Co-authored-by: Scion <agent@scion.dev>
…eCloudPlatform#296)

* Fix test suite leaking Hub credentials, corrupting agent state (GoogleCloudPlatform#123)

Tests that spawn sciontool (e.g., TestInitCommand_Integration) inherited
live Hub env vars from the agent container, causing the subprocess to
talk to the real Hub and reset the agent phase to "starting."

- Add scrubHubEnv(t) helpers that use t.Setenv to clear Hub env vars
  (SCION_HUB_ENDPOINT, SCION_HUB_URL, SCION_AUTH_TOKEN, SCION_AGENT_ID,
  SCION_AGENT_MODE) with automatic restore on test cleanup
- Filter Hub env vars from subprocess Cmd.Env in TestInitCommand_Integration
  as belt-and-suspenders protection
- Convert os.Setenv/os.Unsetenv to t.Setenv throughout hub_test.go and
  client_test.go for crash-safe env var isolation

* Add project log entry for issue GoogleCloudPlatform#123 fix

* Address PR GoogleCloudPlatform#296 review feedback in init_test.go

Replace hardcoded /tmp/sciontool-test path with t.TempDir() to avoid
permission conflicts and test races. Replace map allocation in
filterHubEnv with slices.Contains on the static hubEnvVars slice.
…oogleCloudPlatform#299)

Three new documentation pages:

- External Channels: covers Telegram (bidirectional group chat),
  Discord (outbound webhooks), and A2A protocol bridge in one page.
  Summarizes concepts and links to detailed READMEs in extras/.

- Hub Setup on GCE: step-by-step walkthrough of deploying a hub
  using the starter-hub scripts. Covers provisioning, repo setup,
  TLS, and post-setup next steps.

- Multi-Broker Setup: how to connect multiple machines to a single
  hub for distributed agent execution. Covers architecture, broker
  registration, selection, and cross-broker considerations.

Sidebar updated to include all three pages.
* Add sort and filter capabilities to agent list view (GoogleCloudPlatform#71)

CLI: add --phase, --activity, --template filter flags and --sort,
--reverse sort flags to 'scion list'. Validates flag values against
known phases/activities. Passes phase filter server-side in hub mode
for efficiency.

Web UI: add phase filter chips (All/Running/Stopped/Suspended/Error),
sortable table headers (Name, Status, Updated), and sort dropdown for
grid view. Filter and sort state persists to localStorage.

Closes GoogleCloudPlatform#71

* Address review feedback: input canonicalization and validation

- CLI: canonicalize --phase/--activity/--sort to lowercase in
  validateListFlags, remove redundant empty check on filterActivity
- Web UI: validate localStorage phase filter against known values
  instead of raw cast
- Web UI: validate localStorage sort config field/dir values before
  applying
- Web UI: handle invalid date strings in formatRelativeTime with
  isNaN guard
…rm#295)

* Add prominent disconnected overlay to web terminal

When the WebSocket connection drops, a full-terminal overlay now appears
with 50% black opacity and large red "DISCONNECTED" text centered on it.
The overlay appears immediately on disconnect and disappears when the
connection is re-established. The small status indicator in the toolbar
remains as a secondary signal.

Fixes GoogleCloudPlatform#77

* Move disconnected overlay to be a sibling of xterm container

The overlay was a child of .terminal-container, whose DOM is managed by
xterm.js. Lit re-rendering the overlay on connect/disconnect state
changes conflicts with xterm's DOM management.

Fix: introduce .terminal-wrapper as the relative-positioning context,
make .terminal-container absolutely positioned inside it, and render
the overlay as a sibling — outside xterm's managed subtree.

* Use wasConnected flag instead of terminal ref for overlay reactivity

Replace the non-reactive `this.terminal` reference in the overlay
condition with a new `@state() wasConnected` flag. This fixes two issues:

1. Lit reactivity: `this.terminal` lacked `@state()` so changes to it
   didn't trigger re-renders. The new `wasConnected` is properly
   decorated as reactive state.

2. Initial connection: using `this.terminal` would flash the overlay
   during the brief window between terminal init and WebSocket open.
   `wasConnected` is only set true after the first successful connect,
   so the overlay only appears after a genuine disconnection.
…tore port, LISTEN/NOTIFY (GoogleCloudPlatform#304)

* P0-1: switch Postgres driver from lib/pq to pgx/v5 stdlib

- Add github.com/jackc/pgx/v5/stdlib (registers as "pgx")
- driver_postgres.go: blank import pgx stdlib instead of lib/pq
- OpenPostgres: open via sql.Open("pgx", dsn) + entsql.OpenDB
- Introduce PoolConfig (applied to *sql.DB); thread through
  OpenSQLite/OpenPostgres and update all callers
- go mod tidy drops lib/pq

* P0-2: add connection pool config to DatabaseConfig

- DatabaseConfig gains MaxOpenConns / MaxIdleConns / ConnMaxLifetime
  plus ConnMaxLifetimeDuration() helper
- DefaultGlobalConfig sets sqlite pool defaults (MaxOpenConns=1,
  load-bearing for write serialization)
- applyDatabasePoolDefaults fills postgres defaults (20/5/30m) and
  forces sqlite MaxOpenConns=1; called in both load paths
- Mirror fields in V1DatabaseConfig + both conversion directions
- Wire pool settings into entc.OpenSQLite in initStore

* P0-3/P0-4: CRUD-parity test harness + spec-driven fixture generator

P0-3: pkg/store/storetest/ — backend-agnostic, table-driven CRUD oracle.
A Factory(t) -> store.Store is injected; generic Domain[T] descriptors drive
Create/Read/Update/Delete (+optional soft-delete)/List-paginate/List-filter.
Ships group + policy domains and runs green against today's CompositeStore
(SQLite base + Ent DB). Ready to accept a postgresFactory for P3-2.

P0-4: internal/fixturegen/ — Go-defined spec seeding >=1 row per table across
all 30 domain tables, with edge cases (NULL optionals, max-length strings,
nested/unicode JSON, soft-deleted agent, BLOB). Deterministic. 'go run
./internal/fixturegen' emits testdata/hub-v46-fixture.db, prints a 30-table
coverage report, and caches the blob to the scratchpad mount. CI gate fails if
any table has zero rows.

* feat(ent): add 23 new Ent schemas for full table parity (P1-2 + P1-3)

* feat(observability): add Cloud Monitoring scaffolding for LISTEN/NOTIFY metrics (P0-5)

* P2: port notification + gcp/github/token domains to Ent entadapter

Add Ent-backed implementations of the notification, GCP service account,
GitHub App installation, and user access token store sub-interfaces:

- notification_store.go: NotificationStore (subscriptions, notifications,
  templates). Dispatch uses an atomic conditional update as the multi-replica
  claim primitive, and an optional NotificationPublisher designs in the
  LISTEN/NOTIFY fan-out for created/dispatched events.
- external_store.go: GCPServiceAccountStore + GitHubInstallationStore +
  UserAccessTokenStore. GitHub create is idempotent (INSERT OR IGNORE
  semantics), repositories/scopes are JSON, default_scopes is CSV, and tokens
  support key-hash lookup. Legacy api_keys is intentionally not surfaced.
- storetest: add GCPServiceAccount, SubscriptionTemplate, and
  NotificationSubscription CRUD-parity domains.

Does not modify composite.go.

* P2: port schedule, maintenance, message domains to Ent entadapter

- schedule_store.go: ScheduleStore + ScheduledEventStore sub-interfaces with
  dialect-aware SELECT FOR UPDATE SKIP LOCKED claim helper for the
  ListDueSchedules / ListPendingScheduledEvents job-claim paths (plain SELECT
  on SQLite, SKIP LOCKED on Postgres).
- maintenance_store.go: run-state RMW, AbortRunningMaintenanceOps, Go-side
  seed (uuid.New) replacing SQLite randomblob() UUID seeds.
- message_store.go: CRUD, read flags, PurgeOldMessages, design-in
  PublishUserMessage hook for Postgres LISTEN/NOTIFY.
- pkg/ent/client_driver.go: hand-written Client.Driver() accessor for
  dialect detection + raw locking queries.

* feat(entadapter): port user + allowlist/invite domains to Ent (P2)

Implements the Ent-backed store adapters for the user and
allowlist/invite domains, plus their CRUD-parity oracle descriptors.

pkg/store/entadapter/user_store.go (store.UserStore):
- CreateUser/GetUser/GetUserByEmail/UpdateUser/UpdateUserLastSeen/
  DeleteUser/ListUsers.
- Case-insensitive email: emails are normalized to lower case on write
  (so the plain unique index enforces case-insensitive uniqueness,
  equivalent to the legacy UNIQUE COLLATE NOCASE) and matched with
  EmailEqualFold (lower(email)=lower($1)) on read. ent codegen +
  AutoMigrate cannot emit a real lower(email) functional index across
  both SQLite (tests) and Postgres, so the invariant is enforced at the
  port layer.
- Offset-based pagination matching the legacy SQLite store.

pkg/store/entadapter/allowlist_store.go (store.AllowListStore +
store.InviteCodeStore):
- Full allow-list + invite-code CRUD.
- BulkAddAllowListEntries uses CreateBulk + OnConflictColumns(email).
  Ignore() for race-safe INSERT-OR-IGNORE; added/skipped counts mirror
  the legacy per-row semantics (existing + within-batch dups skipped).
- IncrementInviteUseCount is a single atomic conditional UPDATE
  (revoked=false AND not expired AND (max_uses=0 OR use_count<max_uses)),
  which is race-free on both backends without SELECT...FOR UPDATE. The
  sql/lock feature is enabled and ForUpdate is available for genuine
  multi-statement RMW paths.
- ListAllowListEntriesWithInvites batch-joins invite codes (invite_id is
  a plain column, not an Ent edge).

Schema:
- pkg/ent/schema/user.go: add nillable last_seen field (+ index) needed
  by UpdateUserLastSeen / lastSeen sort; document the case-insensitive
  email strategy.
- pkg/ent/generate.go: enable --feature sql/upsert,sql/lock (required for
  OnConflict and ForUpdate).

Tests (all passing):
- pkg/store/storetest/domains_user.go: UserDomain, AllowListDomain,
  InviteCodeDomain oracle descriptors (kept in a separate file to avoid
  contending on domains.go).
- entadapter oracle test runs the shared CRUD-parity suite directly
  against the new adapters; behavior tests cover case-insensitivity,
  bulk idempotency, conditional increment, stats, and the invite join.

NOTE: Generated Ent code under pkg/ent/** is intentionally NOT included.
This is a shared worktree where sibling port agents concurrently modify
schemas and the same feature flags; the generated code must be
regenerated at wave integration via:
    go generate ./pkg/ent/...
Verified locally that regeneration + full build + tests pass.

Per P2 scope: composite.go wiring and ensureEntUser shadow removal are
deferred to P2-collapse.

* P2: port secret/env_var + template/harness_config domains to Ent

Add Ent-backed store implementations for the secret/env and
template/harness domains, mirroring the legacy SQLite semantics:

- entadapter/secret_store.go: SecretStore implementing store.SecretStore
  + store.EnvVarStore. Polymorphic (scope, scope_id) addressing, COALESCE
  target->key projection, version bump on update, get-then-update upsert,
  and transitive ListProgenySecrets via a created_by IN-list over the
  ancestor set (user scope + allow_progeny only; encrypted value withheld).
- entadapter/template_store.go: TemplateStore implementing
  store.TemplateStore + store.HarnessConfigStore. base_template hierarchy,
  scope/project_id backwards-compat lookups, content_hash, JSON config/files
  columns, DeleteByScope. Subscription templates are owned by NotificationStore.
- Direct Ent unit tests incl. a progeny-inheritance parity test.
- storetest: Template/HarnessConfig/Secret/EnvVar domain descriptors wired
  into RunStoreSuite for cross-backend CRUD parity.

* P2: port project/broker + brokersecret domains to Ent

Port the project/broker domain (projects, runtime_brokers, project_contributors,
project_sync_state) and the broker-auth domain (broker_secrets,
broker_join_tokens) from raw SQL to Ent adapters.

- pkg/store/entadapter/project_store.go: implements ProjectStore,
  RuntimeBrokerStore, ProjectProviderStore and ProjectSyncStateStore.
  * provider + sync-state upserts use Ent OnConflict().UpdateNewValues()
    (sql/upsert) keyed on the (project_id, broker_id) unique index.
  * runtime broker heartbeat/update use an optimistic version-CAS loop on a
    new internal lock_version token, serializing concurrent writers portably
    across SQLite (tests) and Postgres without SELECT ... FOR UPDATE.
  * slug lookups support case-insensitive matching (EqualFold).
  * project computed fields (AgentCount, ActiveBrokerCount, ProjectType) are
    derived via Ent queries, matching the legacy SQLite store.
- pkg/store/entadapter/brokersecret_store.go: implements BrokerSecretStore
  (per-broker HMAC secrets + short-lived join tokens, expiry cleanup).
- Project Ent schema: add operational fields for full parity
  (default_runtime_broker_id, shared_dirs, github_*, git_identity).
- RuntimeBroker Ent schema: relax vestigial type column to Optional, add
  internal lock_version concurrency token.
- Regenerate Ent with sql/upsert,sql/lock features.
- storetest: add Project, RuntimeBroker, BrokerSecret and BrokerJoinToken
  CRUD-parity domains.
- Unit tests for both adapters.

Per the integration plan, composite.go wiring and ensureEntProject shadow
removal are deferred to P2-collapse.

* P2: port agent domain to Ent entadapter (XL)

* chore(ent): regenerate Ent code for all 30 entity schemas

Regenerated with --feature sql/upsert,sql/lock to support
OnConflict upserts and ForUpdate/SKIP LOCKED job claims.

* P2-collapse: collapse dual-DB into single Ent store

Wire all Ent-backed sub-stores into CompositeStore via embedding, removing
the raw-SQL base store and the User/Agent/Project shadow-sync machinery
(ensureEntUser/ensureEntAgent/ensureEntProject). CompositeStore now serves
every domain from a single Ent client and implements Close/Ping/Migrate
directly.

Collapse initStore() to open one Ent SQLite DB (no _ent shadow DSN, no
MigrateGroveToProjectData, no raw sqlite.New). Register the User, AllowList,
and InviteCode domains in the storetest CRUD-parity suite. Update entadapter
tests for the single-DB NewCompositeStore(client) signature.

go build ./... green; go test ./pkg/store/entadapter/... ./pkg/store/storetest/... green.

* P2-delete: remove raw-SQL store implementation

Delete the ~6k-LOC raw-SQL store (sqlite.go) and its per-domain sibling
files (brokersecret, gcp_service_account, github_installation, maintenance,
messages, notification, project_sync_state, schedule, scheduled_event) plus
their tests, including the inline schema-migration scaffold. Keep driver.go,
which registers the pure-Go SQLite driver used by Ent's SQLite backend.

Repoint the two non-test consumers to the Ent-backed store:
  - cmd/hub_secret_migrate.go now opens an Ent client + CompositeStore.
  - internal/fixturegen opens via entc and seeds the Ent schema's *sql.DB.

go build ./... green; no remaining production references to the raw store.

* test: compile-migrate downstream suites to Ent store + fix signing-key PK

Replace the removed raw-SQL store in downstream tests with an Ent-backed
newTestStore helper (pkg/hub, pkg/secret) and update cmd/server_test.go and
internal/fixturegen tests. Port the 8 raw-SQL DB() access sites in hub tests
via a new CompositeStore.DB() escape-hatch accessor.

Fix a production bug surfaced by the collapse: hub/server.go signingKeySecretID
generated a non-UUID secret primary key, which the Ent secret store rejects;
it now derives a deterministic UUIDv5. go build ./... green; entadapter and
storetest suites green.

NOTE: hub/secret/fixturegen suites now COMPILE but many tests still fail
because their fixtures seed non-UUID string IDs that the UUID-PK Ent schema
rejects; addressed in follow-up commits (tid() helper).

* test(hub): map non-UUID fixture IDs to UUIDs via tid() helper

Wrap human-readable test identifiers in tid() (deterministic UUIDv5) so the
UUID-PK Ent store accepts them while preserving cross-reference consistency and
ID-equality assertions. Reduces pkg/hub failures from 611 to 79; remaining
failures are behavioral, not ID-format, and are addressed separately.

# Conflicts:
#	pkg/hub/handlers_project_test.go
#	pkg/hub/httpdispatcher_test.go

* fix(store): seed maintenance ops in Migrate; initStore uses Migrate

Restore raw-SQL parity: CompositeStore.Migrate now runs AutoMigrate and seeds
built-in maintenance operations (the raw store seeded these in its migrations).
initStore and hub test helpers call s.Migrate() so production and tests seed
consistently. Fixes the maintenance-operation hub tests (404 'Operation not
found'). pkg/hub failures 79 -> 71.

* test(hub): satisfy Ent NotEmpty validators in fixtures

Add slugs/broker names to test fixtures that previously relied on the raw
store's lenient (no-validator) inserts: project/agent slugs in the logs test
helper, broker slugs in embedded/profile/authz fixtures, and BrokerName on
envgather ProjectProvider literals. pkg/hub failures 71 -> 57.

* test(secret): map non-UUID fixture IDs to UUIDs via tid()

Apply the tid() helper to pkg/secret fixtures (including a dynamically built
secret ID) so the UUID-PK Ent store accepts them. pkg/secret now fully green.

* test(cmd): map non-UUID fixture IDs to UUIDs via tid(); add broker slug/name

Wrap broker/grove/agent IDs passed to registerGlobalProjectAndBroker and the
dispatcher tests in tid(), and supply RuntimeBroker.slug / ProjectContributor
broker_name to satisfy Ent validators. cmd now green except
TestDeleteStopped_RequiresGroveContext, which requires the 'docker' binary
(absent in this sandbox) and is unrelated to the store migration.

# Conflicts:
#	cmd/server_dispatcher_test.go

* test(hub): wrap remaining latent non-UUID fixture IDs

Catch IDs that surfaced behind earlier failures (stale-agent-*, agent-visible-authz,
agent-profile-hb, env-owner-1). No more UUID-parse errors in pkg/hub; the
remaining ~56 failures are behavioral (URL paths built from old raw IDs,
assertion mismatches), addressed next.

* fix(entadapter): Get-by-id returns ErrNotFound for non-UUID identifiers

Restore raw-SQL store parity: a malformed identifier cannot match any UUID
primary key, so get-by-id lookups now report store.ErrNotFound instead of
store.ErrInvalidInput. This matches the raw store (a lookup with a bad id simply
returned no row) and is what callers depend on — e.g. resolveTemplate passes a
template *name* to GetTemplate and relies on ErrNotFound to fall back to
slug-based resolution. New parseGetID helper applied across all 17 get-by-id
methods. pkg/hub failures 56 -> 40; entadapter/storetest stay green.

* test(hub): fix store-less id wraps and project-route URL paths

- controlchannel_client_test: revert tid() wraps (store-less path-builder test;
  IDs must match the expected literal paths).
- github/envgather: project-scoped route handlers resolve the project by UUID id,
  so build paths with tid(rawID) via fmt.Sprintf instead of the old raw-id
  literal. pkg/hub failures 40 -> 32.

* test(hub): unwrap projectIDFromServiceAccountEmail expectation

The tid() sweep over-wrapped a non-ID expected value in a pure-function test;
restore the literal GCP project id.

* fix(ent): GCPServiceAccount.project_id is a string, not a UUID

The GCP service account project_id holds the GCP *cloud project* identifier
(e.g. 'my-project-123'), a free-form string — not a UUID. The schema declared
it field.UUID, so entadapter CreateGCPServiceAccount/Update did
parseUUID(sa.ProjectID) and rejected real GCP project ids, breaking SA
mint/create with a 400 in production (storetest masked it by passing a UUID).

Change the schema field to field.String, regenerate Ent, and store/read
project_id as a string in external_store.go. Fixes ~7 hub GCP tests; pkg/hub
31 -> 23.

* test(hub): fix GCP SA project-id assertion and project-settings id

Unwrap the over-wrapped 'my-project' expectation now that project_id is a
string, and wrap the dynamic project-settings project ID with tid().

* test(hub): fix bootstrap sync-to-finalize agent paths and storage keys

Build the finalize request path from the agent's tid() UUID and seed mock
storage under WorkspaceStoragePath(projectID, agent.ID) — the handler derives
the workspace key from the agent's real id, not the old raw name. pkg/hub
23 -> 19.

* test(hub): revert tid() over-wraps in store-less events_test

events_test exercises the in-memory ChannelEventPublisher directly; its
ProjectID/IDs are subject-string components, not stored UUIDs. The tid() sweep
wrongly rewrote them so published subjects no longer matched the subscriptions
(timeouts). Restore the literal values. pkg/hub 19 -> 12.

* test(hub): fix maintenance-run path and notifications agentId queries

Use tid() UUIDs in the maintenance run-detail path and the notifications
agentId query params; guard list indexing with require.Len so a mismatch fails
cleanly instead of panicking (panics truncate the package run).

* test(hub): wrap remaining fixture IDs revealed after panic-cascade cleared

Panics ([0] on empty lists) had been truncating the package run, hiding many
failures and starving the tid() sweep. With those guarded, sweep the newly
reached tests: wrap dynamic rune-suffix IDs and the setupProjectWithBroker /
seedCreatedAgentForHarnessTest helper IDs, and convert raw query-param project
IDs to tid(). No UUID-parse errors remain in pkg/hub.

* test(hub): unwrap tid() in scheduler_test (mock store, raw ids)

scheduler_test uses an in-memory mockScheduledEventStore, not the Ent store, so
its ids need no UUIDs; the erroneous tid() wraps broke raw getEvent lookups and
caused a nil-pointer panic that truncated the package run.

* fix(ent): Template.harness may be empty (raw-store parity)

A template imported from a directory that declares no harness type has an empty
harness; the raw-SQL store stored it, but the Ent NotEmpty validator made
BootstrapTemplatesFromDir silently skip such templates. Drop NotEmpty and
regenerate. Removing the [0]-on-empty panics this caused un-truncates the hub
package run (true failure count now visible).

* test(hub): wrap dynamic fixture IDs in wake/workspace/signing-key tests

Wrap tid() around the wake_test, setupWorkspaceProject, and empty-value
signing-key secret IDs now reachable after panic removal. No panics in the hub
package run.

* test(hub): convert raw-id URL path segments to tid()

Build GET/PUT/DELETE paths for agents/projects/brokers/templates/harness-configs
and workspace sync routes from tid(rawID) so the by-id handlers resolve the
entity (raw ids no longer match the UUID PKs). pkg/hub 93 -> 80.

* fix(entadapter)+test(hub): FK error mapping + permissions FK fixtures

mapError now distinguishes foreign-key violations (-> ErrInvalidInput, a bad
reference) from unique-constraint violations (-> ErrAlreadyExists); previously
both surfaced as a misleading 'already exists'/409.

Seed the users/agents that group memberships and policy bindings reference
(the Ent store enforces user/agent FK edges the raw store lacked), wrap
remaining raw fixture/URL ids in tid(), and give the AddAgent fixtures slugs.
All pkg/hub permissions tests pass.

* fix(hub): seed creator users for agent-created agents; cascade-delete subscriptions on hard agent delete

* test(hub): seed broker slug/name in dispatcher and project_cache fixtures (Ent validators)

* test(hub): use tid() in principal/agent URL paths; broker slug in template_bootstrap

* fix(entadapter): cascade-delete agents on project delete (raw-store parity); test(hub): seed FK users, broker_name, deterministic UUIDs

* test(hub): MaxOpenConns=1 for SQLite test store (serialize writes); tid() URLs + FK user seeds in events/stopall

* test(hub): unwrap over-wrapped tid() in unit tests (workspace/logfilter/gcp/web); valid-UUID NotFound cases; tid() scheduled-event URLs

* fix(ent): allow empty display_name (raw-store NOT NULL parity, email fallback); test(hub): seed FK owner users, UUID policy/broker/agent IDs in authz remediation

* feat(migrate): add Migration β tool (Ent-SQLite → Ent-Postgres)

Implements 'scion server migrate --from sqlite://... --to postgres://...'
per postgres-strategy.md §7.3.

- entc.OpenSQLiteReadOnly: opens source with PRAGMA query_only=ON (no WAL
  write), MaxOpenConns=1 so the source is never mutated.
- entc.MigrateData: generic reflection-based, dependency-ordered copy of all
  30 Ent entities (FK-ordered core first), idempotent (skips rows whose PK
  already exists), atomic per entity (txn), chunked CreateBulk, source/dest
  row-count verification after each entity, plus the Group.child_groups M2M
  edge. FK columns are plain fields so edges are preserved via setters.
- cmd/server migrate: DSN parsing (sqlite://, file:, bare path; postgres URL
  or keyword form), --keep-source default / --drop-source cutover, progress
  logging.

Verified end-to-end against live CloudSQL Postgres 16 (integration test +
real CLI run): full copy, idempotent re-run, FK + M2M + value round-trips,
--drop-source removal.

* feat(concurrency): dialect-aware multi-replica primitives for Postgres (P3-3..6)

Add cluster-coordination primitives so N stateless hub processes can share one
Postgres, each degrading to a no-op on single-writer SQLite:

- store.AdvisoryLocker + entadapter TryAdvisoryLock (pg_try_advisory_lock on a
  dedicated conn); Scheduler.RegisterRecurringSingleton gates the heartbeat,
  stalled, purge, schedule-evaluator and github-health sweeps to one
  replica/tick.
- store.ScheduledEventClaimer + ClaimScheduledEvent atomic claim; fireEvent
  claims one-shot events before side effects (dedup across replica startup
  recovery).
- CompositeStore.RunSerializable: SERIALIZABLE + retry on 40001/40P01 (single
  run on SQLite) for future multi-row invariants.
- dbmetrics.StartPoolSampler feeds DB connection-pool gauges to the P0-5
  scaffold; wired into StartBackgroundServices via SetDBMetrics.

Verified existing primitives correct (agent StateVersion CAS, FOR UPDATE sweeps,
notification atomic dispatch). Found and documented the schedule SKIP LOCKED
early-commit gap (lock released before the status transition), closed by the
singleton evaluator. Audit + budget docs in scratchpad.

Tests: locking_test.go (advisory no-op, serializable, claim exactly-once incl.
8-way concurrent), pool_sampler_test.go.

* feat(hub): widen events to EventPublisher interface + Postgres LISTEN/NOTIFY publisher

P3-7: Decouple call sites from the concrete *ChannelEventPublisher.
- Add Subscribe(patterns...) (<-chan Event, func()) to the EventPublisher
  interface; implement it on noopEventPublisher (nil channel) — *ChannelEventPublisher
  already had it.
- Factor the Publish* methods into a shared eventBuilder (sink func) so every
  backend emits identical subjects/payloads; ChannelEventPublisher embeds it.
- web.go (field + SetEventPublisher), messagebroker.go and notifications.go
  (field + constructor) now take EventPublisher; handlers_messages.go gates SSE
  on "not the no-op publisher" instead of a concrete type assertion.

P3-8: PostgresEventPublisher over pgx LISTEN/NOTIFY (cross-replica delivery).
- Per-grove channels plus a global channel (flat exact-match); event type in the
  JSON envelope. Grove-scoped subjects publish to both the grove channel and the
  global channel; subscriptions group their patterns by resolved channel so an
  event is matched only against patterns that opted into the arriving channel
  (no double delivery).
- 8 KB NOTIFY limit handled by reference-and-refetch via scion_event_payloads
  (TTL-swept so every replica can refetch).
- PublishTx enrolls the NOTIFY in a caller transaction (atomic write+publish;
  rollback => no deliver). Delivery flows exclusively through the listener.
- Listener goroutine reconnects with backoff and re-LISTENs (resubscribe);
  dynamic LISTEN/UNLISTEN applied on a poll (WaitForNotification timeout does
  not invalidate the pgconn connection).
- Emits pkg/observability/dbmetrics signals (published/delivered/dropped,
  payload size, publish->deliver latency, reconnects, pool stats).
- cmd: newEventPublisher selects the backend by database driver (postgres =>
  PostgresEventPublisher, else ChannelEventPublisher) with safe fallback.

Tests: routing/registry/payload-offload/metrics/transactional-executor unit
tests run without a DB; cross-replica delivery, oversized round-trip,
transactional rollback, and reconnect+resubscribe are gated behind
SCION_TEST_POSTGRES_DSN. go build ./... green; full pkg/hub suite green.

Note: server.go's equivalent type-assertion cleanup is left in the working tree
(co-edited with concurrent P0-5/scheduler work) and is functionally optional —
HEAD server.go already compiles against the widened interface.

* test(store): parameterize store suites over {sqlite, postgres} (P3-2)

Add pkg/store/enttest: a backend-selecting Ent client factory for the store
test suites. Default is in-memory SQLite; built with -tags integration and
SCION_TEST_POSTGRES_URL set, it provisions a per-package ephemeral Postgres
database (created/dropped via TestMain) and isolates each test in its own
schema (search_path) so tests never observe each other's rows. Falls back to
SQLite when the env var is unset.

Route all entadapter and storetest helpers through enttest.NewClient so the
same CRUD-parity oracle runs unchanged against either backend.

Fix two real Postgres bugs surfaced by the new path:
- entadapter/dialect.go ancestryContains: emit the bind parameter via
  Builder.Arg ($n on Postgres) instead of a literal '?' through ExprP, which
  was not rebound and produced a syntax error; and use jsonb_array_elements_text
  (the column is jsonb on Postgres, not json).
- schedule_store_test ClaimPath: make the concurrent-claim assertion
  backend-aware. SQLite serializes (MaxOpenConns=1, no SKIP LOCKED) so every
  caller sees both due rows; Postgres uses FOR UPDATE SKIP LOCKED so concurrent
  callers may observe a disjoint subset (0..2) and must only never error or
  exceed 2.

Verified: full SQLite suite green; storetest CRUD parity green on CloudSQL
Postgres; entadapter green on Postgres (schedule ClaimPath fix confirmed).

* fix(hub): start dispatcher/broker for any subscription-capable EventPublisher

Wave C integration: newEventPublisher can now return a PostgresEventPublisher
(LISTEN/NOTIFY) in addition to ChannelEventPublisher. The dispatcher/broker
startup previously hard-asserted *ChannelEventPublisher, which silently skipped
starting them under Postgres. Gate on (not noop and not nil) instead, matching
the existing pattern in handlers_messages.go.

* fix(hub): harden Postgres event publish + verify wiring; lower PG pool default

Task 1 — LISTEN/NOTIFY publish path:
- Add TestPostgresIntegration_HandlerCreateProjectEmitsNotify: drives the real
  POST /api/v1/projects handler with a PostgresEventPublisher and asserts a
  pg_notify lands on scion_ev_global via an independent raw LISTEN — the exact
  capability the multi-replica live test probed. Verified PASSING against live
  CloudSQL, proving the handler -> s.events -> pg_notify wiring is correct end
  to end (the four pre-existing SCION_TEST_POSTGRES_DSN integration tests also
  pass). The multi-hub 'no NOTIFY' symptom was not reproducible against the
  current tree.
- Bound the autocommit publish (Publish* methods) with publishTimeout (5s).
  These run synchronously on the caller's (request handler) goroutine and
  acquire from the event pool; on a connection-starved instance that acquire
  could block indefinitely, stalling CRUD and silently never emitting NOTIFY.
  The timeout converts that into a logged error + dropped event (publishing is
  fire-and-forget). PublishTx (transactional path) is unaffected.

Task 2 — connection budget:
- Lower the default Postgres MaxOpenConns 20 -> 10 so multiple replicas fit a
  modest connection budget (see CONNECTION-BUDGET.md). CloudSQL instance
  scion-postgres-test resized db-f1-micro -> db-g1-small and max_connections
  set to 100 (out of band).

* test(store): add Postgres stress/integration suite (contention, isolation, pool, NOTIFY, migration, schema, multi-process)

Add pkg/store/integrationtest/: a Postgres-only suite that exercises behavior
the SQLite parity suites cannot reach. Gated by //go:build integration and
SCION_TEST_POSTGRES_URL; skips cleanly otherwise.

Coverage:
- Contention: state_version CAS race (no lost updates, >=N-1 retries, final
  version==1+N), SKIP LOCKED / conditional-UPDATE event claim (single winner +
  disjoint drain), unique-key races (project slug, user email, agent slug).
- Isolation: SERIALIZABLE conflict + RunSerializable retry recovery, REPEATABLE
  READ no-phantom snapshot, READ COMMITTED dirty-read prevention.
- Pool: exhaustion + queued recovery, saturated pool honoring context deadline,
  long txn not starving short queries, healing after pg_terminate_backend.
- LISTEN/NOTIFY: ordered burst no-drop, 8000B payload limit, listener
  reconnect/resume, cross-channel isolation.
- Migration: 1000+ row counts + bounded-memory listing, idempotent re-migration.
- Schema: NULL semantics, unicode/emoji, nested JSON + special chars, large-text
  non-truncation, TIMESTAMPTZ microsecond precision.
- Multi-process: forks the test binary for cross-process advisory-lock
  exclusivity and cross-process NOTIFY delivery.

Configurable concurrency via SCION_TEST_CONCURRENCY (default 10).

Extend pkg/store/enttest with Active() and NewSchemaURL() so tests can open
custom-pool clients and share a DSN with forked child processes; non-integration
stubs keep the package API stable.

* fix(db): recycle stale conns + keepalives; skip singleton tick on lock error

Stale-connection pool stalls (CloudSQL drops idle conns after ~10m):
- Add ConnMaxIdleTime to DatabaseConfig/PoolConfig (default 5m pg, 0 sqlite)
  and apply SetConnMaxIdleTime on the database/sql pool.
- OpenPostgres now parses the DSN with pgx and opens via stdlib.OpenDB with
  TCP keepalive GUCs (idle 60s / interval 15s / count 4) and a 10s connect
  timeout, so a silently-dropped peer is detected instead of the first query
  after idle hanging on a dead socket.
- pgx event pool (events_postgres.go): set keepalives + connect timeout on
  both the pool's ConnConfig and the dedicated listener connection, plus
  MaxConnIdleTime 5m / MaxConnLifetime 30m.

Advisory-lock leader election (scheduler.go):
- A lock-acquisition error no longer falls open to running the handler
  unguarded (which would duplicate singleton work across replicas); the tick
  is skipped and retried next interval. Added regression tests.

Test harness (enttest/integrationtest):
- Accept libpq keyword/value DSNs (not just URL form) when deriving the
  ephemeral db/schema/params; add WithConnParam helper.
- Fix migration idempotency test's per-pass row-count expectation.

* fix(store): bound advisory-lock conn checkout + unlock with short timeout

TryAdvisoryLock checked a connection out of the pool and ran the unlock
on the full 55s scheduler-handler context (acquire) and an unbounded
context.Background() (release). On a pool that could not promptly serve a
healthy connection, db.Conn() blocked for the entire 55s before failing
with 'context deadline exceeded' on every tick; with several singleton
handlers firing each 60s tick, those long-blocked goroutines and their
pending pool connection requests piled up across ticks and kept the pool
jammed (checked out client-side, idle server-side).

The unbounded unlock was a second leak vector: if the held connection
died mid critical-section, ExecContext could hang forever, so conn.Close()
never ran and the connection leaked out of the pool permanently.

Bind both the acquire (db.Conn + pg_try_advisory_lock) and the release
(pg_advisory_unlock) to a 5s timeout so a bad tick fails fast and retries
next tick instead of parking a goroutine for ~55s, and so a dead
connection can never block release from freeing the conn. Lock semantics
are unchanged: cancelling the acquire context tears down only that
context, not the checked-out session that holds the lock.

* feat(migrate): in-process migration α (legacy raw-SQL hub.db → Ent)

Upgrade a legacy raw-SQL Hub database (the ~53-migration, 30-table schema
from the removed pkg/store/sqlite store) to the consolidated Ent-backed
SQLite schema, in-process on first boot, behind an automatic backup.

pkg/ent/entc/migrate_alpha.go:
- IsLegacyRawSQLSchema: detect via the schema_migrations sentinel + the
  legacy-only agents.agent_id column (no-op for an Ent/empty/absent file).
- MigrateAlphaSQLite: backup (checkpoint WAL + copy to hub.db.bak.<ts>),
  AutoMigrate a fresh Ent schema, ATTACH the legacy file, copy every table
  with INSERT…SELECT (foreign_keys OFF), verify per-table row counts, then
  atomically swap the migrated file into place.
- Data-driven column mapping (created_at→created, updated_at→updated,
  agents.agent_id→slug, policies→access_policies); bespoke SQL for the
  group_members/policy_bindings polymorphic splits and surrogate ids;
  groups.parent_id→group_child_groups edge.
- Deterministic UUIDv5 remap for legacy non-UUID primary keys (internal
  signing-key secrets; plugin runtime-broker ids) with consistent rewrite
  of every foreign-key reference via a TEMP _id_remap table.
- Tolerates missing legacy tables (older schema versions).

cmd/server_foreground.go: detect + migrate in initStore's sqlite path,
with a --no-auto-migrate operator opt-out (cmd/server.go).

Validated end-to-end against four production hub.db files (scion-integration,
-integration2, -demo, -gteam): exact row-count parity (up to ~19k rows),
every entity reads back through the live Ent store, idempotent re-runs, and
broker FK references resolve post-remap. Pre-existing dangling agent
created_by/owner_id refs are faithfully preserved (loader runs FK-off).

* fix(config): apply real Postgres pool size (leaked SQLite default of 1 starved the pool)

The struct-level default for Database.MaxOpenConns/MaxIdleConns is 1 — the
value SQLite REQUIRES to serialize writes. applyDatabasePoolDefaults only
bumped postgres to a real pool when the value was <= 0, but a postgres
deployment configured via env/driver override inherits the embedded default
of 1, so the guard never fired and the Ent pool ran with a SINGLE connection.

Effect in production (both integration hubs): every singleton scheduler tick
checks out the lone pool connection to hold its advisory lock, then blocks
waiting for a second connection to do its work — a self-deadlock that resolves
only at the 55s handler context deadline. All API requests serialize behind
the one connection, so GET /api/v1/* served in ~55s across the board.

Note env overrides could not paper over this: envKeyToConfigKey splits on
every underscore, so SCION_SERVER_DATABASE_MAX_OPEN_CONNS maps to
database.max.open.conns, not database.max_open_conns — silently ignored.

Treat the leaked SQLite default (<= 1) as 'unset' for postgres so the pool
default (10) applies; explicit sizing of 2+ is still respected. SQLite remains
pinned to 1. Adds regression tests for all three cases.

* docs: add multi-node broker dispatch and NFS workspace designs

- broker-dispatch.md: DB-as-state-machine + LISTEN/NOTIFY pattern for
  cross-replica broker command routing and agent lifecycle dispatch
- nfs-workspace.md: NFS workspace coordination for VM (host bind-mount)
  and K8s/Cloud Run (per-pod mount) runtime models

* fix(store): address PR GoogleCloudPlatform#304 review — context leaks and DSN parsing

Thread the server's cancellable context into initStore and
initWebServer instead of using context.Background(), so that:

- DB migrations and the health-check ping cancel on Ctrl+C during
  startup (medium-priority review comment).
- The Postgres LISTEN/NOTIFY event publisher goroutine shuts down
  cleanly when the server exits, preventing connection leaks
  (high-priority review comment).

Also fix parseSQLiteSourceDSN to handle the file:// prefix before
the file: prefix, so that file:///var/lib/hub.db correctly resolves
to /var/lib/hub.db instead of ///var/lib/hub.db. Add test cases for
file:// and file:/// DSN forms.

* docs: add project log for PR GoogleCloudPlatform#304 review fixes

* fix(store): context leak in legacy migration & double file: prefix

1. Thread the server's cancellable context through
   maybeMigrateLegacySQLite → MigrateAlphaSQLite so that Ctrl+C
   during first-boot legacy migration aborts it instead of running
   with an uncancellable context.Background().

2. Guard against a double "file:" prefix when constructing the
   SQLite DSN. If the operator's database.url already starts with
   "file:", we no longer blindly prepend another "file:" prefix.
   Also correctly appends cache=shared with "&" when the DSN
   already contains query parameters.

* fix(store): rename ProjectTypeHubNative → ProjectTypeHubManaged (rebase fixup)

Upstream renamed hub-native to hub-managed while the PR was in
flight. Update the two remaining references that the rebase
conflict resolution missed.

---------

Co-authored-by: Scion <agent@scion.dev>
…t token

TestClient_StartTokenRefresh exercised RefreshToken -> WriteTokenFile
without isolating the token home, so running the suite inside a live
agent container overwrote the real ~/.scion/scion-token with the test
stub "refreshed-token". Every subsequent Hub call then 401'd with
"compact JWS format must have three parts" / "unrecognized token format".

- Add SetTokenHome(t.TempDir()) to the test, matching its siblings.
- Guard WriteTokenFile: panic under `go test` unless SetTokenHome was
  called, so a forgotten isolation can never corrupt live state again.
  Reads remain unguarded (harmless; return empty when absent).
…ecycle + message routing (GoogleCloudPlatform#305)

* Add canonical engineering glossary (GLOSSARY.md) (#102)

* Add engineering glossary (GLOSSARY.md) with canonical terms and cleanup tracker

Add a root-level GLOSSARY.md capturing canonical Scion terminology in the
ubiquitous-language format (preferred term + synonyms to avoid), grouped by
domain cluster, plus an Exceptions & Future Cleanup section tracking known
naming-convergence work. Link it from agents.md as the canonical engineering
glossary.

* Revise glossary: broker reframe, Event Bus, Hub-managed, and term refinements

Refine entries from review: redefine Message Broker as the pluggable
messaging-integration system (add Broker plugin, Built-in broker); add Event
Bus for the NATS real-time/event capability; collapse hub-native/Hub Workspace
into Hub-managed project/workspace; tighten Template (harness-agnostic, optional
default harness-config), Skill (template-only, Agent Skills link), Profile
(named runtime-broker settings bundle), Harness/Harness-config; reframe Hub as
the control plane in both modes; add Group and Message Group. Expand Exceptions
& Future Cleanup to nine tracked items.

* Glossary: restructure headings, add cross-refs, modes table, and new terms

- Retitle to "Scion Glossary"; drop the "Language" wrapper and promote
  the thematic categories to top-level sections
- Add an Operations section (Attach, Dispatch) and move Profile next to
  Runtime Broker
- Add a Local/Workstation/Hosted comparison table and "See also"
  cross-refs across the main confusable term clusters
- Reframe the intro around the three-way broker collision (incl. Event
  Bus) and defer to the disambiguation rule; sentence-case "Shared
  directory"
- Add canonical entries for Secret, Notification, and Schedule
- Add a "Potential Future Additions" section cataloguing candidate terms

* Glossary: remove Exceptions & Future Cleanup tracker

The cleanup items are now tracked by dedicated agents that open GitHub
issues and implementation PRs, so the staged tracker no longer lives in
the glossary. Reword the two intro/disambiguation references that pointed
at the removed section to point at GitHub issues instead.

---------

Co-authored-by: Preston Holmes <ptone@google.com>

* P0-1: switch Postgres driver from lib/pq to pgx/v5 stdlib

- Add github.com/jackc/pgx/v5/stdlib (registers as "pgx")
- driver_postgres.go: blank import pgx stdlib instead of lib/pq
- OpenPostgres: open via sql.Open("pgx", dsn) + entsql.OpenDB
- Introduce PoolConfig (applied to *sql.DB); thread through
  OpenSQLite/OpenPostgres and update all callers
- go mod tidy drops lib/pq

* P0-2: add connection pool config to DatabaseConfig

- DatabaseConfig gains MaxOpenConns / MaxIdleConns / ConnMaxLifetime
  plus ConnMaxLifetimeDuration() helper
- DefaultGlobalConfig sets sqlite pool defaults (MaxOpenConns=1,
  load-bearing for write serialization)
- applyDatabasePoolDefaults fills postgres defaults (20/5/30m) and
  forces sqlite MaxOpenConns=1; called in both load paths
- Mirror fields in V1DatabaseConfig + both conversion directions
- Wire pool settings into entc.OpenSQLite in initStore

* P0-3/P0-4: CRUD-parity test harness + spec-driven fixture generator

P0-3: pkg/store/storetest/ — backend-agnostic, table-driven CRUD oracle.
A Factory(t) -> store.Store is injected; generic Domain[T] descriptors drive
Create/Read/Update/Delete (+optional soft-delete)/List-paginate/List-filter.
Ships group + policy domains and runs green against today's CompositeStore
(SQLite base + Ent DB). Ready to accept a postgresFactory for P3-2.

P0-4: internal/fixturegen/ — Go-defined spec seeding >=1 row per table across
all 30 domain tables, with edge cases (NULL optionals, max-length strings,
nested/unicode JSON, soft-deleted agent, BLOB). Deterministic. 'go run
./internal/fixturegen' emits testdata/hub-v46-fixture.db, prints a 30-table
coverage report, and caches the blob to the scratchpad mount. CI gate fails if
any table has zero rows.

* feat(ent): add 23 new Ent schemas for full table parity (P1-2 + P1-3)

* P2: port notification + gcp/github/token domains to Ent entadapter

Add Ent-backed implementations of the notification, GCP service account,
GitHub App installation, and user access token store sub-interfaces:

- notification_store.go: NotificationStore (subscriptions, notifications,
  templates). Dispatch uses an atomic conditional update as the multi-replica
  claim primitive, and an optional NotificationPublisher designs in the
  LISTEN/NOTIFY fan-out for created/dispatched events.
- external_store.go: GCPServiceAccountStore + GitHubInstallationStore +
  UserAccessTokenStore. GitHub create is idempotent (INSERT OR IGNORE
  semantics), repositories/scopes are JSON, default_scopes is CSV, and tokens
  support key-hash lookup. Legacy api_keys is intentionally not surfaced.
- storetest: add GCPServiceAccount, SubscriptionTemplate, and
  NotificationSubscription CRUD-parity domains.

Does not modify composite.go.

* P2: port schedule, maintenance, message domains to Ent entadapter

- schedule_store.go: ScheduleStore + ScheduledEventStore sub-interfaces with
  dialect-aware SELECT FOR UPDATE SKIP LOCKED claim helper for the
  ListDueSchedules / ListPendingScheduledEvents job-claim paths (plain SELECT
  on SQLite, SKIP LOCKED on Postgres).
- maintenance_store.go: run-state RMW, AbortRunningMaintenanceOps, Go-side
  seed (uuid.New) replacing SQLite randomblob() UUID seeds.
- message_store.go: CRUD, read flags, PurgeOldMessages, design-in
  PublishUserMessage hook for Postgres LISTEN/NOTIFY.
- pkg/ent/client_driver.go: hand-written Client.Driver() accessor for
  dialect detection + raw locking queries.

* feat(entadapter): port user + allowlist/invite domains to Ent (P2)

Implements the Ent-backed store adapters for the user and
allowlist/invite domains, plus their CRUD-parity oracle descriptors.

pkg/store/entadapter/user_store.go (store.UserStore):
- CreateUser/GetUser/GetUserByEmail/UpdateUser/UpdateUserLastSeen/
  DeleteUser/ListUsers.
- Case-insensitive email: emails are normalized to lower case on write
  (so the plain unique index enforces case-insensitive uniqueness,
  equivalent to the legacy UNIQUE COLLATE NOCASE) and matched with
  EmailEqualFold (lower(email)=lower($1)) on read. ent codegen +
  AutoMigrate cannot emit a real lower(email) functional index across
  both SQLite (tests) and Postgres, so the invariant is enforced at the
  port layer.
- Offset-based pagination matching the legacy SQLite store.

pkg/store/entadapter/allowlist_store.go (store.AllowListStore +
store.InviteCodeStore):
- Full allow-list + invite-code CRUD.
- BulkAddAllowListEntries uses CreateBulk + OnConflictColumns(email).
  Ignore() for race-safe INSERT-OR-IGNORE; added/skipped counts mirror
  the legacy per-row semantics (existing + within-batch dups skipped).
- IncrementInviteUseCount is a single atomic conditional UPDATE
  (revoked=false AND not expired AND (max_uses=0 OR use_count<max_uses)),
  which is race-free on both backends without SELECT...FOR UPDATE. The
  sql/lock feature is enabled and ForUpdate is available for genuine
  multi-statement RMW paths.
- ListAllowListEntriesWithInvites batch-joins invite codes (invite_id is
  a plain column, not an Ent edge).

Schema:
- pkg/ent/schema/user.go: add nillable last_seen field (+ index) needed
  by UpdateUserLastSeen / lastSeen sort; document the case-insensitive
  email strategy.
- pkg/ent/generate.go: enable --feature sql/upsert,sql/lock (required for
  OnConflict and ForUpdate).

Tests (all passing):
- pkg/store/storetest/domains_user.go: UserDomain, AllowListDomain,
  InviteCodeDomain oracle descriptors (kept in a separate file to avoid
  contending on domains.go).
- entadapter oracle test runs the shared CRUD-parity suite directly
  against the new adapters; behavior tests cover case-insensitivity,
  bulk idempotency, conditional increment, stats, and the invite join.

NOTE: Generated Ent code under pkg/ent/** is intentionally NOT included.
This is a shared worktree where sibling port agents concurrently modify
schemas and the same feature flags; the generated code must be
regenerated at wave integration via:
    go generate ./pkg/ent/...
Verified locally that regeneration + full build + tests pass.

Per P2 scope: composite.go wiring and ensureEntUser shadow removal are
deferred to P2-collapse.

* P2: port secret/env_var + template/harness_config domains to Ent

Add Ent-backed store implementations for the secret/env and
template/harness domains, mirroring the legacy SQLite semantics:

- entadapter/secret_store.go: SecretStore implementing store.SecretStore
  + store.EnvVarStore. Polymorphic (scope, scope_id) addressing, COALESCE
  target->key projection, version bump on update, get-then-update upsert,
  and transitive ListProgenySecrets via a created_by IN-list over the
  ancestor set (user scope + allow_progeny only; encrypted value withheld).
- entadapter/template_store.go: TemplateStore implementing
  store.TemplateStore + store.HarnessConfigStore. base_template hierarchy,
  scope/project_id backwards-compat lookups, content_hash, JSON config/files
  columns, DeleteByScope. Subscription templates are owned by NotificationStore.
- Direct Ent unit tests incl. a progeny-inheritance parity test.
- storetest: Template/HarnessConfig/Secret/EnvVar domain descriptors wired
  into RunStoreSuite for cross-backend CRUD parity.

* P2: port project/broker + brokersecret domains to Ent

Port the project/broker domain (projects, runtime_brokers, project_contributors,
project_sync_state) and the broker-auth domain (broker_secrets,
broker_join_tokens) from raw SQL to Ent adapters.

- pkg/store/entadapter/project_store.go: implements ProjectStore,
  RuntimeBrokerStore, ProjectProviderStore and ProjectSyncStateStore.
  * provider + sync-state upserts use Ent OnConflict().UpdateNewValues()
    (sql/upsert) keyed on the (project_id, broker_id) unique index.
  * runtime broker heartbeat/update use an optimistic version-CAS loop on a
    new internal lock_version token, serializing concurrent writers portably
    across SQLite (tests) and Postgres without SELECT ... FOR UPDATE.
  * slug lookups support case-insensitive matching (EqualFold).
  * project computed fields (AgentCount, ActiveBrokerCount, ProjectType) are
    derived via Ent queries, matching the legacy SQLite store.
- pkg/store/entadapter/brokersecret_store.go: implements BrokerSecretStore
  (per-broker HMAC secrets + short-lived join tokens, expiry cleanup).
- Project Ent schema: add operational fields for full parity
  (default_runtime_broker_id, shared_dirs, github_*, git_identity).
- RuntimeBroker Ent schema: relax vestigial type column to Optional, add
  internal lock_version concurrency token.
- Regenerate Ent with sql/upsert,sql/lock features.
- storetest: add Project, RuntimeBroker, BrokerSecret and BrokerJoinToken
  CRUD-parity domains.
- Unit tests for both adapters.

Per the integration plan, composite.go wiring and ensureEntProject shadow
removal are deferred to P2-collapse.

* P2: port agent domain to Ent entadapter (XL)

* chore(ent): regenerate Ent code for all 30 entity schemas

Regenerated with --feature sql/upsert,sql/lock to support
OnConflict upserts and ForUpdate/SKIP LOCKED job claims.

* P2-collapse: collapse dual-DB into single Ent store

Wire all Ent-backed sub-stores into CompositeStore via embedding, removing
the raw-SQL base store and the User/Agent/Project shadow-sync machinery
(ensureEntUser/ensureEntAgent/ensureEntProject). CompositeStore now serves
every domain from a single Ent client and implements Close/Ping/Migrate
directly.

Collapse initStore() to open one Ent SQLite DB (no _ent shadow DSN, no
MigrateGroveToProjectData, no raw sqlite.New). Register the User, AllowList,
and InviteCode domains in the storetest CRUD-parity suite. Update entadapter
tests for the single-DB NewCompositeStore(client) signature.

go build ./... green; go test ./pkg/store/entadapter/... ./pkg/store/storetest/... green.

* P2-delete: remove raw-SQL store implementation

Delete the ~6k-LOC raw-SQL store (sqlite.go) and its per-domain sibling
files (brokersecret, gcp_service_account, github_installation, maintenance,
messages, notification, project_sync_state, schedule, scheduled_event) plus
their tests, including the inline schema-migration scaffold. Keep driver.go,
which registers the pure-Go SQLite driver used by Ent's SQLite backend.

Repoint the two non-test consumers to the Ent-backed store:
  - cmd/hub_secret_migrate.go now opens an Ent client + CompositeStore.
  - internal/fixturegen opens via entc and seeds the Ent schema's *sql.DB.

go build ./... green; no remaining production references to the raw store.

* test: compile-migrate downstream suites to Ent store + fix signing-key PK

Replace the removed raw-SQL store in downstream tests with an Ent-backed
newTestStore helper (pkg/hub, pkg/secret) and update cmd/server_test.go and
internal/fixturegen tests. Port the 8 raw-SQL DB() access sites in hub tests
via a new CompositeStore.DB() escape-hatch accessor.

Fix a production bug surfaced by the collapse: hub/server.go signingKeySecretID
generated a non-UUID secret primary key, which the Ent secret store rejects;
it now derives a deterministic UUIDv5. go build ./... green; entadapter and
storetest suites green.

NOTE: hub/secret/fixturegen suites now COMPILE but many tests still fail
because their fixtures seed non-UUID string IDs that the UUID-PK Ent schema
rejects; addressed in follow-up commits (tid() helper).

* test(hub): map non-UUID fixture IDs to UUIDs via tid() helper

Wrap human-readable test identifiers in tid() (deterministic UUIDv5) so the
UUID-PK Ent store accepts them while preserving cross-reference consistency and
ID-equality assertions. Reduces pkg/hub failures from 611 to 79; remaining
failures are behavioral, not ID-format, and are addressed separately.

* fix(store): seed maintenance ops in Migrate; initStore uses Migrate

Restore raw-SQL parity: CompositeStore.Migrate now runs AutoMigrate and seeds
built-in maintenance operations (the raw store seeded these in its migrations).
initStore and hub test helpers call s.Migrate() so production and tests seed
consistently. Fixes the maintenance-operation hub tests (404 'Operation not
found'). pkg/hub failures 79 -> 71.

* test(hub): satisfy Ent NotEmpty validators in fixtures

Add slugs/broker names to test fixtures that previously relied on the raw
store's lenient (no-validator) inserts: project/agent slugs in the logs test
helper, broker slugs in embedded/profile/authz fixtures, and BrokerName on
envgather ProjectProvider literals. pkg/hub failures 71 -> 57.

* fix(entadapter): Get-by-id returns ErrNotFound for non-UUID identifiers

Restore raw-SQL store parity: a malformed identifier cannot match any UUID
primary key, so get-by-id lookups now report store.ErrNotFound instead of
store.ErrInvalidInput. This matches the raw store (a lookup with a bad id simply
returned no row) and is what callers depend on — e.g. resolveTemplate passes a
template *name* to GetTemplate and relies on ErrNotFound to fall back to
slug-based resolution. New parseGetID helper applied across all 17 get-by-id
methods. pkg/hub failures 56 -> 40; entadapter/storetest stay green.

* test(hub): fix store-less id wraps and project-route URL paths

- controlchannel_client_test: revert tid() wraps (store-less path-builder test;
  IDs must match the expected literal paths).
- github/envgather: project-scoped route handlers resolve the project by UUID id,
  so build paths with tid(rawID) via fmt.Sprintf instead of the old raw-id
  literal. pkg/hub failures 40 -> 32.

* test(hub): unwrap projectIDFromServiceAccountEmail expectation

The tid() sweep over-wrapped a non-ID expected value in a pure-function test;
restore the literal GCP project id.

* fix(ent): GCPServiceAccount.project_id is a string, not a UUID

The GCP service account project_id holds the GCP *cloud project* identifier
(e.g. 'my-project-123'), a free-form string — not a UUID. The schema declared
it field.UUID, so entadapter CreateGCPServiceAccount/Update did
parseUUID(sa.ProjectID) and rejected real GCP project ids, breaking SA
mint/create with a 400 in production (storetest masked it by passing a UUID).

Change the schema field to field.String, regenerate Ent, and store/read
project_id as a string in external_store.go. Fixes ~7 hub GCP tests; pkg/hub
31 -> 23.

* test(hub): fix GCP SA project-id assertion and project-settings id

Unwrap the over-wrapped 'my-project' expectation now that project_id is a
string, and wrap the dynamic project-settings project ID with tid().

* test(hub): revert tid() over-wraps in store-less events_test

events_test exercises the in-memory ChannelEventPublisher directly; its
ProjectID/IDs are subject-string components, not stored UUIDs. The tid() sweep
wrongly rewrote them so published subjects no longer matched the subscriptions
(timeouts). Restore the literal values. pkg/hub 19 -> 12.

* test(hub): fix maintenance-run path and notifications agentId queries

Use tid() UUIDs in the maintenance run-detail path and the notifications
agentId query params; guard list indexing with require.Len so a mismatch fails
cleanly instead of panicking (panics truncate the package run).

* test(hub): wrap remaining fixture IDs revealed after panic-cascade cleared

Panics ([0] on empty lists) had been truncating the package run, hiding many
failures and starving the tid() sweep. With those guarded, sweep the newly
reached tests: wrap dynamic rune-suffix IDs and the setupProjectWithBroker /
seedCreatedAgentForHarnessTest helper IDs, and convert raw query-param project
IDs to tid(). No UUID-parse errors remain in pkg/hub.

* test(hub): unwrap tid() in scheduler_test (mock store, raw ids)

scheduler_test uses an in-memory mockScheduledEventStore, not the Ent store, so
its ids need no UUIDs; the erroneous tid() wraps broke raw getEvent lookups and
caused a nil-pointer panic that truncated the package run.

* fix(ent): Template.harness may be empty (raw-store parity)

A template imported from a directory that declares no harness type has an empty
harness; the raw-SQL store stored it, but the Ent NotEmpty validator made
BootstrapTemplatesFromDir silently skip such templates. Drop NotEmpty and
regenerate. Removing the [0]-on-empty panics this caused un-truncates the hub
package run (true failure count now visible).

* test(hub): wrap dynamic fixture IDs in wake/workspace/signing-key tests

Wrap tid() around the wake_test, setupWorkspaceProject, and empty-value
signing-key secret IDs now reachable after panic removal. No panics in the hub
package run.

* test(hub): convert raw-id URL path segments to tid()

Build GET/PUT/DELETE paths for agents/projects/brokers/templates/harness-configs
and workspace sync routes from tid(rawID) so the by-id handlers resolve the
entity (raw ids no longer match the UUID PKs). pkg/hub 93 -> 80.

* fix(hub): seed creator users for agent-created agents; cascade-delete subscriptions on hard agent delete

* test(hub): seed broker slug/name in dispatcher and project_cache fixtures (Ent validators)

* fix(entadapter): cascade-delete agents on project delete (raw-store parity); test(hub): seed FK users, broker_name, deterministic UUIDs

* test(hub): MaxOpenConns=1 for SQLite test store (serialize writes); tid() URLs + FK user seeds in events/stopall

* test(hub): unwrap over-wrapped tid() in unit tests (workspace/logfilter/gcp/web); valid-UUID NotFound cases; tid() scheduled-event URLs

* fix(ent): allow empty display_name (raw-store NOT NULL parity, email fallback); test(hub): seed FK owner users, UUID policy/broker/agent IDs in authz remediation

* feat(migrate): add Migration β tool (Ent-SQLite → Ent-Postgres)

Implements 'scion server migrate --from sqlite://... --to postgres://...'
per postgres-strategy.md §7.3.

- entc.OpenSQLiteReadOnly: opens source with PRAGMA query_only=ON (no WAL
  write), MaxOpenConns=1 so the source is never mutated.
- entc.MigrateData: generic reflection-based, dependency-ordered copy of all
  30 Ent entities (FK-ordered core first), idempotent (skips rows whose PK
  already exists), atomic per entity (txn), chunked CreateBulk, source/dest
  row-count verification after each entity, plus the Group.child_groups M2M
  edge. FK columns are plain fields so edges are preserved via setters.
- cmd/server migrate: DSN parsing (sqlite://, file:, bare path; postgres URL
  or keyword form), --keep-source default / --drop-source cutover, progress
  logging.

Verified end-to-end against live CloudSQL Postgres 16 (integration test +
real CLI run): full copy, idempotent re-run, FK + M2M + value round-trips,
--drop-source removal.

* feat(concurrency): dialect-aware multi-replica primitives for Postgres (P3-3..6)

Add cluster-coordination primitives so N stateless hub processes can share one
Postgres, each degrading to a no-op on single-writer SQLite:

- store.AdvisoryLocker + entadapter TryAdvisoryLock (pg_try_advisory_lock on a
  dedicated conn); Scheduler.RegisterRecurringSingleton gates the heartbeat,
  stalled, purge, schedule-evaluator and github-health sweeps to one
  replica/tick.
- store.ScheduledEventClaimer + ClaimScheduledEvent atomic claim; fireEvent
  claims one-shot events before side effects (dedup across replica startup
  recovery).
- CompositeStore.RunSerializable: SERIALIZABLE + retry on 40001/40P01 (single
  run on SQLite) for future multi-row invariants.
- dbmetrics.StartPoolSampler feeds DB connection-pool gauges to the P0-5
  scaffold; wired into StartBackgroundServices via SetDBMetrics.

Verified existing primitives correct (agent StateVersion CAS, FOR UPDATE sweeps,
notification atomic dispatch). Found and documented the schedule SKIP LOCKED
early-commit gap (lock released before the status transition), closed by the
singleton evaluator. Audit + budget docs in scratchpad.

Tests: locking_test.go (advisory no-op, serializable, claim exactly-once incl.
8-way concurrent), pool_sampler_test.go.

* feat(hub): widen events to EventPublisher interface + Postgres LISTEN/NOTIFY publisher

P3-7: Decouple call sites from the concrete *ChannelEventPublisher.
- Add Subscribe(patterns...) (<-chan Event, func()) to the EventPublisher
  interface; implement it on noopEventPublisher (nil channel) — *ChannelEventPublisher
  already had it.
- Factor the Publish* methods into a shared eventBuilder (sink func) so every
  backend emits identical subjects/payloads; ChannelEventPublisher embeds it.
- web.go (field + SetEventPublisher), messagebroker.go and notifications.go
  (field + constructor) now take EventPublisher; handlers_messages.go gates SSE
  on "not the no-op publisher" instead of a concrete type assertion.

P3-8: PostgresEventPublisher over pgx LISTEN/NOTIFY (cross-replica delivery).
- Per-grove channels plus a global channel (flat exact-match); event type in the
  JSON envelope. Grove-scoped subjects publish to both the grove channel and the
  global channel; subscriptions group their patterns by resolved channel so an
  event is matched only against patterns that opted into the arriving channel
  (no double delivery).
- 8 KB NOTIFY limit handled by reference-and-refetch via scion_event_payloads
  (TTL-swept so every replica can refetch).
- PublishTx enrolls the NOTIFY in a caller transaction (atomic write+publish;
  rollback => no deliver). Delivery flows exclusively through the listener.
- Listener goroutine reconnects with backoff and re-LISTENs (resubscribe);
  dynamic LISTEN/UNLISTEN applied on a poll (WaitForNotification timeout does
  not invalidate the pgconn connection).
- Emits pkg/observability/dbmetrics signals (published/delivered/dropped,
  payload size, publish->deliver latency, reconnects, pool stats).
- cmd: newEventPublisher selects the backend by database driver (postgres =>
  PostgresEventPublisher, else ChannelEventPublisher) with safe fallback.

Tests: routing/registry/payload-offload/metrics/transactional-executor unit
tests run without a DB; cross-replica delivery, oversized round-trip,
transactional rollback, and reconnect+resubscribe are gated behind
SCION_TEST_POSTGRES_DSN. go build ./... green; full pkg/hub suite green.

Note: server.go's equivalent type-assertion cleanup is left in the working tree
(co-edited with concurrent P0-5/scheduler work) and is functionally optional —
HEAD server.go already compiles against the widened interface.

* test(store): parameterize store suites over {sqlite, postgres} (P3-2)

Add pkg/store/enttest: a backend-selecting Ent client factory for the store
test suites. Default is in-memory SQLite; built with -tags integration and
SCION_TEST_POSTGRES_URL set, it provisions a per-package ephemeral Postgres
database (created/dropped via TestMain) and isolates each test in its own
schema (search_path) so tests never observe each other's rows. Falls back to
SQLite when the env var is unset.

Route all entadapter and storetest helpers through enttest.NewClient so the
same CRUD-parity oracle runs unchanged against either backend.

Fix two real Postgres bugs surfaced by the new path:
- entadapter/dialect.go ancestryContains: emit the bind parameter via
  Builder.Arg ($n on Postgres) instead of a literal '?' through ExprP, which
  was not rebound and produced a syntax error; and use jsonb_array_elements_text
  (the column is jsonb on Postgres, not json).
- schedule_store_test ClaimPath: make the concurrent-claim assertion
  backend-aware. SQLite serializes (MaxOpenConns=1, no SKIP LOCKED) so every
  caller sees both due rows; Postgres uses FOR UPDATE SKIP LOCKED so concurrent
  callers may observe a disjoint subset (0..2) and must only never error or
  exceed 2.

Verified: full SQLite suite green; storetest CRUD parity green on CloudSQL
Postgres; entadapter green on Postgres (schedule ClaimPath fix confirmed).

* fix(hub): harden Postgres event publish + verify wiring; lower PG pool default

Task 1 — LISTEN/NOTIFY publish path:
- Add TestPostgresIntegration_HandlerCreateProjectEmitsNotify: drives the real
  POST /api/v1/projects handler with a PostgresEventPublisher and asserts a
  pg_notify lands on scion_ev_global via an independent raw LISTEN — the exact
  capability the multi-replica live test probed. Verified PASSING against live
  CloudSQL, proving the handler -> s.events -> pg_notify wiring is correct end
  to end (the four pre-existing SCION_TEST_POSTGRES_DSN integration tests also
  pass). The multi-hub 'no NOTIFY' symptom was not reproducible against the
  current tree.
- Bound the autocommit publish (Publish* methods) with publishTimeout (5s).
  These run synchronously on the caller's (request handler) goroutine and
  acquire from the event pool; on a connection-starved instance that acquire
  could block indefinitely, stalling CRUD and silently never emitting NOTIFY.
  The timeout converts that into a logged error + dropped event (publishing is
  fire-and-forget). PublishTx (transactional path) is unaffected.

Task 2 — connection budget:
- Lower the default Postgres MaxOpenConns 20 -> 10 so multiple replicas fit a
  modest connection budget (see CONNECTION-BUDGET.md). CloudSQL instance
  scion-postgres-test resized db-f1-micro -> db-g1-small and max_connections
  set to 100 (out of band).

* test(store): add Postgres stress/integration suite (contention, isolation, pool, NOTIFY, migration, schema, multi-process)

Add pkg/store/integrationtest/: a Postgres-only suite that exercises behavior
the SQLite parity suites cannot reach. Gated by //go:build integration and
SCION_TEST_POSTGRES_URL; skips cleanly otherwise.

Coverage:
- Contention: state_version CAS race (no lost updates, >=N-1 retries, final
  version==1+N), SKIP LOCKED / conditional-UPDATE event claim (single winner +
  disjoint drain), unique-key races (project slug, user email, agent slug).
- Isolation: SERIALIZABLE conflict + RunSerializable retry recovery, REPEATABLE
  READ no-phantom snapshot, READ COMMITTED dirty-read prevention.
- Pool: exhaustion + queued recovery, saturated pool honoring context deadline,
  long txn not starving short queries, healing after pg_terminate_backend.
- LISTEN/NOTIFY: ordered burst no-drop, 8000B payload limit, listener
  reconnect/resume, cross-channel isolation.
- Migration: 1000+ row counts + bounded-memory listing, idempotent re-migration.
- Schema: NULL semantics, unicode/emoji, nested JSON + special chars, large-text
  non-truncation, TIMESTAMPTZ microsecond precision.
- Multi-process: forks the test binary for cross-process advisory-lock
  exclusivity and cross-process NOTIFY delivery.

Configurable concurrency via SCION_TEST_CONCURRENCY (default 10).

Extend pkg/store/enttest with Active() and NewSchemaURL() so tests can open
custom-pool clients and share a DSN with forked child processes; non-integration
stubs keep the package API stable.

* fix(db): recycle stale conns + keepalives; skip singleton tick on lock error

Stale-connection pool stalls (CloudSQL drops idle conns after ~10m):
- Add ConnMaxIdleTime to DatabaseConfig/PoolConfig (default 5m pg, 0 sqlite)
  and apply SetConnMaxIdleTime on the database/sql pool.
- OpenPostgres now parses the DSN with pgx and opens via stdlib.OpenDB with
  TCP keepalive GUCs (idle 60s / interval 15s / count 4) and a 10s connect
  timeout, so a silently-dropped peer is detected instead of the first query
  after idle hanging on a dead socket.
- pgx event pool (events_postgres.go): set keepalives + connect timeout on
  both the pool's ConnConfig and the dedicated listener connection, plus
  MaxConnIdleTime 5m / MaxConnLifetime 30m.

Advisory-lock leader election (scheduler.go):
- A lock-acquisition error no longer falls open to running the handler
  unguarded (which would duplicate singleton work across replicas); the tick
  is skipped and retried next interval. Added regression tests.

Test harness (enttest/integrationtest):
- Accept libpq keyword/value DSNs (not just URL form) when deriving the
  ephemeral db/schema/params; add WithConnParam helper.
- Fix migration idempotency test's per-pass row-count expectation.

* fix(store): bound advisory-lock conn checkout + unlock with short timeout

TryAdvisoryLock checked a connection out of the pool and ran the unlock
on the full 55s scheduler-handler context (acquire) and an unbounded
context.Background() (release). On a pool that could not promptly serve a
healthy connection, db.Conn() blocked for the entire 55s before failing
with 'context deadline exceeded' on every tick; with several singleton
handlers firing each 60s tick, those long-blocked goroutines and their
pending pool connection requests piled up across ticks and kept the pool
jammed (checked out client-side, idle server-side).

The unbounded unlock was a second leak vector: if the held connection
died mid critical-section, ExecContext could hang forever, so conn.Close()
never ran and the connection leaked out of the pool permanently.

Bind both the acquire (db.Conn + pg_try_advisory_lock) and the release
(pg_advisory_unlock) to a 5s timeout so a bad tick fails fast and retries
next tick instead of parking a goroutine for ~55s, and so a dead
connection can never block release from freeing the conn. Lock semantics
are unchanged: cancelling the acquire context tears down only that
context, not the checked-out session that holds the lock.

* feat(migrate): in-process migration α (legacy raw-SQL hub.db → Ent)

Upgrade a legacy raw-SQL Hub database (the ~53-migration, 30-table schema
from the removed pkg/store/sqlite store) to the consolidated Ent-backed
SQLite schema, in-process on first boot, behind an automatic backup.

pkg/ent/entc/migrate_alpha.go:
- IsLegacyRawSQLSchema: detect via the schema_migrations sentinel + the
  legacy-only agents.agent_id column (no-op for an Ent/empty/absent file).
- MigrateAlphaSQLite: backup (checkpoint WAL + copy to hub.db.bak.<ts>),
  AutoMigrate a fresh Ent schema, ATTACH the legacy file, copy every table
  with INSERT…SELECT (foreign_keys OFF), verify per-table row counts, then
  atomically swap the migrated file into place.
- Data-driven column mapping (created_at→created, updated_at→updated,
  agents.agent_id→slug, policies→access_policies); bespoke SQL for the
  group_members/policy_bindings polymorphic splits and surrogate ids;
  groups.parent_id→group_child_groups edge.
- Deterministic UUIDv5 remap for legacy non-UUID primary keys (internal
  signing-key secrets; plugin runtime-broker ids) with consistent rewrite
  of every foreign-key reference via a TEMP _id_remap table.
- Tolerates missing legacy tables (older schema versions).

cmd/server_foreground.go: detect + migrate in initStore's sqlite path,
with a --no-auto-migrate operator opt-out (cmd/server.go).

Validated end-to-end against four production hub.db files (scion-integration,
-integration2, -demo, -gteam): exact row-count parity (up to ~19k rows),
every entity reads back through the live Ent store, idempotent re-runs, and
broker FK references resolve post-remap. Pre-existing dangling agent
created_by/owner_id refs are faithfully preserved (loader runs FK-off).

* fix(config): apply real Postgres pool size (leaked SQLite default of 1 starved the pool)

The struct-level default for Database.MaxOpenConns/MaxIdleConns is 1 — the
value SQLite REQUIRES to serialize writes. applyDatabasePoolDefaults only
bumped postgres to a real pool when the value was <= 0, but a postgres
deployment configured via env/driver override inherits the embedded default
of 1, so the guard never fired and the Ent pool ran with a SINGLE connection.

Effect in production (both integration hubs): every singleton scheduler tick
checks out the lone pool connection to hold its advisory lock, then blocks
waiting for a second connection to do its work — a self-deadlock that resolves
only at the 55s handler context deadline. All API requests serialize behind
the one connection, so GET /api/v1/* served in ~55s across the board.

Note env overrides could not paper over this: envKeyToConfigKey splits on
every underscore, so SCION_SERVER_DATABASE_MAX_OPEN_CONNS maps to
database.max.open.conns, not database.max_open_conns — silently ignored.

Treat the leaked SQLite default (<= 1) as 'unset' for postgres so the pool
default (10) applies; explicit sizing of 2+ is still respected. SQLite remains
pinned to 1. Adds regression tests for all three cases.

* feat(hub): per-process instanceID on Server (B1-1)

Add a unique per-process instanceID to Server, generated at construction
via uuid.NewString(). Optionally prefixed with POD_NAME env var for
log readability, but uniqueness is always guaranteed by the UUID.

This ID serves as the affinity key for broker dispatch (design §4.1)
and is intentionally distinct from config.ResolveHubID, which is
shareable across replicas.

* feat(schema): affinity columns on runtime_brokers (B1-2)

Add 3 nullable fields to the runtime_brokers ent schema and store model
for tracking which hub instance holds the control-channel socket:

  - connected_hub_id     (TEXT, optional/nullable)
  - connected_session_id (TEXT, optional/nullable)
  - connected_at         (TIMESTAMPTZ, optional/nullable)

Dialect-neutral (no Postgres-only annotations) — AutoMigrate works on
both SQLite and CloudSQL Postgres per postgres-strategy.md §6.4.

Wire the fields through the ent<->store conversion code in both
directions (entBrokerToStore, CreateRuntimeBroker, UpdateRuntimeBroker).
Regenerated ent code included.

* feat(store): Claim/Release runtime-broker affinity CAS methods (B1-3)

Mirrors UpdateRuntimeBrokerHeartbeat's lock_version CAS loop.
- ClaimRuntimeBrokerConnection: newest-wins, sets affinity + status=online + heartbeat in one write
- ReleaseRuntimeBrokerConnection: compare-and-clear, returns cleared=false (no-op) if affinity moved (disconnect-race fix)
Tests cover claim/overwrite/clear/no-op + A->B flap (design 9.4).

* fix(hub): thread sessionID through connect + fix onDisconnect clobber race (B1-4, B1-5)

B1-4: HandleUpgrade returns sessionID; markBrokerOnline(brokerID, sessionID)
  now calls ClaimRuntimeBrokerConnection(brokerID, instanceID, sessionID),
  recording affinity + online + heartbeat in one CAS write.
B1-5: SetOnDisconnect callback gains sessionID; the handler compare-and-clears
  via ReleaseRuntimeBrokerConnection and skips the offline stamp when affinity
  has moved (flap). removeConnection now only removes/fires for the matching
  session, so an old connection's teardown can't drop a newer live socket.

* feat(schema): broker_dispatch intent table + messages dispatch-state (B2-1, B2-2)

B2-1: new BrokerDispatch ent entity (table broker_dispatch) — id, broker_id,
  agent_id(null), agent_slug, project_id(null), op, args(JSON), state, result,
  claimed_by, attempts, error, created_at/updated_at, deadline_at(null);
  index (broker_id,state). store.BrokerDispatch model + state constants.
B2-2: messages.dispatch_state (default 'pending') + dispatched_at; wired through
  store.Message + entadapter conversion/create. Dialect-neutral.

* feat(hub): PostgresCommandBus LISTEN/NOTIFY signal listener on scion_broker_cmd (B2-4)

Introduce a CommandBus interface and PostgresCommandBus implementation
that listens on the new global channel scion_broker_cmd for broker
dispatch wakeup signals. This is a sibling of PostgresEventPublisher,
reusing the same connect/reconnect/keepalive helpers but maintaining
its own independent pgx connection and pool (design §5.1).

Key components:
- PostgresCommandBus: LISTEN loop with backoff-reconnect on its own
  dedicated connection; filters signals by local broker ownership via
  an injected ownsLocally func (wired to ControlChannelManager.IsConnected);
  invokes an injected onSignal reconcile callback (to be wired to the
  reconcile drain in B2-5).
- NotifyBrokerCmd: issues NOTIFY inside the caller's transaction so the
  signal commits atomically with the durable intent row (mirrors PublishTx).
- NoopCommandBus: safe no-op for the SQLite backend (single-process,
  all brokers are local).
- Backend selection in newCommandBus mirrors newEventPublisher: Postgres
  driver → PostgresCommandBus; otherwise → NoopCommandBus.
- Server.SetCommandBus/CommandBus() setter/getter; cleanup in both
  Shutdown and CleanupResources paths.

* feat(store): BrokerDispatch store methods + message dispatch CAS (B2-3)

BrokerDispatchStore: Insert/Claim(CAS pending->in_progress)/Complete/Fail/
ListPendingDispatch + MarkMessageDispatched(CAS)/ListPendingMessages (via agent
runtime_broker_id). Wired into CompositeStore + store.Store. Tests: concurrent
claim single-winner (exactly-once), drain pending-only, message CAS dedupe,
complete/fail transitions, pending-messages-by-broker-agent.

* feat(hub): reconcile-on-connect drain wired to bus + markBrokerOnline (B2-5)

Server.reconcileBroker drains pending broker_dispatch rows (CAS-claim -> exec ->
done/fail) and pending messages (CAS MarkMessageDispatched -> deliver) for a
broker this node owns. Exactly-once via store CAS; idempotent + concurrent-safe.
Wired as durability backstop into markBrokerOnline (async on reconnect) and as
the command-bus signal handler (SetOnSignal -> ReconcileBroker). Op executors are
seams (executeDispatch/deliverMessage) that Phase 3/4 fill with local tunnel ops.

* feat(hub): route() decision in HybridBrokerClient (B3-1)

routeLocal (IsConnected, unchanged fast path) | routeForward (affinity owner
alive) | routeHTTP (broker endpoint set) | routeUndeliverable. Affinity is a
hint only (StoreAffinityLookup over connected_hub_id + last_heartbeat freshness),
injectable for testing. Not yet wired into dispatch (B3-2 wires message path).
Table-driven tests over all branches incl. local-precedence + nil-affinity.

* feat(hub): cross-node message dispatch via route()+intent+signal+owner drain (B3-2, B3-3)

Route-gate the message send path: HybridBrokerClient.MessageAgent now uses
route(brokerID, endpoint) to decide delivery. routeLocal and routeHTTP follow
existing paths unchanged. routeForward/routeUndeliverable return
ErrMessageDeferred — the message row (already persisted with
dispatch_state=pending) is the durable intent. All call sites
(handleAgentMessage, set[], broadcastDirect, messagebroker, notifications,
scheduler) catch the sentinel, emit a best-effort NOTIFY wakeup via
SignalBrokerCmd, and return 202 Accepted (or log as deferred).

Fill the deliverMessage seam in reconcile.go: resolves the agent from the
message's AgentID, obtains the dispatcher, and calls DispatchAgentMessage for
local tunnel delivery. reconcileBroker already CAS-marks dispatched before
calling this.

Wire SetAffinityLookup(StoreAffinityLookup(store, 0)) on the
HybridBrokerClient in CreateAuthenticatedDispatcher so route() can return
routeForward when another node owns the broker.

Add SignalBrokerCmd to the CommandBus interface — a best-effort NOTIFY using
the bus's own pool, used by the message path where the durable intent is the
message row itself and the NOTIFY is only a wakeup hint.

* feat(hub): lifecycle dispatch (rolling-timeout wait + cross-node start/stop/restart) (B4-1, B4-2)

B4-1: Rolling-timeout wait helper (dispatch_wait.go)
- waitForAgentTransition subscribes to agent.<id>.status events and loops
  with a rolling window (dispatchRollingTimeout=90s) that resets on ANY
  AgentStatusEvent (phase/activity/detail change).
- Terminal phase → return phase, nil. Window expiry → ErrDispatchFailed.
  Context cancellation → ctx.Err().
- Caller subscribes BEFORE writing intent, passes the channel + unsub.

B4-2: Cross-node start/stop/restart dispatch
- Route-gated HybridBrokerClient.StartAgent/StopAgent/RestartAgent exactly
  like MessageAgent: routeLocal → control-channel tunnel (unchanged fast
  path), routeHTTP → HTTP fallback, routeForward/routeUndeliverable →
  ErrLifecycleDeferred.
- Dispatch args structs (dispatch_args.go): StartDispatchArgs captures
  task, resolvedEnv, resolvedSecrets, inlineConfig, sharedDirs,
  sharedWorkspace, projectPath, projectSlug, harnessConfig.
  RestartDispatchArgs captures resolvedEnv. StopDispatchArgs is empty.
  All JSON-serializable for broker_dispatch.args column.
- Owner-side executeDispatch (reconcile.go): start/stop/restart cases
  deserialize args, load agent from store, call local
  DispatchAgentStart/Stop/Restart via the dispatcher. Unknown ops
  (delete, finalize_env, etc.) still fail cleanly for B4-3/B4-4.

Tests: waitForAgentTransition (terminal, error, rolling reset, silence
expiry, context cancel, unsub); route-gating of Start/Stop/Restart
returns ErrLifecycleDeferred when non-local; executeDispatch lifecycle
cases invoke the local dispatcher; args round-trip (serialize→deserialize)
is lossless; reconcile end-to-end lifecycle path.

* feat(hub): wire originator-side cross-node lifecycle dispatch (B4-2 complete)

The originator-side orchestration was missing: ErrLifecycleDeferred was
returned by HybridBrokerClient but nothing caught it. Now the full
cross-node start/stop/restart flow works transparently to all handler
call sites.

Originator side (HTTPAgentDispatcher):
- DispatchAgentStart/Stop/Restart catch ErrLifecycleDeferred after
  env/secret resolution and invoke deferredLifecycle:
  1. Subscribe("agent.<id>.status") BEFORE writing intent
  2. InsertBrokerDispatch{op, agent_id, broker_id, args}
  3. Best-effort SignalBrokerCmd (row is durable backstop)
  4. waitForAgentTransition with terminal set per op
  5. Return nil on success, error on error-phase/timeout
- SetCrossNodeDeps(events, commandBus) wired in server.go's
  getOrCreateDispatcher, so all handler call sites get cross-node
  for free with synchronous semantics preserved.
- Local path (routeLocal) is unchanged at zero added latency — no
  subscribe, no intent row, no wait.

Args decision: owner RE-RESOLVES env/secrets via DispatchAgentStart
(all hub instances share the same store + secret backend), so
StartDispatchArgs carries only {Task}. RestartDispatchArgs and
StopDispatchArgs are empty. This avoids serializing potentially large
env/secrets into the DB while remaining correct because all hubs read
from the same shared store.

waitForAgentTransition refactored to a standalone function (no Server
receiver) so the dispatcher can call it directly.

Tests:
- TestDeferredStart_WritesIntentAndWaits: deferred start writes a
  broker_dispatch row, waits, returns success on "running" event
- TestDeferredStart_ReturnsErrorOnErrorPhase: error phase → error
- TestLocalStart_SkipsIntentRow: local path calls tunnel directly,
  no intent row written
- All existing tests pass (no regressions)

* fix(hub): make web session replica-portable to fix OAuth state_mismatch

OAuth login behind the load balancer intermittently failed with
state_mismatch: the CSRF state token (and the entire web session) was
stored in a gorilla FilesystemStore on the handling replica's local
disk, while the browser only carried a session-ID cookie. When the LB
routed /auth/login and /auth/callback to different replicas, the
callback replica had no matching session file -> empty state ->
state_mismatch. It only "worked" when both hops happened to hit the
same backend.

The same flaw affected the post-login session: sessionToBearerMiddleware
reads the Hub access/refresh JWTs from that disk-local store on every API
request, so sessions silently dropped whenever a follow-up request
landed on a different replica.

Replace the FilesystemStore with an encrypted, signed gorilla
CookieStore so the whole session lives in the client's cookie and any
replica sharing SESSION_SECRET can read it. Keys are derived
deterministically from SESSION_SECRET (32-byte HMAC auth key + 32-byte
AES-256 encryption key, domain-separated). No DB, no migration; works
with N replicas.

The original switch to disk was motivated by a "JWT tokens exceed 4096
bytes" concern. Measured against the current compact HS256 tokens the
full session (identity + access + refresh) encodes to ~2.6 KB, well
under the browser's ~4 KB per-cookie cap, so the securecookie length
limit is left in force (oversize would now error+log, not silently drop).

Tests: replace the obsolete NoMaxLengthLimit test with a cross-replica
round-trip regression test (cookie minted by replica A decodes on
replica B with the same secret; carries OAuth state + post-login tokens)
plus a negative test (a different secret cannot decode the cookie).

* feat(hub): cross-node delete + create-time data ops dispatch (B4-3, B4-4)

Route-gate HybridBrokerClient.DeleteAgent, CheckAgentPrompt,
CreateAgentWithGather, and FinalizeEnv through route() so
routeForward/routeUndeliverable return ErrLifecycleDeferred (matching
start/stop/restart pattern from B4-2).

B4-3 (delete dispatch):
- deferredDelete on ErrLifecycleDeferred: subscribe
  broker.dispatch.<id>.done → InsertBrokerDispatch{op:delete} →
  SignalBrokerCmd → waitForDispatchDone (reads DB row, authoritative).
- Owner executeDispatch case "delete": deserializes DeleteDispatchArgs →
  local DispatchAgentDelete (idempotent, 404 ok).
- DeleteDispatchArgs struct + UnmarshalDeleteArgs for args round-trip.

B4-4 (create-time data ops):
- deferredDataOp/deferredDataOpResult: common originator flow for ops
  that return results via the dispatch row (design §6.3). Subscribe to
  broker.dispatch.<id>.done BEFORE writing intent, insert dispatch,
  signal, waitForDispatchDone, read result from GetBrokerDispatch.
- deferredCheckPrompt: returns bool from CheckPromptResult in row.
- deferredFinalizeEnv: fire-and-forget via deferredDataOp.
- deferredCreateWithGather: returns envRequirements from row result.
- Owner executeDispatch cases: check_prompt, finalize_env, create —
  run local op, marshal result JSON, return it.
- PublishDispatchDone on EventPublisher: slim completion event
  broker.dispatch.<id>.done emitted by reconcile loop on complete/fail.
- waitForDispatchDone: event-driven wait with bounded re-read at
  rolling timeout (missed event recovery, design §6.3).
- GetBrokerDispatch added to BrokerDispatchStore interface + entadapter.

Local fast path unchanged (routeLocal → zero added latency).

* feat(hub): stale-affinity + stuck-dispatch reaper singleton (B5-1)

* feat(hub): pending-message sweep + dispatch metrics (B5-2)

Add observability for the multi-node broker dispatch pipeline:

Sweep:
- CountStuckPendingMessages store method (messages pending > threshold)
- brokerMessageSweepHandler registered as RecurringSingleton with
  LockBrokerMessageSweep (0x5C100007), runs every 1m

Metrics (pkg/observability/dispatchmetrics):
- Counters: dispatch published/claimed/done/failed, message dispatched
- Gauge: message stuck (pending beyond 5m threshold)
- Histograms: intent-to-done latency, reconcile drain duration
- Counter: command bus reconnects

Emit sites:
- InsertBrokerDispatch → IncPublished (httpdispatcher.go)
- ClaimBrokerDispatch → IncClaimed (reconcile.go)
- CompleteBrokerDispatch → IncDone + RecordDispatchLatency (reconcile.go)
- FailBrokerDispatch → IncFailed (reconcile.go)
- MarkMessageDispatched → IncMessageDispatched (reconcile.go)
- reconcileBroker → RecordReconcileDrainDuration (reconcile.go)
- command bus reconnect → IncCmdBusReconnects (command_bus.go)
- sweep handler → ObserveMessageStuck (sweep.go)

* fix(hub): derive JWT signing keys from shared SESSION_SECRET to fix cross-replica login loop

The cookie-store fix (0515e2a8) made the web session replica-portable, but
the Hub JWT *inside* the cookie is still signed with a per-replica key:
ensureSigningKey scopes signing keys to (scope=hub, scope_id=hubID) and
hubID = sha256(hostname)[:12]. The integration env runs two replicas of one
logical hub behind a single LB, sharing one Postgres DB and one
SESSION_SECRET but with different hostnames -> different hubIDs -> different
HS256 signing keys.

So a user JWT minted on replica A failed signature verification on replica B
(go-jose: error in cryptographic primitive); refresh failed too (refresh
token signed with the same foreign key), so sessionToBearerMiddleware
declared the session irrecoverably invalid, DELETED the cookie (MaxAge=-1)
and returned session_expired. The cookie deletion turns it into a redirect
loop: dashboard flashes, then /login?error=session_expired.

Fix: extend the 0515e2a8 approach (replica-portable via the shared secret)
from the cookie to the keys inside it. Add ServerConfig.SharedSigningSecret;
when set, ensureSigningKey derives the agent and user signing keys
deterministically from it (domain-separated by key name) and bypasses
per-host secret-backend storage. cmd feeds the same --session-secret /
SESSION_SECRET value into both the web cookie store and the hub config via a
new resolveSessionSecret() helper. Empty secret keeps the existing per-hub
behavior (no regression for single-node/local dev).

Tests: cross-replica round trip (different hubID + same secret -> identical
keys, token minted on A validates on B; different secret cannot) plus
pre-configured-key precedence.

Note: rollout rotates the signing keys (now derived from SESSION_SECRET), so
existing web/CLI tokens are invalidated once and users re-login.

* docs: project log for B5-3 chaos gate — GB5 PASSED (GA gate for broker dispatch)

* fix(hub): align fakeHTTPClient.CleanupProject with interface (3 params, not 4)

* fix(hub): address PR #305 review feedback

- server_migrate.go: use nil-checked deferred close for src DB, and
  explicitly close src before dropSQLiteFile to prevent Windows sharing
  violations
- server_migrate.go: handle file:// prefix before file: to correctly
  parse file:///path/to/db URLs
- server_foreground.go: evaluate GetControlChannelManager() inside the
  ownsLocally closure to avoid capturing a stale nil value
- server_migrate_test.go: add test case for file:/// URL format
- server_test.go: sanitize t.Name() slashes in newTestStore to prevent
  SQLite path errors in subtests

* docs: add project log for PR #305 review feedback fixes

* fix(hub): prevent duplicate message delivery, guard dispatch state transitions

C1: Call MarkMessageDispatched after successful local dispatch in
messagebroker.go and handlers.go (single-recipient, set[], broadcast).
Without this, successfully dispatched messages remained
dispatch_state=pending and were re-delivered on every broker reconnect
via reconcileBroker.

C2: Return immediately in messagebroker.go deliverToAgent when
CreateMessage fails — without a durable row, a deferred signal has
nothing for the owning node to reconcile.

C3: Guard CompleteBrokerDispatch and FailBrokerDispatch with
state=in_progress CAS predicate so a done dispatch cannot be flipped
to failed or vice versa. Update tests to claim before completing/failing
to match the new CAS guard.

* fix(hub): reconcile broker→eventbus and hub-native→hub-managed renames after rebase

Post-rebase fixups to align the feature branch with main's refactoring:
- broker package → eventbus package rename (types, imports, methods)
- SetRecipient → GroupRecipient, SetMessageResponse → GroupMessageResponse
- hubNativeProjectPath → hubManagedProjectPath
- ProjectTypeHubNative → ProjectTypeHubManaged
- populateAgentConfig gains ctx parameter
- Add missing handleResourcesImport and handleMessageChannels handlers
- Add ListChannels method to MessageBrokerProxy
- Wire newCommandBus in server_foreground.go
- Restore main's test fixtures for renamed APIs

---------

Co-authored-by: scion-gteam[bot] <271067763+scion-gteam[bot]@users.noreply.github.com>
Co-authored-by: Scion <agent@scion.dev>
…GoogleCloudPlatform#303)

* fix: atomic session-guarded broker disconnect to prevent reconnect race (GoogleCloudPlatform#131)

The onDisconnect callback previously used separate ReleaseRuntimeBrokerConnection
and UpdateRuntimeBrokerHeartbeat calls. When a broker disconnects and reconnects
rapidly, the stale disconnect's offline stamp can clobber the new connection's
online status because UpdateRuntimeBrokerHeartbeat has no session guard — it
unconditionally overwrites status. Provider statuses are also clobbered and never
restored by heartbeats, leaving the broker permanently invisible until hub restart.

Add ReleaseAndMarkBrokerOffline which atomically clears affinity AND stamps
status=offline in a single CAS write. If a concurrent reconnect has already
claimed the broker with a new session, the compare fails and the callback is
a no-op. Also add a re-check guard before updating provider statuses.

* docs: add project log for broker disconnect race fix unification
…rm#301)

* docs(design): reduced resource clone/delete design (resolved review)

* refactor: remove dead Locked field from Template and HarnessConfig models

Remove the Locked bool field, all 16 enforcement sites across 6 handler
files, the force query parameter from delete endpoints, 3 locked-template
tests, and add a DB migration to drop the column. No production code ever
set Locked=true — this simplifies the handlers for the upcoming clone/delete
feature.

* feat: add harness-config clone endpoint, authz hardening, and slug uniqueness

- Add handleHarnessConfigClone mirroring template clone
- Add CheckAccess authz to deleteTemplateV2, handleTemplateClone, deleteHarnessConfig, handleHarnessConfigClone
- Add DB migration V55: UNIQUE constraint on (slug, scope, scope_id)
- Return 409 Conflict on slug collision during clone
- Add clone failure cleanup
- Add tests for clone, authz, and slug collision

* feat(web): add Clone/Delete row actions and clone-from-global to resource list

- Add Clone and Delete action menu to shared resource-list component
- Add delete confirmation dialog with deleteFiles checkbox (default on)
- Add clone dialog with name input and 409 collision handling
- Add clone-from-global picker in project settings view
- Unify on resource-changed event (migrate resource-imported)
- Gate actions on capabilities (canClone, canDelete properties)

* fix: address PR review — cleanup orphaned files on DB create failure, remove redundant clone method

- Add stor.DeletePrefix cleanup when CreateTemplate/CreateHarnessConfig fails
  after files were already copied (prevents orphaned storage files)
- Remove redundant confirmCloneFromGlobal method — confirmClone already
  handles cross-scope clone via the component's scope/scopeId properties

* fix: adapt Locked removal and slug constraint to Ent-based schema

Remove Locked references from entadapter, remove stale sqlite.go
(replaced by Ent ORM upstream), add UNIQUE(slug, scope, scope_id)
to Ent schema indexes, and regenerate Ent code.

* fix: adapt tests and entadapter for Ent-based store (UUID IDs, no Locked)

- Use api.NewUUID() for all test entity IDs (Ent enforces UUID format)
- Remove Locked field from entadapter create/update calls
- Remove stale sqlite.go (replaced by Ent ORM upstream)
- Add UNIQUE(slug, scope, scope_id) to Ent schema indexes
…form#309)

* fix(hub): make web session replica-portable to fix OAuth state_mismatch

OAuth login behind the load balancer intermittently failed with
state_mismatch: the CSRF state token (and the entire web session) was
stored in a gorilla FilesystemStore on the handling replica's local
disk, while the browser only carried a session-ID cookie. When the LB
routed /auth/login and /auth/callback to different replicas, the
callback replica had no matching session file -> empty state ->
state_mismatch. It only "worked" when both hops happened to hit the
same backend.

The same flaw affected the post-login session: sessionToBearerMiddleware
reads the Hub access/refresh JWTs from that disk-local store on every API
request, so sessions silently dropped whenever a follow-up request
landed on a different replica.

Replace the FilesystemStore with an encrypted, signed gorilla
CookieStore so the whole session lives in the client's cookie and any
replica sharing SESSION_SECRET can read it. Keys are derived
deterministically from SESSION_SECRET (32-byte HMAC auth key + 32-byte
AES-256 encryption key, domain-separated). No DB, no migration; works
with N replicas.

The original switch to disk was motivated by a "JWT tokens exceed 4096
bytes" concern. Measured against the current compact HS256 tokens the
full session (identity + access + refresh) encodes to ~2.6 KB, well
under the browser's ~4 KB per-cookie cap, so the securecookie length
limit is left in force (oversize would now error+log, not silently drop).

Tests: replace the obsolete NoMaxLengthLimit test with a cross-replica
round-trip regression test (cookie minted by replica A decodes on
replica B with the same secret; carries OAuth state + post-login tokens)
plus a negative test (a different secret cannot decode the cookie).

* fix(hub): derive JWT signing keys from shared SESSION_SECRET to fix cross-replica login loop

The cookie-store fix (0515e2a) made the web session replica-portable, but
the Hub JWT *inside* the cookie is still signed with a per-replica key:
ensureSigningKey scopes signing keys to (scope=hub, scope_id=hubID) and
hubID = sha256(hostname)[:12]. The integration env runs two replicas of one
logical hub behind a single LB, sharing one Postgres DB and one
SESSION_SECRET but with different hostnames -> different hubIDs -> different
HS256 signing keys.

So a user JWT minted on replica A failed signature verification on replica B
(go-jose: error in cryptographic primitive); refresh failed too (refresh
token signed with the same foreign key), so sessionToBearerMiddleware
declared the session irrecoverably invalid, DELETED the cookie (MaxAge=-1)
and returned session_expired. The cookie deletion turns it into a redirect
loop: dashboard flashes, then /login?error=session_expired.

Fix: extend the 0515e2a approach (replica-portable via the shared secret)
from the cookie to the keys inside it. Add ServerConfig.SharedSigningSecret;
when set, ensureSigningKey derives the agent and user signing keys
deterministically from it (domain-separated by key name) and bypasses
per-host secret-backend storage. cmd feeds the same --session-secret /
SESSION_SECRET value into both the web cookie store and the hub config via a
new resolveSessionSecret() helper. Empty secret keeps the existing per-hub
behavior (no regression for single-node/local dev).

Tests: cross-replica round trip (different hubID + same secret -> identical
keys, token minted on A validates on B; different secret cannot) plus
pre-configured-key precedence.

Note: rollout rotates the signing keys (now derived from SESSION_SECRET), so
existing web/CLI tokens are invalidated once and users re-login.

---------

Co-authored-by: Scion <agent@scion.dev>
…events (GoogleCloudPlatform#312)

A rapid session.start → session.end sequence from a spurious sciontool
could permanently reset an agent's phase even while the agent works
normally. This adds two guards:

1. Phase regression guard: rejects transitions that would move an agent
   backward in its forward-progress lifecycle (e.g. running → starting)
   in both the status update handler and broker heartbeat handler.

2. Activity-driven phase auto-correction: when an activity that implies
   the agent is running (working, thinking, executing, etc.) arrives but
   the phase is pre-running, auto-promotes the phase to running.

Fixes GoogleCloudPlatform#124
…GoogleCloudPlatform#313)

Also unset SCION_PROJECT_ID when clearing hub context env vars, since
IsHubContext() checks all four env vars and a leftover SCION_PROJECT_ID
causes FindProjectRoot() to return a synthetic path instead of failing.
…tform#311)

* Fix agent list task overflow and unify action buttons

Task cell in list view used inline span styling that silently ignored
max-width/overflow constraints, allowing long task text to push action
buttons off-screen. Switch to display:-webkit-box with line-clamp:2
so text wraps to at most two lines with ellipsis.

Card view action buttons now render icon-only (matching list view),
with sl-tooltip and aria-label for accessibility. Both views share a
single renderActionButtons helper, eliminating the duplicated button
logic. Color-coded hover effects added to action buttons in both
views: red for stop/delete, amber for suspend, green for resume/start.

Closes GoogleCloudPlatform#134
Closes GoogleCloudPlatform#135

* Fix agent list task overflow and unify action buttons

Task cell in list view used inline span styling that silently ignored
max-width/overflow constraints, allowing long task text to push action
buttons off-screen. Switch to display:-webkit-box with line-clamp:2
so text wraps to at most two lines with ellipsis.

Card view action buttons now render icon-only (matching list view),
with sl-tooltip and aria-label for accessibility. Both views share a
single renderActionButtons helper, eliminating the duplicated button
logic. Color-coded hover effects use translucent rgba backgrounds
that work in both light and dark mode: red for stop/delete, amber
for suspend, green for resume/start.

Closes GoogleCloudPlatform#134
Closes GoogleCloudPlatform#135

* Add before/after screenshots for PR review

Screenshots captured from the real running app (Vite dev server +
fetch mock for agent data). Shows before/after for both issues in
light mode and dark mode.

* Fix hover on disabled buttons and tooltip on disabled terminal

Add :not([disabled]) to hover CSS selectors so color-coded hover
effects don't apply to disabled action buttons. Wrap the Terminal
button in an inline-flex span inside sl-tooltip so the tooltip
remains accessible even when the button has pointer-events:none.
* docs(design): auth proxy mode (Google IAP) architecture

Add design for an exclusive proxy human-auth mode that derives the user from
a verified Google IAP signed header (X-Goog-IAP-JWT-Assertion), reusing the
existing domain/allowlist/admin provisioning controls. Also specifies a
hub-minted transport-auth layer (dedicated SA, generalizing PR GoogleCloudPlatform#307) so agents
can traverse the IAP / Cloud Run-invoker front door, with a generalized
array-based token refresh.

* refactor(hub): extract provisionUser, dedupe OAuth find-or-create

Extract the duplicated find-or-create-user block from four OAuth
handlers (handleAuthLogin, handleAuthToken, handleCLIAuthToken,
completeOAuthLogin) into a single provisionUser method on Server.

The new method encapsulates:
1. Authorization check (isUserAuthorized) with audit logging
2. GetUserByEmail / CreateUser (find-or-create)
3. Profile backfill (DisplayName, AvatarURL when empty)
4. Admin promotion (when admin list changes)
5. Hub membership enrollment (ensureHubMembership)

Introduces ExternalUserInfo struct (decoupled from OAuthUserInfo) and
ErrAccessDenied sentinel error for caller-side HTTP response mapping.

This is Phase 0 of the auth-proxy-mode feature — pure refactor with
no behavior change. The proxy middleware (Phase 1) will call the same
provisionUser method.

NOTE: No suspended-user check is added. The existing OAuth flow does
not check user.Status == "suspended" either; adding it here would
change behavior. This gap is documented for Phase 1.

* docs(project-log): record provisionUser extraction findings

* feat(auth): implement proxy auth mode with IAP JWT verification (Phase 1)

Add exclusive proxy auth mode for Google IAP signed-header authentication:

- pkg/hub/proxyauth.go (NEW): ProxyAuthenticator interface, IAPAuthenticator
  with ES256 JWT verification via go-jose/v4, JWKS lazy-fetch cache with
  periodic refresh + on-miss refresh for unknown kids + transient failure
  tolerance (last-good keys).

- pkg/config: auth.mode selector (oauth|proxy|dev), auth.proxy section with
  provider/iap.audience/overrides in both DevAuthConfig (GlobalConfig) and
  V1AuthConfig (settings.yaml). Wire conversion in both directions.

- pkg/hub/auth.go: Replace IP-only extractProxyUser branch with
  ProxyAuthenticator path. Add 60s resolution cache (ProxyUserCache) wrapping
  provisionUser — signature verification runs every request, only the store
  lookup is cached. Legacy extractProxyUser preserved when no authenticator
  is configured.

- pkg/hub/handlers_auth.go: Add suspended-user gate to provisionUser —
  rejects Status=="suspended" with ErrUserSuspended. This is an intentional
  behavior change sanctioned by the design doc, closing the pre-existing
  OAuth suspended-login gap documented in Phase 0.

- pkg/hub/web.go: In proxy mode, handleAuthProviders returns no OAuth
  providers; handleLogout redirects to IAP's clear_login_cookie endpoint.

- cmd/server_foreground.go: Construct IAPAuthenticator when mode==proxy &&
  provider==iap, wire into ServerConfig.ProxyAuth.

Security: audience binding is mandatory; only the signed JWT assertion is
authoritative (X-Goog-Authenticated-User-* headers ignored); clock skew
±30s; JWKS cache handles key rotation and transient fetch failures.

* test(auth): add comprehensive IAPAuthenticator unit tests

Tests using self-generated ES256 key pair + httptest JWKS server:
- Valid assertion -> correct ProxyUserInfo (subject/email stripped, lowercased)
- Bad signature -> error
- Wrong audience -> error (mandatory binding)
- Wrong issuer -> error
- Expired token (past 30s skew) -> error
- Missing header -> (nil, nil) fall-through
- Unknown kid triggers JWKS refresh and succeeds
- Custom issuer override for testing
- HD (hosted domain) claim extraction
- Email lowercasing
- JWKS cache transient failure tolerance (serves last-good keys)

* style: fix gofmt formatting in proxyauth_test.go and settings_v1.go

* docs(project-log): record auth-proxy-mode Phase 1 implementation

* config: add auth.transport config for outbound transport auth

Add TransportAuthConfig (hub_config.go) and V1TransportConfig
(settings_v1.go) for the transport-layer auth that lets agents
traverse IAP / Cloud Run invoker front doors. Config supports
mode (none|cloudrun_invoker|iap), oidcAudience, and
platformAuthSA fields. Wire into V1↔GlobalConfig conversion
and env key mapping.

Phase 2 item 6 of auth-proxy-mode.

* hub: add TransportTokenMinter interface and implementations

Introduce the TransportTokenMinter interface for minting Google OIDC
ID tokens that let agents traverse platform guards (IAP / Cloud Run
invoker). Three implementations:

- gcpTransportMinter: production impl using IAM Credentials API
  (generateIdToken) to impersonate a dedicated platform-auth SA.
  Uses already-vendored google.golang.org/api/iamcredentials/v1.
- noopTransportMinter: returns error when transport auth is disabled.
- FakeTransportMinter: exported test double for other packages.

Also adds RefreshTokenEntry type for the generalized tokens[] array
and parseJWTExpiry for extracting expiry from ID tokens.

All tests pass with no live GCP dependency (httptest fakes).

Phase 2 item 6 of auth-proxy-mode.

* hub: wire transport token minter into ServerConfig and dispatch

Add TransportMode, TransportAudience, TransportMinter fields to
ServerConfig and wire them through to the Server struct and
HTTPAgentDispatcher. Transport tokens are injected as env vars
(SCION_TRANSPORT_TOKEN, SCION_TRANSPORT_AUDIENCE,
SCION_TRANSPORT_TOKEN_EXPIRY) into agent dispatch payloads in
all three dispatch paths (Create, Start, Restart).

server_foreground.go constructs a gcpTransportMinter from
auth.transport config, deriving audience from hubEndpoint
for cloudrun_invoker mode.

When transport mode is "none" or unset, no minter is created
and no transport tokens are injected — zero impact on existing
deployments.

Phase 2 item 6 of auth-proxy-mode.

* hub: extend token refresh response with generalized tokens[] array

The agent token refresh handler now returns a tokens[] array
alongside the existing token/expires_at fields for backward
compatibility. Old clients ignore tokens[]; new clients use it
to apply both app-layer and transport-layer tokens.

When transport auth is configured (transportMinter != nil), the
response includes a google_oidc transport token entry with the
configured audience. When disabled, only the app scion_access
entry appears.

Transport token minting errors are logged but don't fail the
refresh — the app token is always returned.

Phase 2 item 7 of auth-proxy-mode.

* sciontool: add pluggable OIDC transport for agent outbound auth

Implement the agent-side transport-layer auth with two pluggable
token sources:

- injectedTokenSource: uses the hub-provided SCION_TRANSPORT_TOKEN
  env var (cold start), then refreshed via the tokens[] array on
  subsequent refresh calls.
- metadataTokenSource: fetches OIDC from the GCE metadata server
  (passthrough/on-GCE mode, the PR GoogleCloudPlatform#307 pattern).

Selection logic: SCION_TRANSPORT_TOKEN env → injected mode;
else if on GCE → metadata mode; else → no OIDC transport.

The oidcTransport RoundTripper injects Authorization: Bearer on
outbound hub requests. Graceful degradation: if token fetch fails,
the request proceeds without the header (the hub can still auth
via X-Scion-Agent-Token).

Client changes:
- Add oidcSource field and configureOIDCTransport() in NewClient()
- Update RefreshTokenResponse with tokens[] array (backward compat)
- RefreshToken() applies transport tokens via applyRefreshTokens()
- Refresh scheduling uses shortest-lived entry (5-min margin for
  transport tokens vs 2h for scion tokens)

23 new tests covering both sources, transport, configuration,
end-to-end dual-header, and refresh token application.

Phase 2 item 8 of auth-proxy-mode.

* docs(project-log): record auth-proxy-mode Phase 2 implementation

* docs: add IAP proxy auth deployment guide (Phase 3)

Add comprehensive deployment documentation for the IAP + Cloud Run
invoker topology, covering inbound human IAP authentication,
outbound agent transport auth (dual-layer OIDC + scion token),
security considerations, and an end-to-end GCP setup checklist.
All config keys and env vars verified against shipped code.

* fix: prevent JWKS cache stampede and add HTTP client timeout

- resolveHTTPClient() now returns a client with 10s timeout instead of
  http.DefaultClient (which has no timeout), preventing hangs on JWKS fetches.
  Tests that inject their own HTTPClient are unaffected.

- JWKS cache refresh now debounces on lastAttempted (set at the start of
  every attempt, success or failure) instead of lastFetched (success only).
  This prevents stampedes during persistent JWKS outages where every
  cache-miss would trigger an unbounded refresh.

- Added a refreshing guard to prevent concurrent in-flight refreshes
  (proactive background refresh + synchronous miss-refresh could race).

- Network I/O is now performed outside the write lock to avoid holding
  the mutex across HTTP requests.

- Added TestJWKSCache_StampedePreventionDuringOutage to verify that
  repeated misses during an outage do not cause repeated fetches within
  the debounce window.

* fix: replace custom splitJWT with strings.Split and cache IAM service

- Replace the hand-rolled splitJWT function with strings.Split(token, ".").
  Behavior is identical for well-formed JWTs; the custom function is deleted.

- Cache the IAM credentials service client in gcpTransportMinter using
  sync.Once so it is created once and reused across MintIDToken calls
  instead of creating a new HTTP client/service on every invocation.
  Uses context.Background() for the long-lived client construction;
  per-call ctx continues to be passed to .Context(ctx).Do().
  FakeTransportMinter is unaffected.
…oogleCloudPlatform#302)

* fix: resolve workspace file browser to groves/ instead of projects/

The Hub UI file browser was showing the wrong directory contents. The
hubManagedProjectPath() function resolved workspace paths to
~/.scion/projects/<slug>/ (project metadata) instead of
~/.scion/groves/<slug>/ (the actual git checkout mounted as /workspace
in agents).

Reverse the lookup priority: check groves/ first, fall back to
projects/, and default to groves/ when neither has content.

Fixes GoogleCloudPlatform#130

* docs: add project log for issue GoogleCloudPlatform#130 workspace path fix

* fix: guard hubManagedProjectPath against empty slug

Prevent hubManagedProjectPath from resolving to the parent directory
when called with an empty slug. Add unit test for this case.
…by/owner_id)

The Agent Ent schema modeled created_by/owner_id as foreign keys to the
users table. When an agent creates a sub-agent, those columns hold the
*creating agent's* ID, which has no users-table row, so Postgres rejected
the insert with a foreign-key violation. mapError maps that to
ErrInvalidInput, surfacing as a detail-free "validation_error: Invalid
input (status: 400)" on every agent-initiated `scion start`. User-created
agents were unaffected, masking the regression (introduced when GoogleCloudPlatform#304
ported the agent store onto Ent).

created_by/owner_id are polymorphic principal references (user OR agent),
like ancestry. Drop the User-typed edges and keep them as plain principal
UUID fields; resolve the delegation creator by ID and tolerate "no such
user". Atlas AutoMigrate drops the two FK constraints on existing DBs at
next boot.

Tests: the sole sub-agent creation test only passed because it seeded a
fake user row sharing the agent's ID — an impossible production state.
Remove that workaround so it exercises the real path, and add store/ent
regression tests asserting a non-user principal ID is accepted.
…o agent containers (GoogleCloudPlatform#322)

* Add sciontool doctor and agent auth reset infrastructure

When an agent's hub JWT expires and the refresh loop fails (e.g. hub
signing key rotation), the agent becomes a zombie: running locally but
invisible to the hub. This adds two features to diagnose and recover:

1. `sciontool doctor` command — runs inside the agent container to check
   env vars, token validity/expiry, hub connectivity, auth status, and
   GCP metadata/GitHub token health. Prints actionable remediation.

2. Auth reset mechanism — allows pushing a fresh token into a running
   agent without restarting. The flow is:
   - Hub generates a new agent JWT via DispatchAgentResetAuth
   - Broker's /reset-auth endpoint writes the token file via exec
   - Broker sends SIGUSR2 to sciontool init (PID 1)
   - Init re-reads the token, updates the hub client, restarts the
     token refresh loop, and sends an immediate heartbeat

Also adds Client.SetToken() for in-memory token updates.

* Add scion reset-auth CLI command and hub API endpoint

Adds the user-facing `scion reset-auth <agent>` command that triggers
an auth reset on a running agent via the Hub. Also adds:
- Hub handler for POST /api/v1/agents/{id}/reset-auth
- hubclient AgentService.ResetAuth() method

---------

Co-authored-by: Scion Agent (eng-manager) <agent@scion.dev>
Adds a "Reset Auth" button in the agent detail header actions area,
visible when the agent is running. Clicking it calls the hub's
POST /api/v1/agents/{id}/reset-auth endpoint, which generates a
fresh JWT and pushes it into the running container without restart.
GoogleCloudPlatform#323)

* Make SIGUSR2 signal best-effort in reset-auth handler

The kill -USR2 step can fail (e.g. PID 1 is not sciontool init, or
the process doesn't handle the signal). Since the token file write
already succeeded and the refresh loop will pick up the new token
without the signal, treat signal failure as a warning rather than
returning a 500 error.

* Add admin bulk reset-auth endpoint

POST /api/v1/admin/agents/reset-auth-all lists all running agents and
dispatches an auth reset for each, returning a per-agent success/failure
summary. Admin role required.

* Add Reset Auth All button to admin maintenance page

Adds a Quick Actions section with a "Reset Auth — All Running Agents"
button that calls POST /api/v1/admin/agents/reset-auth-all and displays
a per-agent success/failure summary inline.

---------

Co-authored-by: Scion Agent (eng-manager) <agent@scion.dev>
ptone and others added 28 commits June 12, 2026 04:41
…metrics) (GoogleCloudPlatform#407)

Clarify the two distinct metric families in Scion:
- Infrastructure metrics (scion.hub.*, scion.db.*, scion.dispatch.*) for
  platform health, produced by the Hub process
- Agent metrics (gen_ai.*, agent.*) for harness/model telemetry, produced
  inside agent containers via the telemetry pipeline

Also defines the Telemetry pipeline term.

Co-authored-by: Scion Agent (metrics-architect) <agent@scion.dev>
…tform#410)

Co-authored-by: Scion Agent (harness-build-blocker-fix) <agent@scion.dev>
* skill-bank M5a: add RoutingSkillResolver and scheme detection

Introduce RoutingSkillResolver that groups SkillReferences by URI scheme
and dispatches each group to a registered scheme-specific resolver. The
hub resolver serves as the fallback for skill:// URIs and bare names.

Includes detectScheme() which routes gh://, gcp-skill://, and GitHub
full URLs to their respective resolvers, with comprehensive tests
covering fallback routing, scheme dispatch, mixed batches, unsupported
schemes, nil fallback safety, and error propagation.

* skill-bank M5a: wire routing resolver at CLI and broker call sites

Replace direct HubSkillResolver construction with RoutingSkillResolver
wrapping the hub resolver at both CLI (cmd/create.go) and broker
(pkg/runtimebroker/handlers.go) call sites. CachingSkillResolver wraps
the routing resolver so content-hash caching applies to all source types.

Add SkillURIScheme() utility to pkg/api/skill_uri.go for extracting the
scheme portion of a skill URI without full parsing.

* skill-bank M5c: add SkillRegistry schema, store, and models

Add the SkillRegistry Ent schema with fields for name, endpoint,
type (hub/gcp), trust_level (trusted/pinned), auth_token, resolve_path,
pinned_hashes, and status. Define the SkillRegistryStore interface and
its Ent adapter implementation with CRUD, pinned hash management, and
list operations. Embed in the composite store.

* skill-bank M5c: add skill registry CRUD handlers

Add admin-only HTTP handlers for skill registry CRUD operations:
create, list, get, update, delete, and pin hash. Register routes at
/api/v1/skill-registries. Enforce HTTPS-only endpoints, validate
registry names, and never expose auth tokens in API responses.

* skill-bank M5c: add federation proxy and trust enforcement

Add federateResolve to proxy skill resolution requests to external
registries. The resolve endpoint now detects non-scion registry URIs
and delegates to the configured external registry instead of local
resolution. Supports trusted (pass-through) and pinned (hash
verification) trust levels.

* skill-bank M5c: add hub client and CLI for skill registries

Add SkillRegistryService to the hub client with List, Get, Create,
Update, Delete, and Pin operations. Add CLI subcommands under
'scion skills registries' for list, add, show, update, remove, and
pin operations with table and JSON output support.

* skill-bank M5c: add federation and registry tests

Add 16 tests covering federation proxy (trusted/pinned happy paths,
hash mismatch, missing pin, unknown/disabled/wrong-type registry,
external registry down, auth token forwarding, custom resolve path)
and registry CRUD (lifecycle, duplicate name rejection, HTTPS-only
enforcement, non-admin rejection, auth token not in responses, pin).

* skill-bank M5c: fix federation security issues (H1, H3, L1)

H1: Add 10MB body size limit on federation success path to prevent OOM.
H3: Disable redirect following on federation HTTP client to prevent
credential leakage via Authorization header on cross-origin redirects.
L1: Create federation HTTP client once on Server struct instead of
per-call, enabling connection pooling and proper test injection.

* skill-bank M5d: add gcp-skill:// URI parser and tests

Add ParseGCPSkillURI which extracts alias, skill ID, and optional
version from gcp-skill://alias/SKILL_ID[@Version] URIs. This is the
first building block for the GCP Vertex AI skill resolver.

* skill-bank M5d: add GCPSkillResolver with Vertex AI API integration

Implements gcp-skill:// resolution via GCP Vertex AI Skills API.
The resolver uses ADC for authentication, looks up registry aliases
via an injected RegistryLookup function, fetches skill metadata and
files from the GCP API, and computes content hashes for verification.

* skill-bank M5d: wire GCP resolver at broker and add tests

Register the GCPSkillResolver in the broker's skill resolver chain.
The registry alias lookup uses the Hub API (which accepts name-based
lookups). Add comprehensive tests covering happy path, error cases
(unknown alias, disabled registry, wrong type, GCP 404/403, ADC
failure, empty files), and alias forwarding.

* skill-bank M5d: fix version validation, SSRF defense, and response size limit

F1: Validate that if a version is requested via @Version in the URI,
the GCP API response version must match — reject with a clear error
otherwise.

F2: Validate file download URLs before fetching: must use HTTPS
(except localhost), must share the same host as the registry endpoint,
and must not target link-local (169.254.x.x) or RFC 1918 addresses.

F6: Wrap fetchSkillMetadata response body with io.LimitReader (1MB)
to prevent OOM from oversized API responses.

* skill-bank M5b: add gh:// URI parser and tests

* skill-bank M5b: add GitHubSkillResolver with Contents API integration

* skill-bank M5b: add full GitHub URL parser

* skill-bank M5b: wire GitHubSkillResolver at CLI and broker

* skill-bank M5b: add GitHub resolver integration tests

* skill-bank M5b: fix input sanitization and response size limit

* skill-bank M5: fix PR review findings (nil checks, SSRF IPv6, resolvePath, ADC caching)

* skill-bank M5: fix SSRF redirect bypass, ADC context, and PinSkillHash race

* skill-bank M5: fix federation URI translation, CLI GCP wiring, and path escaping

* skill-bank M5: fix CI — gofmt and missing mock method

---------

Co-authored-by: Scion Agent (skill-bank-m5a-dev3) <agent@scion.dev>
* fix: error contracts, integration feedback, outbound errors, and wake audit

Stream B — Non-existent agent error contract:
- Move agent lookup before message persistence in deliverToAgent() to
  prevent orphan message rows for deleted agents
- Add DELIVERY_FAILED notification type dispatched to agent senders
  when broker-path delivery targets a non-existent agent
- Enhance Hub API 404 responses with agent slug and project context
- Mark scheduled events targeting deleted agents with status=failed

Stream I — Outbound agent-to-user error feedback:
- Persistence failure returns 500 (was silent 200 OK)
- Missing recipient returns 400 (removed silent creator fallback)
- Broker dispatch failure returns 502 with clear message
- Successful sends return message_id, status, recipient, recipient_id

Stream K — Wake audit and test coverage:
- Add TestHandleAgentMessage_WakeSuspended (primary use case was untested)
- Add wake failure scenario tests (start fails, delivery fails)
- Add test for messaging suspended agent without --wake
- Bump wake timeout from 15s to 30s to match broker retry deadline
- Add distinct error for wake-success-delivery-failure
- Reject messages to suspended agents without --wake with clear error

Stream C — Integration error feedback:
- Add ActionAttach permission check for user: senders in handleBrokerInbound
- Validate default agents against agent cache before routing in Telegram
- Report Hub delivery errors back to originating Telegram chat
- Add error cooldown (max 1 per 5 min per chat+thread+error-type)
- Include remediation suggestions in error responses

* fix: address review findings M1, M2, L1, L2

M1: Fix misleading "Message persisted but delivery failed" error message
    to "Message delivery failed" — the broker path doesn't persist before
    dispatch, so the old message was incorrect.

M2: Add lazy eviction to errorCooldown map in shouldSuppressError() when
    map exceeds 1000 entries, preventing unbounded growth in long-running
    Telegram plugin instances.

L1: Fix gofmt alignment on ErrCodeAgentNotFound and ErrCodeDeliveryFailed
    constants.

L2: Inline responseStatus and deliveryStatus variables that were never
    reassigned — every error path returns early, so the scaffolding added
    no value.

* feat(messaging): broadcast partial-failure reporting and CLI sender feedback

Stream H — CLI Sender Feedback Improvements:
- Add agent phase pre-check in handleAgentMessage: non-running agents
  return 409 Conflict with guidance (suspended: use --wake, stopped/error:
  use scion start, other: wait for running state).
- Extend 200 OK response with message_id, status, agent, agent_phase.
- Update hubclient SendStructuredMessage to return *MessageResponse.
- CLI differentiates "delivered" (200) from "deferred" (202) output.

Stream G — Broadcast Partial-Failure Reporting:
- Broadcasts return 202 Accepted with targeting info: total agents,
  targeted (running) count, skipped count with phase breakdown.
- broadcastDirect publishes DELIVERY_FAILED notifications for per-agent
  delivery failures.
- Message broker fan-out publishes DELIVERY_FAILED on dispatch failures.
- CLI grove-scoped broadcast uses Hub broadcast endpoint and prints
  acceptance summary with targeted/skipped breakdown.
- Update hubclient BroadcastMessage to return *BroadcastResponse.

* fix: address review findings M1, M2, M3, M4

M1: Eliminate double ListAgents TOCTOU in direct-broadcast path by
    passing pre-classified running agents from the handler's single
    query to broadcastDirect.

M2: Add TODO noting --all path needs P3 upgrade when a global
    broadcast endpoint is added.

M3: Restore zero-targeted guard — print "No running agents" when
    targeted count is 0 instead of misleading acceptance message.

M4: Sort skipped breakdown phases alphabetically for deterministic
    CLI output.

* style: fix gofmt formatting in broker_v2.go and agents.go

* feat: channel validation, group[] rename, and scheduled event cleanup (GoogleCloudPlatform#213)

Stream A — Channel/flag validation:
- Validate --channel names against registered channels at send time
  in CLI (sendMessageViaHub, sendOutboundMessageViaHub) and Hub
  (handleAgentOutboundMessage).
- Return actionable error naming available channels.

Stream F — set[] to group[] rename:
- Accept both group[ and set[ prefixes in IsGroupRecipient/ParseGroupRecipient
  for backward compatibility.
- FormatGroupRecipients now emits group[...] as the canonical syntax.
- CLI help text updated to show group[...] as primary syntax.
- Deprecation warning logged when set[...] is used.

Stream J — Scheduled event cleanup on agent deletion:
- Cancel all pending scheduled events targeting a deleted agent in
  performAgentDelete, before the agent record is removed.
- Match events by parsing payload for agent ID/name/slug.
- Mark cancelled events with status "cancelled" and reason
  "target agent deleted".
- Cancel corresponding in-memory scheduler timers.

* feat(messaging): no-queuing delivery policy with synchronous broker retry

Replace implicit fire-and-forget queuing with synchronous-or-reject
semantics. Messages are now retried against the broker for up to 30s
with exponential backoff before failing with 502 (non-transient error)
or 504 (timeout). Messages are persisted with dispatch_state=dispatched
optimistically and marked as failed on delivery failure.

- Add dispatchWithBrokerRetry() helper with exponential backoff
- Add ErrBrokerTimeout sentinel and broker_timeout error code
- Add MarkMessageFailed() to store interface
- Update all 7 dispatch call sites to use sync retry
- Remove signalDeferredMessage, pending message scan in reconcileBroker
- Remove signalDeferred wiring from MessageBrokerProxy, NotificationDispatcher
- Remove dead "deferred" branch from CLI message output

* fix(messaging): address Phase 4 review findings

F1: MarkMessageFailed now persists the failure reason via a new
    dispatch_failure_reason column, and removes redundant control flow.
F2: Update stale ErrMessageDeferred comment to reflect retry semantics.
F3: broadcastDirect persists before dispatch, matching other handlers.
F4: Document sequential retry O(N×30s) risk in handleGroupMessage.
F5: Note shared 30s context in deliverToAgent.
F6: Document that post-Phase-4 pending rows indicate a bug.

* fix(messaging): address Phase 1 review findings

- M1: Use FormatGroupRecipients (not deprecated FormatSetRecipients) in handleGroupMessage
- M2: Fail closed when broker proxy is nil during channel validation in outbound handler
- M4: Add unit tests for eventTargetsAgent (6 tests) and validateChannel (3 tests)
- L1: Fix FormatGroupRecipients docstring (set[...] -> group[...])
- Fix ListChannels using CheckResponse which closes body before decode; use DecodeResponse instead

* fix: resolve CI lint and gofmt failures

- Fix gofmt trailing newline in reconcile.go
- Fix errcheck: check CancelEvent return value in handlers.go
- Fix errcheck: discard json.Encode return in handlers.go and test files
- Fix staticcheck: use tagged switch in message_channel_test.go

* fix(messaging): address PR GoogleCloudPlatform#409 review comments

- Add deliveryErr parameter to publishDeliveryFailed for accurate error messages
- Distinguish ErrNotFound from transient errors in agent lookup (messagebroker.go)
- Distinguish ErrNotFound from transient errors in broker inbound handler (403 vs 500)
- Throttle errorCooldown map cleanup to every 100 calls instead of every call

---------

Co-authored-by: Scion Agent (message-improvements-p2) <agent@scion.dev>
…loudPlatform#411)

Co-authored-by: Scion Agent (dev-followup-pr) <agent@scion.dev>
Removes [esbuild](https://github.com/evanw/esbuild). It's no longer used after updating ancestor dependency [vite](https://github.com/vitejs/vite/tree/HEAD/packages/vite). These dependencies need to be updated together.


Removes `esbuild`

Updates `vite` from 7.3.2 to 8.0.16
- [Release notes](https://github.com/vitejs/vite/releases)
- [Changelog](https://github.com/vitejs/vite/blob/main/packages/vite/CHANGELOG.md)
- [Commits](https://github.com/vitejs/vite/commits/v8.0.16/packages/vite)

---
updated-dependencies:
- dependency-name: esbuild
  dependency-version:
  dependency-type: indirect
- dependency-name: vite
  dependency-version: 8.0.16
  dependency-type: direct:development
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…udPlatform#416)

Vite 7 deprecated the bundled transformWithEsbuild and now requires
esbuild to be installed separately. This fixes the CI build failure:
"Failed to load transformWithEsbuild. It is deprecated and it now
requires esbuild to be installed separately."

Co-authored-by: Scion Agent (ci-vite-esbuild-fix) <agent@scion.dev>
Co-authored-by: Scion Agent (ci-gofmt-fix) <agent@scion.dev>
…latform#418)

- Use BadRequest() helper for base64 validation in all 4 secret handlers
- Store decoded plaintext (string(decoded)) instead of base64-encoded
  req.Value, preventing double-encoding when secrets are injected as
  environment variables or written to secrets.json
- Add MaxBytesReader (128KB) to setSecret, handleProjectSecretByKey,
  and handleBrokerSecretByKey to match handleAgentSecrets
- Encode secret values as base64 in the frontend using TextEncoder
  before sending to the API

Co-authored-by: Scion Agent (secret-400-pr418-fix) <agent@scion.dev>
…rm#420)

Restores the build feature code that was accidentally removed by PR GoogleCloudPlatform#412.
This includes:
- BuildHarnessConfigImageExecutor in maintenance_executors.go
- build-harness-config-image seeded operation
- Executor wiring in admin_maintenance.go
- Build Image button, dialog, and log streaming UI on harness-config detail page

Originally shipped in PRs GoogleCloudPlatform#406 and GoogleCloudPlatform#410.

Co-authored-by: Scion Agent (harness-local-build) <agent@scion.dev>
Stop auto-closing tasks on the first content message from an agent.
Previously, any non-state-change message immediately marked the task
as completed and closed all subscriptions (the MVP single-turn
limitation documented in the TODO at bridge.go:633).

Now content messages are broadcast to streaming and push subscribers
with state=working and Final=false, keeping the task alive. Task
lifecycle is driven solely by agent state-change messages:
- working/thinking/executing → working (non-terminal)
- waiting_for_input → input-required (non-terminal)
- completed → completed (terminal, closes task)
- error/stalled → failed (terminal, closes task)

This enables multi-turn conversations where agents ask clarifying
questions, send progress updates, or emit interim artifacts before
completing.

Design doc: .design/a2a-multi-turn-lifecycle.md
20 tests covering the multi-turn task lifecycle:
- Content messages don't complete tasks
- Content broadcasts with state=working, Final=false
- Multiple content messages keep task alive
- State-change to completed/failed closes task properly
- State-change to input-required keeps task alive
- Blocking SendMessage returns working (not completed)
- Blocking timeout/error/cancel cleans up activeTask
- Full multi-turn lifecycle integration test
- Slug-based fallback correlation with content
- Metrics not incremented on content messages
Fixes from code review:

1. Terminal state-changes dropped during blocking calls: dispatchToWaiter
   skipped state-change messages entirely, even terminal ones. The task's
   DB state was never updated to completed/failed. Fix: update DB state
   for terminal state-changes even when a waiter is active.

2. Janitor reaping active multi-turn tasks: content messages didn't
   refresh the task's UpdatedAt timestamp, so long conversations could
   be reaped as stale. Fix: call UpdateTaskState(working) on content
   messages to refresh the timestamp.

Added/updated tests for both scenarios.
…review

Debug/refactor cycle findings:
- Refactored dispatchToActiveTask for clarity
- Added test coverage for edge cases in state-change handling
- All tests pass
Enable multi-turn conversations by routing message/send with a taskID
to the same agent, continuing the conversation instead of creating a
new task.

When SendMessageParams includes a taskID:
1. Look up the existing task and verify ownership
2. Reject if task is in a terminal state (completed/failed/canceled)
3. Resolve the agent from stored task metadata
4. Send the follow-up message to the agent
5. Return the existing task (not a new one)

This works with both blocking and non-blocking modes. Combined with
the multi-turn lifecycle change (PR 1), this enables the full A2A
multi-turn flow: client sends initial message → agent responds or
asks for input → client sends follow-up → agent continues.

Design doc: .design/a2a-task-followup.md
22 tests covering follow-up message routing:
- Valid/unknown/terminal/wrong-project/wrong-agent task ID handling
- Task state transitions (input-required → working)
- Blocking timeout/error/cancel/success cleanup paths
- Non-blocking registration and send-failure cleanup
- Concurrent follow-ups on same task
- Message content translation
- Server-level TaskID passthrough and error handling

Bugs fixed during review:
- Blocking success path leaked activeTask (added defer unregister)
- Non-blocking send failure didn't mark task failed or unregister
Fixes from code review:
- Blocking success: refresh task timestamp with UpdateTaskState(working)
- Send failure: mark task as failed + unregister activeTask
- Timeout/cancel: mark task as failed
- Added tests verifying DB state after each path
Found during 12-cycle debug/refactor:
- Fixed edge cases in follow-up routing and state management
- Added test coverage for discovered paths
- 3 consecutive clean cycles after fixes
Update agent cards to advertise streaming and push notification
support now that multi-turn conversations are implemented.

- Registry card: streaming=true, pushNotifications=true
- Per-agent cards: streaming=true, pushNotifications=true
- Remove MVP streaming warning from handleStreamMessage
- Update README: remove single-turn limitation, update known
  limitations to reflect current state (no gRPC/REST transport)
4 tests verifying multi-turn capability advertisement:
- Registry card advertises streaming=true, pushNotifications=true
- Per-agent card matches registry capabilities
- Direct unit test of GenerateAgentCard capability values
- Drift prevention test ensuring registry and per-agent cards stay in sync
…d use topic helpers (GoogleCloudPlatform#421)

Address review feedback from merged A2A PRs GoogleCloudPlatform#314 and GoogleCloudPlatform#315:

- Add TouchTask store method to refresh timestamps without changing state
- Guard dispatchToActiveTask so content messages read and preserve the
  current task state instead of unconditionally resetting to working
- Replace hardcoded fmt.Sprintf topic patterns in sendFollowUp and
  SendStreamingMessage with projectcompat.UserTopic/LegacyUserTopic
- Fix SendStructuredMessage call sites missing second return value
- Update followup_test.go mocks to match current hubclient interfaces

Co-authored-by: Scion Agent (a2a-review-followup-dev) <agent@scion.dev>
Co-authored-by: Scion Agent (broker-shutdown-inv) <agent@scion.dev>
Templates that specify a fully-qualified custom image (e.g.
ghcr.io/myorg/scion-myimage:latest) currently get their registry
prefix rewritten by the broker's image_registry setting. This makes
it impossible to use custom scion-* images hosted in external
registries without push access to the broker's registry.

Add an `image_pinned` field to ScionConfig. When set to true in a
template's scion-agent.yaml, the image is used as-is without
registry rewriting.
@zeroasterisk

Copy link
Copy Markdown
Owner Author

Closing — superseded by format-based detection (#8 / ptone#266). image_pinned approach deprecated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants