Two-tier settings architecture for HA multi-node deployments

## Problem: Settings drift in HA multi-node deployments

In HA deployments where multiple Hub instances share a Postgres database, **all operational settings live in local `settings.yaml` files on each node's filesystem**. There is no mechanism to synchronize settings across replicas. This creates several problems:

1. **Settings drift** — An admin changes `admin_emails`, `user_access_mode`, or telemetry defaults on one node via `PUT /api/v1/admin/server-config`. This writes to that node's `~/.scion/settings.yaml`. Other replicas behind the load balancer still serve the old values until they are independently updated or restarted.

2. **No single source of truth** — The admin settings API (`handleGetServerConfig` / `handlePutServerConfig` in `pkg/hub/admin_settings.go`) reads from and writes to the local filesystem. Behind a load balancer, GET and PUT may hit different nodes, making the admin UI unreliable.

3. **Runtime-reloadable settings are node-local** — `reloadSettings()` applies changes like `telemetry_default`, `admin_emails`, `auto_suspend_stalled`, `user_access_mode`, `log_level`, and `github_app` to the calling node only. Other replicas remain stale.

4. **Project settings already work correctly** — Project-level settings (default template, harness config, agent limits) are stored as project annotations in the database (`pkg/hub/project_settings_handlers.go`), so they are already consistent across all nodes. This proves the pattern works.

### Current architecture summary

| What | Where stored | HA-safe? |
|------|-------------|----------|
| Agent, project, user, template data | Postgres (Ent ORM) | Yes |
| Project settings (limits, defaults) | DB (project annotations) | Yes |
| Event delivery | Postgres LISTEN/NOTIFY (`events_postgres.go`) | Yes |
| Dispatch coordination | Postgres command bus (`command_bus.go`) | Yes |
| Leader election for scheduled work | Postgres advisory locks (`concurrency.go`) | Yes |
| **Hub operational settings** | **`~/.scion/settings.yaml` on each node** | **No** |

## Proposal: Two-tier settings architecture

Split settings into two tiers based on when they are needed during Hub startup:

### Layer 0 — Bootstrap settings (local config / env vars)

Settings that MUST be known before the Hub can connect to the database. These stay in `settings.yaml` or environment variables and must be consistent across nodes via deployment tooling (Helm values, Terraform, Cloud Run env vars, etc.).

| Setting | Rationale |
|---------|-----------|
| `database.driver`, `database.url`, pool settings | Needed to establish DB connection |
| `hub.port`, `hub.host` | Needed to bind the HTTP listener |
| `broker.port`, `broker.host`, `broker.enabled` | Needed to bind broker listener |
| `auth.mode` (oauth/proxy/dev) | Determines middleware stack at startup |
| `oauth.*` (client IDs/secrets) | Needed for OAuth middleware init |
| `auth.proxy.*` (IAP config) | Needed for proxy auth middleware init |
| `secrets.backend`, `secrets.gcp_*` | Needed to init secret backend for signing keys |
| `storage.provider`, `storage.bucket` | Needed to init storage backend |
| `hub.hub_id` | Needed for secret namespacing before DB access |
| `server.mode` (workstation/hosted) | Determines startup behavior |
| `log_level`, `log_format` | Needed for initial logging setup |

### Layer 1 — Operational settings (stored in Postgres)

Settings that are only needed after the Hub is running and connected to the database. These should be stored in a new `hub_settings` table so all nodes read the same values.

| Setting | Currently in | Notes |
|---------|-------------|-------|
| `admin_emails` | settings.yaml | Already runtime-reloadable |
| `user_access_mode` | settings.yaml | Already runtime-reloadable |
| `telemetry_enabled` (default) | settings.yaml | Already runtime-reloadable |
| `telemetry` (full config) | settings.yaml | Already runtime-reloadable |
| `auto_suspend_stalled` | settings.yaml | Already runtime-reloadable |
| `admin_mode` / `maintenance_message` | settings.yaml | Should be instant across all nodes |
| `soft_delete_retention` | settings.yaml | Operational policy |
| `soft_delete_retain_files` | settings.yaml | Operational policy |
| `hub.public_url` (endpoint) | settings.yaml | Only used after startup for agent callbacks |
| `github_app` (non-secret fields) | settings.yaml | Already runtime-reloadable |
| `cors.*` | settings.yaml | Could be changed at runtime |
| `image_registry` | settings.yaml | Operational — agents need it at dispatch |
| `default_template` | settings.yaml | Operational default |
| `default_harness_config` | settings.yaml | Operational default |
| `default_max_turns/model_calls/duration` | settings.yaml | Operational limits |
| `notification_channels` | settings.yaml | Routing config (secrets stay in secret backend) |
| `message_broker.*` | settings.yaml | Requires restart anyway — borderline |

## Implementation sketch

### 1. New `hub_settings` table

```sql
CREATE TABLE hub_settings (
    key         TEXT PRIMARY KEY,
    value       TEXT NOT NULL,       -- JSON-encoded value
    updated_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_by  TEXT                 -- email of admin who changed it
);
```

Using a key-value design rather than a wide table allows adding new settings without migrations.

### 2. Settings cache with invalidation

Each Hub replica loads Layer 1 settings into an in-memory cache at startup. Invalidation options:

- **Postgres LISTEN/NOTIFY** (preferred) — A `scion_settings_changed` channel. On PUT, the writing node issues NOTIFY; all replicas re-read from the `hub_settings` table. Piggybacks on the existing `PostgresEventPublisher` infrastructure.
- **Polling fallback** — Re-read every N seconds (e.g., 30s) as a simpler alternative or backstop.

### 3. Migration path

1. On startup, if `hub_settings` table is empty, seed it from the local `settings.yaml` (Layer 1 fields only). This provides a zero-effort migration for existing deployments.
2. Layer 1 values in `settings.yaml` become fallback defaults — the DB value takes precedence when present.
3. The admin API (`/api/v1/admin/server-config`) switches to reading/writing Layer 1 settings from the DB instead of the filesystem.
4. Env var overrides for Layer 1 settings should still work (highest priority), allowing per-node operational overrides when genuinely needed.

### 4. Priority chain (highest to lowest)

```
Environment variable (SCION_SERVER_*)
  → hub_settings table (Layer 1)
    → settings.yaml fallback (Layer 1 fields)
      → compiled defaults
```

### 5. Admin API changes

- `GET /api/v1/admin/server-config` reads Layer 1 from DB, Layer 0 from local config
- `PUT /api/v1/admin/server-config` writes Layer 1 to DB + issues NOTIFY, rejects Layer 0 changes with guidance to use config files
- New: `GET /api/v1/admin/server-config/nodes` could report per-node Layer 0 values for diagnostics

## What this does NOT change

- **Layer 0 settings** remain file/env-based — operators must keep them consistent via deployment tooling (this is standard practice for DB connection strings, TLS certs, etc.)
- **Project-level settings** stay in project annotations (already HA-safe)
- **Secrets** (signing keys, OAuth secrets, broker tokens) stay in the secret backend
- **Single-node / workstation mode** continues to work unchanged — Layer 1 falls back to `settings.yaml` when no Postgres is configured

## Relevant code paths

- Config loading: `pkg/config/hub_config.go` (`LoadGlobalConfig`, `GlobalConfig`)
- Versioned settings: `pkg/config/settings_v1.go` (`VersionedSettings`, `V1ServerConfig`)
- Admin settings API: `pkg/hub/admin_settings.go` (`handleAdminServerConfig`, `reloadSettings`)
- Project settings (DB model): `pkg/hub/project_settings_handlers.go`
- Event publisher infra: `pkg/hub/events_postgres.go` (`PostgresEventPublisher`)
- Hub server config: `pkg/hub/server.go` (`ServerConfig`)
- Store interfaces: `pkg/store/store.go`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Two-tier settings architecture for HA multi-node deployments #268

Problem: Settings drift in HA multi-node deployments

Current architecture summary

Proposal: Two-tier settings architecture

Layer 0 — Bootstrap settings (local config / env vars)

Layer 1 — Operational settings (stored in Postgres)

Implementation sketch

1. New `hub_settings` table

2. Settings cache with invalidation

3. Migration path

4. Priority chain (highest to lowest)

5. Admin API changes

What this does NOT change

Relevant code paths

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

What	Where stored	HA-safe?
Agent, project, user, template data	Postgres (Ent ORM)	Yes
Project settings (limits, defaults)	DB (project annotations)	Yes
Event delivery	Postgres LISTEN/NOTIFY (`events_postgres.go`)	Yes
Dispatch coordination	Postgres command bus (`command_bus.go`)	Yes
Leader election for scheduled work	Postgres advisory locks (`concurrency.go`)	Yes
Hub operational settings	`~/.scion/settings.yaml` on each node	No

Setting	Rationale
`database.driver`, `database.url`, pool settings	Needed to establish DB connection
`hub.port`, `hub.host`	Needed to bind the HTTP listener
`broker.port`, `broker.host`, `broker.enabled`	Needed to bind broker listener
`auth.mode` (oauth/proxy/dev)	Determines middleware stack at startup
`oauth.*` (client IDs/secrets)	Needed for OAuth middleware init
`auth.proxy.*` (IAP config)	Needed for proxy auth middleware init
`secrets.backend`, `secrets.gcp_*`	Needed to init secret backend for signing keys
`storage.provider`, `storage.bucket`	Needed to init storage backend
`hub.hub_id`	Needed for secret namespacing before DB access
`server.mode` (workstation/hosted)	Determines startup behavior
`log_level`, `log_format`	Needed for initial logging setup

Setting	Currently in	Notes
`admin_emails`	settings.yaml	Already runtime-reloadable
`user_access_mode`	settings.yaml	Already runtime-reloadable
`telemetry_enabled` (default)	settings.yaml	Already runtime-reloadable
`telemetry` (full config)	settings.yaml	Already runtime-reloadable
`auto_suspend_stalled`	settings.yaml	Already runtime-reloadable
`admin_mode` / `maintenance_message`	settings.yaml	Should be instant across all nodes
`soft_delete_retention`	settings.yaml	Operational policy
`soft_delete_retain_files`	settings.yaml	Operational policy
`hub.public_url` (endpoint)	settings.yaml	Only used after startup for agent callbacks
`github_app` (non-secret fields)	settings.yaml	Already runtime-reloadable
`cors.*`	settings.yaml	Could be changed at runtime
`image_registry`	settings.yaml	Operational — agents need it at dispatch
`default_template`	settings.yaml	Operational default
`default_harness_config`	settings.yaml	Operational default
`default_max_turns/model_calls/duration`	settings.yaml	Operational limits
`notification_channels`	settings.yaml	Routing config (secrets stay in secret backend)
`message_broker.*`	settings.yaml	Requires restart anyway — borderline

Two-tier settings architecture for HA multi-node deployments #268

Description

Problem: Settings drift in HA multi-node deployments

Current architecture summary

Proposal: Two-tier settings architecture

Layer 0 — Bootstrap settings (local config / env vars)

Layer 1 — Operational settings (stored in Postgres)

Implementation sketch

1. New hub_settings table

2. Settings cache with invalidation

3. Migration path

4. Priority chain (highest to lowest)

5. Admin API changes

What this does NOT change

Relevant code paths

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

1. New `hub_settings` table