Problem: Settings drift in HA multi-node deployments
In HA deployments where multiple Hub instances share a Postgres database, all operational settings live in local settings.yaml files on each node's filesystem. There is no mechanism to synchronize settings across replicas. This creates several problems:
-
Settings drift — An admin changes admin_emails, user_access_mode, or telemetry defaults on one node via PUT /api/v1/admin/server-config. This writes to that node's ~/.scion/settings.yaml. Other replicas behind the load balancer still serve the old values until they are independently updated or restarted.
-
No single source of truth — The admin settings API (handleGetServerConfig / handlePutServerConfig in pkg/hub/admin_settings.go) reads from and writes to the local filesystem. Behind a load balancer, GET and PUT may hit different nodes, making the admin UI unreliable.
-
Runtime-reloadable settings are node-local — reloadSettings() applies changes like telemetry_default, admin_emails, auto_suspend_stalled, user_access_mode, log_level, and github_app to the calling node only. Other replicas remain stale.
-
Project settings already work correctly — Project-level settings (default template, harness config, agent limits) are stored as project annotations in the database (pkg/hub/project_settings_handlers.go), so they are already consistent across all nodes. This proves the pattern works.
Current architecture summary
| What |
Where stored |
HA-safe? |
| Agent, project, user, template data |
Postgres (Ent ORM) |
Yes |
| Project settings (limits, defaults) |
DB (project annotations) |
Yes |
| Event delivery |
Postgres LISTEN/NOTIFY (events_postgres.go) |
Yes |
| Dispatch coordination |
Postgres command bus (command_bus.go) |
Yes |
| Leader election for scheduled work |
Postgres advisory locks (concurrency.go) |
Yes |
| Hub operational settings |
~/.scion/settings.yaml on each node |
No |
Proposal: Two-tier settings architecture
Split settings into two tiers based on when they are needed during Hub startup:
Layer 0 — Bootstrap settings (local config / env vars)
Settings that MUST be known before the Hub can connect to the database. These stay in settings.yaml or environment variables and must be consistent across nodes via deployment tooling (Helm values, Terraform, Cloud Run env vars, etc.).
| Setting |
Rationale |
database.driver, database.url, pool settings |
Needed to establish DB connection |
hub.port, hub.host |
Needed to bind the HTTP listener |
broker.port, broker.host, broker.enabled |
Needed to bind broker listener |
auth.mode (oauth/proxy/dev) |
Determines middleware stack at startup |
oauth.* (client IDs/secrets) |
Needed for OAuth middleware init |
auth.proxy.* (IAP config) |
Needed for proxy auth middleware init |
secrets.backend, secrets.gcp_* |
Needed to init secret backend for signing keys |
storage.provider, storage.bucket |
Needed to init storage backend |
hub.hub_id |
Needed for secret namespacing before DB access |
server.mode (workstation/hosted) |
Determines startup behavior |
log_level, log_format |
Needed for initial logging setup |
Layer 1 — Operational settings (stored in Postgres)
Settings that are only needed after the Hub is running and connected to the database. These should be stored in a new hub_settings table so all nodes read the same values.
| Setting |
Currently in |
Notes |
admin_emails |
settings.yaml |
Already runtime-reloadable |
user_access_mode |
settings.yaml |
Already runtime-reloadable |
telemetry_enabled (default) |
settings.yaml |
Already runtime-reloadable |
telemetry (full config) |
settings.yaml |
Already runtime-reloadable |
auto_suspend_stalled |
settings.yaml |
Already runtime-reloadable |
admin_mode / maintenance_message |
settings.yaml |
Should be instant across all nodes |
soft_delete_retention |
settings.yaml |
Operational policy |
soft_delete_retain_files |
settings.yaml |
Operational policy |
hub.public_url (endpoint) |
settings.yaml |
Only used after startup for agent callbacks |
github_app (non-secret fields) |
settings.yaml |
Already runtime-reloadable |
cors.* |
settings.yaml |
Could be changed at runtime |
image_registry |
settings.yaml |
Operational — agents need it at dispatch |
default_template |
settings.yaml |
Operational default |
default_harness_config |
settings.yaml |
Operational default |
default_max_turns/model_calls/duration |
settings.yaml |
Operational limits |
notification_channels |
settings.yaml |
Routing config (secrets stay in secret backend) |
message_broker.* |
settings.yaml |
Requires restart anyway — borderline |
Implementation sketch
1. New hub_settings table
CREATE TABLE hub_settings (
key TEXT PRIMARY KEY,
value TEXT NOT NULL, -- JSON-encoded value
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_by TEXT -- email of admin who changed it
);
Using a key-value design rather than a wide table allows adding new settings without migrations.
2. Settings cache with invalidation
Each Hub replica loads Layer 1 settings into an in-memory cache at startup. Invalidation options:
- Postgres LISTEN/NOTIFY (preferred) — A
scion_settings_changed channel. On PUT, the writing node issues NOTIFY; all replicas re-read from the hub_settings table. Piggybacks on the existing PostgresEventPublisher infrastructure.
- Polling fallback — Re-read every N seconds (e.g., 30s) as a simpler alternative or backstop.
3. Migration path
- On startup, if
hub_settings table is empty, seed it from the local settings.yaml (Layer 1 fields only). This provides a zero-effort migration for existing deployments.
- Layer 1 values in
settings.yaml become fallback defaults — the DB value takes precedence when present.
- The admin API (
/api/v1/admin/server-config) switches to reading/writing Layer 1 settings from the DB instead of the filesystem.
- Env var overrides for Layer 1 settings should still work (highest priority), allowing per-node operational overrides when genuinely needed.
4. Priority chain (highest to lowest)
Environment variable (SCION_SERVER_*)
→ hub_settings table (Layer 1)
→ settings.yaml fallback (Layer 1 fields)
→ compiled defaults
5. Admin API changes
GET /api/v1/admin/server-config reads Layer 1 from DB, Layer 0 from local config
PUT /api/v1/admin/server-config writes Layer 1 to DB + issues NOTIFY, rejects Layer 0 changes with guidance to use config files
- New:
GET /api/v1/admin/server-config/nodes could report per-node Layer 0 values for diagnostics
What this does NOT change
- Layer 0 settings remain file/env-based — operators must keep them consistent via deployment tooling (this is standard practice for DB connection strings, TLS certs, etc.)
- Project-level settings stay in project annotations (already HA-safe)
- Secrets (signing keys, OAuth secrets, broker tokens) stay in the secret backend
- Single-node / workstation mode continues to work unchanged — Layer 1 falls back to
settings.yaml when no Postgres is configured
Relevant code paths
- Config loading:
pkg/config/hub_config.go (LoadGlobalConfig, GlobalConfig)
- Versioned settings:
pkg/config/settings_v1.go (VersionedSettings, V1ServerConfig)
- Admin settings API:
pkg/hub/admin_settings.go (handleAdminServerConfig, reloadSettings)
- Project settings (DB model):
pkg/hub/project_settings_handlers.go
- Event publisher infra:
pkg/hub/events_postgres.go (PostgresEventPublisher)
- Hub server config:
pkg/hub/server.go (ServerConfig)
- Store interfaces:
pkg/store/store.go
Problem: Settings drift in HA multi-node deployments
In HA deployments where multiple Hub instances share a Postgres database, all operational settings live in local
settings.yamlfiles on each node's filesystem. There is no mechanism to synchronize settings across replicas. This creates several problems:Settings drift — An admin changes
admin_emails,user_access_mode, or telemetry defaults on one node viaPUT /api/v1/admin/server-config. This writes to that node's~/.scion/settings.yaml. Other replicas behind the load balancer still serve the old values until they are independently updated or restarted.No single source of truth — The admin settings API (
handleGetServerConfig/handlePutServerConfiginpkg/hub/admin_settings.go) reads from and writes to the local filesystem. Behind a load balancer, GET and PUT may hit different nodes, making the admin UI unreliable.Runtime-reloadable settings are node-local —
reloadSettings()applies changes liketelemetry_default,admin_emails,auto_suspend_stalled,user_access_mode,log_level, andgithub_appto the calling node only. Other replicas remain stale.Project settings already work correctly — Project-level settings (default template, harness config, agent limits) are stored as project annotations in the database (
pkg/hub/project_settings_handlers.go), so they are already consistent across all nodes. This proves the pattern works.Current architecture summary
events_postgres.go)command_bus.go)concurrency.go)~/.scion/settings.yamlon each nodeProposal: Two-tier settings architecture
Split settings into two tiers based on when they are needed during Hub startup:
Layer 0 — Bootstrap settings (local config / env vars)
Settings that MUST be known before the Hub can connect to the database. These stay in
settings.yamlor environment variables and must be consistent across nodes via deployment tooling (Helm values, Terraform, Cloud Run env vars, etc.).database.driver,database.url, pool settingshub.port,hub.hostbroker.port,broker.host,broker.enabledauth.mode(oauth/proxy/dev)oauth.*(client IDs/secrets)auth.proxy.*(IAP config)secrets.backend,secrets.gcp_*storage.provider,storage.buckethub.hub_idserver.mode(workstation/hosted)log_level,log_formatLayer 1 — Operational settings (stored in Postgres)
Settings that are only needed after the Hub is running and connected to the database. These should be stored in a new
hub_settingstable so all nodes read the same values.admin_emailsuser_access_modetelemetry_enabled(default)telemetry(full config)auto_suspend_stalledadmin_mode/maintenance_messagesoft_delete_retentionsoft_delete_retain_fileshub.public_url(endpoint)github_app(non-secret fields)cors.*image_registrydefault_templatedefault_harness_configdefault_max_turns/model_calls/durationnotification_channelsmessage_broker.*Implementation sketch
1. New
hub_settingstableUsing a key-value design rather than a wide table allows adding new settings without migrations.
2. Settings cache with invalidation
Each Hub replica loads Layer 1 settings into an in-memory cache at startup. Invalidation options:
scion_settings_changedchannel. On PUT, the writing node issues NOTIFY; all replicas re-read from thehub_settingstable. Piggybacks on the existingPostgresEventPublisherinfrastructure.3. Migration path
hub_settingstable is empty, seed it from the localsettings.yaml(Layer 1 fields only). This provides a zero-effort migration for existing deployments.settings.yamlbecome fallback defaults — the DB value takes precedence when present./api/v1/admin/server-config) switches to reading/writing Layer 1 settings from the DB instead of the filesystem.4. Priority chain (highest to lowest)
5. Admin API changes
GET /api/v1/admin/server-configreads Layer 1 from DB, Layer 0 from local configPUT /api/v1/admin/server-configwrites Layer 1 to DB + issues NOTIFY, rejects Layer 0 changes with guidance to use config filesGET /api/v1/admin/server-config/nodescould report per-node Layer 0 values for diagnosticsWhat this does NOT change
settings.yamlwhen no Postgres is configuredRelevant code paths
pkg/config/hub_config.go(LoadGlobalConfig,GlobalConfig)pkg/config/settings_v1.go(VersionedSettings,V1ServerConfig)pkg/hub/admin_settings.go(handleAdminServerConfig,reloadSettings)pkg/hub/project_settings_handlers.gopkg/hub/events_postgres.go(PostgresEventPublisher)pkg/hub/server.go(ServerConfig)pkg/store/store.go