Skip to content

Two-tier settings architecture for HA multi-node deployments #268

Description

@ptone

Problem: Settings drift in HA multi-node deployments

In HA deployments where multiple Hub instances share a Postgres database, all operational settings live in local settings.yaml files on each node's filesystem. There is no mechanism to synchronize settings across replicas. This creates several problems:

  1. Settings drift — An admin changes admin_emails, user_access_mode, or telemetry defaults on one node via PUT /api/v1/admin/server-config. This writes to that node's ~/.scion/settings.yaml. Other replicas behind the load balancer still serve the old values until they are independently updated or restarted.

  2. No single source of truth — The admin settings API (handleGetServerConfig / handlePutServerConfig in pkg/hub/admin_settings.go) reads from and writes to the local filesystem. Behind a load balancer, GET and PUT may hit different nodes, making the admin UI unreliable.

  3. Runtime-reloadable settings are node-localreloadSettings() applies changes like telemetry_default, admin_emails, auto_suspend_stalled, user_access_mode, log_level, and github_app to the calling node only. Other replicas remain stale.

  4. Project settings already work correctly — Project-level settings (default template, harness config, agent limits) are stored as project annotations in the database (pkg/hub/project_settings_handlers.go), so they are already consistent across all nodes. This proves the pattern works.

Current architecture summary

What Where stored HA-safe?
Agent, project, user, template data Postgres (Ent ORM) Yes
Project settings (limits, defaults) DB (project annotations) Yes
Event delivery Postgres LISTEN/NOTIFY (events_postgres.go) Yes
Dispatch coordination Postgres command bus (command_bus.go) Yes
Leader election for scheduled work Postgres advisory locks (concurrency.go) Yes
Hub operational settings ~/.scion/settings.yaml on each node No

Proposal: Two-tier settings architecture

Split settings into two tiers based on when they are needed during Hub startup:

Layer 0 — Bootstrap settings (local config / env vars)

Settings that MUST be known before the Hub can connect to the database. These stay in settings.yaml or environment variables and must be consistent across nodes via deployment tooling (Helm values, Terraform, Cloud Run env vars, etc.).

Setting Rationale
database.driver, database.url, pool settings Needed to establish DB connection
hub.port, hub.host Needed to bind the HTTP listener
broker.port, broker.host, broker.enabled Needed to bind broker listener
auth.mode (oauth/proxy/dev) Determines middleware stack at startup
oauth.* (client IDs/secrets) Needed for OAuth middleware init
auth.proxy.* (IAP config) Needed for proxy auth middleware init
secrets.backend, secrets.gcp_* Needed to init secret backend for signing keys
storage.provider, storage.bucket Needed to init storage backend
hub.hub_id Needed for secret namespacing before DB access
server.mode (workstation/hosted) Determines startup behavior
log_level, log_format Needed for initial logging setup

Layer 1 — Operational settings (stored in Postgres)

Settings that are only needed after the Hub is running and connected to the database. These should be stored in a new hub_settings table so all nodes read the same values.

Setting Currently in Notes
admin_emails settings.yaml Already runtime-reloadable
user_access_mode settings.yaml Already runtime-reloadable
telemetry_enabled (default) settings.yaml Already runtime-reloadable
telemetry (full config) settings.yaml Already runtime-reloadable
auto_suspend_stalled settings.yaml Already runtime-reloadable
admin_mode / maintenance_message settings.yaml Should be instant across all nodes
soft_delete_retention settings.yaml Operational policy
soft_delete_retain_files settings.yaml Operational policy
hub.public_url (endpoint) settings.yaml Only used after startup for agent callbacks
github_app (non-secret fields) settings.yaml Already runtime-reloadable
cors.* settings.yaml Could be changed at runtime
image_registry settings.yaml Operational — agents need it at dispatch
default_template settings.yaml Operational default
default_harness_config settings.yaml Operational default
default_max_turns/model_calls/duration settings.yaml Operational limits
notification_channels settings.yaml Routing config (secrets stay in secret backend)
message_broker.* settings.yaml Requires restart anyway — borderline

Implementation sketch

1. New hub_settings table

CREATE TABLE hub_settings (
    key         TEXT PRIMARY KEY,
    value       TEXT NOT NULL,       -- JSON-encoded value
    updated_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_by  TEXT                 -- email of admin who changed it
);

Using a key-value design rather than a wide table allows adding new settings without migrations.

2. Settings cache with invalidation

Each Hub replica loads Layer 1 settings into an in-memory cache at startup. Invalidation options:

  • Postgres LISTEN/NOTIFY (preferred) — A scion_settings_changed channel. On PUT, the writing node issues NOTIFY; all replicas re-read from the hub_settings table. Piggybacks on the existing PostgresEventPublisher infrastructure.
  • Polling fallback — Re-read every N seconds (e.g., 30s) as a simpler alternative or backstop.

3. Migration path

  1. On startup, if hub_settings table is empty, seed it from the local settings.yaml (Layer 1 fields only). This provides a zero-effort migration for existing deployments.
  2. Layer 1 values in settings.yaml become fallback defaults — the DB value takes precedence when present.
  3. The admin API (/api/v1/admin/server-config) switches to reading/writing Layer 1 settings from the DB instead of the filesystem.
  4. Env var overrides for Layer 1 settings should still work (highest priority), allowing per-node operational overrides when genuinely needed.

4. Priority chain (highest to lowest)

Environment variable (SCION_SERVER_*)
  → hub_settings table (Layer 1)
    → settings.yaml fallback (Layer 1 fields)
      → compiled defaults

5. Admin API changes

  • GET /api/v1/admin/server-config reads Layer 1 from DB, Layer 0 from local config
  • PUT /api/v1/admin/server-config writes Layer 1 to DB + issues NOTIFY, rejects Layer 0 changes with guidance to use config files
  • New: GET /api/v1/admin/server-config/nodes could report per-node Layer 0 values for diagnostics

What this does NOT change

  • Layer 0 settings remain file/env-based — operators must keep them consistent via deployment tooling (this is standard practice for DB connection strings, TLS certs, etc.)
  • Project-level settings stay in project annotations (already HA-safe)
  • Secrets (signing keys, OAuth secrets, broker tokens) stay in the secret backend
  • Single-node / workstation mode continues to work unchanged — Layer 1 falls back to settings.yaml when no Postgres is configured

Relevant code paths

  • Config loading: pkg/config/hub_config.go (LoadGlobalConfig, GlobalConfig)
  • Versioned settings: pkg/config/settings_v1.go (VersionedSettings, V1ServerConfig)
  • Admin settings API: pkg/hub/admin_settings.go (handleAdminServerConfig, reloadSettings)
  • Project settings (DB model): pkg/hub/project_settings_handlers.go
  • Event publisher infra: pkg/hub/events_postgres.go (PostgresEventPublisher)
  • Hub server config: pkg/hub/server.go (ServerConfig)
  • Store interfaces: pkg/store/store.go

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions