Comprehensive architecture documentation for the Industrial Data Gateway platform.
- Core Design Principles
- System Components
- Data Flow Architecture
- Protocol Adapter System
- Schema Management
- Edge Buffering Strategy
- Time Semantics
- Configuration Distribution
- Data Quality & Validation
- Aggregation & Downsampling
- Alarm Lifecycle
- Alert Routing
- Security Model
- Failure Modes & Recovery
- State Management
- Overflow Handling
- Observability
- Update & Rollback Strategy
- Disaster Recovery
- Deployment Patterns
- Planned Copilot Tool Surface
Current status: this document describes the implemented baseline and the remaining production direction. StreamForge is still in production-readiness work. Redpanda is the chosen embedded broker direction and the local dev/runtime path is now Redpanda-backed, while production packaging and image-pull templates are still pending.
Non-negotiable principle: All industrial data flows through the gateway's local Kafka-compatible stream as the authoritative local source.
Key clarification:
- StreamForge manages only the local stream broker on each gateway
- Redpanda is the chosen embedded edge broker direction
- the local dev/runtime stack currently uses Redpanda, but production packaging is still pending
- the Kafka protocol, client libraries, topics, partitions, and consumer-group semantics remain part of the architecture
- a customer's Kafka-compatible system, if any, is treated as a sink destination, not as StreamForge-managed infrastructure
Rationale:
- Local durability: the gateway stream retains records during network or downstream sink outages
- Temporal decoupling: Adapters and sinks evolve independently
- Replayability: Any sink can replay from the local stream
- Simplicity: StreamForge is a gateway platform, not a broker operations platform
Clear separation of concerns:
| Control Plane | Data Plane |
|---|---|
| Configuration management | Protocol adapters |
| Topology visualization | Local stream broker |
| Health monitoring | Sink services |
| User interface | Data transformation |
| Planned tool/Copilot surface | Validation & aggregation |
Benefits:
- Control Plane failures don't stop data flow
- Gateway runs indefinitely with cached config
- Clear security boundaries
All data is classified at the adapter level:
| Type | Definition | Examples | Stream Topics |
|---|---|---|---|
| Telemetry | Continuous numeric measurements | Temperature, pressure | telemetry.* |
| Events | Discrete state changes | Valve opened, motor started | events.* |
| Alarms | Events with severity & lifecycle | Overpressure, fire detected | alarms.* |
Classification is determined by adapter configuration, not runtime detection.
Gateways are fully self-contained:
- The gateway-local Kafka-compatible stream buffers all data
- Configuration cached locally
- First boot requires Control Plane
- Subsequent boots work completely offline
- Sinks push to customer's destinations
┌─────────────────────────────────────────────────────────┐
│ Control Plane │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌───────────────┐ │
│ │ Config API │ │ UI │ │ Planned Tools │ │
│ │ (FastAPI) │ │ (React/TS) │ │ (Tools) │ │
│ └──────┬───────┘ └──────────────┘ └───────────────┘ │
│ │ │
│ ┌──────┴───────┐ ┌──────────────┐ │
│ │ PostgreSQL │ │ Schema │ │
│ │ │ │ Registry │ │
│ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────┘
- Technology: FastAPI (Python)
- Database: PostgreSQL in the current implementation. SQLite remains an evaluated local-control-plane option, not the shipped storage model.
- Endpoints:
/api/v1/gateways- Gateway inventory, approval, status, token issue/renew, config delivery, and heartbeat surfaces (self-registration is not enabled in the currently shipped implementation)/api/v1/pipelines- Current deployment/pipeline management path/api/v1/sinks- Sink configuration/api/v1/schemas- Schema management/api/v1/alarms- Alarm management/api/v1/health- Aggregated health metrics
- Technology: React + TypeScript + CSS-based design system
- Features:
- Adapter, sink, and deployment management
- Protocol-aware configuration forms
- Gateway fleet management
- Real-time topology visualization
- Health dashboards
- Alarm management with ACK/SUPPRESS
- DLQ viewer with reprocess workflow
- Log viewer with filtering
The intended operator model is:
Gatewaysas deployment targetsAdaptersas first-class configured source connectorsSinksas first-class configured destination connectorsDeploymentsas composed gateway configurations that attach adapters and sinks to a gateway, alongside validation, event, and aggregate rules
This distinction matters because the gateway runtime already supports one active gateway configuration containing multiple adapters and multiple sinks. The UI and control-plane configuration model should reflect that directly instead of implying a one-adapter-to-one-sink workflow.
- Status: planned/deferred, not a production-complete component today
- Intended purpose: expose StreamForge capabilities as tools for external agents after the core runtime and operator workflows are production-trustworthy
- See Planned Copilot Tool Surface section
┌─────────────────────────────────────────────────────────┐
│ StreamForge Gateway │
│ │
│ ┌─────────────────────────────────────────────────────┐│
│ │ Gateway Runtime (Python asyncio) ││
│ │ ├── Config Poller (30s interval) ││
│ │ ├── Adapter Manager ││
│ │ ├── Validator Module ││
│ │ ├── Aggregator Module ││
│ │ ├── Health Reporter ││
│ │ └── Metrics Exporter (/metrics) ││
│ └─────────────────────────────────────────────────────┘│
│ │
│ ┌───────────────┐ ┌───────────────┐ ┌─────────────┐ │
│ │ Local Stream │ │ Adapters │ │ Sinks │ │
│ │ (Kafka API) │ │ (Docker) │ │ (Docker) │ │
│ └───────────────┘ └───────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────┘
- Technology: Python asyncio
- Responsibilities:
- Load and cache configurations
- Manage adapter/sink containers
- Run validator and aggregator modules (not containers)
- Monitor component health
- Report to Control Plane
- Deployment: Docker containers
- Isolation: Each adapter runs in its own container with resource limits
- Implemented Adapters:
adapter_modbus_tcp- Modbus TCP clientadapter_modbus_rtu- Modbus RTU over serialadapter_opcua- OPC UA clientadapter_mqtt- MQTT subscriber
Planned or deferred adapter families include:
adapter_xbee- XBee wirelessadapter_lora- LoRa wireless
Adapter configuration must be protocol-aware. Flat scalar field lists are not enough for the real configuration contracts:
- Modbus adapters require repeatable point mappings, including registers and optional event/state points
- MQTT requires repeatable subscriptions and field mappings
- OPC UA requires monitored-item definitions and subscription settings
An adapter instance represents one source connection or session context and may contain many mapped signals inside it. Multi-parameter industrial sources such as PLCs, MQTT payloads, and OPC UA servers should normally be modeled as one adapter with repeatable point/subscription/monitored-item mappings, not as one adapter per parameter.
See Adapters And Deployments Spec for the locked operator-facing configuration contract.
- Deployment: Embedded Kafka-compatible broker
- Chosen direction: Redpanda
- Current local dev/runtime image:
redpandadata/redpanda:v26.1.9 - Production packaging: Pending
- Storage: Configurable (e.g., 100GB)
- Purpose: Buffer all data locally, survive network outages
- Deployment: Docker containers on gateway
- Pattern: Kafka-compatible consumer → external writer
- Implemented sinks:
sink_timescaledb- Writes to TimescaleDBsink_kafka- Replicates to customer's Kafka-compatible clustersink_http- HTTP POST to cloud endpointssink_alert_router- Routes alarms to webhooks such as Slack
Planned or deferred sink families include separate PostgreSQL/Supabase-specific writers and object-store/archive sinks such as S3/Parquet.
- Deployment: Module inside Gateway Runtime (not a container)
- Purpose: Validate data quality, route bad data to DLQ
- Deployment: Module inside Gateway Runtime (not a container)
- Purpose: Compute multi-resolution aggregates (1s, 1min, etc.)
┌─────────────────┐
│ OT Device │ (PLC, sensor, SCADA)
└────────┬────────┘
│
▼
┌───────────────────────────────────────────────┐
│ StreamForge Gateway │
│ │
│ ┌─────────────────┐ │
│ │ Protocol Adapter│ (Docker container) │
│ │ - Polls device │ │
│ │ - Normalizes │ │
│ │ - Classifies │ │
│ └────────┬────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Local Stream │ (Kafka-compatible broker)│
│ │ telemetry.raw │ │
│ │ events.raw │ │
│ │ alarms.raw │ │
│ └────────┬────────┘ │
│ │ │
│ ├──────────────┐ │
│ ▼ ▼ │
│ ┌──────────────┐ ┌──────────┐ │
│ │ Validator │ │Aggregator│ │
│ │ (module) │ │(module) │ │
│ └──────┬───────┘ └────┬─────┘ │
│ │ │ │
│ ▼ ▼ │
│ telemetry.clean telemetry.1s │
│ dlq.telemetry telemetry.1min │
└───────────────────────────────────────────────┘
│
▼
┌─────────────────────┐
│ Sink Containers │
│ (S3, DB, Kafka-compatible...) │
└─────────────────────┘
│
▼
┌───────────────────────────────┐
│ Customer's Infrastructure │
│ (DB, Kafka-compatible, Cloud, etc.) │
└───────────────────────────────┘
Topics by data type, partitioned by asset_id:
telemetry.raw # Raw telemetry from adapters
telemetry.clean # Validated telemetry (GOOD/SUSPECT/UNCERTAIN)
telemetry.1s # 1-second aggregates
telemetry.1min # 1-minute aggregates
events.raw # Raw events from adapters
events.clean # Validated events
alarms.raw # Raw alarm detections
alarms.clean # Validated alarms with lifecycle
dlq.telemetry # Failed telemetry validation
dlq.events # Failed event validation
Partitioning: 12 partitions per topic, keyed by asset_id
Every adapter is a Docker container implementing this interface:
Passed via environment variable ADAPTER_CONFIG (JSON):
{
"adapter_id": "adapter_modbus_001",
"adapter_type": "modbus_tcp",
"host": "192.168.1.50",
"port": 502,
"unit_id": 1,
"asset_id": "line_1_sensor_01",
"registers": {
"40001": {"param": "temperature", "unit": "celsius", "type": "float32"}
},
"kafka_bootstrap": "kafka:9092",
"topic": "telemetry.raw",
"poll_interval_ms": 1000
}Adapter containers publish through the shared adapter publisher, which targets the gateway-local Kafka-compatible broker, keys records by asset_id when present, and waits for broker acknowledgement with acks=all.
{
"asset_id": "line_1_sensor_01",
"parameter": "temperature",
"value": 85.4,
"unit": "celsius",
"quality": "GOOD",
"classification": "TELEMETRY",
"timestamps": {
"device_time": "2026-01-29T10:00:00.123Z",
"gateway_time": "2026-01-29T10:00:00.456Z"
},
"metadata": {
"adapter_id": "adapter_modbus_001",
"adapter_version": "1.2.0"
}
}For Modbus adapters, contiguous or overlapping holding-register definitions are batched into shared reads before decoding so the gateway minimizes round trips to the device while preserving the same normalized message contract. Connection and read failures are handled with bounded retry/backoff and reconnect attempts inside the adapter before control falls back to container restart supervision.
GET /health → {"status": "healthy", "last_reading_at": "..."}
GET /metrics → Prometheus format
- Serialization: Avro (mandatory)
- Registry: Kafka-compatible Schema Registry API. The local dev stack uses Redpanda's built-in Schema Registry endpoint.
- Compatibility: BACKWARD (default)
Gateways cache schemas locally for offline operation:
- On startup, pull all relevant schemas
- Cache to local file
- If offline, use cached schemas
- When online, sync updates
Each gateway runs a single-node Kafka-compatible broker:
- Chosen direction: Redpanda
- Current local dev/runtime image:
redpandadata/redpanda:v26.1.9 - Production packaging: Pending
- Storage: Configurable per gateway (e.g., 50-500GB)
- Retention: Time-based + size-based
- Proven durability and replay
- Standard tooling and client semantics
- No custom buffering code
- Handles backpressure natively
- Redpanda keeps those semantics while reducing the operational burden for edge deployments
Every message carries multiple timestamps:
{
"timestamps": {
"device_time": "2026-01-29T10:00:00.123Z",
"gateway_time": "2026-01-29T10:00:00.456Z"
},
"clock_skew_ms": 333
}| Timestamp | Purpose |
|---|---|
device_time |
When sensor measured (authoritative for analytics) |
gateway_time |
When gateway received (for lag detection) |
clock_skew_ms |
Difference (for quality monitoring) |
Rules:
- Analytics use
device_time - If
device_timemissing → usegateway_time, mark quality asUNCERTAIN - Alert if
clock_skew > 10 seconds
Gateway ──(poll every 30s)──> Control Plane API
<──(config JSON)────
Optional: WebSocket notification on config change triggers immediate pull.
- Gateway caches control-plane config to a local file (
/data/config/gateway.jsonby default) - First boot requires Control Plane if no valid cached config exists yet
- Subsequent boots start from cached config immediately, then refresh in the background
- Gateway runs indefinitely offline once a valid cache has been established
| Code | Meaning | Action |
|---|---|---|
| GOOD | Passed all validation | Forward to .clean topic |
| SUSPECT | Anomalous but valid | Forward with flag |
| UNCERTAIN | Low confidence (e.g., missing device_time) | Forward with warning |
| BAD | Failed validation | Route to DLQ |
- Range checks: Value within min/max
- Rate-of-change: Detect spikes
- Duplicate detection: Same asset + param + timestamp
- Gap detection: Missing expected readings
- BAD messages go to
dlq.telemetry(ordlq.events) - Operator views in UI with error reason
- Operator can:
- Approve single message → reprocess
- Bulk approve → reprocess batch
- Update validation rules → reprocess all
- Approved messages flow to
.cleantopic
Aggregator module produces multiple resolutions:
telemetry.raw (100 Hz) → telemetry.1s → telemetry.1min → telemetry.1hour
{
"asset_id": "vibration_sensor_01",
"parameter": "vibration",
"window_start": "2026-01-29T10:00:00Z",
"window_end": "2026-01-29T10:01:00Z",
"aggregates": {
"avg": 3.42,
"min": 2.91,
"max": 4.15,
"count": 6000
}
}Each pipeline can specify which resolutions to produce.
| State | Meaning |
|---|---|
| ACTIVE | Alarm condition detected |
| ACKNOWLEDGED | Operator has seen it |
| CLEARED | Condition no longer true |
| SUPPRESSED | Manually silenced |
| Action | Owner |
|---|---|
| Detect alarm condition | Gateway (adapter or validator) |
| ACK/SUPPRESS | Control Plane (via UI) |
| Store state | Local stream + Control Plane DB |
Dedicated sink for routing alarms to external systems:
routes:
- severity: CRITICAL
destinations: [pagerduty, sms]
- severity: HIGH
destinations: [slack]
- severity: [MEDIUM, LOW]
destinations: [email]
digest_interval: 15minAll alarms visible in Control Plane UI regardless of alert routing.
- JWT tokens (1-year validity)
- Auto-renew before expiry
- Gateway enrollment tokens create pending gateway records
- Operator approval is required before token issuance
- Built-in username/password
- Optional OAuth2/OIDC integration
| Role | Permissions |
|---|---|
| Viewer | Read dashboards, metrics, logs |
| Operator | + ACK alarms, restart components, approve DLQ |
| Engineer | + Create/edit pipelines, modify validation rules |
| Admin | + Manage users, gateways, full access |
Roles can be scoped to specific gateways:
- "Engineer for Factory North only"
- "Operator for all gateways"
| Connection | Encryption |
|---|---|
| Gateway → Control Plane | TLS required |
| Sink → External | TLS required |
| Internal (localhost) | Optional (demo: none) |
Hybrid approach: Simple restart + bulkhead isolation + circuit breaker
| Component | Failure | Recovery |
|---|---|---|
| Adapter | Crash | Auto-restart (max 5 in 5 min) |
| Adapter | Protocol timeout | Retry with backoff, mark DEGRADED |
| Sink | Downstream unreachable | Circuit breaker opens, local stream retains uncommitted records, retry after cooldown |
| Validator | Crash | Auto-restart, messages queue |
| Local broker | Crash | Auto-restart, data preserved |
| Gateway Runtime | Crash | Systemd restarts, children restart |
Each adapter/sink has Docker resource limits:
- Memory limit
- CPU limit
- One failing adapter cannot starve others
For sinks and Control Plane communication:
- Closed: Normal operation
- Open: After 5 consecutive failures, pause 30s
- Half-open: Try one request, success → Closed, fail → Open
- Sink consumers commit local stream offsets only after successful downstream writes so buffered records remain retryable while the breaker is open
| State | Meaning |
|---|---|
| HEALTHY | All components working |
| DEGRADED | Some non-critical components failed |
| UNHEALTHY | Critical component failed |
| FAILED | Max retries exceeded, needs manual intervention |
| State | Location | Persistence |
|---|---|---|
| Gateway config | Cached locally (file) | Synced from Control Plane |
| Adapter state | In-memory | Stateless (restart clean) |
| Local stream data | Broker disk | Persistent |
| Sink progress | Kafka-compatible consumer offsets | Persistent |
| Alarm state | Local stream + Control Plane DB | Persistent |
| DLQ messages | Local stream topic | Persistent |
| Schema cache | Local file | Synced from registry |
- The gateway-local stream is source of truth for data
- Control Plane DB is source of truth for config
- Stateless adapters — restart cleanly
- Consumer offsets track progress — sinks resume exactly where they left off
When local broker storage fills up:
| Disk Usage | Action |
|---|---|
| 0-70% | Normal operation |
| 70-80% | Alert. Compress old segments. |
| 80-90% | Aggressive downsampling (keep 1min, discard raw older than 1hr) |
| 90-95% | Evict oldest low-priority data |
| 95%+ | Block producers. Critical alert. |
| Priority | Topics |
|---|---|
| Critical | alarms.*, dlq.* |
| High | events.*, telemetry.1min |
| Medium | telemetry.1s |
| Low | telemetry.raw |
Raw telemetry is evicted first. Alarms are never evicted.
Collection: Local Prometheus + remote push
Architecture:
[Adapters /metrics]
[Sinks /metrics]
[Gateway /metrics]
↓
[Local Prometheus] ──(push)──> Control Plane
Retention:
- Local: 24 hours
- Control Plane: 30 days
Format: Structured JSON
{
"timestamp": "2026-01-29T10:00:00Z",
"level": "ERROR",
"component": "adapter-modbus",
"gateway_id": "gw-factory-01",
"message": "Connection timeout"
}Destinations:
- Local: 7 days, 100MB ring buffer
- Remote: Push to Control Plane (30 days)
- Optional: External (Loki, Elasticsearch)
Endpoints (all components):
GET /health/live → 200 or 503
GET /health/ready → 200 or 503
GET /health → Full status JSON
Gateway Status Roll-up:
| Status | Condition |
|---|---|
| HEALTHY | All components healthy |
| DEGRADED | Some non-critical failed |
| UNHEALTHY | Critical component failed |
| OFFLINE | No heartbeat to Control Plane |
Each gateway config specifies exact versions:
adapters:
modbus: "v1.2.3"
opcua: "v2.0.1"
sinks:
s3: "v1.0.0"- Operator triggers update in UI
- Control Plane pushes new config with version
- Gateway pulls new container image
- Stop old container → Start new container
- Health check → Success: done / Fail: auto-rollback
- Previous image kept locally
- One-click rollback in UI
- Automatic rollback if health check fails after update
| What | Method | Frequency |
|---|---|---|
| PostgreSQL | pg_dump | Daily + before updates |
| Schema Registry | Export | Daily |
| Config files | File backup | On change |
Restore: Deploy fresh → restore database → gateways reconnect.
Gateways are cattle, not pets.
| Scenario | Recovery |
|---|---|
| Hardware dies | Deploy new, register same ID, config pulled |
| Local broker data corrupted | Restart, affected buffered data may be lost, adapters resume polling |
| Config corrupted | Re-pull from Control Plane |
Configuration state should be recoverable from the control plane. Buffered data that has not yet drained to a sink still depends on the gateway's local broker disk, so disk protection and backup choices matter in production packaging.
| Mode | Description |
|---|---|
| Remote | Control Plane in cloud/datacenter, gateways connect |
| Local | Control Plane on gateway itself (single-gateway deployment) |
| Hybrid | Central Control Plane + local UI per gateway |
Cloud/Datacenter:
└── Control Plane (API + UI + DB)
Edge (each location):
└── Gateway (Runtime + local stream + Adapters + Sinks)
Gateway Device:
├── Control Plane (PostgreSQL in current implementation)
├── Gateway Runtime
├── Local stream broker
├── Adapters
└── Sinks
Cloud:
└── Central Control Plane
Edge:
├── Gateway
└── Local UI (optional, connects to same API)
Status: planned/deferred. This section captures the intended direction, not a production-complete subsystem in the current codebase. AI/copilot work should remain behind the production-readiness items tracked in Production Readiness Reconciliation.
StreamForge should expose MCP-compatible tools, not a monolithic agent.
Why: Customers may have their own orchestrating agent. Multiple agents = chaos. One agent calling multiple tools = clean.
[Customer's Orchestrating Agent]
↓
[MCP Protocol]
↓
┌─────┴─────┬──────────┬──────────┐
↓ ↓ ↓ ↓
[StreamForge] [Data [RAG [ML
Tools] Platform] System] Model]
| Tool | Purpose |
|---|---|
list_gateways |
List all gateways with status |
get_gateway_details |
Full details for one gateway |
get_adapter_status |
Adapter health and metrics |
get_sink_status |
Sink health and lag |
get_logs |
Query logs with filters |
get_metrics |
Query metrics |
get_alarms |
List alarms |
get_dlq_messages |
Query DLQ |
get_config |
Read config for any component |
get_schemas |
List schemas |
| Tool | Purpose |
|---|---|
diagnose_gateway |
Analyze health, suggest fixes |
analyze_dlq |
Pattern analysis on failed messages |
explain_alarm |
Context and actions for alarm |
compare_configs |
Diff two versions |
| Tool | Purpose |
|---|---|
suggest_config_change |
Returns diff, no apply |
apply_config_change |
Apply with approval token |
restart_component |
Restart adapter/sink |
approve_dlq_messages |
Reprocess DLQ messages |
acknowledge_alarm |
Ack an alarm |
generate_adapter_template |
Scaffold new adapter |
generate_sink_template |
Scaffold new sink |
For users who only use StreamForge:
- Chat UI in dashboard
- Uses the same MCP tools internally
- Can be disabled
- Delete gateway
- Delete sink data
- Modify production without review
- Access credentials directly
- Execute arbitrary code
- Disable authentication
- Self-approve changes
- Deploy new adapter/sink
- Change validation rules
- Update gateway version
- Modify alarm thresholds
- Restart components
Every Copilot action logged:
{
"timestamp": "...",
"user": "john@example.com",
"copilot_action": "suggested_config_change",
"target": "adapter-modbus-01",
"approved": true,
"approved_by": "john@example.com"
}| # | Decision | Choice |
|---|---|---|
| 1 | Edge Buffering | Embedded Kafka-compatible broker; Redpanda chosen for edge, production packaging pending |
| 2 | Protocol Adapters | Docker containers |
| 3 | Schema Management | Avro + Schema Registry + offline caching |
| 4 | Config Distribution | Poll every 30s + optional push |
| 5 | Time Semantics | Multi-timestamp (device_time, gateway_time, clock_skew) |
| 6 | Data Quality | Validator module. GOOD/BAD/SUSPECT/UNCERTAIN. BAD → DLQ |
| 7 | Aggregation | Edge module. Multi-resolution topics |
| 8 | Serialization | Avro |
| 9 | Topics | By data type, partition by asset_id, 12 partitions |
| 10 | Alarms | ACTIVE/ACK/CLEARED/SUPPRESSED. Gateway detects, Control Plane manages |
| 11 | Classification | At adapter level via config |
| 12 | Alert Routing | Optional Alert Router sink |
| 13 | Coupling | Loosely coupled. Gateway runs offline indefinitely |
| 14 | Sinks | Docker containers on gateway |
| 15 | Autonomy | Full. First boot needs Control Plane, then offline OK |
| 16 | Registration | Admin-managed enrollment tokens, pending gateway review, and operator approval today; packaged site runbooks still pending |
| 17 | Auth | Gateways: JWT. Users: built-in auth today; OAuth/OIDC later |
| 18 | RBAC | 4 roles + resource scoping |
| 19 | Failure | Hybrid: restart + bulkhead + circuit breaker |
| 20 | State | Local Kafka-compatible stream = data truth. Control Plane = config truth |
| 21 | Overflow | Tiered: compress → downsample → evict by priority |
| 22 | Metrics | Local Prometheus + remote push |
| 23 | Logging | Structured JSON, push to Control Plane |
| 24 | Health | Liveness/readiness/deep. Status roll-up |
| 25 | Multi-tenancy | Single-tenant (demo). Multi-tenant later if SaaS |
| 26 | Network | TLS for external. Optional for internal |
| 27 | Updates | Version pinning, operator-triggered, auto-rollback |
| 28 | DR | Control Plane backups. Gateways are cattle |
| 29 | Copilot Safety | Hard blocks, soft blocks, audit trail |
| 30 | Copilot Scope | MCP tools. Built-in Copilot optional |
End of Architecture Document
For detailed decision rationale, see ADR documents.