Skip to content

Latest commit

 

History

History
1129 lines (918 loc) · 48.6 KB

File metadata and controls

1129 lines (918 loc) · 48.6 KB

Agentic Sandbox - System Architecture

Version: 1.0 Date: 2025-02-07 Status: Complete

This document describes the architecture of the Agentic Sandbox system, which provides runtime isolation for persistent, unrestricted AI agent processes.


Table of Contents

  1. System Overview
  2. Management Server Architecture
  3. Agent Client Architecture
  4. Task Orchestration
  5. Communication Protocols
  6. Security Architecture
  7. Observability
  8. Data Flow Diagrams
  9. AIWG Integration Architecture

1. System Overview

1.1 Purpose

The Agentic Sandbox provides secure, isolated runtime environments for AI agents (primarily Claude Code). Each agent runs inside a dedicated QEMU/KVM virtual machine with:

  • Full hardware virtualization isolation from the host
  • Persistent storage for workspaces and outputs
  • Bidirectional real-time communication with the management server
  • Resource limits (CPU, memory, disk) enforcement
  • Network isolation options (isolated, outbound-only, full)

1.2 High-Level Architecture

+----------------------------------------------------------------------------+
|                              Host System                                    |
|                                                                            |
|  +----------------------------------------------------------------------+  |
|  |                    Management Server (Rust/Tokio)                    |  |
|  |                                                                      |  |
|  |  +----------+  +------------+  +----------+  +------------------+    |  |
|  |  |  gRPC    |  | WebSocket  |  |   HTTP   |  |   Orchestrator   |    |  |
|  |  |  :8120   |  |   :8121    |  |  :8122   |  |                  |    |  |
|  |  +----+-----+  +-----+------+  +----+-----+  +--------+---------+    |  |
|  |       |              |              |                 |              |  |
|  |       +------+-------+------+-------+--------+--------+              |  |
|  |              |              |                |                       |  |
|  |  +----------+--+  +---------+---+  +---------+-------+               |  |
|  |  |   Agent     |  |   Command   |  |    Telemetry    |               |  |
|  |  |  Registry   |  |  Dispatcher |  |  (Logs/Metrics) |               |  |
|  |  +-------------+  +-------------+  +-----------------+               |  |
|  +----------------------------------------------------------------------+  |
|                              |                                              |
|                              | gRPC Bidirectional Streaming                 |
|                              |                                              |
|  +----------------------------------------------------------------------+  |
|  |                    QEMU/KVM Virtual Machines                         |  |
|  |                                                                      |  |
|  |  +----------------+  +----------------+  +----------------+          |  |
|  |  |   Agent VM 1   |  |   Agent VM 2   |  |   Agent VM N   |   ...    |  |
|  |  |   (agent-01)   |  |   (agent-02)   |  |   (agent-NN)   |          |  |
|  |  |                |  |                |  |                |          |  |
|  |  | +------------+ |  | +------------+ |  | +------------+ |          |  |
|  |  | |   Agent    | |  | |   Agent    | |  | |   Agent    | |          |  |
|  |  | |   Client   | |  | |   Client   | |  | |   Client   | |          |  |
|  |  | | (Rust)     | |  | | (Rust)     | |  | | (Rust)     | |          |  |
|  |  | +------------+ |  | +------------+ |  | +------------+ |          |  |
|  |  +----------------+  +----------------+  +----------------+          |  |
|  +----------------------------------------------------------------------+  |
|                              |                                              |
|  +----------------------------------------------------------------------+  |
|  |                    Agentshare (virtiofs)                             |  |
|  |  +--------------------+  +--------------------------------------+    |  |
|  |  |  /global (RO)      |  |  /inbox/<agent-id> (RW)              |    |  |
|  |  |  Shared resources  |  |  Per-agent workspaces and outputs    |    |  |
|  |  +--------------------+  +--------------------------------------+    |  |
|  +----------------------------------------------------------------------+  |
+----------------------------------------------------------------------------+

1.3 Key Components

Component Technology Purpose
Management Server Rust (Tokio, Tonic, Axum) Central control plane for agent orchestration
Agent Client Rust In-VM process connecting to management server
VM Runtime QEMU/KVM via libvirt Hardware virtualization for isolation
Shared Storage virtiofs Host-guest filesystem sharing
Provisioning Bash + cloud-init VM creation and configuration

1.4 Port Assignments

Port Protocol Purpose
8120 gRPC Agent bidirectional streaming
8121 WebSocket Real-time output streaming to clients
8122 HTTP Dashboard, REST API, metrics

2. Management Server Architecture

The management server (management/src/main.rs) is the central control plane, built on Tokio's async runtime.

2.1 Module Structure

management/src/
├── main.rs              # Entry point, server bootstrap
├── config.rs            # Configuration (env-based)
├── grpc.rs              # gRPC service implementation
├── registry.rs          # Agent registry (DashMap)
├── auth.rs              # Secret store, token verification
├── dispatch.rs          # Command dispatcher
├── output/              # Output aggregation
├── ws/                  # WebSocket hub
├── http/                # HTTP server (Axum)
│   ├── mod.rs
│   ├── server.rs
│   ├── health.rs
│   ├── vms.rs
│   ├── tasks.rs
│   ├── events.rs
│   └── ...
├── heartbeat.rs         # Stale connection detection
├── libvirt_events.rs    # VM lifecycle events
├── crash_loop.rs        # Crash loop detection
├── orchestrator/        # Task orchestration (22 modules)
└── telemetry/           # Logging, metrics, tracing

2.2 Core Subsystems

2.2.1 Agent Registry

File: management/src/registry.rs

The AgentRegistry tracks all connected agents using a lock-free DashMap:

pub struct AgentRegistry {
    agents: DashMap<String, ConnectedAgent>,
}

pub struct ConnectedAgent {
    pub agent_id: String,
    pub registration: AgentRegistration,
    pub status: AgentStatus,
    pub connected_at: DateTime<Utc>,
    pub last_heartbeat: DateTime<Utc>,
    pub command_tx: mpsc::Sender<ManagementMessage>,
    pub metrics: Option<AgentMetrics>,
}

Key Operations:

  • register() - Add new agent connection
  • unregister() - Remove disconnected agent
  • heartbeat() - Update last-seen timestamp
  • send_command() - Route command to specific agent
  • mark_stale() / mark_disconnected() - Connection health tracking

2.2.2 Command Dispatcher

File: management/src/dispatch.rs

Routes commands to agents and tracks pending executions:

  • Generates unique command IDs (UUIDs)
  • Maintains pending command map for result correlation
  • Handles session reconciliation after agent reconnection
  • Supports three session types:
    • Interactive - PTY-based terminal sessions
    • Headless - Non-interactive command execution
    • Background - Long-running daemon processes

2.2.3 Output Aggregator

File: management/src/output/aggregator.rs

Buffers and broadcasts output streams:

  • Per-agent circular buffers for stdout/stderr/log
  • Subscription system for WebSocket clients
  • Handles backpressure when clients are slow

2.3 Orchestrator Subsystem

Directory: management/src/orchestrator/

The orchestrator manages complete task lifecycles for Claude Code execution:

Module Purpose
mod.rs Central Orchestrator struct
task.rs Task state machine (10 states)
manifest.rs Task manifest parsing
storage.rs Filesystem operations (inbox/outbox)
checkpoint.rs State persistence for recovery
executor.rs Task execution logic
monitor.rs Real-time output monitoring
collector.rs Artifact collection
artifacts.rs Streaming artifact uploads
secrets.rs Secret resolution (env/file/Vault)
timeouts.rs Timeout enforcement
retry.rs Exponential backoff retry
degradation.rs Graceful degradation
hang_detection.rs Stuck task detection
reconciliation.rs State reconciliation
cleanup.rs Resource cleanup
multi_agent.rs Parent-child task tracking
slo.rs SLO/SLI tracking
circuit_breaker.rs Failure isolation
vm_pool.rs VM pool management
audit.rs Audit logging

2.4 Telemetry Subsystem

Directory: management/src/telemetry/

Module Purpose
mod.rs Telemetry initialization
logging.rs Structured logging (pretty/JSON/compact)
metrics.rs Prometheus metrics
trace_id.rs Distributed trace ID propagation
otel.rs OpenTelemetry export (optional)

Logging Configuration (Environment Variables):

Variable Values Default
LOG_LEVEL trace, debug, info, warn, error info
LOG_FORMAT pretty, json, compact pretty
LOG_FILE Path to log file (none)
LOG_FILE_ROTATION hourly, daily, never daily

2.5 Crash Loop Detection

File: management/src/crash_loop.rs

Monitors VM lifecycle events for crash patterns:

pub struct CrashLoopConfig {
    pub max_restarts: u32,          // 5 restarts triggers loop
    pub window_minutes: u32,        // 10 minute window
    pub min_uptime_seconds: u32,    // 60s to count as healthy
    pub remediation_enabled: bool,  // Auto-rebuild on crash loop
    pub max_rebuild_attempts: u32,  // 3 attempts before giving up
}

VM States:

  • Healthy - Normal operation
  • Starting - VM booting
  • Recovering - Recovering from crash
  • CrashLoop - Repeated crashes detected
  • Rebuilding - Auto-remediation in progress
  • Failed - Max rebuilds exhausted

3. Agent Client Architecture

The agent client (agent-rs/src/main.rs) runs inside each VM, connecting to the management server.

3.1 Module Structure

agent-rs/src/
├── main.rs      # Entry point, connection loop
├── health.rs    # Health state machine
├── metrics.rs   # Prometheus metrics export
├── claude.rs    # Claude Code task runner
└── lib.rs       # Library exports

3.2 Connection Lifecycle

                    ┌─────────────────────┐
                    │     Boot VM         │
                    └──────────┬──────────┘
                               │
                    ┌──────────▼──────────┐
                    │  Load Configuration │
                    │  (/etc/agentic-     │
                    │   sandbox/agent.env)│
                    └──────────┬──────────┘
                               │
               ┌───────────────▼───────────────┐
               │  Connect to Management Server │◄────┐
               │  (host.internal:8120)         │     │
               └───────────────┬───────────────┘     │
                               │                     │
                    ┌──────────▼──────────┐          │
                    │  Send Registration  │          │
                    │  (AgentRegistration)│          │
                    └──────────┬──────────┘          │
                               │                     │
                    ┌──────────▼──────────┐          │
                    │  Receive Ack +      │          │
                    │  Session Query      │          │
                    └──────────┬──────────┘          │
                               │                     │
           ┌───────────────────▼───────────────────┐ │
           │         Main Stream Loop              │ │
           │  ┌─────────────────────────────────┐  │ │
           │  │ Receive Commands                │  │ │
           │  │ Send Heartbeats (5s interval)   │  │ │
           │  │ Stream stdout/stderr            │  │ │
           │  │ Report Command Results          │  │ │
           │  └─────────────────────────────────┘  │ │
           └───────────────────┬───────────────────┘ │
                               │                     │
                    ┌──────────▼──────────┐          │
                    │  Disconnect         │          │
                    │  (cleanup sessions) │          │
                    └──────────┬──────────┘          │
                               │                     │
                    ┌──────────▼──────────┐          │
                    │  Reconnect Backoff  ├──────────┘
                    │  (5s → 60s max)     │
                    └─────────────────────┘

3.3 Command Execution

The agent supports three execution modes:

Standard Execution (Non-PTY)

async fn execute_command(cmd: CommandRequest, ...) {
    let mut process = Command::new(&cmd.command)
        .args(&cmd.args)
        .stdin(Stdio::piped())
        .stdout(Stdio::piped())
        .stderr(Stdio::piped())
        .spawn()?;

    // Stream stdout/stderr back via gRPC
    // Forward stdin from management server
    // Report CommandResult on completion
}

PTY Execution (Interactive)

async fn execute_command_pty(cmd: CommandRequest, ...) {
    let pty = openpty(None, None)?;

    // Fork child process
    match unsafe { unistd::fork() } {
        Ok(ForkResult::Child) => {
            // Set up PTY as controlling terminal
            // Exec command via bash
        }
        Ok(ForkResult::Parent { child }) => {
            // Forward stdin to PTY master
            // Read PTY output, stream to gRPC
            // Handle resize (SIGWINCH)
            // Handle signals (SIGTERM, SIGINT)
        }
    }
}

Claude Task Execution

async fn execute_claude_task(cmd: CommandRequest, ...) {
    let config: ClaudeTaskConfig = serde_json::from_str(&cmd.args[0])?;

    let runner = ClaudeRunner::new(config);
    let exit_code = runner.run(output_tx).await?;
    // Output streamed via gRPC OutputChunk messages
}

3.4 Health Monitoring

File: agent-rs/src/health.rs

Three-state health model:

State Description Behavior
Healthy Normal operation Accept new tasks
Degraded Limited capacity Finish existing, reject new
Unhealthy Recovery mode Diagnostic only

Health Transitions:

Healthy ──(3 consecutive failures)──► Degraded
Degraded ──(5 consecutive failures)──► Unhealthy
Unhealthy ──(3 consecutive successes)──► Healthy

Triggers:

  • Connection failures
  • Memory usage > 85% (degraded) / > 95% (unhealthy)
  • Frequent restarts (> 3)
  • Circuit breaker trips

3.5 Systemd Watchdog Integration

The agent integrates with systemd's watchdog:

pub struct SystemdWatchdog {
    enabled: bool,
    interval: Duration,  // Half of WATCHDOG_USEC
    health: Arc<HealthMonitor>,
}

impl SystemdWatchdog {
    pub fn notify_ready(&self) -> Result<(), String>;
    pub fn ping(&self) -> Result<(), String>;
    pub async fn run_ping_loop(self: Arc<Self>);
}

The management server also integrates with systemd. Its packaged agentic-mgmt.service uses Type=notify, WatchdogSec=30, KillMode=mixed, and LimitNOFILE=1048576. agentic-mgmt sends READY=1 after the gRPC listener has bound and the HTTP/WebSocket startup tasks have been launched, then sends WATCHDOG=1 at half of the watchdog interval reported by systemd. This complements the in-process HTTP self-watchdog: systemd observes runtime scheduler stalls and restarts the process, while the HTTP self-watchdog exits when the HTTP task wedges but the runtime still schedules other tasks.


4. Task Orchestration

4.1 Task Lifecycle States

File: management/src/orchestrator/task.rs

            ┌─────────────────────────────────────────────────────────────┐
            │                                                             │
   ┌────────▼────────┐                                                    │
   │     Pending     │                                                    │
   └────────┬────────┘                                                    │
            │                                                             │
   ┌────────▼────────┐     ┌──────────────────────────────────────────┐   │
   │     Staging     │────►│                                          │   │
   │ (clone repo)    │     │                                          │   │
   └────────┬────────┘     │                                          │   │
            │              │                                          │   │
   ┌────────▼────────┐     │              Failed                      │   │
   │  Provisioning   │────►│              (destroy VM)                │   │
   │  (create VM)    │     │                                          │   │
   └────────┬────────┘     │                       OR                 │   │
            │              │                                          │   │
   ┌────────▼────────┐     │           FailedPreserved               │   │
   │      Ready      │────►│           (preserve for debug)           │   │
   │                 │     │                                          │   │
   └────────┬────────┘     └──────────────────────────────────────────┘   │
            │                                                             │
   ┌────────▼────────┐                                                    │
   │     Running     │───────────────────────────────────────────────────►│
   │ (Claude Code)   │                      Cancelled                     │
   └────────┬────────┘                                                    │
            │                                                             │
   ┌────────▼────────┐                                                    │
   │   Completing    │                                                    │
   │ (collect arts)  │                                                    │
   └────────┬────────┘                                                    │
            │                                                             │
   ┌────────▼────────┐                                                    │
   │    Completed    │                                                    │
   └─────────────────┘                                                    │

4.2 Storage Model

File: management/src/orchestrator/storage.rs

Each task has a dedicated directory structure:

/srv/agentshare/tasks/{task-id}/
├── manifest.yaml            # Original task manifest
├── inbox/                   # Inputs for agent
│   └── TASK.md             # Task instructions
└── outbox/                  # Outputs from agent
    ├── metadata.json        # Task metadata
    ├── progress/
    │   ├── stdout.log       # Claude stdout
    │   ├── stderr.log       # Claude stderr
    │   └── events.jsonl     # Structured events
    └── artifacts/           # Collected output files
        └── ...

Agentshare Mounts (virtiofs):

Host Path VM Mount Access
/srv/agentshare/global-ro /mnt/global Read-only
/srv/agentshare/inbox/<agent-id> /mnt/inbox Read-write

4.3 Hang Detection

File: management/src/orchestrator/hang_detection.rs

Multiple detection strategies:

Strategy Threshold Description
OutputSilence 10 minutes No stdout/stderr output
CpuIdle 15 minutes CPU < 5%
ProcessStuck 20 minutes No progress indicators

Recovery Actions:

  • NotifyOnly - Alert but don't intervene
  • Terminate - Kill task, cleanup VM
  • Restart - Restore from checkpoint
  • PreserveForDebug - Keep VM for investigation

4.4 Retry Policies

File: management/src/orchestrator/retry.rs

pub struct RetryPolicy {
    pub max_attempts: u32,
    pub initial_delay: Duration,
    pub max_delay: Duration,
    pub multiplier: f64,      // Exponential backoff
    pub jitter: bool,         // ±15% randomization
}

// Predefined policies
RetryPolicy::GIT_CLONE     // 3 attempts, 5s→60s
RetryPolicy::VM_PROVISION  // 2 attempts, 10s→30s
RetryPolicy::SSH_CONNECT   // 5 attempts, 2s→30s

4.5 Secret Resolution

File: management/src/orchestrator/secrets.rs

Secrets can be resolved from multiple sources:

Source Format Example
env Environment variable ANTHROPIC_API_KEY
file File path /run/secrets/api-key
vault HashiCorp Vault myapp/db:password

Vault Configuration:

export VAULT_ADDR="https://vault.example.com:8200"
export VAULT_TOKEN="s.xxxxx"
export VAULT_MOUNT="secret"  # Default KV v2 mount

5. Communication Protocols

5.1 gRPC Bidirectional Streaming

File: proto/agent.proto

The primary communication channel uses gRPC bidirectional streaming:

service AgentService {
  // Bidirectional stream for agent-management communication
  rpc Connect(stream AgentMessage) returns (stream ManagementMessage);

  // One-shot command execution with streaming output
  rpc Exec(ExecRequest) returns (stream ExecOutput);
}

5.2 Message Types

Agent → Management

Message Purpose
AgentRegistration Initial connection with system info
Heartbeat Liveness signal with basic metrics
OutputChunk Stdout/stderr/log streaming
CommandResult Command completion notification
Metrics Full metrics snapshot
SessionReport Active sessions for reconciliation
SessionReconcileAck Confirmation of session cleanup

Management → Agent

Message Purpose
RegistrationAck Accept/reject connection
CommandRequest Execute command
StdinChunk Input for running command
PtyControl Resize terminal, send signal
ConfigUpdate Runtime configuration
ShutdownSignal Graceful shutdown
SessionQuery Request active sessions
SessionReconcile Instruct session cleanup

5.3 Session Reconciliation Protocol

On agent reconnection, stale sessions are cleaned up:

Management                              Agent
    │                                     │
    │    SessionQuery(report_all=true)    │
    │────────────────────────────────────►│
    │                                     │
    │    SessionReport(sessions=[...])    │
    │◄────────────────────────────────────│
    │                                     │
    │    SessionReconcile(                │
    │      keep=[known_ids],              │
    │      kill=[orphan_ids])             │
    │────────────────────────────────────►│
    │                                     │
    │    SessionReconcileAck(             │
    │      killed=[...],                  │
    │      kept=[...])                    │
    │◄────────────────────────────────────│

5.4 PTY Control

For interactive sessions:

message PtyControl {
  string command_id = 1;
  oneof action {
    PtyResize resize = 2;   // Terminal resize (SIGWINCH)
    PtySignal signal = 3;   // Send signal (SIGINT, SIGTERM)
  }
}

message PtyResize {
  uint32 cols = 1;
  uint32 rows = 2;
}

6. Security Architecture

6.1 VM Isolation

Each agent runs in a fully isolated QEMU/KVM virtual machine:

  • Hardware Virtualization: Full x86 emulation with VT-x/AMD-V
  • Memory Isolation: Separate address space, no shared memory with host
  • Disk Isolation: Per-VM qcow2 images with optional COW
  • Process Isolation: Agent cannot see host processes

6.2 Network Isolation

Three network modes:

Mode Description Use Case
Isolated No network access Sandboxed tasks
Outbound Egress only with allowlist API access
Full Full network access Development VMs

Implementation:

  • libvirt network with NAT
  • Optional DNS-based filtering (Blocky)
  • Per-VM firewall rules via iptables

6.3 Secret Management

Host Side:

~/.config/agentic-sandbox/agent-tokens
├── agent-01.hash    # SHA256 hash of secret
├── agent-02.hash
└── ...

VM Side:

/etc/agentic-sandbox/agent.env    # Plaintext secret (mode 600)

Authentication Flow:

Agent                            Management Server
  │                                      │
  │  gRPC Connect with headers:          │
  │    x-agent-id: agent-01              │
  │    x-agent-secret: <plaintext>       │
  │─────────────────────────────────────►│
  │                                      │
  │                    ┌─────────────────┤
  │                    │ SHA256(secret)  │
  │                    │ == stored hash? │
  │                    └─────────────────┤
  │                                      │
  │  RegistrationAck(accepted=true)      │
  │◄─────────────────────────────────────│

6.4 Resource Limits

Enforced via libvirt domain XML:

<domain>
  <vcpu>4</vcpu>
  <memory unit='GiB'>8</memory>
  <blkiotune>
    <device>
      <path>/var/lib/libvirt/images/agent-01.qcow2</path>
      <read_bytes_sec>100000000</read_bytes_sec>
      <write_bytes_sec>50000000</write_bytes_sec>
    </device>
  </blkiotune>
</domain>

Additional Limits:

  • Disk quotas via ext4 project quotas
  • Network bandwidth via tc
  • Process limits via cgroups

7. Observability

7.1 Prometheus Metrics

Endpoint: http://localhost:8122/metrics

Server Metrics:

# Agent metrics
agentic_agents_connected                    # Current connections
agentic_agents_by_status{status="ready"}    # By status
agentic_agent_sessions_active{agent_id}     # Active sessions

# Command metrics
agentic_commands_total                      # Total dispatched
agentic_commands_by_result{result="success"}
agentic_command_latency_seconds_bucket{le="1"}  # Histogram

# Task metrics
agentic_tasks_total
agentic_tasks_by_state{state="running"}
agentic_task_outcomes_total{outcome="success"}

# WebSocket metrics
agentic_ws_connections_current
agentic_ws_connections_total

Agent Metrics (in-VM):

agentic_agent_health_state{agent_id,state}
agentic_agent_restarts_total{agent_id}
agentic_agent_watchdog_pings_total{agent_id}
agentic_agent_circuit_breaker_trips{agent_id}
agentic_agent_uptime_seconds{agent_id}

7.2 Structured Logging

Formats:

  • pretty - Human-readable with colors (default)
  • json - Machine-parseable JSON lines
  • compact - Single-line minimal

Log Fields:

{
  "timestamp": "2025-02-07T12:00:00Z",
  "level": "INFO",
  "target": "agentic_management::grpc",
  "message": "Agent registered",
  "agent_id": "agent-01",
  "ip_address": "192.168.122.201",
  "trace_id": "abc123"
}

7.3 Trace ID Propagation

Every request carries a trace ID for correlation:

// Extract from incoming request
let trace_id = extract_trace_id(&request).unwrap_or_else(generate_trace_id);

// Include in all log messages
info!(trace_id = %trace_id, "Processing request");

// Pass to downstream services
request.metadata_mut().insert("x-trace-id", trace_id);

7.4 Audit Trail

File: management/src/orchestrator/audit.rs

All significant operations are audit-logged:

pub enum AuditEventType {
    TaskSubmitted,
    TaskCompleted,
    TaskFailed,
    TaskCancelled,
    VmProvisioned,
    VmDestroyed,
    SecretAccessed,
    SessionReconciled,
}

pub struct AuditEvent {
    pub timestamp: DateTime<Utc>,
    pub event_type: AuditEventType,
    pub actor: String,      // User or system
    pub resource: String,   // Task ID, VM name
    pub outcome: Outcome,   // Success, Failure
    pub details: Value,     // Additional context
}

8. Data Flow Diagrams

8.1 Task Submission Flow

CLI/API                 Management Server                 Agent VM
   │                           │                             │
   │  POST /api/v1/tasks       │                             │
   │  {manifest}               │                             │
   │──────────────────────────►│                             │
   │                           │                             │
   │                    ┌──────┴──────┐                      │
   │                    │  Validate   │                      │
   │                    │  manifest   │                      │
   │                    └──────┬──────┘                      │
   │                           │                             │
   │                    ┌──────┴──────┐                      │
   │                    │  Create     │                      │
   │                    │  storage    │                      │
   │                    │  dirs       │                      │
   │                    └──────┬──────┘                      │
   │                           │                             │
   │                    ┌──────┴──────┐                      │
   │                    │  Clone repo │                      │
   │                    │  to inbox   │                      │
   │                    └──────┬──────┘                      │
   │                           │                             │
   │                    ┌──────┴──────┐                      │
   │                    │  Provision  │──────────────────────►
   │                    │  VM         │   (provision-vm.sh)  │
   │                    └──────┬──────┘                      │
   │                           │                             │
   │                           │    gRPC Connect             │
   │                           │◄────────────────────────────│
   │                           │                             │
   │                           │    CommandRequest           │
   │                           │    (__claude_task__)        │
   │                           │────────────────────────────►│
   │                           │                             │
   │                           │    OutputChunk (stream)     │
   │                           │◄────────────────────────────│
   │                           │                             │
   │                           │    CommandResult            │
   │                           │◄────────────────────────────│
   │                           │                             │
   │  {task_id, status}        │                             │
   │◄──────────────────────────│                             │

8.2 Agent Registration Flow

Agent Client                   Management Server                Dashboard
    │                                 │                             │
    │   gRPC Connect()                │                             │
    │   x-agent-id: agent-01          │                             │
    │   x-agent-secret: xxx           │                             │
    │────────────────────────────────►│                             │
    │                                 │                             │
    │                          ┌──────┴──────┐                      │
    │                          │  Verify     │                      │
    │                          │  secret     │                      │
    │                          └──────┬──────┘                      │
    │                                 │                             │
    │   AgentRegistration             │                             │
    │   {agent_id, hostname, ip, ...} │                             │
    │────────────────────────────────►│                             │
    │                                 │                             │
    │                          ┌──────┴──────┐                      │
    │                          │  Register   │                      │
    │                          │  in DashMap │                      │
    │                          └──────┬──────┘                      │
    │                                 │                             │
    │   RegistrationAck               │                             │
    │   {accepted: true}              │                             │
    │◄────────────────────────────────│                             │
    │                                 │                             │
    │   SessionQuery                  │                             │
    │   {report_all: true}            │                             │
    │◄────────────────────────────────│                             │
    │                                 │                             │
    │                                 │   WebSocket event            │
    │                                 │   "agent_registered"         │
    │                                 │────────────────────────────►│
    │                                 │                             │
    │   Heartbeat (every 5s)          │                             │
    │────────────────────────────────►│                             │
    │                                 │                             │
    │   Metrics (every 5s)            │   WebSocket update          │
    │────────────────────────────────►│────────────────────────────►│

8.3 Command Execution Flow

Dashboard            Management Server                  Agent
    │                       │                             │
    │  WS: run_command      │                             │
    │  {agent_id, command}  │                             │
    │──────────────────────►│                             │
    │                       │                             │
    │                ┌──────┴──────┐                      │
    │                │  Generate   │                      │
    │                │  command_id │                      │
    │                └──────┬──────┘                      │
    │                       │                             │
    │                ┌──────┴──────┐                      │
    │                │  Register   │                      │
    │                │  pending    │                      │
    │                │  command    │                      │
    │                └──────┬──────┘                      │
    │                       │                             │
    │                       │  CommandRequest             │
    │                       │  {command_id, command,      │
    │                       │   allocate_pty: true}       │
    │                       │────────────────────────────►│
    │                       │                             │
    │                       │                      ┌──────┴──────┐
    │                       │                      │  Fork PTY   │
    │                       │                      │  process    │
    │                       │                      └──────┬──────┘
    │                       │                             │
    │                       │  OutputChunk (stdout)       │
    │  WS: output           │◄────────────────────────────│
    │◄──────────────────────│                             │
    │                       │                             │
    │  WS: stdin            │                             │
    │──────────────────────►│  StdinChunk                 │
    │                       │────────────────────────────►│
    │                       │                             │
    │  WS: resize           │                             │
    │──────────────────────►│  PtyControl(resize)         │
    │                       │────────────────────────────►│
    │                       │                             │
    │                       │  CommandResult              │
    │  WS: result           │  {exit_code, duration}      │
    │◄──────────────────────│◄────────────────────────────│

9. AIWG Integration Architecture

9.1 Standalone vs. AIWG-Connected Operation

Agentic Sandbox works in two modes and the choice is purely operational — no code changes required.

Standalone mode (default, no extra configuration): All features — VM provisioning, task orchestration, PTY streaming, HITL detection, web dashboard — work without any external dependency. Agentic Sandbox is a complete agent runtime platform on its own.

AIWG-connected mode (set AIWG_SERVE_ENDPOINT): The management server registers with an aiwg serve instance and becomes a managed compute node in the AIWG ecosystem. Events flow outbound in real time; the aiwg serve dashboard gains visibility into all sandboxes, agents, sessions, and HITL requests across the fleet.

9.2 Role in the AIWG Ecosystem

AIWG is a multi-agent AI framework providing 188 specialized agents, workflow commands, SDLC phases, and an operator dashboard. Agentic Sandbox provides the persistent execution layer:

┌──────────────────────────────────────────────────────────────┐
│                     AIWG Operator Layer                       │
│                                                              │
│  aiwg serve dashboard    Mission Control    AIWG CLI         │
│       :7337               (mc dispatch)    (ralph, flows)    │
│         │                      │                │            │
│         │  WebSocket events    │                │            │
│         │◄─────────────────────┤                │            │
└─────────┼──────────────────────┼────────────────┼────────────┘
          │  Registration API    │  PTY adapter   │  Loadout
          │  sandbox events      │  delegation    │  manifests
          ▼                      ▼                ▼
┌──────────────────────────────────────────────────────────────┐
│                  Agentic Sandbox Layer                         │
│                                                              │
│  Management Server                                           │
│  ├── gRPC :8120        ← agent-client connections            │
│  ├── WebSocket :8121   ← terminal streaming, metrics         │
│  └── HTTP :8122        ← REST API, dashboard, HITL           │
│                                                              │
│  QEMU/KVM VMs          Docker Containers                     │
│  ├── agent-01 (Claude Code + AIWG sdlc-complete)             │
│  ├── agent-02 (Claude Code + AIWG forensics)                 │
│  └── ...                                                     │
└──────────────────────────────────────────────────────────────┘

AIWG's daemon PTY adapter can delegate sessions to agentic-sandbox when AIWG_SANDBOX_ENDPOINT is set in aiwg's configuration, routing Tier 2 terminal sessions through agentic-sandbox VMs instead of local processes.

9.3 Event Integration

The AiwgServeHandle component in the management server is the outbound integration point. It emits SandboxEvent messages to aiwg serve asynchronously:

// src/aiwg_serve.rs
pub enum SandboxEvent {
    AgentConnected    { agent_id, hostname, ip_address, loadout },
    AgentDisconnected { agent_id, reason },
    AgentReady        { agent_id },
    AgentProvisioning { agent_id, step, progress_json },
    SessionStart      { agent_id, session_id, command },
    SessionEnd        { agent_id, session_id, exit_code },
    HitlInputRequired { agent_id, session_id, hitl_id, prompt, context },
}

All events are fire-and-forget (try_send into a buffered channel). The background task serializes them to JSON and sends over the authenticated WebSocket. If aiwg serve is unreachable, events are buffered up to 256 messages then dropped — the management server is never blocked.

9.4 Loadout Manifests and AIWG Frameworks

VMs provisioned with AIWG framework loadouts arrive pre-configured with agents, commands, and skills already deployed. The management server reads aiwg_frameworks metadata from agent registrations and surfaces it in the agent list API:

# profiles/claude-only.yaml
aiwg_frameworks:
  - name: sdlc-complete
    providers: [claude]

This means agentic-sandbox VMs can participate in AIWG multi-agent workflows from the moment they connect — no manual setup required.

9.5 HITL Integration with aiwg serve

HITL requests created by agentic-sandbox's prompt detector are immediately pushed to the aiwg serve dashboard:

Agent PTY → prompt_detector → HitlStore::create() → AiwgServeHandle::emit(HitlInputRequired)
                                                              │
                                                   aiwg serve HITL drawer
                                                              │
                                              Operator responds via dashboard
                                                              │
                                         POST /api/v1/hitl/{id}/respond
                                                              │
                                          CommandDispatcher injects → PTY stdin

The aiwg serve dashboard and the agentic-sandbox REST API are both valid response channels — either can resolve a pending HITL request.


Appendix: Source File Reference

Management Server

File Lines Purpose
management/src/main.rs ~180 Server entry point
management/src/grpc.rs ~400 gRPC service
management/src/registry.rs ~290 Agent registry
management/src/dispatch.rs ~500 Command dispatcher
management/src/orchestrator/mod.rs ~380 Orchestrator main
management/src/orchestrator/task.rs ~340 Task state machine
management/src/orchestrator/storage.rs ~270 Storage operations
management/src/telemetry/metrics.rs ~590 Prometheus metrics
management/src/crash_loop.rs ~560 Crash detection

Agent Client

File Lines Purpose
agent-rs/src/main.rs ~1700 Agent entry point
agent-rs/src/health.rs ~550 Health monitoring
agent-rs/src/metrics.rs ~135 Metrics export
agent-rs/src/claude.rs ~475 Claude runner

Protocol

File Lines Purpose
proto/agent.proto ~510 gRPC protocol definition

Document History

Version Date Changes
1.0 2025-02-07 Initial comprehensive documentation