Version: 1.0 Date: 2025-02-07 Status: Complete
This document describes the architecture of the Agentic Sandbox system, which provides runtime isolation for persistent, unrestricted AI agent processes.
- System Overview
- Management Server Architecture
- Agent Client Architecture
- Task Orchestration
- Communication Protocols
- Security Architecture
- Observability
- Data Flow Diagrams
- AIWG Integration Architecture
The Agentic Sandbox provides secure, isolated runtime environments for AI agents (primarily Claude Code). Each agent runs inside a dedicated QEMU/KVM virtual machine with:
- Full hardware virtualization isolation from the host
- Persistent storage for workspaces and outputs
- Bidirectional real-time communication with the management server
- Resource limits (CPU, memory, disk) enforcement
- Network isolation options (isolated, outbound-only, full)
+----------------------------------------------------------------------------+
| Host System |
| |
| +----------------------------------------------------------------------+ |
| | Management Server (Rust/Tokio) | |
| | | |
| | +----------+ +------------+ +----------+ +------------------+ | |
| | | gRPC | | WebSocket | | HTTP | | Orchestrator | | |
| | | :8120 | | :8121 | | :8122 | | | | |
| | +----+-----+ +-----+------+ +----+-----+ +--------+---------+ | |
| | | | | | | |
| | +------+-------+------+-------+--------+--------+ | |
| | | | | | |
| | +----------+--+ +---------+---+ +---------+-------+ | |
| | | Agent | | Command | | Telemetry | | |
| | | Registry | | Dispatcher | | (Logs/Metrics) | | |
| | +-------------+ +-------------+ +-----------------+ | |
| +----------------------------------------------------------------------+ |
| | |
| | gRPC Bidirectional Streaming |
| | |
| +----------------------------------------------------------------------+ |
| | QEMU/KVM Virtual Machines | |
| | | |
| | +----------------+ +----------------+ +----------------+ | |
| | | Agent VM 1 | | Agent VM 2 | | Agent VM N | ... | |
| | | (agent-01) | | (agent-02) | | (agent-NN) | | |
| | | | | | | | | |
| | | +------------+ | | +------------+ | | +------------+ | | |
| | | | Agent | | | | Agent | | | | Agent | | | |
| | | | Client | | | | Client | | | | Client | | | |
| | | | (Rust) | | | | (Rust) | | | | (Rust) | | | |
| | | +------------+ | | +------------+ | | +------------+ | | |
| | +----------------+ +----------------+ +----------------+ | |
| +----------------------------------------------------------------------+ |
| | |
| +----------------------------------------------------------------------+ |
| | Agentshare (virtiofs) | |
| | +--------------------+ +--------------------------------------+ | |
| | | /global (RO) | | /inbox/<agent-id> (RW) | | |
| | | Shared resources | | Per-agent workspaces and outputs | | |
| | +--------------------+ +--------------------------------------+ | |
| +----------------------------------------------------------------------+ |
+----------------------------------------------------------------------------+
| Component | Technology | Purpose |
|---|---|---|
| Management Server | Rust (Tokio, Tonic, Axum) | Central control plane for agent orchestration |
| Agent Client | Rust | In-VM process connecting to management server |
| VM Runtime | QEMU/KVM via libvirt | Hardware virtualization for isolation |
| Shared Storage | virtiofs | Host-guest filesystem sharing |
| Provisioning | Bash + cloud-init | VM creation and configuration |
| Port | Protocol | Purpose |
|---|---|---|
| 8120 | gRPC | Agent bidirectional streaming |
| 8121 | WebSocket | Real-time output streaming to clients |
| 8122 | HTTP | Dashboard, REST API, metrics |
The management server (management/src/main.rs) is the central control plane, built on Tokio's async runtime.
management/src/
├── main.rs # Entry point, server bootstrap
├── config.rs # Configuration (env-based)
├── grpc.rs # gRPC service implementation
├── registry.rs # Agent registry (DashMap)
├── auth.rs # Secret store, token verification
├── dispatch.rs # Command dispatcher
├── output/ # Output aggregation
├── ws/ # WebSocket hub
├── http/ # HTTP server (Axum)
│ ├── mod.rs
│ ├── server.rs
│ ├── health.rs
│ ├── vms.rs
│ ├── tasks.rs
│ ├── events.rs
│ └── ...
├── heartbeat.rs # Stale connection detection
├── libvirt_events.rs # VM lifecycle events
├── crash_loop.rs # Crash loop detection
├── orchestrator/ # Task orchestration (22 modules)
└── telemetry/ # Logging, metrics, tracing
File: management/src/registry.rs
The AgentRegistry tracks all connected agents using a lock-free DashMap:
pub struct AgentRegistry {
agents: DashMap<String, ConnectedAgent>,
}
pub struct ConnectedAgent {
pub agent_id: String,
pub registration: AgentRegistration,
pub status: AgentStatus,
pub connected_at: DateTime<Utc>,
pub last_heartbeat: DateTime<Utc>,
pub command_tx: mpsc::Sender<ManagementMessage>,
pub metrics: Option<AgentMetrics>,
}Key Operations:
register()- Add new agent connectionunregister()- Remove disconnected agentheartbeat()- Update last-seen timestampsend_command()- Route command to specific agentmark_stale()/mark_disconnected()- Connection health tracking
File: management/src/dispatch.rs
Routes commands to agents and tracks pending executions:
- Generates unique command IDs (UUIDs)
- Maintains pending command map for result correlation
- Handles session reconciliation after agent reconnection
- Supports three session types:
- Interactive - PTY-based terminal sessions
- Headless - Non-interactive command execution
- Background - Long-running daemon processes
File: management/src/output/aggregator.rs
Buffers and broadcasts output streams:
- Per-agent circular buffers for stdout/stderr/log
- Subscription system for WebSocket clients
- Handles backpressure when clients are slow
Directory: management/src/orchestrator/
The orchestrator manages complete task lifecycles for Claude Code execution:
| Module | Purpose |
|---|---|
mod.rs |
Central Orchestrator struct |
task.rs |
Task state machine (10 states) |
manifest.rs |
Task manifest parsing |
storage.rs |
Filesystem operations (inbox/outbox) |
checkpoint.rs |
State persistence for recovery |
executor.rs |
Task execution logic |
monitor.rs |
Real-time output monitoring |
collector.rs |
Artifact collection |
artifacts.rs |
Streaming artifact uploads |
secrets.rs |
Secret resolution (env/file/Vault) |
timeouts.rs |
Timeout enforcement |
retry.rs |
Exponential backoff retry |
degradation.rs |
Graceful degradation |
hang_detection.rs |
Stuck task detection |
reconciliation.rs |
State reconciliation |
cleanup.rs |
Resource cleanup |
multi_agent.rs |
Parent-child task tracking |
slo.rs |
SLO/SLI tracking |
circuit_breaker.rs |
Failure isolation |
vm_pool.rs |
VM pool management |
audit.rs |
Audit logging |
Directory: management/src/telemetry/
| Module | Purpose |
|---|---|
mod.rs |
Telemetry initialization |
logging.rs |
Structured logging (pretty/JSON/compact) |
metrics.rs |
Prometheus metrics |
trace_id.rs |
Distributed trace ID propagation |
otel.rs |
OpenTelemetry export (optional) |
Logging Configuration (Environment Variables):
| Variable | Values | Default |
|---|---|---|
LOG_LEVEL |
trace, debug, info, warn, error | info |
LOG_FORMAT |
pretty, json, compact | pretty |
LOG_FILE |
Path to log file | (none) |
LOG_FILE_ROTATION |
hourly, daily, never | daily |
File: management/src/crash_loop.rs
Monitors VM lifecycle events for crash patterns:
pub struct CrashLoopConfig {
pub max_restarts: u32, // 5 restarts triggers loop
pub window_minutes: u32, // 10 minute window
pub min_uptime_seconds: u32, // 60s to count as healthy
pub remediation_enabled: bool, // Auto-rebuild on crash loop
pub max_rebuild_attempts: u32, // 3 attempts before giving up
}VM States:
Healthy- Normal operationStarting- VM bootingRecovering- Recovering from crashCrashLoop- Repeated crashes detectedRebuilding- Auto-remediation in progressFailed- Max rebuilds exhausted
The agent client (agent-rs/src/main.rs) runs inside each VM, connecting to the management server.
agent-rs/src/
├── main.rs # Entry point, connection loop
├── health.rs # Health state machine
├── metrics.rs # Prometheus metrics export
├── claude.rs # Claude Code task runner
└── lib.rs # Library exports
┌─────────────────────┐
│ Boot VM │
└──────────┬──────────┘
│
┌──────────▼──────────┐
│ Load Configuration │
│ (/etc/agentic- │
│ sandbox/agent.env)│
└──────────┬──────────┘
│
┌───────────────▼───────────────┐
│ Connect to Management Server │◄────┐
│ (host.internal:8120) │ │
└───────────────┬───────────────┘ │
│ │
┌──────────▼──────────┐ │
│ Send Registration │ │
│ (AgentRegistration)│ │
└──────────┬──────────┘ │
│ │
┌──────────▼──────────┐ │
│ Receive Ack + │ │
│ Session Query │ │
└──────────┬──────────┘ │
│ │
┌───────────────────▼───────────────────┐ │
│ Main Stream Loop │ │
│ ┌─────────────────────────────────┐ │ │
│ │ Receive Commands │ │ │
│ │ Send Heartbeats (5s interval) │ │ │
│ │ Stream stdout/stderr │ │ │
│ │ Report Command Results │ │ │
│ └─────────────────────────────────┘ │ │
└───────────────────┬───────────────────┘ │
│ │
┌──────────▼──────────┐ │
│ Disconnect │ │
│ (cleanup sessions) │ │
└──────────┬──────────┘ │
│ │
┌──────────▼──────────┐ │
│ Reconnect Backoff ├──────────┘
│ (5s → 60s max) │
└─────────────────────┘
The agent supports three execution modes:
async fn execute_command(cmd: CommandRequest, ...) {
let mut process = Command::new(&cmd.command)
.args(&cmd.args)
.stdin(Stdio::piped())
.stdout(Stdio::piped())
.stderr(Stdio::piped())
.spawn()?;
// Stream stdout/stderr back via gRPC
// Forward stdin from management server
// Report CommandResult on completion
}async fn execute_command_pty(cmd: CommandRequest, ...) {
let pty = openpty(None, None)?;
// Fork child process
match unsafe { unistd::fork() } {
Ok(ForkResult::Child) => {
// Set up PTY as controlling terminal
// Exec command via bash
}
Ok(ForkResult::Parent { child }) => {
// Forward stdin to PTY master
// Read PTY output, stream to gRPC
// Handle resize (SIGWINCH)
// Handle signals (SIGTERM, SIGINT)
}
}
}async fn execute_claude_task(cmd: CommandRequest, ...) {
let config: ClaudeTaskConfig = serde_json::from_str(&cmd.args[0])?;
let runner = ClaudeRunner::new(config);
let exit_code = runner.run(output_tx).await?;
// Output streamed via gRPC OutputChunk messages
}File: agent-rs/src/health.rs
Three-state health model:
| State | Description | Behavior |
|---|---|---|
| Healthy | Normal operation | Accept new tasks |
| Degraded | Limited capacity | Finish existing, reject new |
| Unhealthy | Recovery mode | Diagnostic only |
Health Transitions:
Healthy ──(3 consecutive failures)──► Degraded
Degraded ──(5 consecutive failures)──► Unhealthy
Unhealthy ──(3 consecutive successes)──► Healthy
Triggers:
- Connection failures
- Memory usage > 85% (degraded) / > 95% (unhealthy)
- Frequent restarts (> 3)
- Circuit breaker trips
The agent integrates with systemd's watchdog:
pub struct SystemdWatchdog {
enabled: bool,
interval: Duration, // Half of WATCHDOG_USEC
health: Arc<HealthMonitor>,
}
impl SystemdWatchdog {
pub fn notify_ready(&self) -> Result<(), String>;
pub fn ping(&self) -> Result<(), String>;
pub async fn run_ping_loop(self: Arc<Self>);
}The management server also integrates with systemd. Its packaged
agentic-mgmt.service uses Type=notify, WatchdogSec=30, KillMode=mixed,
and LimitNOFILE=1048576. agentic-mgmt sends READY=1 after the gRPC
listener has bound and the HTTP/WebSocket startup tasks have been launched,
then sends WATCHDOG=1 at half of the watchdog interval reported by systemd.
This complements the in-process HTTP self-watchdog: systemd observes runtime
scheduler stalls and restarts the process, while the HTTP self-watchdog exits
when the HTTP task wedges but the runtime still schedules other tasks.
File: management/src/orchestrator/task.rs
┌─────────────────────────────────────────────────────────────┐
│ │
┌────────▼────────┐ │
│ Pending │ │
└────────┬────────┘ │
│ │
┌────────▼────────┐ ┌──────────────────────────────────────────┐ │
│ Staging │────►│ │ │
│ (clone repo) │ │ │ │
└────────┬────────┘ │ │ │
│ │ │ │
┌────────▼────────┐ │ Failed │ │
│ Provisioning │────►│ (destroy VM) │ │
│ (create VM) │ │ │ │
└────────┬────────┘ │ OR │ │
│ │ │ │
┌────────▼────────┐ │ FailedPreserved │ │
│ Ready │────►│ (preserve for debug) │ │
│ │ │ │ │
└────────┬────────┘ └──────────────────────────────────────────┘ │
│ │
┌────────▼────────┐ │
│ Running │───────────────────────────────────────────────────►│
│ (Claude Code) │ Cancelled │
└────────┬────────┘ │
│ │
┌────────▼────────┐ │
│ Completing │ │
│ (collect arts) │ │
└────────┬────────┘ │
│ │
┌────────▼────────┐ │
│ Completed │ │
└─────────────────┘ │
File: management/src/orchestrator/storage.rs
Each task has a dedicated directory structure:
/srv/agentshare/tasks/{task-id}/
├── manifest.yaml # Original task manifest
├── inbox/ # Inputs for agent
│ └── TASK.md # Task instructions
└── outbox/ # Outputs from agent
├── metadata.json # Task metadata
├── progress/
│ ├── stdout.log # Claude stdout
│ ├── stderr.log # Claude stderr
│ └── events.jsonl # Structured events
└── artifacts/ # Collected output files
└── ...
Agentshare Mounts (virtiofs):
| Host Path | VM Mount | Access |
|---|---|---|
/srv/agentshare/global-ro |
/mnt/global |
Read-only |
/srv/agentshare/inbox/<agent-id> |
/mnt/inbox |
Read-write |
File: management/src/orchestrator/hang_detection.rs
Multiple detection strategies:
| Strategy | Threshold | Description |
|---|---|---|
OutputSilence |
10 minutes | No stdout/stderr output |
CpuIdle |
15 minutes | CPU < 5% |
ProcessStuck |
20 minutes | No progress indicators |
Recovery Actions:
NotifyOnly- Alert but don't interveneTerminate- Kill task, cleanup VMRestart- Restore from checkpointPreserveForDebug- Keep VM for investigation
File: management/src/orchestrator/retry.rs
pub struct RetryPolicy {
pub max_attempts: u32,
pub initial_delay: Duration,
pub max_delay: Duration,
pub multiplier: f64, // Exponential backoff
pub jitter: bool, // ±15% randomization
}
// Predefined policies
RetryPolicy::GIT_CLONE // 3 attempts, 5s→60s
RetryPolicy::VM_PROVISION // 2 attempts, 10s→30s
RetryPolicy::SSH_CONNECT // 5 attempts, 2s→30sFile: management/src/orchestrator/secrets.rs
Secrets can be resolved from multiple sources:
| Source | Format | Example |
|---|---|---|
env |
Environment variable | ANTHROPIC_API_KEY |
file |
File path | /run/secrets/api-key |
vault |
HashiCorp Vault | myapp/db:password |
Vault Configuration:
export VAULT_ADDR="https://vault.example.com:8200"
export VAULT_TOKEN="s.xxxxx"
export VAULT_MOUNT="secret" # Default KV v2 mountFile: proto/agent.proto
The primary communication channel uses gRPC bidirectional streaming:
service AgentService {
// Bidirectional stream for agent-management communication
rpc Connect(stream AgentMessage) returns (stream ManagementMessage);
// One-shot command execution with streaming output
rpc Exec(ExecRequest) returns (stream ExecOutput);
}| Message | Purpose |
|---|---|
AgentRegistration |
Initial connection with system info |
Heartbeat |
Liveness signal with basic metrics |
OutputChunk |
Stdout/stderr/log streaming |
CommandResult |
Command completion notification |
Metrics |
Full metrics snapshot |
SessionReport |
Active sessions for reconciliation |
SessionReconcileAck |
Confirmation of session cleanup |
| Message | Purpose |
|---|---|
RegistrationAck |
Accept/reject connection |
CommandRequest |
Execute command |
StdinChunk |
Input for running command |
PtyControl |
Resize terminal, send signal |
ConfigUpdate |
Runtime configuration |
ShutdownSignal |
Graceful shutdown |
SessionQuery |
Request active sessions |
SessionReconcile |
Instruct session cleanup |
On agent reconnection, stale sessions are cleaned up:
Management Agent
│ │
│ SessionQuery(report_all=true) │
│────────────────────────────────────►│
│ │
│ SessionReport(sessions=[...]) │
│◄────────────────────────────────────│
│ │
│ SessionReconcile( │
│ keep=[known_ids], │
│ kill=[orphan_ids]) │
│────────────────────────────────────►│
│ │
│ SessionReconcileAck( │
│ killed=[...], │
│ kept=[...]) │
│◄────────────────────────────────────│
For interactive sessions:
message PtyControl {
string command_id = 1;
oneof action {
PtyResize resize = 2; // Terminal resize (SIGWINCH)
PtySignal signal = 3; // Send signal (SIGINT, SIGTERM)
}
}
message PtyResize {
uint32 cols = 1;
uint32 rows = 2;
}Each agent runs in a fully isolated QEMU/KVM virtual machine:
- Hardware Virtualization: Full x86 emulation with VT-x/AMD-V
- Memory Isolation: Separate address space, no shared memory with host
- Disk Isolation: Per-VM qcow2 images with optional COW
- Process Isolation: Agent cannot see host processes
Three network modes:
| Mode | Description | Use Case |
|---|---|---|
Isolated |
No network access | Sandboxed tasks |
Outbound |
Egress only with allowlist | API access |
Full |
Full network access | Development VMs |
Implementation:
- libvirt network with NAT
- Optional DNS-based filtering (Blocky)
- Per-VM firewall rules via iptables
Host Side:
~/.config/agentic-sandbox/agent-tokens
├── agent-01.hash # SHA256 hash of secret
├── agent-02.hash
└── ...
VM Side:
/etc/agentic-sandbox/agent.env # Plaintext secret (mode 600)
Authentication Flow:
Agent Management Server
│ │
│ gRPC Connect with headers: │
│ x-agent-id: agent-01 │
│ x-agent-secret: <plaintext> │
│─────────────────────────────────────►│
│ │
│ ┌─────────────────┤
│ │ SHA256(secret) │
│ │ == stored hash? │
│ └─────────────────┤
│ │
│ RegistrationAck(accepted=true) │
│◄─────────────────────────────────────│
Enforced via libvirt domain XML:
<domain>
<vcpu>4</vcpu>
<memory unit='GiB'>8</memory>
<blkiotune>
<device>
<path>/var/lib/libvirt/images/agent-01.qcow2</path>
<read_bytes_sec>100000000</read_bytes_sec>
<write_bytes_sec>50000000</write_bytes_sec>
</device>
</blkiotune>
</domain>Additional Limits:
- Disk quotas via ext4 project quotas
- Network bandwidth via tc
- Process limits via cgroups
Endpoint: http://localhost:8122/metrics
Server Metrics:
# Agent metrics
agentic_agents_connected # Current connections
agentic_agents_by_status{status="ready"} # By status
agentic_agent_sessions_active{agent_id} # Active sessions
# Command metrics
agentic_commands_total # Total dispatched
agentic_commands_by_result{result="success"}
agentic_command_latency_seconds_bucket{le="1"} # Histogram
# Task metrics
agentic_tasks_total
agentic_tasks_by_state{state="running"}
agentic_task_outcomes_total{outcome="success"}
# WebSocket metrics
agentic_ws_connections_current
agentic_ws_connections_total
Agent Metrics (in-VM):
agentic_agent_health_state{agent_id,state}
agentic_agent_restarts_total{agent_id}
agentic_agent_watchdog_pings_total{agent_id}
agentic_agent_circuit_breaker_trips{agent_id}
agentic_agent_uptime_seconds{agent_id}
Formats:
pretty- Human-readable with colors (default)json- Machine-parseable JSON linescompact- Single-line minimal
Log Fields:
{
"timestamp": "2025-02-07T12:00:00Z",
"level": "INFO",
"target": "agentic_management::grpc",
"message": "Agent registered",
"agent_id": "agent-01",
"ip_address": "192.168.122.201",
"trace_id": "abc123"
}Every request carries a trace ID for correlation:
// Extract from incoming request
let trace_id = extract_trace_id(&request).unwrap_or_else(generate_trace_id);
// Include in all log messages
info!(trace_id = %trace_id, "Processing request");
// Pass to downstream services
request.metadata_mut().insert("x-trace-id", trace_id);File: management/src/orchestrator/audit.rs
All significant operations are audit-logged:
pub enum AuditEventType {
TaskSubmitted,
TaskCompleted,
TaskFailed,
TaskCancelled,
VmProvisioned,
VmDestroyed,
SecretAccessed,
SessionReconciled,
}
pub struct AuditEvent {
pub timestamp: DateTime<Utc>,
pub event_type: AuditEventType,
pub actor: String, // User or system
pub resource: String, // Task ID, VM name
pub outcome: Outcome, // Success, Failure
pub details: Value, // Additional context
}CLI/API Management Server Agent VM
│ │ │
│ POST /api/v1/tasks │ │
│ {manifest} │ │
│──────────────────────────►│ │
│ │ │
│ ┌──────┴──────┐ │
│ │ Validate │ │
│ │ manifest │ │
│ └──────┬──────┘ │
│ │ │
│ ┌──────┴──────┐ │
│ │ Create │ │
│ │ storage │ │
│ │ dirs │ │
│ └──────┬──────┘ │
│ │ │
│ ┌──────┴──────┐ │
│ │ Clone repo │ │
│ │ to inbox │ │
│ └──────┬──────┘ │
│ │ │
│ ┌──────┴──────┐ │
│ │ Provision │──────────────────────►
│ │ VM │ (provision-vm.sh) │
│ └──────┬──────┘ │
│ │ │
│ │ gRPC Connect │
│ │◄────────────────────────────│
│ │ │
│ │ CommandRequest │
│ │ (__claude_task__) │
│ │────────────────────────────►│
│ │ │
│ │ OutputChunk (stream) │
│ │◄────────────────────────────│
│ │ │
│ │ CommandResult │
│ │◄────────────────────────────│
│ │ │
│ {task_id, status} │ │
│◄──────────────────────────│ │
Agent Client Management Server Dashboard
│ │ │
│ gRPC Connect() │ │
│ x-agent-id: agent-01 │ │
│ x-agent-secret: xxx │ │
│────────────────────────────────►│ │
│ │ │
│ ┌──────┴──────┐ │
│ │ Verify │ │
│ │ secret │ │
│ └──────┬──────┘ │
│ │ │
│ AgentRegistration │ │
│ {agent_id, hostname, ip, ...} │ │
│────────────────────────────────►│ │
│ │ │
│ ┌──────┴──────┐ │
│ │ Register │ │
│ │ in DashMap │ │
│ └──────┬──────┘ │
│ │ │
│ RegistrationAck │ │
│ {accepted: true} │ │
│◄────────────────────────────────│ │
│ │ │
│ SessionQuery │ │
│ {report_all: true} │ │
│◄────────────────────────────────│ │
│ │ │
│ │ WebSocket event │
│ │ "agent_registered" │
│ │────────────────────────────►│
│ │ │
│ Heartbeat (every 5s) │ │
│────────────────────────────────►│ │
│ │ │
│ Metrics (every 5s) │ WebSocket update │
│────────────────────────────────►│────────────────────────────►│
Dashboard Management Server Agent
│ │ │
│ WS: run_command │ │
│ {agent_id, command} │ │
│──────────────────────►│ │
│ │ │
│ ┌──────┴──────┐ │
│ │ Generate │ │
│ │ command_id │ │
│ └──────┬──────┘ │
│ │ │
│ ┌──────┴──────┐ │
│ │ Register │ │
│ │ pending │ │
│ │ command │ │
│ └──────┬──────┘ │
│ │ │
│ │ CommandRequest │
│ │ {command_id, command, │
│ │ allocate_pty: true} │
│ │────────────────────────────►│
│ │ │
│ │ ┌──────┴──────┐
│ │ │ Fork PTY │
│ │ │ process │
│ │ └──────┬──────┘
│ │ │
│ │ OutputChunk (stdout) │
│ WS: output │◄────────────────────────────│
│◄──────────────────────│ │
│ │ │
│ WS: stdin │ │
│──────────────────────►│ StdinChunk │
│ │────────────────────────────►│
│ │ │
│ WS: resize │ │
│──────────────────────►│ PtyControl(resize) │
│ │────────────────────────────►│
│ │ │
│ │ CommandResult │
│ WS: result │ {exit_code, duration} │
│◄──────────────────────│◄────────────────────────────│
Agentic Sandbox works in two modes and the choice is purely operational — no code changes required.
Standalone mode (default, no extra configuration): All features — VM provisioning, task orchestration, PTY streaming, HITL detection, web dashboard — work without any external dependency. Agentic Sandbox is a complete agent runtime platform on its own.
AIWG-connected mode (set AIWG_SERVE_ENDPOINT): The management server registers with an aiwg serve instance and becomes a managed compute node in the AIWG ecosystem. Events flow outbound in real time; the aiwg serve dashboard gains visibility into all sandboxes, agents, sessions, and HITL requests across the fleet.
AIWG is a multi-agent AI framework providing 188 specialized agents, workflow commands, SDLC phases, and an operator dashboard. Agentic Sandbox provides the persistent execution layer:
┌──────────────────────────────────────────────────────────────┐
│ AIWG Operator Layer │
│ │
│ aiwg serve dashboard Mission Control AIWG CLI │
│ :7337 (mc dispatch) (ralph, flows) │
│ │ │ │ │
│ │ WebSocket events │ │ │
│ │◄─────────────────────┤ │ │
└─────────┼──────────────────────┼────────────────┼────────────┘
│ Registration API │ PTY adapter │ Loadout
│ sandbox events │ delegation │ manifests
▼ ▼ ▼
┌──────────────────────────────────────────────────────────────┐
│ Agentic Sandbox Layer │
│ │
│ Management Server │
│ ├── gRPC :8120 ← agent-client connections │
│ ├── WebSocket :8121 ← terminal streaming, metrics │
│ └── HTTP :8122 ← REST API, dashboard, HITL │
│ │
│ QEMU/KVM VMs Docker Containers │
│ ├── agent-01 (Claude Code + AIWG sdlc-complete) │
│ ├── agent-02 (Claude Code + AIWG forensics) │
│ └── ... │
└──────────────────────────────────────────────────────────────┘
AIWG's daemon PTY adapter can delegate sessions to agentic-sandbox when AIWG_SANDBOX_ENDPOINT is set in aiwg's configuration, routing Tier 2 terminal sessions through agentic-sandbox VMs instead of local processes.
The AiwgServeHandle component in the management server is the outbound integration point. It emits SandboxEvent messages to aiwg serve asynchronously:
// src/aiwg_serve.rs
pub enum SandboxEvent {
AgentConnected { agent_id, hostname, ip_address, loadout },
AgentDisconnected { agent_id, reason },
AgentReady { agent_id },
AgentProvisioning { agent_id, step, progress_json },
SessionStart { agent_id, session_id, command },
SessionEnd { agent_id, session_id, exit_code },
HitlInputRequired { agent_id, session_id, hitl_id, prompt, context },
}All events are fire-and-forget (try_send into a buffered channel). The background task serializes them to JSON and sends over the authenticated WebSocket. If aiwg serve is unreachable, events are buffered up to 256 messages then dropped — the management server is never blocked.
VMs provisioned with AIWG framework loadouts arrive pre-configured with agents, commands, and skills already deployed. The management server reads aiwg_frameworks metadata from agent registrations and surfaces it in the agent list API:
# profiles/claude-only.yaml
aiwg_frameworks:
- name: sdlc-complete
providers: [claude]This means agentic-sandbox VMs can participate in AIWG multi-agent workflows from the moment they connect — no manual setup required.
HITL requests created by agentic-sandbox's prompt detector are immediately pushed to the aiwg serve dashboard:
Agent PTY → prompt_detector → HitlStore::create() → AiwgServeHandle::emit(HitlInputRequired)
│
aiwg serve HITL drawer
│
Operator responds via dashboard
│
POST /api/v1/hitl/{id}/respond
│
CommandDispatcher injects → PTY stdin
The aiwg serve dashboard and the agentic-sandbox REST API are both valid response channels — either can resolve a pending HITL request.
| File | Lines | Purpose |
|---|---|---|
management/src/main.rs |
~180 | Server entry point |
management/src/grpc.rs |
~400 | gRPC service |
management/src/registry.rs |
~290 | Agent registry |
management/src/dispatch.rs |
~500 | Command dispatcher |
management/src/orchestrator/mod.rs |
~380 | Orchestrator main |
management/src/orchestrator/task.rs |
~340 | Task state machine |
management/src/orchestrator/storage.rs |
~270 | Storage operations |
management/src/telemetry/metrics.rs |
~590 | Prometheus metrics |
management/src/crash_loop.rs |
~560 | Crash detection |
| File | Lines | Purpose |
|---|---|---|
agent-rs/src/main.rs |
~1700 | Agent entry point |
agent-rs/src/health.rs |
~550 | Health monitoring |
agent-rs/src/metrics.rs |
~135 | Metrics export |
agent-rs/src/claude.rs |
~475 | Claude runner |
| File | Lines | Purpose |
|---|---|---|
proto/agent.proto |
~510 | gRPC protocol definition |
| Version | Date | Changes |
|---|---|---|
| 1.0 | 2025-02-07 | Initial comprehensive documentation |