A distributed execution runtime built from first principles — custom TCP protocol, DAG scheduler, AOF-backed store, and agentic runtime in a single zero-dependency JAR.
5-Minute Quickstart · Full Documentation · How Titan Compares · Architecture
Titan is a distributed execution runtime with a custom DAG scheduler, binary wire protocol, AOF-backed KV store, and agentic execution engine. The core engine is a single JAR with zero external dependencies — it schedules, routes, and executes across a cluster with nothing else installed. The dashboard (Flask) and MCP server (mcp package) are optional extensions on top. It runs on a bare VM or laptop.
Submit jobs in YAML or Python. Titan resolves dependencies, routes to capable workers, streams logs back, and recovers from crashes via AOF replay. At the agentic tier, tasks can spawn new tasks mid-execution, share state across nodes through TitanStore, and pause for human approval before continuing.
It covers three capability tiers in one binary:
| Tier | What you run |
|---|---|
| T1 — Distributed Task Scheduler | Batch jobs, static DAGs, GPU/CPU-routed workloads, delayed execution |
| T2 — Service Orchestrator | Long-running APIs and daemons with auto-restart and port management |
| T3 — Agentic Runtime | Self-mutating DAGs, LLM-driven agents, multi-agent pipelines with HITL gates |
v1.0 research status. Single-Master topology (Raft on the v2 roadmap), process-level isolation (Docker on v2), no mTLS yet. Built to be understood, not to replace Kubernetes or Temporal in production today.
Titan ships with a built-in Python Flask dashboard. Three views:
Real-time view of every connected worker — capability tag (GENERAL / GPU / HIGH_MEM), active job count, running services, and recent activity. Includes a + Launch Worker button to spin up new nodes from the browser.
Every pipeline submitted to the cluster — via CLI, SDK, YAML, or the visual Constructor — is automatically rendered as a live dependency graph. Node colours update in real-time as jobs move through PENDING → RUNNING → COMPLETED / FAILED. Click any node to stream stdout/stderr live.
Browser-based drag-and-drop DAG editor. Add task and service nodes, draw dependency edges, configure script, capability, priority, and HITL gates per node — then deploy directly to the cluster in one click. Auto-generates equivalent Python SDK and YAML as you build.
Groups all DAG stages that share an agent_run_id into a single timeline row. When an agent runs iteratively (PLAN → ITER → EVAL → SYNTH), each stage is a separate DAG submission — Agent Runs reconstructs the full lifecycle so you don't have to hunt through individual entries.
| Method | Best for |
|---|---|
| YAML file | Repeatable, version-controlled pipelines. Commit to git and re-run any time. |
| Python SDK | Programmatic pipelines where shape is determined at runtime — agent loops, dynamic fan-out. |
| Visual Constructor | Building pipelines without writing code. Drag nodes, draw edges, deploy in one click. |
| MCP (natural language) | Controlling Titan from Claude Desktop or Cursor. Describe what you want — the agent writes the scripts and submits the DAG on your behalf. |
All four paths produce the same result: a tracked DAG in the visualizer with per-job logs, status, and workspace files.
Build a pipeline by dragging nodes and drawing edges — deploy to the cluster with one click.
Screen.Recording.2026-05-17.at.10.23.48.PM.mov
A DAG pauses at a checkpoint and waits for a human Approve/Reject before downstream jobs resume.
Screen.Recording.2026-05-17.at.10.25.48.PM.mov
HITL gate mid-execution on a multi-branch pipeline — shows how the visualizer reflects the paused state.
Screen.Recording.2026-05-17.at.10.29.38.PM.mov
A multi-stage agent loop — each stage is a separate DAG submission, grouped into one timeline in Agent Runs.
Screen.Recording.2026-05-17.at.11.22.29.PM.mov
More: Dynamic DAG Execution, Reactive Scaling, GPU Routing, Fanout
Control Plane: Dynamic DAG Execution
dynamic_dag.mp4
Reactive Worker Scaling
titan_load_scaling.mp4
GPU Affinity Routing
GPU_Affinity_yaml.mp4
Parallel Execution (Fanout)
fanout_yaml_dag.mp4
Full Load Cycle (Scale Up & Descale)
titan_load_descaling.mp4
# 1. Build the engine
mvn clean package -DskipTests
# 2. Start the cluster (Master + 2 workers + TitanStore + dashboard)
./titan-dev.sh up
# 3. Open the dashboard
open http://localhost:5000Submit your first job:
from titan_sdk.titan_sdk import TitanClient, TitanJob
client = TitanClient()
client.submit_dag("hello-world", [
TitanJob(job_id="hello", script_content="print('hello from Titan')")
])Full walkthrough: 5-Minute Quickstart
Three components:
- Control Plane (Master) — DAG scheduling, dependency resolution, capability routing, AOF-backed state recovery
- Workers — Capability-tagged execution nodes that self-register on startup and re-register after restart
- TitanStore (optional) — AOF-backed KV store for crash recovery and cross-job shared agent state. No external database required.
The wire protocol (TITAN_PROTO) is a custom fixed-header binary format over raw TCP — no JSON on the dispatch path.
Titan ships a built-in MCP server. Connect Claude Desktop or Cursor and control your cluster in natural language:
"Research three approaches to distributed ML scheduling. Analyze gaps, methodology, and open problems in parallel. Synthesize into a report."
Titan executes the parallel jobs, fans results into a synthesis job, and returns the output — all from a single prompt. No terminal, no code.
Execution
- Static DAGs via YAML or Python SDK
- Dynamic DAG mutation at runtime — tasks can spawn new tasks mid-execution
- Long-running services alongside batch jobs in one runtime
Routing & Scaling
- Capability-based routing — tag workers
GPU,HIGH_MEM, or custom; jobs are held until a matching node is free - Affinity routing — pin jobs to specific workers by tag
- Least-connection dispatch across available workers
- Reactive auto-scaling — when a worker's queue saturates, it spawns child worker processes on the same machine to absorb the spike; idle burst workers decommission automatically after 45 seconds
Worker Lifecycle
- Permanent vs. ephemeral workers — mark nodes as permanent so they stay alive across job completions; burst workers exit when idle
- Workers re-register with the Master on reconnect — the cluster heals through a Master restart without manual intervention
- Orphan cleanup on startup — workers scan for and kill leftover processes from a previous crash before accepting new work
Resilience
- AOF crash recovery — Master replays state on restart, resumes in-flight DAGs
- Worker re-registration — cluster recovers through a Master restart without manual intervention
- Callback retry with exponential backoff
Observability
- Live DAG visualizer with per-node status
- Real-time log streaming from any job
- Agent Runs timeline for multi-stage agent workflows
Human-in-the-Loop
- Native HITL gates — pause a DAG at any checkpoint, Approve/Reject from the dashboard
- Configurable timeout (default 48 hours)
- Automatic gate injection via SDK
Titan runs locally out of the box. When you're ready to move to the cloud, package_cloud.sh builds two deployment bundles:
./package_cloud.sh
# → titan-master-bundle.zip (~2.3 MB) — everything needed on the Master VM
# → titan-worker-bundle.zip (~120 KB) — Worker.jar + titan_sdk for remote workers- Multi-VM Setup — permanent cluster on GCP / AWS / Azure
- Remote GPU via SSH Tunnel — keep your local machine as Master, tunnel a RunPod or cloud VM as a worker with no open ports
| Example | What it shows |
|---|---|
| Build Your First Agent | Writer → Critic loop — simplest agentic pattern in ~60 lines |
| Human-in-the-Loop Pipeline | ML pipeline that pauses for human Approve/Reject before training |
| Multi-Agent Research Pipeline | Parallel agents + HITL gate + synthesis fan-in |
| Static YAML Pipelines | Diamond patterns, GPU routing, parallel fan-out |
| LangChain Integration | Wrap the Titan SDK as LangChain tools — no MCP needed |
Titan is an experimental runtime built from first principles by a single developer. Bug reports, edge case findings, and contributions are encouraged.
Report an Issue · How to Contribute
Licensed under the Apache License 2.0. © 2026 Ram Narayanan A S.





