Skip to content

ramn51/titan-orchestrator

Repository files navigation

Titan Orchestrator

Documentation License Status Solo Project

A distributed execution runtime built from first principles — custom TCP protocol, DAG scheduler, AOF-backed store, and agentic runtime in a single zero-dependency JAR.

5-Minute Quickstart · Full Documentation · How Titan Compares · Architecture


Titan is a distributed execution runtime with a custom DAG scheduler, binary wire protocol, AOF-backed KV store, and agentic execution engine. The core engine is a single JAR with zero external dependencies — it schedules, routes, and executes across a cluster with nothing else installed. The dashboard (Flask) and MCP server (mcp package) are optional extensions on top. It runs on a bare VM or laptop.

Submit jobs in YAML or Python. Titan resolves dependencies, routes to capable workers, streams logs back, and recovers from crashes via AOF replay. At the agentic tier, tasks can spawn new tasks mid-execution, share state across nodes through TitanStore, and pause for human approval before continuing.

It covers three capability tiers in one binary:

Tier What you run
T1 — Distributed Task Scheduler Batch jobs, static DAGs, GPU/CPU-routed workloads, delayed execution
T2 — Service Orchestrator Long-running APIs and daemons with auto-restart and port management
T3 — Agentic Runtime Self-mutating DAGs, LLM-driven agents, multi-agent pipelines with HITL gates

v1.0 research status. Single-Master topology (Raft on the v2 roadmap), process-level isolation (Docker on v2), no mTLS yet. Built to be understood, not to replace Kubernetes or Temporal in production today.


Dashboard

Titan ships with a built-in Python Flask dashboard. Three views:

Orchestrator — Cluster health and worker load

Titan Orchestrator Dashboard

Real-time view of every connected worker — capability tag (GENERAL / GPU / HIGH_MEM), active job count, running services, and recent activity. Includes a + Launch Worker button to spin up new nodes from the browser.

DAG Pipelines — Live dependency graph

DAG Visualizer

Every pipeline submitted to the cluster — via CLI, SDK, YAML, or the visual Constructor — is automatically rendered as a live dependency graph. Node colours update in real-time as jobs move through PENDING → RUNNING → COMPLETED / FAILED. Click any node to stream stdout/stderr live.

DAG Constructor — Visual pipeline builder

DAG Constructor

Browser-based drag-and-drop DAG editor. Add task and service nodes, draw dependency edges, configure script, capability, priority, and HITL gates per node — then deploy directly to the cluster in one click. Auto-generates equivalent Python SDK and YAML as you build.

Agent Runs — Multi-stage agent timeline

Agent Runs

Groups all DAG stages that share an agent_run_id into a single timeline row. When an agent runs iteratively (PLAN → ITER → EVAL → SYNTH), each stage is a separate DAG submission — Agent Runs reconstructs the full lifecycle so you don't have to hunt through individual entries.


Four Ways to Define a Pipeline

Method Best for
YAML file Repeatable, version-controlled pipelines. Commit to git and re-run any time.
Python SDK Programmatic pipelines where shape is determined at runtime — agent loops, dynamic fan-out.
Visual Constructor Building pipelines without writing code. Drag nodes, draw edges, deploy in one click.
MCP (natural language) Controlling Titan from Claude Desktop or Cursor. Describe what you want — the agent writes the scripts and submits the DAG on your behalf.

All four paths produce the same result: a tracked DAG in the visualizer with per-job logs, status, and workspace files.


Demos

Visual DAG Constructor

Build a pipeline by dragging nodes and drawing edges — deploy to the cluster with one click.

Screen.Recording.2026-05-17.at.10.23.48.PM.mov

Human-in-the-Loop (HITL) Gate

A DAG pauses at a checkpoint and waits for a human Approve/Reject before downstream jobs resume.

Screen.Recording.2026-05-17.at.10.25.48.PM.mov

HITL on a Complex Graph

HITL gate mid-execution on a multi-branch pipeline — shows how the visualizer reflects the paused state.

Screen.Recording.2026-05-17.at.10.29.38.PM.mov

Agentic AI Workflow

A multi-stage agent loop — each stage is a separate DAG submission, grouped into one timeline in Agent Runs.

Screen.Recording.2026-05-17.at.11.22.29.PM.mov
More: Dynamic DAG Execution, Reactive Scaling, GPU Routing, Fanout

Control Plane: Dynamic DAG Execution

dynamic_dag.mp4

Reactive Worker Scaling

titan_load_scaling.mp4

GPU Affinity Routing

GPU_Affinity_yaml.mp4

Parallel Execution (Fanout)

fanout_yaml_dag.mp4

Full Load Cycle (Scale Up & Descale)

titan_load_descaling.mp4

Quick Start

# 1. Build the engine
mvn clean package -DskipTests

# 2. Start the cluster (Master + 2 workers + TitanStore + dashboard)
./titan-dev.sh up

# 3. Open the dashboard
open http://localhost:5000

Submit your first job:

from titan_sdk.titan_sdk import TitanClient, TitanJob

client = TitanClient()
client.submit_dag("hello-world", [
    TitanJob(job_id="hello", script_content="print('hello from Titan')")
])

Full walkthrough: 5-Minute Quickstart


Architecture

Titan Architecture Diagram

Three components:

  • Control Plane (Master) — DAG scheduling, dependency resolution, capability routing, AOF-backed state recovery
  • Workers — Capability-tagged execution nodes that self-register on startup and re-register after restart
  • TitanStore (optional) — AOF-backed KV store for crash recovery and cross-job shared agent state. No external database required.

The wire protocol (TITAN_PROTO) is a custom fixed-header binary format over raw TCP — no JSON on the dispatch path.

Architecture Deep Dive


MCP — Control Titan from Claude Desktop

Titan ships a built-in MCP server. Connect Claude Desktop or Cursor and control your cluster in natural language:

"Research three approaches to distributed ML scheduling. Analyze gaps, methodology, and open problems in parallel. Synthesize into a report."

Titan executes the parallel jobs, fans results into a synthesis job, and returns the output — all from a single prompt. No terminal, no code.

MCP Setup and Use Cases


Key Features

Execution

  • Static DAGs via YAML or Python SDK
  • Dynamic DAG mutation at runtime — tasks can spawn new tasks mid-execution
  • Long-running services alongside batch jobs in one runtime

Routing & Scaling

  • Capability-based routing — tag workers GPU, HIGH_MEM, or custom; jobs are held until a matching node is free
  • Affinity routing — pin jobs to specific workers by tag
  • Least-connection dispatch across available workers
  • Reactive auto-scaling — when a worker's queue saturates, it spawns child worker processes on the same machine to absorb the spike; idle burst workers decommission automatically after 45 seconds

Worker Lifecycle

  • Permanent vs. ephemeral workers — mark nodes as permanent so they stay alive across job completions; burst workers exit when idle
  • Workers re-register with the Master on reconnect — the cluster heals through a Master restart without manual intervention
  • Orphan cleanup on startup — workers scan for and kill leftover processes from a previous crash before accepting new work

Resilience

  • AOF crash recovery — Master replays state on restart, resumes in-flight DAGs
  • Worker re-registration — cluster recovers through a Master restart without manual intervention
  • Callback retry with exponential backoff

Observability

  • Live DAG visualizer with per-node status
  • Real-time log streaming from any job
  • Agent Runs timeline for multi-stage agent workflows

Human-in-the-Loop

  • Native HITL gates — pause a DAG at any checkpoint, Approve/Reject from the dashboard
  • Configurable timeout (default 48 hours)
  • Automatic gate injection via SDK

Cloud Deployment

Titan runs locally out of the box. When you're ready to move to the cloud, package_cloud.sh builds two deployment bundles:

./package_cloud.sh
# → titan-master-bundle.zip   (~2.3 MB)  — everything needed on the Master VM
# → titan-worker-bundle.zip   (~120 KB)  — Worker.jar + titan_sdk for remote workers

Examples

Example What it shows
Build Your First Agent Writer → Critic loop — simplest agentic pattern in ~60 lines
Human-in-the-Loop Pipeline ML pipeline that pauses for human Approve/Reject before training
Multi-Agent Research Pipeline Parallel agents + HITL gate + synthesis fan-in
Static YAML Pipelines Diamond patterns, GPU routing, parallel fan-out
LangChain Integration Wrap the Titan SDK as LangChain tools — no MCP needed

Contributing

Titan is an experimental runtime built from first principles by a single developer. Bug reports, edge case findings, and contributions are encouraged.

Report an Issue · How to Contribute


License

Licensed under the Apache License 2.0. © 2026 Ram Narayanan A S.

About

A zero-dependency distributed orchestrator built from scratch. Bridges static DevOps pipelines and dynamic Agentic AI workflows via a Python SDK & CLI. Features reactive auto-scaling and capability-based routing

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors