AgentRL

Local-first Harness Operating System for defining, evaluating, evolving, versioning, and deploying agent harnesses.

Bring your own agent, repo, tools, or runtime. AgentRL turns them into harnesses you can evaluate, improve, version, and deploy locally.

Packages

Install from PyPI:

pip install agentrl-os

Run from GitHub Container Registry:

docker pull ghcr.io/junaidahmed361/agentrl:latest
docker run --rm ghcr.io/junaidahmed361/agentrl:latest --version

GitHub package: ghcr.io/junaidahmed361/agentrl

Package name: agentrl-os

Import package: agentrl

CLI command: agentrl

AgentRL is a systems layer for defining, evaluating, evolving, versioning, and deploying agent harnesses through one unified interface. It standardizes how agent systems are tested and improved over time without forcing teams to rebuild task formats, reward schemas, trajectory traces, registries, and local deployment plumbing from scratch.

AgentRL is not an orchestration framework and not primarily an RL framework. RL, prompt optimization, skill optimization, memory optimization, preference learning, and tool optimization are implementation details behind the harness abstraction.

Why AgentRL exists

Agentic systems are becoming easier to prototype and harder to operate.

A team can assemble a capable agent from a model, a prompt, tools, memory, an orchestration library, and a few eval scripts. The hard part starts after the demo:

How are tasks represented?
How are rewards represented?
How are evaluation results compared over time?
How are trajectories stored and replayed?
How are harness changes versioned?
How do you know a prompt, skill, tool policy, or memory policy actually improved behavior?
How do you safely deploy a new behavior and roll it back?

Most agent projects answer these questions with ad-hoc glue code. That glue code becomes the real operating system for the agent, but it is rarely standardized, versioned, or portable.

AgentRL exists to make that systems layer explicit.

The goal is to make building, improving, and operating agent harnesses feel closer to using scikit-learn-style project primitives:

from agentrl import Project

project = Project("./my-agent-system")
project.compile()
project.fit()          # sklearn-style alias for lifecycle train/evolve
project.transform()    # compiled harness artifacts
project.score()        # average harness evaluation pass rate
project.reinforce({    # retro feedback routed from Campaigns or another product layer
    "source": "campaign_retrospective",
    "target": "Market Researcher",
    "instruction": "Require competitor price citations before recommendations.",
    "reinforcement_targets": ["evaluation", "memory", "prompts"],
})
project.self_retro({   # final-review trace traversal -> root cause -> reinforcement
    "final_review": "Weak competitor pricing evidence in the Market Researcher trace.",
    "trace_paths": ["traces/market-researcher.jsonl", "traces/analytics-agent.jsonl"],
    "signals": ["Market Researcher omitted competitor pricing citations"],
})
project.train(strategy="verification")
project.evaluate()
project.auto_harness()
project.deploy()

The gap AgentRL fills

The ecosystem already has strong tools, but they solve different parts of the lifecycle:

LangGraph helps build stateful agent orchestration graphs.
TRL helps train models with reinforcement learning and preference methods.
Ray helps scale distributed execution.
Verifiers and RLVR-style systems help score verifiable tasks.
Atropos-style systems help collect rollouts and trajectories.
Repo2RLEnv/Harbor-style systems turn repositories into verifiable coding tasks.
Evaluation frameworks help run benchmark suites.

AgentRL is the layer above those pieces:

tasks + harness + rewards + evaluation + traces + versions + deployment

It gives those pieces a common operating model so teams can move from experiments to repeatable harness evolution without locking into a single runtime or training method.

What AgentRL is and is not

AgentRL is a Harness Operating System.

It owns:

Project layout
Harness definitions
Task and reward schemas
Evaluation records
Trajectory/trace observability
Local version registry
Self-evolution candidate management
Local deployment records
Adapter boundaries to external systems

It does not try to replace:

agent runtimes
graph orchestration frameworks
RL training libraries
distributed compute systems
repository-to-task synthesis pipelines
hosted experiment platforms

Those are integrations or backends. AgentRL standardizes the harness lifecycle around them.

Differentiation from existing tools

AgentRL vs LangGraph

LangGraph is for orchestrating stateful agent workflows.

AgentRL is for defining, evaluating, evolving, versioning, and deploying the harnesses those workflows run inside.

A LangGraph app can be wrapped by an AgentRL harness. AgentRL should not become a competing graph runtime.

LangGraph: how should the agent transition between steps?
AgentRL: how is this behavior evaluated, improved, versioned, and deployed?

AgentRL vs OpenHarness

OpenHarness and AgentRL are adjacent, not substitutes.

OpenHarness is an agent runtime harness: it focuses on goal execution with tools, memory, permissions, skills, hooks, MCP, subagents, context management, and an agent loop.

AgentRL is a harness lifecycle system: it starts from execution artifacts and manages evaluation, reward definitions, trace collection, versioning, evolution, deployment, and rollback.

OpenHarness:
Goal → Execution

AgentRL:
Execution → Evaluation → Evolution → Versioning → Deployment

AgentRL should not add a competing agent loop, MCP layer, permissions framework, subagent runtime, campaign autorun loop, contracted-worker queue, or context manager. OpenHarness can instead be attached as a runtime adapter:

from agentrl import Project
from agentrl.adapters import OpenHarnessAdapter

project = Project("./agent-system")
runtime = OpenHarnessAdapter.from_endpoint("http://localhost:8000")
project.attach_runtime(runtime)

Clean layering:

Repo2RLEnv / Harbor: repo → verifiable coding tasks
OpenHarness: task or goal → execution trajectory
AgentRL: execution trajectory → evaluate, evolve, version, deploy

AgentRL vs TRL/RL libraries

TRL and similar libraries help train models.

AgentRL starts before and around training: task schemas, reward specs, evaluations, traces, versioning, and deployment. Training is one possible optimization backend, not the identity of the project.

TRL: optimize model weights.
AgentRL: operate harnesses; use training only when cheaper optimizations are insufficient.

AgentRL’s preferred improvement order is intentionally practical:

prompts → skills → memory policies → routing → tools → fine-tuning → RL

AgentRL vs eval frameworks

Eval frameworks usually run tests and produce scores.

AgentRL includes evaluation, but connects it to harness compilation, candidate evolution, trace replay, version registry, deployment preflight checks, and rollback.

Eval framework: did this system pass a test?
AgentRL: should this harness change be promoted and deployed?

AgentRL vs Repo2RLEnv / Harbor

Repo2RLEnv/Harbor-style systems generate verifiable coding tasks from repositories.

AgentRL does not reimplement that synthesis. It imports those tasks into CodingHarness using Repo2RLEnvAdapter, preserving provenance, content hashes, sandbox metadata, and executable verification rewards.

Repo2RLEnv: repo → verifiable coding tasks
AgentRL: tasks + harness → evaluate, optimize, version, deploy

If Repo2RLEnv output is unavailable or invalid, AgentRL records provenance errors and imports no fabricated passing tasks.

AgentRL vs hosted experiment platforms

AgentRL is local-first. The MVP works without a hosted service:

harness compilation
local registry
local traces
local evaluation
local self-evolution candidates
local deployment records

Hosted registries, managed evals, GPU training, or enterprise governance can exist later as optional services, not prerequisites.

Research and library lineage

AgentRL is designed to sit on top of, or interoperate with, research and libraries such as:

RLVR / execution rewards for verifiable tasks
DPO and preference learning for subjective tasks
GEPA-style reflective prompt evolution
SkillOpt-style skill evolution
learned reward models and LLM judges
TRL for training backends
Ray for distributed execution
LangGraph for orchestration backends
Verifiers for executable reward environments
Atropos-style trajectory collection
Repo2RLEnv / Harbor for repo-derived coding tasks

These remain implementation details. AgentRL’s public abstraction stays centered on Project and Harness.

Install

pip install agentrl-os

For local development:

git clone https://github.com/junaidahmed361/agentrl.git
cd agentrl
uv sync --extra dev
uv run pytest -q

Quick start

agentrl init my-agent-system
cd my-agent-system
agentrl compile
agentrl train
agentrl evaluate
agentrl deploy

Targeted agent harnesses for Campaigns

Campaigns can employ AgentRL-backed agents, but AgentRL owns the harness lifecycle for each targeted agent.

Example: create a market researcher harness before a marketing campaign employs it.

agentrl init campaign-harnesses
cd campaign-harnesses
agentrl create-agent-harness \
  --agent "Market Researcher" \
  --role market_researcher \
  --objective "Support a marketing campaign with RAG-grounded market analysis"
agentrl evaluate
agentrl deploy

AgentRL infers applicable components from the role/objective, such as RAG, trace recording, decision logs, evaluation, memory, tool use, approval gates, and contracting support. Campaigns then references the resulting pod/harness declaration as an employed fleet agent while preserving AgentRL as the lifecycle owner.

Example demo

For a local Hermes-style agent replication demo, see:

examples/local-hermes-agent-os.md

The demo shows how AgentRL can represent a local agent OS with router, coding, RAG, tool-use, memory, skills, registry, traces, and local deployment while keeping Hermes-style execution as a harness capability rather than a competing runtime.

You can also dogfood it directly from the CLI:

agentrl demo local-agent-os --path local-agent-os --goal "Fix a failing pytest in this repo"
cd local-agent-os
agentrl agent-os

The demo local-agent-os command removes the extra prep I had to do while dogfooding: it initializes the template, compiles harnesses, routes a sample goal, records trace/memory files, runs evaluation, creates an adaptive auto-harness candidate, and writes a local deployment record. The agent-os shell then lets you keep interacting with the same project.

The shell routes goals to a coding, RAG, or tool-use harness, records JSONL memory under .agentrl/agent_os/, writes trace files under .agentrl/traces/, and reuses the same evaluation, version registry, auto-harness, and local deployment commands as the rest of AgentRL.

Python API

from agentrl import Project

project = Project("./my-agent-system")
project.compile()
project.fit()          # sklearn-style alias for lifecycle train/evolve
project.transform()    # compiled harness artifacts
project.score()        # average harness evaluation pass rate
project.reinforce({    # retro feedback routed from Campaigns or another product layer
    "source": "campaign_retrospective",
    "target": "Market Researcher",
    "instruction": "Require competitor price citations before recommendations.",
    "reinforcement_targets": ["evaluation", "memory", "prompts"],
})
project.self_retro({   # final-review trace traversal -> root cause -> reinforcement
    "final_review": "Weak competitor pricing evidence in the Market Researcher trace.",
    "trace_paths": ["traces/market-researcher.jsonl", "traces/analytics-agent.jsonl"],
    "signals": ["Market Researcher omitted competitor pricing citations"],
})
project.train(strategy="verification")
project.evaluate()
project.auto_harness()
project.deploy()

Core operating model

Project
├── Harnesses
│   ├── Tasks
│   ├── Rewards
│   ├── Evaluations
│   ├── Policies
│   └── Goal Workflows
├── Memory
├── Skills
├── Version Registry
├── Observability
└── Deployment

Public top-level concepts stay intentionally small:

Project
Harness
Memory
Skills
Version Registry
Observability
Deployment
Goal Workflows
Auto-Harness

Advanced methods such as RLVR, DPO, GEPA, SkillOpt, TRL, Ray, LangGraph, Verifiers, and Atropos are adapters or backend implementation details.

Built-in harnesses

coding: verifiable coding tasks using filesystem/terminal evidence
rag: retrieval-grounded question answering with citation/hallucination reward dimensions
tool_use: safe tool-selection and tool-call evaluation

Repo2RLEnv adapter

from agentrl import Project
from agentrl.adapters import Repo2RLEnvAdapter

project = Project.init("./coding-agent")
source = Repo2RLEnvAdapter.from_repo(
    repo="pallets/click",
    pipeline="pr_runtime",
    limit=10,
)
project.harness("coding").add_tasks(source.to_taskset())
project.compile()
project.train(strategy="verification")
project.evaluate()

The adapter maps Repo2RLEnv/Harbor-style metadata into AgentRL TaskSet objects and attaches executable verification rewards to the coding harness.

OpenHarness runtime adapter

Use OpenHarness as an execution runtime adapter instead of rebuilding its runtime capabilities inside AgentRL.

from agentrl import Project
from agentrl.adapters import OpenHarnessAdapter

project = Project.init("./agent-system")
runtime = OpenHarnessAdapter.from_endpoint("http://localhost:8000")
project.attach_runtime(runtime)

You can also import exported OpenHarness-style traces for lifecycle evaluation/versioning:

runtime = OpenHarnessAdapter.from_trace_file("./openharness-traces.jsonl")
project.harness("coding").add_tasks(runtime.to_taskset())
project.evaluate()

The adapter boundary is intentional: OpenHarness owns goal execution; AgentRL owns evaluation, evolution, version registry, deployment records, and rollback.

CLI

agentrl --version
agentrl init my-project
agentrl compile
agentrl train --strategy verification
agentrl evaluate
agentrl evolve --targets prompts,skills,memory
agentrl auto-harness --mode static
agentrl run-goal "Fix the failing login test."
agentrl deploy
agentrl version list
agentrl version diff <left-version-id> <right-version-id>
agentrl version rollback <version-id>

Local-first artifacts

AgentRL stores project-local state under .agentrl/:

.agentrl/
├── compiled/          # compiled harness specs
├── registry/          # local version registry artifacts
├── traces/            # JSONL evaluation traces
├── candidates/        # promoted self-evolution candidates
├── rejected/          # rejected candidates
└── deployments/local/ # local deployment records

MVP features

Project abstraction
Harness compilation
TaskSet, RewardSpec, EvaluationResult schemas
Local version registry with list/diff/rollback
Built-in coding, RAG, and tool-use harnesses
Repo2RLEnvAdapter for Harbor-style coding tasks
OpenHarnessAdapter for external runtime traces without reimplementing runtime concerns
Evaluation engine with JSONL traces
Basic self-evolution and auto-harness candidate promotion/archive
Local deployment artifacts with evaluation preflight gating
Typed Python package and console script

Development

uv sync --extra dev
uv run pytest -q
uv run python -m build
uv run twine check dist/*

License

Apache-2.0. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.github/workflows		.github/workflows
assets		assets
examples		examples
src/agentrl		src/agentrl
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
Dockerfile		Dockerfile
IMPLEMENTATION_PLAN.md		IMPLEMENTATION_PLAN.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AgentRL

Packages

Why AgentRL exists

The gap AgentRL fills

What AgentRL is and is not

Differentiation from existing tools

AgentRL vs LangGraph

AgentRL vs OpenHarness

AgentRL vs TRL/RL libraries

AgentRL vs eval frameworks

AgentRL vs Repo2RLEnv / Harbor

AgentRL vs hosted experiment platforms

Research and library lineage

Install

Quick start

Targeted agent harnesses for Campaigns

Example demo

Python API

Core operating model

Built-in harnesses

Repo2RLEnv adapter

OpenHarness runtime adapter

CLI

Local-first artifacts

MVP features

Development

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AgentRL

Packages

Why AgentRL exists

The gap AgentRL fills

What AgentRL is and is not

Differentiation from existing tools

AgentRL vs LangGraph

AgentRL vs OpenHarness

AgentRL vs TRL/RL libraries

AgentRL vs eval frameworks

AgentRL vs Repo2RLEnv / Harbor

AgentRL vs hosted experiment platforms

Research and library lineage

Install

Quick start

Targeted agent harnesses for Campaigns

Example demo

Python API

Core operating model

Built-in harnesses

Repo2RLEnv adapter

OpenHarness runtime adapter

CLI

Local-first artifacts

MVP features

Development

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages