AgentUnit is designed as a research-grade framework. If you use AgentUnit in your research, please cite it using the following metadata.
To cite AgentUnit in publications:
@software{agentunit2024,
author = {Aviral Garg},
title = {AgentUnit: A Framework for Multi-Agent System Evaluation and Benchmarking},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/aviralgarg05/agentunit}},
version = {0.1.0}
}AgentUnit ensures research reproducibility through:
- Deterministic Evaluation: Configurable seeds for random number generators and LLM temperature control.
- Versioned Benchmarks: Fixed versions of GAIA and AgentArena datasets.
- Traceability: Comprehensive logging of all agent interactions, tool calls, and metric calculations.
- Experiment Tracking: Built-in tracking of configuration, code versions (git commit), and results (metrics/tracker.py).
-
Environment Setup:
git clone https://github.com/aviralgarg05/agentunit.git cd agentunit python -m venv .venv source .venv/bin/activate pip install -e ".[dev]"
-
Configuration: Set the same environment variables (provider API keys) and
ExperimentConfigparameters. Start runs with a fixed seed:import random import numpy as np SEED = 42 random.seed(SEED) np.random.seed(SEED) # Configure LLM temperature to 0.0 for deterministic outputs where possible
-
Running Benchmarks: Use the provided scripts in
examples/orexperiments/which log all parameters. -
Verifying Results: Compare your
experiments/experiment_*.jsonoutput with published results. Usesrc/agentunit/statsmodule for statistical significance testing between runs.