VTC-Bench: Evaluating Agentic Multimodal Models via Compositional Visual Tool Chaining

arXiv Paper: Dataset:

📢 News

[2026/03/16] We are thrilled to introduce VTC-Bench, a comprehensive benchmark designed to rigorously evaluate the advanced tool-use proficiency and multi-tool composition capabilities of Multimodal Large Language Models (MLLMs). 🎉

📌 Introduction

Recent advancements have extended Multimodal Large Language Models (MLLMs) beyond standard visual question answering to utilizing external tools for advanced visual tasks, effectively transforming them into active, agentic problem solvers.
Despite this progress, accurately executing and effectively composing diverse tools for complex visual tasks remains a persistent bottleneck. Existing benchmarks are often constrained by sparse tool-sets and simple tool-use trajectories, failing to capture the complex tool interactions required in practical, real-world conditions.
To bridge this critical gap, we introduce Visual Tool Chain-Bench (VTC-Bench). To emulate authentic computer vision pipelines, our framework integrates 35 diverse OpenCV-based visual operations.
VTC-Bench features 680 curated problems structured across a progressive nine-category cognitive hierarchy. A key feature of our benchmark is that every problem is paired with a ground-truth execution trajectory. These reference toolchains are primarily designed to facilitate fine-grained diagnostic analysis of the models' intermediate planning and tool-calling behaviors, providing deeper insights beyond just the final accuracy.
Extensive experiments on 19 leading MLLMs reveal that even the top-performing model (Gemini-3.0-Pro) only achieves 51.2% on our benchmark, highlighting that multi-tool composition remains a persistent challenge and models often rely on suboptimal heuristics rather than optimal tool selection.

🔍 Benchmark Overview

VTC-Bench is organized into a three-tier cognitive hierarchy that maps the evolution of multimodal agents from passive visual sensing to active constructive reasoning:

Tier 1: Visual Perception Enhancement: Foundational tasks including Robust OCR, Perceptual Restoration, and Attention Focusing. These require models to mitigate environmental interference and rectify geometric distortions.
Tier 2: Quantitative Visual Estimation: Tasks including Measurement, Color, and Counting. These evaluate the model's capacity to perceive and precisely quantify physical attributes.
Tier 3: Compositional Visual Reasoning: Advanced tasks including Chart, Math, and Spatial Reasoning. These demand complex logical deduction through multi-step tool orchestration and auxiliary construction.

✨ Evaluation Pipeline

VTC-Bench supports evaluating models across two distinct tool-use interaction paradigms:

📍 Track A: Code Interpreter (Code-Driven)

In this track, the agent utilizes a code interpreter to synthesize Python code for visual manipulation.
Models must generate programmatic solutions using raw OpenCV (cv2) code based on a strictly provided list of allowed capabilities and parameter logic.

📍 Track B: Atomic OpenCV Toolbox (Interface-Driven)

In this track, the agent interacts iteratively with predefined interfaces from a suite of 32 distinct tools (categorized into Geometry, Enhancement, Feature Extraction, and Drawing).
We utilize frameworks like Qwen-Agent (for models with native tool-calling) or Thyme (for generating code/interfaces for open-source models) to manage the reasoning and execution layer.

🚀 Quick Start / Evaluation Usage

Follow these steps to quickly set up the environment and run evaluations on VTC-Bench:

Install the qwen-agent environment:
```
pip install -U qwen-agent
```
Modify the configuration file: Update the evaluation settings (e.g., model API keys, paths) in the YAML configuration file according to your setup.
```
./eval_config/gpt_4o_interface.yaml
```
Run the evaluation script: Execute the evaluation pipeline using the configured YAML file.
```
python VTC_Bench_Eval.py -c ./eval_config/gpt_4o_interface.yaml
```

💡 Representative Examples of Each Task

VTC-Bench evaluates models across 9 diverse tasks requiring complex toolchaining:

Attention Focusing: Re-orienting focus via spatial normalization (e.g., Rotate, Crop, Convert Color, Binarize).
Chart: Simultaneous restoration, perception, and inference of chart data.
Color: Quantifying color proportions using chromatic space manipulations.
Counting: Overcoming visual occlusion using morphological utilities for "segment-and-count" pipelines.
Math: STEM-oriented geometric reasoning requiring auxiliary lines.
Measurement: Sub-pixel precision physical dimension estimation.
Perceptual Restoration: Neutralizing haze and noise to recover semantic info.
Robust OCR: Strategic planning to binarize and sharpen before text recognition under compound degradation.
Spatial Reasoning: Transforming visual cues into precise spatial coordinates.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
eval		eval
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VTC-Bench: Evaluating Agentic Multimodal Models via Compositional Visual Tool Chaining

📢 News

📌 Introduction

🔍 Benchmark Overview

✨ Evaluation Pipeline

📍 Track A: Code Interpreter (Code-Driven)

📍 Track B: Atomic OpenCV Toolbox (Interface-Driven)

🚀 Quick Start / Evaluation Usage

💡 Representative Examples of Each Task

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VTC-Bench: Evaluating Agentic Multimodal Models via Compositional Visual Tool Chaining

📢 News

📌 Introduction

🔍 Benchmark Overview

✨ Evaluation Pipeline

📍 Track A: Code Interpreter (Code-Driven)

📍 Track B: Atomic OpenCV Toolbox (Interface-Driven)

🚀 Quick Start / Evaluation Usage

💡 Representative Examples of Each Task

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages