Skip to content

HRI-EU/merge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MERGE: Guided Vision-Language Models for Multi-Actor Event Reasoning and Grounding in Human-Robot Interaction

This repository contains the implementation of MERGE, accepted at the 2026 IEEE International Conference on Robotics and Automation (ICRA 2026).

MERGE is a system for multi-actor event reasoning and grounding in human-robot interaction scenarios. It combines a lightweight perception pipeline with Vision-Language Models (VLMs) to identify actors and objects, track them over time, and structure interactions as actor-action-object relations. The system is designed to support temporally consistent situational grounding in dynamic group interactions involving humans and robots.

For details, please refer to the paper:

MERGE: Guided Vision-Language Models for Multi-Actor Event Reasoning and Grounding in Human-Robot Interaction Joerg Deigmoeller, Nakul Agarwal, Stephan Hasler, Daniel Tanneberg, Anna Belardinelli, Reza Ghoddoosian, Chao Wang, Felix Ocker, Fan Zhang, Behzad Dariush, Michael Gienger
Accepted at ICRA 2026
arXiv:2603.18988

Installation

We recommend using uv for installation and dependency management.

After installing uv, run the following commands from the project root:

uv venv
uv sync

Dataset

Before running the experiments, download the GROUND dataset from:

https://usa.honda-ri.com/ground

Unpack the dataset and run the following script, pointing it to the extracted GROUND-eval folder:

bash scripts/copy_ground_eval.sh ~/ground/GROUND-eval

The script copies the required evaluation data into the expected data/scene_* directories.

API Keys

To use GPT-based models, set your OpenAI API key:

export OPENAI_API_KEY="SECRET_KEY"

To use Gemini-based models, set your Gemini API key:

export GEMINI_API_KEY="SECRET_KEY"

Evaluation

The main evaluation script can be used to reproduce the quantitative results reported in the paper:

uv run scripts/evaluate.py

The script evaluates the method outputs stored in the dataset directories:

data/scene_*/runs

These folders contain the outputs produced by the different methods and baselines. The evaluation script compares these outputs against the corresponding annotations and reports the metrics used in the paper.

Different experiment configurations can be selected by editing the corresponding configuration section in:

scripts/evaluate.py

By default, the evaluation runs MERGE with GPT-4o.

Experiments

To run the full MERGE pipeline, use:

uv run scripts/merge_full.py

To run the baseline experiments, use:

uv run scripts/baselines.py

Citation

If you use this repository, the GROUND dataset, or the MERGE system in your research, please cite our paper:

@inproceedings{deigmoeller2026merge,
  title={MERGE: Guided Vision-Language Models for Multi-Actor Event Reasoning and Grounding in Human-Robot Interaction},
  author={Deigmoeller, Joerg and Agarwal, Nakul and Hasler, Stephan and Tanneberg, Daniel and Belardinelli, Anna and Ghoddoosian, Reza and Wang, Chao and Ocker, Felix and Zhang, Fan and Dariush, Behzad and Gienger, Michael},
  booktitle={Proceedings of the 2026 IEEE International Conference on Robotics and Automation (ICRA)},
  year={2026}
}

License

This project is licensed under the BSD 3-Clause License.

Copyright (c) 2025, Honda Research Institute Europe GmbH. All rights reserved.

See the LICENSE file for the full license text.

SPDX-License-Identifier: BSD-3-Clause

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors