Overview

Collaborative Instance-object Navigation (CoIN) requires an agent to navigate to a target object instance specified by natural language, collaborating, along the way, with a human user, in cluttered environments with many similar objects (distractors). On of the most important capabilities of such agents is the ability to ask the user for clarifying questions when facing uncertain detections or ambiguous instructions. Therefore, this challenge will test the agents' ability to ask questions to disambiguate similar objects during navigation.

Challenge

In this challenge, participants will be asked to train multimodal agents that, given a object description in natural language and a image of an object, can either

1) determine whether the object shown in the image matches the given object description, or not (when sufficient information is available)

2) ask a clarifying question to the user when the description and the image are ambiguous

The goal of the agents is to correctly identify the image that matches the object description, while asking as few questions as possible to the user.

Evaluation

Each multimodal agent will be evaluated on N episodes, on a varied set of object descriptions and images. To ensure reproducibility, the user is replaced by a large Vision-Language Model ('Oracle'), which will be used to answer the agent's questions during evaluation. Each episode contains a variable number of images presented to the agent sequentially. The agents will be evaluated on a combination of accuracy and number of questions asked, with a strong emphasis on accuracy.

Dates

Challenge start date: June 15th 2026

Challenge end date: August 15th 2026

Leaderboard

🥇 1st Place

TBD

🥈 2nd Place

TBD

🥉 3rd Place

TBD

Submission

The participants will have access to the codebase and a training set of episodes to use to train and tune their agents. The participants will be asked to submit their trained agents (weights uploaded to huggingface) before the deadline, alongside a technical report and the original code. These agents will be evaluated on a held-out test set of episodes that will be released after the deadline.

Organizers

Edoardo Zorzi

Sapienza University of Rome, Italy

Ph.D Student.

Yiming Wang

Fondazione Bruno Kessler, Trento, Italy

Senior Researcher.

Citation

If you use the provided materials, please cite the relevant paper below.

CoIN

@InProceedings{taioli2025coin,
          author    = {Taioli, Francesco and Zorzi, Edoardo and Franchi, Gianni and Castellini, Alberto and Farinelli, 
            Alessandro and Cristani, Marco and Wang, Yiming},
          title     = {Collaborative Instance Object Navigation: Leveraging Uncertainty-Awareness to Minimize Human-Agent Dialogues},
          booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
          month     = {October},
          year      = {2025},
          pages     = {18781-18792}
      }

Question asking for CoIN

@InProceedings{zorzi2026coinqa,
          author    = {Zorzi, Edoardo and Taioli, Francesco and Wang, Yiming and Cristani, Marco and Farinelli, 
            Alessandro and Castellini, Alberto and Bazzani, Loris},
          title     = {Benchmarking Interaction, Beyond Policy: a Reproducible Benchmark for Collaborative Instance Object Navigation},
          booktitle = {https://arxiv.org/pdf/2604.00265},
          month     = {March},
          year      = {2026},
      }

CoIN Challenge 2026