Visual-O1: Understanding Ambiguous Instructions via Multi-modal Multi-turn Chain-of-thoughts Reasoning

Official implementation of "Visual-O1: Understanding Ambiguous Instructions via Multi-modal Multi-turn Chain-of-thoughts Reasoning".

Update

[24/02] We released the code on Github.
[23/01] Our paper is accepted by ICLR 2025.

Overview of Visual-O1

Visual-O1 introduces multi-modal multi-turn chain-of-thoughts to understand ambiguity with (a) instantial experience for high-intelligent models to generate the correct answer directly, and (b) empirical experience for general-intelligent models to transform ambiguous instructions into clear instructions and then generate the correct answer. Instantial and empirical experience develops during inference and one-time optimization stage.

Example

Install and Preparation

To run the code, please configure the environment using the following script.

conda env create -f environment.yaml
conda activate visualo1

cd ops && sh make.sh && cd ..

conda install -c conda-forge mpi4py -y

Then, please download the checkpoints for segmentation.

sh ./download_ckpt.sh

Finally, please manually configure the API key and endpoint. We recommend using Azure OpenAI Service, as it allows for the manual disabling of filters.

Running

python main.py --data_name ambiguous_ris.jsonl --mode instantial

Dataset

Please download our dataset via this link.

Citation

If you find this work useful for your research and applications, please cite using this BibTeX:

@inproceedings{ni2025visual,
  title={Visual-O1: Understanding Ambiguous Instructions via Multi-modal Multi-turn Chain-of-thoughts Reasoning},
  author={Ni, Minheng, Fan, Yutao, Zhang, Lei and Zuo, Wangmeng},
  booktitle={The International Conference on Learning Representations},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
configs		configs
imgs		imgs
instructions		instructions
ops		ops
task_adapter		task_adapter
.gitignore		.gitignore
README.md		README.md
download_ckpt.sh		download_ckpt.sh
environment.yaml		environment.yaml
file_utils.py		file_utils.py
gpt4v.py		gpt4v.py
gpt_config.json		gpt_config.json
inference.py		inference.py
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Visual-O1: Understanding Ambiguous Instructions via Multi-modal Multi-turn Chain-of-thoughts Reasoning

Update

Overview of Visual-O1

Example

Install and Preparation

Running

Dataset

Citation

About

Releases

Packages

Languages

kodenii/Visual-O1

Folders and files

Latest commit

History

Repository files navigation

Visual-O1: Understanding Ambiguous Instructions via Multi-modal Multi-turn Chain-of-thoughts Reasoning

Update

Overview of Visual-O1

Example

Install and Preparation

Running

Dataset

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages