CoReS: Orchestrating the Dance of Reasoning and Segmentation

Bao, Xiaoyi; Sun, Siyang; Ma, Shuailei; Zheng, Kecheng; Guo, Yuxin; Zhao, Guosheng; Zheng, Yun; Wang, Xingang

doi:10.1007/978-3-031-72649-1_11

Xiaoyi Bao^13,14,15,18,
Siyang Sun¹⁵,
Shuailei Ma¹⁶,
Kecheng Zheng¹⁷,
Yuxin Guo^13,14,15,
Guosheng Zhao^13,14,18,
Yun Zheng¹⁵ &
…
Xingang Wang^14,18

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15076))

Included in the following conference series:

European Conference on Computer Vision

420 Accesses

Abstract

The reasoning segmentation task, which demands a nuanced comprehension of intricate queries to accurately pinpoint object regions, is attracting increasing attention. However, Multi-modal Large Language Models (MLLM) often find it difficult to accurately localize the objects described in complex reasoning contexts. We believe that the act of reasoning segmentation should mirror the cognitive stages of human visual search, where each step is a progressive refinement of thought toward the final object. Thus we introduce the Chains of Reasoning and Segmenting (CoReS) and find this top-down visual hierarchy indeed enhances the visual search process. Specifically, we propose a dual-chain structure that generates multi-modal, chain-like outputs to aid the segmentation process. Furthermore, to steer the MLLM’s outputs into this intended hierarchy, we incorporate in-context inputs as guidance. Extensive experiments demonstrate the superior performance of our CoReS, which surpasses the state-of-the-art method by 6.5% on the ReasonSeg dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Grounding Deliberate Reasoning in Multimodal Large Language Models

Contrastive Region Guidance: Improving Grounding in Vision-Language Models Without Training

VISA: Reasoning Video Object Segmentation via Large Language Models

References

Alayrac, J.B., et al.: Flamingo: a visual language model for few-shot learning. In: NeurIPS (2022)
Google Scholar
Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: a deep convolutional encoder-decoder architecture for image segmentation. TPAMI 39, 2481–2495 (2017)
Article Google Scholar
Brown, T., et al.: Language models are few-shot learners. Adv. Neural. Inf. Process. Syst. 33, 1877–1901 (2020)
Google Scholar
Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI 40, 834–848 (2018)
Article Google Scholar
Cheng, B., et al.: Panoptic-deeplab: a simple, strong, and fast baseline for bottom-up panoptic segmentation. In: CVPR (2020)
Google Scholar
Cheng, B., Schwing, A., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS (2021)
Google Scholar
Chowdhery, A., et al.: Palm: scaling language modeling with pathways. J. Mach. Learn. Res. 24(240), 1–113 (2023)
Google Scholar
Ding, H., Liu, C., Wang, S., Jiang, X.: Vlt: vision-language transformer and query generation for referring segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 45, 7900–7916 (2022)
Article Google Scholar
Ding, Z., Wang, J., Tu, Z.: Open-vocabulary panoptic segmentation with maskclip. arXiv preprint arXiv:2208.08984 (2022)
Fu, J., Liu, J., Tian, H., Li, Y., Bao, Y., Fang, Z., Lu, H.: Dual attention network for scene segmentation. In: CVPR (2019)
Google Scholar
Guo, Y., et al.: Dual mean-teacher: an unbiased semi-supervised framework for audio-visual source localization. Adv. Neural Inf. Process. Syst. 36 (2024)
Google Scholar
Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: criss-cross attention for semantic segmentation. In: ICCV (2019)
Google Scholar
Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: Referitgame: referring to objects in photographs of natural scenes. In: EMNLP (2014)
Google Scholar
Kirillov, A., He, K., Girshick, R., Rother, C., Dollár, P.: Panoptic segmentation. In: CVPR (2019)
Google Scholar
Kirillov, A., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023)
Kirillov, A., et al.: Segment anything. arXiv:2304.02643 (2023)
Koh, J.Y., Salakhutdinov, R., Fried, D.: Grounding language models to images for multimodal generation, 1(2) (2023). arXiv preprint arXiv:2301.13823
Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large language models are zero-shot reasoners. CoRR arxiv:2205.11916 (2022). https://doi.org/10.48550/arXiv.2205.11916
Lai, X., et al.: Lisa: reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692 (2023)
Lai, X., et al.: Semi-supervised semantic segmentation with directional context-aware consistency. In: CVPR (2021)
Google Scholar
Li, B., Zhang, Y., Chen, L., Wang, J., Yang, J., Liu, Z.: Otter: a multi-modal model with in-context instruction tuning. arXiv:2305.03726 (2023)
Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv:2301.12597 (2023)
Li, Y., et al.: Fully convolutional networks for panoptic segmentation. In: CVPR (2021)
Google Scholar
Liang, F., et al.: Open-vocabulary semantic segmentation with mask-adapted clip. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7061–7070 (2023)
Google Scholar
Liu, C., Ding, H., Jiang, X.: Gres: generalized referring expression segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23592–23601 (2023)
Google Scholar
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. arXiv:2304.08485 (2023)
Liu, J., et al.: Polyformer: referring image segmentation as sequential polygon generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18653–18663 (2023)
Google Scholar
Liu, W., Rabinovich, A., Berg, A.C.: Parsenet: looking wider to see better. arXiv (2015)
Google Scholar
Lu, P., et al.: Learn to explain: multimodal reasoning via thought chains for science question answering. CoRR arxiv:2209.09513 (2022). https://doi.org/10.48550/arXiv.2209.09513
Luo, G., et al.: Multi-task collaborative network for joint referring expression comprehension and segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10034–10043 (2020)
Google Scholar
Ma, S., Wang, Y., Wang, S., Wei, Y.: Fgahoi: fine-grained anchors for human-object interaction detection. IEEE Trans. Pattern Anal. Mach. Intell. 1–16 (2023). https://doi.org/10.1109/TPAMI.2023.3331738
Mondal, D., Modi, S., Panda, S., Singh, R., Rao, G.S.: Kam-cot: knowledge augmented multimodal chain-of-thoughts reasoning. arXiv preprint arXiv:2401.12863 (2024)
Nagaraja, V.K., Morariu, V.I., Davis, L.S.: Modeling context between objects for referring expression understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 792–807. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_48
Chapter Google Scholar
Noh, H., Hong, S., Han, B.: Learning deconvolution network for semantic segmentation. In: ICCV (2015)
Google Scholar
Peng, Z., et al.: Kosmos-2: grounding multimodal large language models to the world. arXiv:2306.14824 (2023)
Pi, R., et al.: Detgpt: detect what you need via reasoning. arXiv:2305.14167 (2023)
Pramanick, S., et al.: Jack of all tasks, master of many: designing general-purpose coarse-to-fine vision-language model. arXiv preprint arXiv:2312.12423 (2023)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Shelhamer, E., Long, J., Darrell, T.: Fully convolutional networks for semantic segmentation. TPAMI (2017)
Google Scholar
Sun, Q., Fang, Y., Wu, L., Wang, X., Cao, Y.: Eva-clip: improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389 (2023)
Tian, Z., et al.: Adaptive perspective distillation for semantic segmentation. TPAMI 45, 1372–1387 (2022)
Article Google Scholar
Tian, Z., et al.: Learning context-aware classifier for semantic segmentation. In: AAAI (2023)
Google Scholar
Touvron, H., et al.: Llama: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
Wang, Z., et al.: Cris: clip-driven referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11686–11695 (2022)
Google Scholar
Wei, J., et al.: Chain of thought prompting elicits reasoning in large language models. CoRR arxiv:2201.11903 (2022). https://arxiv.org/abs/2201.11903
Wu, P., Xie, S.: V*: Guided visual search as a core mechanism in multimodal llms, Adaptive perspective distillation for semantic segmentation. arXiv preprint arXiv:2312.14135x
Xiong, Y., et al.: Upsnet: a unified panoptic segmentation network. In: CVPR (2019)
Google Scholar
Xu, J., Liu, S., Vahdat, A., Byeon, W., Wang, X., De Mello, S.: Open-vocabulary panoptic segmentation with text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2955–2966 (2023)
Google Scholar
Yang, M., Yu, K., Zhang, C., Li, Z., Yang, K.: Denseaspp for semantic segmentation in street scenes. In: CVPR (2018)
Google Scholar
Yang, Z., Wang, J., Tang, Y., Chen, K., Zhao, H., Torr, P.H.: Lavt: language-aware vision transformer for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18155–18165 (2022)
Google Scholar
Ye, Q., et al.: mplug-owl: modularization empowers large language models with multimodality. arXiv:2304.14178 (2023)
Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. In: ICLR (2016)
Google Scholar
Zhang, S., et al.: Gpt4roi: instruction tuning large language model on region-of-interest. arXiv:2307.03601 (2023)
Zhang, Z., Zhang, A., Li, M., Zhao, H., Karypis, G., Smola, A.: Multimodal chain-of-thought reasoning in language models. CoRR arxiv:2302.00923 (2023). https://doi.org/10.48550/arXiv.2302.00923
Zhao, H., Qi, X., Shen, X., Shi, J., Jia, J.: Icnet for real-time semantic segmentation on high-resolution images. In: ECCV (2018)
Google Scholar
Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR (2017)
Google Scholar
Zhao, H., et al.: Psanet: point-wise spatial attention network for scene parsing. In: ECCV (2018)
Google Scholar
Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: Minigpt-4: enhancing vision-language understanding with advanced large language models. arXiv:2304.10592 (2023)
Zhu, Z., Xu, M., Bai, S., Huang, T., Bai, X.: Asymmetric non-local neural networks for semantic segmentation. In: ICCV (2019)
Google Scholar
Zou, X., et al.: Generalized decoding for pixel, image, and language. In: CVPR (2023)
Google Scholar
Zou, X., et al.: Segment everything everywhere all at once. arXiv:2304.06718 (2023)

Download references

Author information

Authors and Affiliations

School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
Xiaoyi Bao, Yuxin Guo & Guosheng Zhao
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Xiaoyi Bao, Yuxin Guo, Guosheng Zhao & Xingang Wang
Alibaba Group, Hangzhou, China
Xiaoyi Bao, Siyang Sun, Yuxin Guo & Yun Zheng
Northeastern University, Shenyang, China
Shuailei Ma
Ant Group, Hangzhou, China
Kecheng Zheng
Luoyang Institute for Robot and Intelligent Equipment, Luoyang, China
Xiaoyi Bao, Guosheng Zhao & Xingang Wang

Authors

Xiaoyi Bao
View author publications
You can also search for this author in PubMed Google Scholar
Siyang Sun
View author publications
You can also search for this author in PubMed Google Scholar
Shuailei Ma
View author publications
You can also search for this author in PubMed Google Scholar
Kecheng Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Yuxin Guo
View author publications
You can also search for this author in PubMed Google Scholar
Guosheng Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Yun Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Xingang Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xingang Wang .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 14336 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bao, X. et al. (2025). CoReS: Orchestrating the Dance of Reasoning and Segmentation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15076. Springer, Cham. https://doi.org/10.1007/978-3-031-72649-1_11

Download citation

DOI: https://doi.org/10.1007/978-3-031-72649-1_11
Published: 30 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72648-4
Online ISBN: 978-3-031-72649-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

CoReS: Orchestrating the Dance of Reasoning and Segmentation