Skip to main content

CoReS: Orchestrating the Dance of Reasoning and Segmentation

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15076))

Included in the following conference series:

  • 420 Accesses

Abstract

The reasoning segmentation task, which demands a nuanced comprehension of intricate queries to accurately pinpoint object regions, is attracting increasing attention. However, Multi-modal Large Language Models (MLLM) often find it difficult to accurately localize the objects described in complex reasoning contexts. We believe that the act of reasoning segmentation should mirror the cognitive stages of human visual search, where each step is a progressive refinement of thought toward the final object. Thus we introduce the Chains of Reasoning and Segmenting (CoReS) and find this top-down visual hierarchy indeed enhances the visual search process. Specifically, we propose a dual-chain structure that generates multi-modal, chain-like outputs to aid the segmentation process. Furthermore, to steer the MLLM’s outputs into this intended hierarchy, we incorporate in-context inputs as guidance. Extensive experiments demonstrate the superior performance of our CoReS, which surpasses the state-of-the-art method by 6.5% on the ReasonSeg dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Alayrac, J.B., et al.: Flamingo: a visual language model for few-shot learning. In: NeurIPS (2022)

    Google Scholar 

  2. Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: a deep convolutional encoder-decoder architecture for image segmentation. TPAMI 39, 2481–2495 (2017)

    Article  Google Scholar 

  3. Brown, T., et al.: Language models are few-shot learners. Adv. Neural. Inf. Process. Syst. 33, 1877–1901 (2020)

    Google Scholar 

  4. Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI 40, 834–848 (2018)

    Article  Google Scholar 

  5. Cheng, B., et al.: Panoptic-deeplab: a simple, strong, and fast baseline for bottom-up panoptic segmentation. In: CVPR (2020)

    Google Scholar 

  6. Cheng, B., Schwing, A., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS (2021)

    Google Scholar 

  7. Chowdhery, A., et al.: Palm: scaling language modeling with pathways. J. Mach. Learn. Res. 24(240), 1–113 (2023)

    Google Scholar 

  8. Ding, H., Liu, C., Wang, S., Jiang, X.: Vlt: vision-language transformer and query generation for referring segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 45, 7900–7916 (2022)

    Article  Google Scholar 

  9. Ding, Z., Wang, J., Tu, Z.: Open-vocabulary panoptic segmentation with maskclip. arXiv preprint arXiv:2208.08984 (2022)

  10. Fu, J., Liu, J., Tian, H., Li, Y., Bao, Y., Fang, Z., Lu, H.: Dual attention network for scene segmentation. In: CVPR (2019)

    Google Scholar 

  11. Guo, Y., et al.: Dual mean-teacher: an unbiased semi-supervised framework for audio-visual source localization. Adv. Neural Inf. Process. Syst. 36 (2024)

    Google Scholar 

  12. Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: criss-cross attention for semantic segmentation. In: ICCV (2019)

    Google Scholar 

  13. Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: Referitgame: referring to objects in photographs of natural scenes. In: EMNLP (2014)

    Google Scholar 

  14. Kirillov, A., He, K., Girshick, R., Rother, C., Dollár, P.: Panoptic segmentation. In: CVPR (2019)

    Google Scholar 

  15. Kirillov, A., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023)

  16. Kirillov, A., et al.: Segment anything. arXiv:2304.02643 (2023)

  17. Koh, J.Y., Salakhutdinov, R., Fried, D.: Grounding language models to images for multimodal generation, 1(2) (2023). arXiv preprint arXiv:2301.13823

  18. Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large language models are zero-shot reasoners. CoRR arxiv:2205.11916 (2022). https://doi.org/10.48550/arXiv.2205.11916

  19. Lai, X., et al.: Lisa: reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692 (2023)

  20. Lai, X., et al.: Semi-supervised semantic segmentation with directional context-aware consistency. In: CVPR (2021)

    Google Scholar 

  21. Li, B., Zhang, Y., Chen, L., Wang, J., Yang, J., Liu, Z.: Otter: a multi-modal model with in-context instruction tuning. arXiv:2305.03726 (2023)

  22. Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv:2301.12597 (2023)

  23. Li, Y., et al.: Fully convolutional networks for panoptic segmentation. In: CVPR (2021)

    Google Scholar 

  24. Liang, F., et al.: Open-vocabulary semantic segmentation with mask-adapted clip. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7061–7070 (2023)

    Google Scholar 

  25. Liu, C., Ding, H., Jiang, X.: Gres: generalized referring expression segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23592–23601 (2023)

    Google Scholar 

  26. Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. arXiv:2304.08485 (2023)

  27. Liu, J., et al.: Polyformer: referring image segmentation as sequential polygon generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18653–18663 (2023)

    Google Scholar 

  28. Liu, W., Rabinovich, A., Berg, A.C.: Parsenet: looking wider to see better. arXiv (2015)

    Google Scholar 

  29. Lu, P., et al.: Learn to explain: multimodal reasoning via thought chains for science question answering. CoRR arxiv:2209.09513 (2022). https://doi.org/10.48550/arXiv.2209.09513

  30. Luo, G., et al.: Multi-task collaborative network for joint referring expression comprehension and segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10034–10043 (2020)

    Google Scholar 

  31. Ma, S., Wang, Y., Wang, S., Wei, Y.: Fgahoi: fine-grained anchors for human-object interaction detection. IEEE Trans. Pattern Anal. Mach. Intell. 1–16 (2023). https://doi.org/10.1109/TPAMI.2023.3331738

  32. Mondal, D., Modi, S., Panda, S., Singh, R., Rao, G.S.: Kam-cot: knowledge augmented multimodal chain-of-thoughts reasoning. arXiv preprint arXiv:2401.12863 (2024)

  33. Nagaraja, V.K., Morariu, V.I., Davis, L.S.: Modeling context between objects for referring expression understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 792–807. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_48

    Chapter  Google Scholar 

  34. Noh, H., Hong, S., Han, B.: Learning deconvolution network for semantic segmentation. In: ICCV (2015)

    Google Scholar 

  35. Peng, Z., et al.: Kosmos-2: grounding multimodal large language models to the world. arXiv:2306.14824 (2023)

  36. Pi, R., et al.: Detgpt: detect what you need via reasoning. arXiv:2305.14167 (2023)

  37. Pramanick, S., et al.: Jack of all tasks, master of many: designing general-purpose coarse-to-fine vision-language model. arXiv preprint arXiv:2312.12423 (2023)

  38. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)

    Google Scholar 

  39. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28

    Chapter  Google Scholar 

  40. Shelhamer, E., Long, J., Darrell, T.: Fully convolutional networks for semantic segmentation. TPAMI (2017)

    Google Scholar 

  41. Sun, Q., Fang, Y., Wu, L., Wang, X., Cao, Y.: Eva-clip: improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389 (2023)

  42. Tian, Z., et al.: Adaptive perspective distillation for semantic segmentation. TPAMI 45, 1372–1387 (2022)

    Article  Google Scholar 

  43. Tian, Z., et al.: Learning context-aware classifier for semantic segmentation. In: AAAI (2023)

    Google Scholar 

  44. Touvron, H., et al.: Llama: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

  45. Wang, Z., et al.: Cris: clip-driven referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11686–11695 (2022)

    Google Scholar 

  46. Wei, J., et al.: Chain of thought prompting elicits reasoning in large language models. CoRR arxiv:2201.11903 (2022). https://arxiv.org/abs/2201.11903

  47. Wu, P., Xie, S.: V*: Guided visual search as a core mechanism in multimodal llms, Adaptive perspective distillation for semantic segmentation. arXiv preprint arXiv:2312.14135x

  48. Xiong, Y., et al.: Upsnet: a unified panoptic segmentation network. In: CVPR (2019)

    Google Scholar 

  49. Xu, J., Liu, S., Vahdat, A., Byeon, W., Wang, X., De Mello, S.: Open-vocabulary panoptic segmentation with text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2955–2966 (2023)

    Google Scholar 

  50. Yang, M., Yu, K., Zhang, C., Li, Z., Yang, K.: Denseaspp for semantic segmentation in street scenes. In: CVPR (2018)

    Google Scholar 

  51. Yang, Z., Wang, J., Tang, Y., Chen, K., Zhao, H., Torr, P.H.: Lavt: language-aware vision transformer for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18155–18165 (2022)

    Google Scholar 

  52. Ye, Q., et al.: mplug-owl: modularization empowers large language models with multimodality. arXiv:2304.14178 (2023)

  53. Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. In: ICLR (2016)

    Google Scholar 

  54. Zhang, S., et al.: Gpt4roi: instruction tuning large language model on region-of-interest. arXiv:2307.03601 (2023)

  55. Zhang, Z., Zhang, A., Li, M., Zhao, H., Karypis, G., Smola, A.: Multimodal chain-of-thought reasoning in language models. CoRR arxiv:2302.00923 (2023). https://doi.org/10.48550/arXiv.2302.00923

  56. Zhao, H., Qi, X., Shen, X., Shi, J., Jia, J.: Icnet for real-time semantic segmentation on high-resolution images. In: ECCV (2018)

    Google Scholar 

  57. Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR (2017)

    Google Scholar 

  58. Zhao, H., et al.: Psanet: point-wise spatial attention network for scene parsing. In: ECCV (2018)

    Google Scholar 

  59. Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: Minigpt-4: enhancing vision-language understanding with advanced large language models. arXiv:2304.10592 (2023)

  60. Zhu, Z., Xu, M., Bai, S., Huang, T., Bai, X.: Asymmetric non-local neural networks for semantic segmentation. In: ICCV (2019)

    Google Scholar 

  61. Zou, X., et al.: Generalized decoding for pixel, image, and language. In: CVPR (2023)

    Google Scholar 

  62. Zou, X., et al.: Segment everything everywhere all at once. arXiv:2304.06718 (2023)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xingang Wang .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 14336 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bao, X. et al. (2025). CoReS: Orchestrating the Dance of Reasoning and Segmentation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15076. Springer, Cham. https://doi.org/10.1007/978-3-031-72649-1_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72649-1_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72648-4

  • Online ISBN: 978-3-031-72649-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics