ScanReason: Empowering 3D Visual Grounding with Reasoning Capabilities

Zhu, Chenming; Wang, Tai; Zhang, Wenwei; Chen, Kai; Liu, Xihui

doi:10.1007/978-3-031-73242-3_9

Chenming Zhu^13,14,
Tai Wang¹⁴,
Wenwei Zhang¹⁴,
Kai Chen¹⁴ &
…
Xihui Liu¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15066))

Included in the following conference series:

European Conference on Computer Vision

473 Accesses

Abstract

Although great progress has been made in 3D visual grounding, current models still rely on explicit textual descriptions for grounding and lack the ability to reason human intentions from implicit instructions. We propose a new task called 3D reasoning grounding and introduce a new benchmark ScanReason which provides over 10K question-answer-location pairs from five reasoning types that require the synerization of reasoning and grounding. We further design our approach, ReGround3D, composed of the visual-centric reasoning module empowered by Multi-modal Large Language Model (MLLM) and the 3D grounding module to obtain accurate object locations by looking back to the enhanced geometry and fine-grained details from the 3D scenes. A chain-of-grounding mechanism is proposed to further boost the performance with interleaved reasoning and grounding steps during inference. Extensive experiments on the proposed benchmark validate the effectiveness of our proposed approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 74.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding

Take a Step Back: Rethinking the Two Stages in Visual Reasoning

WildRefer: 3D Object Localization in Large-Scale Dynamic Scenes with Multi-modal Visual Data and Natural Language

References

Achlioptas, P., Abdelreheem, A., Xia, F., Elhoseiny, M., Guibas, L.: ReferIt3D: neural listeners for fine-grained 3D object identification in real-world scenes. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020, Part I. LNCS, vol. 12346, pp. 422–440. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_25
Chapter Google Scholar
Azuma, D., Miyanishi, T., Kurita, S., Kawanabe, M.: ScanQA: 3D question answering for spatial scene understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19129–19139 (2022)
Google Scholar
Bakr, E.M., Ayman, M., Ahmed, M., Slim, H., Elhoseiny, M.: CoT3Dref: chain-of-thoughts data-efficient 3D visual grounding. arXiv preprint arXiv:2310.06214 (2023)
Baroni, M.: Linguistic generalization and compositionality in modern artificial neural networks. Philos. Trans. R. Soc. B 375(1791), 20190307 (2020)
Article Google Scholar
Cai, D., Zhao, L., Zhang, J., Sheng, L., Xu, D.: 3DJCG: a unified framework for joint dense captioning and visual grounding on 3D point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16464–16473 (2022)
Google Scholar
Chen, C., et al.: Position-enhanced visual instruction tuning for multimodal large language models. arXiv preprint arXiv:2308.13437 (2023)
Chen, D.Z., Chang, A.X., Nießner, M.: ScanRefer: 3D object localization in RGB-D scans using natural language. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12365, pp. 202–221. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58565-5_13
Chapter Google Scholar
Chen, D.Z., Wu, Q., Nießner, M., Chang, A.X.: D3Net: a speaker-listener architecture for semi-supervised dense captioning and visual grounding in RGB-D scans. arXiv preprint arXiv:2112.01551 (2021)
Chen, F., et al.: X-LLM: bootstrapping advanced large language models by treating multi-modalities as foreign languages. arXiv preprint arXiv:2305.04160 (2023)
Chen, J., et al.: MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478 (2023)
Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., Zhao, R.: Shikra: unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195 (2023)
Chen, S., Guhur, P.L., Tapaswi, M., Schmid, C., Laptev, I.: Language conditioned spatial relation reasoning for 3D object grounding. Adv. Neural. Inf. Process. Syst. 35, 20522–20535 (2022)
Google Scholar
Chen, S., et al.: LL3DA: visual interactive instruction tuning for omni-3D understanding, reasoning, and planning. arXiv preprint arXiv:2311.18651 (2023)
Chen, S., Zhu, H., Chen, X., Lei, Y., Yu, G., Chen, T.: End-to-end 3D dense captioning with vote2cap-DETR. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11124–11133 (2023)
Google Scholar
Chen, T., Saxena, S., Li, L., Fleet, D.J., Hinton, G.: Pix2Seq: a language modeling framework for object detection. arXiv preprint arXiv:2109.10852 (2021)
Chen, Z., Gholami, A., Nießner, M., Chang, A.X.: Scan2Cap: context-aware dense captioning in RGB-D scans. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3193–3203 (2021)
Google Scholar
Chen, Z., Hu, R., Chen, X., Nießner, M., Chang, A.X.: Unit3D: a unified transformer for 3D dense captioning and visual grounding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 18109–18119 (2023)
Google Scholar
Chung, H.W., et al.: Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2022)
Dai, W., et al.: InstructBLIP: towards general-purpose vision-language models with instruction tuning (2023)
Google Scholar
Han, J., et al.: ImageBind-LLM: multi-modality instruction tuning. arXiv preprint arXiv:2309.03905 (2023)
Hong, Y., et al.: 3D-LLM: injecting the 3D world into large language models. Adv. Neural Inf. Process. Syst. 36 (2024)
Google Scholar
Huang, H., et al.: Chat-3D v2: Bridging 3D scene and large language models with object identifiers. arXiv preprint arXiv:2312.08168 (2023)
Huang, S., Chen, Y., Jia, J., Wang, L.: Multi-view transformer for 3D visual grounding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15524–15533 (2022)
Google Scholar
Jain, A., Gkanatsios, N., Mediratta, I., Fragkiadaki, K.: Bottom up top down detection transformers for language grounding in images and point clouds. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, Part XXXVI. Lecture Notes in Computer Science, vol. 13696, pp. 417–433. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20059-5_24
Chapter Google Scholar
Jiao, Y., Chen, S., Jie, Z., Chen, J., Ma, L., Jiang, Y.G.: MORE: multi-order relation mining for dense captioning in 3D scenes. arXiv preprint arXiv:2203.05203 (2022)
Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)
Li, M., et al.: M3DBench: let’s instruct large models with multi-modal 3d prompts. arXiv preprint arXiv:2312.10763 (2023)
Luo, J., et al.: 3D-SPS: single-stage 3D visual grounding via referred point progressive selection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16454–16463 (2022)
Google Scholar
Ma, X., Yong, S., Zheng, Z., Li, Q., Liang, Y., Zhu, S.C., Huang, S.: SQA3D: situated question answering in 3D scenes. arXiv preprint arXiv:2210.07474 (2022)
OpenAI: GPT-4 technical report (2023)
Google Scholar
Roh, J., Desingh, K., Farhadi, A., Fox, D.: LanguageRefer: spatial-language model for 3d visual grounding. In: Conference on Robot Learning, pp. 1046–1056. PMLR (2022)
Google Scholar
Schult, J., Engelmann, F., Hermans, A., Litany, O., Tang, S., Leibe, B.: Mask3D: mask transformer for 3D semantic instance segmentation. In: 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 8216–8223. IEEE (2023)
Google Scholar
Team, I.: InterNLM: a multilingual language model with progressively enhanced capabilities (2023)
Google Scholar
Touvron, H., et al.: LLaMA 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Google Scholar
Wang, H., Zhang, C., Yu, J., Cai, W.: Spatiality-guided transformer for 3D dense captioning on point clouds. arXiv preprint arXiv:2204.10688 (2022)
Wang, T., et al.: EmbodiedScan: a holistic multi-modal 3d perception suite towards embodied AI. arXiv preprint arXiv:2312.16170 (2023)
Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural. Inf. Process. Syst. 35, 24824–24837 (2022)
Google Scholar
Wu, Y., Cheng, X., Zhang, R., Cheng, Z., Zhang, J.: EDA: explicit text-decoupling and dense alignment for 3D visual and language learning. arXiv preprint arXiv:2209.14941 (2022)
Xu, R., Wang, X., Wang, T., Chen, Y., Pang, J., Lin, D.: PointLLM: empowering large language models to understand point clouds. arXiv preprint arXiv:2308.16911 (2023)
Yan, X., et al.: CLEVR3D: compositional language and elementary visual reasoning for question answering in 3D real-world scenes. arXiv preprint arXiv:2112.11691 (2021)
Yang, J., et al.: LLM-grounder: open-vocabulary 3D visual grounding with large language model as an agent. arXiv preprint arXiv:2309.12311 (2023)
Yang, Z., Zhang, S., Wang, L., Luo, J.: SAT: 2D semantics assisted training for 3D visual grounding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1856–1866 (2021)
Google Scholar
Ye, Q., et al.: mPLUG-Owl: modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178 (2023)
Ye, S., Chen, D., Han, S., Liao, J.: 3D question answering. IEEE Trans. Vis. Comput. Graph. (2022)
Google Scholar
Yuan, Z., et al.: X-Trans2Cap: cross-modal knowledge transfer using transformer for 3D dense captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8563–8573 (2022)
Google Scholar
Zhang, S., et al.: OPT: open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022)
Zhao, L., Cai, D., Sheng, L., Xu, D.: 3DVG-transformer: relation modeling for visual grounding on point clouds. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2928–2937 (2021)
Google Scholar
Zhong, Y., Xu, L., Luo, J., Ma, L.: Contextual modeling for 3D dense captioning on point clouds. arXiv preprint arXiv:2210.03925 (2022)
Zhu, C., Zhang, W., Wang, T., Liu, X., Chen, K.: Object2Scene: putting objects in context for open-vocabulary 3D detection. arXiv preprint arXiv:2309.09456 (2023)
Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)
Zhu, Z., Ma, X., Chen, Y., Deng, Z., Huang, S., Li, Q.: 3D-VisTA: pre-trained transformer for 3D vision and text alignment. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2911–2921 (2023)
Google Scholar

Download references

Acknowledgements

This work is supported in part by HKU Startup Fund, HKU Seed Fund for Basic Research, HKU Seed Fund for Translational and Applied Research, HKU IDS research Seed Fund, and HKU Fintech Academy R&D Funding.

Author information

Authors and Affiliations

The University of Hong Kong, Hong Kong, Hong Kong
Chenming Zhu & Xihui Liu
Shanghai AI Laboratory, Shanghai, China
Chenming Zhu, Tai Wang, Wenwei Zhang & Kai Chen

Authors

Chenming Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Tai Wang
View author publications
You can also search for this author in PubMed Google Scholar
Wenwei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Kai Chen
View author publications
You can also search for this author in PubMed Google Scholar
Xihui Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xihui Liu .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Hessen, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 5207 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhu, C., Wang, T., Zhang, W., Chen, K., Liu, X. (2025). ScanReason: Empowering 3D Visual Grounding with Reasoning Capabilities. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15066. Springer, Cham. https://doi.org/10.1007/978-3-031-73242-3_9

Download citation

DOI: https://doi.org/10.1007/978-3-031-73242-3_9
Published: 29 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73241-6
Online ISBN: 978-3-031-73242-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

ScanReason: Empowering 3D Visual Grounding with Reasoning Capabilities