Abstract
Although great progress has been made in 3D visual grounding, current models still rely on explicit textual descriptions for grounding and lack the ability to reason human intentions from implicit instructions. We propose a new task called 3D reasoning grounding and introduce a new benchmark ScanReason which provides over 10K question-answer-location pairs from five reasoning types that require the synerization of reasoning and grounding. We further design our approach, ReGround3D, composed of the visual-centric reasoning module empowered by Multi-modal Large Language Model (MLLM) and the 3D grounding module to obtain accurate object locations by looking back to the enhanced geometry and fine-grained details from the 3D scenes. A chain-of-grounding mechanism is proposed to further boost the performance with interleaved reasoning and grounding steps during inference. Extensive experiments on the proposed benchmark validate the effectiveness of our proposed approach.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Achlioptas, P., Abdelreheem, A., Xia, F., Elhoseiny, M., Guibas, L.: ReferIt3D: neural listeners for fine-grained 3D object identification in real-world scenes. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020, Part I. LNCS, vol. 12346, pp. 422–440. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_25
Azuma, D., Miyanishi, T., Kurita, S., Kawanabe, M.: ScanQA: 3D question answering for spatial scene understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19129–19139 (2022)
Bakr, E.M., Ayman, M., Ahmed, M., Slim, H., Elhoseiny, M.: CoT3Dref: chain-of-thoughts data-efficient 3D visual grounding. arXiv preprint arXiv:2310.06214 (2023)
Baroni, M.: Linguistic generalization and compositionality in modern artificial neural networks. Philos. Trans. R. Soc. B 375(1791), 20190307 (2020)
Cai, D., Zhao, L., Zhang, J., Sheng, L., Xu, D.: 3DJCG: a unified framework for joint dense captioning and visual grounding on 3D point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16464–16473 (2022)
Chen, C., et al.: Position-enhanced visual instruction tuning for multimodal large language models. arXiv preprint arXiv:2308.13437 (2023)
Chen, D.Z., Chang, A.X., Nießner, M.: ScanRefer: 3D object localization in RGB-D scans using natural language. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12365, pp. 202–221. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58565-5_13
Chen, D.Z., Wu, Q., Nießner, M., Chang, A.X.: D3Net: a speaker-listener architecture for semi-supervised dense captioning and visual grounding in RGB-D scans. arXiv preprint arXiv:2112.01551 (2021)
Chen, F., et al.: X-LLM: bootstrapping advanced large language models by treating multi-modalities as foreign languages. arXiv preprint arXiv:2305.04160 (2023)
Chen, J., et al.: MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478 (2023)
Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., Zhao, R.: Shikra: unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195 (2023)
Chen, S., Guhur, P.L., Tapaswi, M., Schmid, C., Laptev, I.: Language conditioned spatial relation reasoning for 3D object grounding. Adv. Neural. Inf. Process. Syst. 35, 20522–20535 (2022)
Chen, S., et al.: LL3DA: visual interactive instruction tuning for omni-3D understanding, reasoning, and planning. arXiv preprint arXiv:2311.18651 (2023)
Chen, S., Zhu, H., Chen, X., Lei, Y., Yu, G., Chen, T.: End-to-end 3D dense captioning with vote2cap-DETR. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11124–11133 (2023)
Chen, T., Saxena, S., Li, L., Fleet, D.J., Hinton, G.: Pix2Seq: a language modeling framework for object detection. arXiv preprint arXiv:2109.10852 (2021)
Chen, Z., Gholami, A., Nießner, M., Chang, A.X.: Scan2Cap: context-aware dense captioning in RGB-D scans. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3193–3203 (2021)
Chen, Z., Hu, R., Chen, X., Nießner, M., Chang, A.X.: Unit3D: a unified transformer for 3D dense captioning and visual grounding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 18109–18119 (2023)
Chung, H.W., et al.: Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2022)
Dai, W., et al.: InstructBLIP: towards general-purpose vision-language models with instruction tuning (2023)
Han, J., et al.: ImageBind-LLM: multi-modality instruction tuning. arXiv preprint arXiv:2309.03905 (2023)
Hong, Y., et al.: 3D-LLM: injecting the 3D world into large language models. Adv. Neural Inf. Process. Syst. 36 (2024)
Huang, H., et al.: Chat-3D v2: Bridging 3D scene and large language models with object identifiers. arXiv preprint arXiv:2312.08168 (2023)
Huang, S., Chen, Y., Jia, J., Wang, L.: Multi-view transformer for 3D visual grounding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15524–15533 (2022)
Jain, A., Gkanatsios, N., Mediratta, I., Fragkiadaki, K.: Bottom up top down detection transformers for language grounding in images and point clouds. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, Part XXXVI. Lecture Notes in Computer Science, vol. 13696, pp. 417–433. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20059-5_24
Jiao, Y., Chen, S., Jie, Z., Chen, J., Ma, L., Jiang, Y.G.: MORE: multi-order relation mining for dense captioning in 3D scenes. arXiv preprint arXiv:2203.05203 (2022)
Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)
Li, M., et al.: M3DBench: let’s instruct large models with multi-modal 3d prompts. arXiv preprint arXiv:2312.10763 (2023)
Luo, J., et al.: 3D-SPS: single-stage 3D visual grounding via referred point progressive selection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16454–16463 (2022)
Ma, X., Yong, S., Zheng, Z., Li, Q., Liang, Y., Zhu, S.C., Huang, S.: SQA3D: situated question answering in 3D scenes. arXiv preprint arXiv:2210.07474 (2022)
OpenAI: GPT-4 technical report (2023)
Roh, J., Desingh, K., Farhadi, A., Fox, D.: LanguageRefer: spatial-language model for 3d visual grounding. In: Conference on Robot Learning, pp. 1046–1056. PMLR (2022)
Schult, J., Engelmann, F., Hermans, A., Litany, O., Tang, S., Leibe, B.: Mask3D: mask transformer for 3D semantic instance segmentation. In: 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 8216–8223. IEEE (2023)
Team, I.: InterNLM: a multilingual language model with progressively enhanced capabilities (2023)
Touvron, H., et al.: LLaMA 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Wang, H., Zhang, C., Yu, J., Cai, W.: Spatiality-guided transformer for 3D dense captioning on point clouds. arXiv preprint arXiv:2204.10688 (2022)
Wang, T., et al.: EmbodiedScan: a holistic multi-modal 3d perception suite towards embodied AI. arXiv preprint arXiv:2312.16170 (2023)
Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural. Inf. Process. Syst. 35, 24824–24837 (2022)
Wu, Y., Cheng, X., Zhang, R., Cheng, Z., Zhang, J.: EDA: explicit text-decoupling and dense alignment for 3D visual and language learning. arXiv preprint arXiv:2209.14941 (2022)
Xu, R., Wang, X., Wang, T., Chen, Y., Pang, J., Lin, D.: PointLLM: empowering large language models to understand point clouds. arXiv preprint arXiv:2308.16911 (2023)
Yan, X., et al.: CLEVR3D: compositional language and elementary visual reasoning for question answering in 3D real-world scenes. arXiv preprint arXiv:2112.11691 (2021)
Yang, J., et al.: LLM-grounder: open-vocabulary 3D visual grounding with large language model as an agent. arXiv preprint arXiv:2309.12311 (2023)
Yang, Z., Zhang, S., Wang, L., Luo, J.: SAT: 2D semantics assisted training for 3D visual grounding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1856–1866 (2021)
Ye, Q., et al.: mPLUG-Owl: modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178 (2023)
Ye, S., Chen, D., Han, S., Liao, J.: 3D question answering. IEEE Trans. Vis. Comput. Graph. (2022)
Yuan, Z., et al.: X-Trans2Cap: cross-modal knowledge transfer using transformer for 3D dense captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8563–8573 (2022)
Zhang, S., et al.: OPT: open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022)
Zhao, L., Cai, D., Sheng, L., Xu, D.: 3DVG-transformer: relation modeling for visual grounding on point clouds. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2928–2937 (2021)
Zhong, Y., Xu, L., Luo, J., Ma, L.: Contextual modeling for 3D dense captioning on point clouds. arXiv preprint arXiv:2210.03925 (2022)
Zhu, C., Zhang, W., Wang, T., Liu, X., Chen, K.: Object2Scene: putting objects in context for open-vocabulary 3D detection. arXiv preprint arXiv:2309.09456 (2023)
Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)
Zhu, Z., Ma, X., Chen, Y., Deng, Z., Huang, S., Li, Q.: 3D-VisTA: pre-trained transformer for 3D vision and text alignment. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2911–2921 (2023)
Acknowledgements
This work is supported in part by HKU Startup Fund, HKU Seed Fund for Basic Research, HKU Seed Fund for Translational and Applied Research, HKU IDS research Seed Fund, and HKU Fintech Academy R&D Funding.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zhu, C., Wang, T., Zhang, W., Chen, K., Liu, X. (2025). ScanReason: Empowering 3D Visual Grounding with Reasoning Capabilities. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15066. Springer, Cham. https://doi.org/10.1007/978-3-031-73242-3_9
Download citation
DOI: https://doi.org/10.1007/978-3-031-73242-3_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73241-6
Online ISBN: 978-3-031-73242-3
eBook Packages: Computer ScienceComputer Science (R0)