Skip to main content

ScanReason: Empowering 3D Visual Grounding with Reasoning Capabilities

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15066))

Included in the following conference series:

  • 473 Accesses

Abstract

Although great progress has been made in 3D visual grounding, current models still rely on explicit textual descriptions for grounding and lack the ability to reason human intentions from implicit instructions. We propose a new task called 3D reasoning grounding and introduce a new benchmark ScanReason which provides over 10K question-answer-location pairs from five reasoning types that require the synerization of reasoning and grounding. We further design our approach, ReGround3D, composed of the visual-centric reasoning module empowered by Multi-modal Large Language Model (MLLM) and the 3D grounding module to obtain accurate object locations by looking back to the enhanced geometry and fine-grained details from the 3D scenes. A chain-of-grounding mechanism is proposed to further boost the performance with interleaved reasoning and grounding steps during inference. Extensive experiments on the proposed benchmark validate the effectiveness of our proposed approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 74.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Achlioptas, P., Abdelreheem, A., Xia, F., Elhoseiny, M., Guibas, L.: ReferIt3D: neural listeners for fine-grained 3D object identification in real-world scenes. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020, Part I. LNCS, vol. 12346, pp. 422–440. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_25

    Chapter  Google Scholar 

  2. Azuma, D., Miyanishi, T., Kurita, S., Kawanabe, M.: ScanQA: 3D question answering for spatial scene understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19129–19139 (2022)

    Google Scholar 

  3. Bakr, E.M., Ayman, M., Ahmed, M., Slim, H., Elhoseiny, M.: CoT3Dref: chain-of-thoughts data-efficient 3D visual grounding. arXiv preprint arXiv:2310.06214 (2023)

  4. Baroni, M.: Linguistic generalization and compositionality in modern artificial neural networks. Philos. Trans. R. Soc. B 375(1791), 20190307 (2020)

    Article  Google Scholar 

  5. Cai, D., Zhao, L., Zhang, J., Sheng, L., Xu, D.: 3DJCG: a unified framework for joint dense captioning and visual grounding on 3D point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16464–16473 (2022)

    Google Scholar 

  6. Chen, C., et al.: Position-enhanced visual instruction tuning for multimodal large language models. arXiv preprint arXiv:2308.13437 (2023)

  7. Chen, D.Z., Chang, A.X., Nießner, M.: ScanRefer: 3D object localization in RGB-D scans using natural language. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12365, pp. 202–221. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58565-5_13

    Chapter  Google Scholar 

  8. Chen, D.Z., Wu, Q., Nießner, M., Chang, A.X.: D3Net: a speaker-listener architecture for semi-supervised dense captioning and visual grounding in RGB-D scans. arXiv preprint arXiv:2112.01551 (2021)

  9. Chen, F., et al.: X-LLM: bootstrapping advanced large language models by treating multi-modalities as foreign languages. arXiv preprint arXiv:2305.04160 (2023)

  10. Chen, J., et al.: MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478 (2023)

  11. Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., Zhao, R.: Shikra: unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195 (2023)

  12. Chen, S., Guhur, P.L., Tapaswi, M., Schmid, C., Laptev, I.: Language conditioned spatial relation reasoning for 3D object grounding. Adv. Neural. Inf. Process. Syst. 35, 20522–20535 (2022)

    Google Scholar 

  13. Chen, S., et al.: LL3DA: visual interactive instruction tuning for omni-3D understanding, reasoning, and planning. arXiv preprint arXiv:2311.18651 (2023)

  14. Chen, S., Zhu, H., Chen, X., Lei, Y., Yu, G., Chen, T.: End-to-end 3D dense captioning with vote2cap-DETR. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11124–11133 (2023)

    Google Scholar 

  15. Chen, T., Saxena, S., Li, L., Fleet, D.J., Hinton, G.: Pix2Seq: a language modeling framework for object detection. arXiv preprint arXiv:2109.10852 (2021)

  16. Chen, Z., Gholami, A., Nießner, M., Chang, A.X.: Scan2Cap: context-aware dense captioning in RGB-D scans. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3193–3203 (2021)

    Google Scholar 

  17. Chen, Z., Hu, R., Chen, X., Nießner, M., Chang, A.X.: Unit3D: a unified transformer for 3D dense captioning and visual grounding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 18109–18119 (2023)

    Google Scholar 

  18. Chung, H.W., et al.: Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2022)

  19. Dai, W., et al.: InstructBLIP: towards general-purpose vision-language models with instruction tuning (2023)

    Google Scholar 

  20. Han, J., et al.: ImageBind-LLM: multi-modality instruction tuning. arXiv preprint arXiv:2309.03905 (2023)

  21. Hong, Y., et al.: 3D-LLM: injecting the 3D world into large language models. Adv. Neural Inf. Process. Syst. 36 (2024)

    Google Scholar 

  22. Huang, H., et al.: Chat-3D v2: Bridging 3D scene and large language models with object identifiers. arXiv preprint arXiv:2312.08168 (2023)

  23. Huang, S., Chen, Y., Jia, J., Wang, L.: Multi-view transformer for 3D visual grounding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15524–15533 (2022)

    Google Scholar 

  24. Jain, A., Gkanatsios, N., Mediratta, I., Fragkiadaki, K.: Bottom up top down detection transformers for language grounding in images and point clouds. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, Part XXXVI. Lecture Notes in Computer Science, vol. 13696, pp. 417–433. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20059-5_24

    Chapter  Google Scholar 

  25. Jiao, Y., Chen, S., Jie, Z., Chen, J., Ma, L., Jiang, Y.G.: MORE: multi-order relation mining for dense captioning in 3D scenes. arXiv preprint arXiv:2203.05203 (2022)

  26. Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)

  27. Li, M., et al.: M3DBench: let’s instruct large models with multi-modal 3d prompts. arXiv preprint arXiv:2312.10763 (2023)

  28. Luo, J., et al.: 3D-SPS: single-stage 3D visual grounding via referred point progressive selection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16454–16463 (2022)

    Google Scholar 

  29. Ma, X., Yong, S., Zheng, Z., Li, Q., Liang, Y., Zhu, S.C., Huang, S.: SQA3D: situated question answering in 3D scenes. arXiv preprint arXiv:2210.07474 (2022)

  30. OpenAI: GPT-4 technical report (2023)

    Google Scholar 

  31. Roh, J., Desingh, K., Farhadi, A., Fox, D.: LanguageRefer: spatial-language model for 3d visual grounding. In: Conference on Robot Learning, pp. 1046–1056. PMLR (2022)

    Google Scholar 

  32. Schult, J., Engelmann, F., Hermans, A., Litany, O., Tang, S., Leibe, B.: Mask3D: mask transformer for 3D semantic instance segmentation. In: 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 8216–8223. IEEE (2023)

    Google Scholar 

  33. Team, I.: InterNLM: a multilingual language model with progressively enhanced capabilities (2023)

    Google Scholar 

  34. Touvron, H., et al.: LLaMA 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)

  35. Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)

    Google Scholar 

  36. Wang, H., Zhang, C., Yu, J., Cai, W.: Spatiality-guided transformer for 3D dense captioning on point clouds. arXiv preprint arXiv:2204.10688 (2022)

  37. Wang, T., et al.: EmbodiedScan: a holistic multi-modal 3d perception suite towards embodied AI. arXiv preprint arXiv:2312.16170 (2023)

  38. Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural. Inf. Process. Syst. 35, 24824–24837 (2022)

    Google Scholar 

  39. Wu, Y., Cheng, X., Zhang, R., Cheng, Z., Zhang, J.: EDA: explicit text-decoupling and dense alignment for 3D visual and language learning. arXiv preprint arXiv:2209.14941 (2022)

  40. Xu, R., Wang, X., Wang, T., Chen, Y., Pang, J., Lin, D.: PointLLM: empowering large language models to understand point clouds. arXiv preprint arXiv:2308.16911 (2023)

  41. Yan, X., et al.: CLEVR3D: compositional language and elementary visual reasoning for question answering in 3D real-world scenes. arXiv preprint arXiv:2112.11691 (2021)

  42. Yang, J., et al.: LLM-grounder: open-vocabulary 3D visual grounding with large language model as an agent. arXiv preprint arXiv:2309.12311 (2023)

  43. Yang, Z., Zhang, S., Wang, L., Luo, J.: SAT: 2D semantics assisted training for 3D visual grounding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1856–1866 (2021)

    Google Scholar 

  44. Ye, Q., et al.: mPLUG-Owl: modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178 (2023)

  45. Ye, S., Chen, D., Han, S., Liao, J.: 3D question answering. IEEE Trans. Vis. Comput. Graph. (2022)

    Google Scholar 

  46. Yuan, Z., et al.: X-Trans2Cap: cross-modal knowledge transfer using transformer for 3D dense captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8563–8573 (2022)

    Google Scholar 

  47. Zhang, S., et al.: OPT: open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022)

  48. Zhao, L., Cai, D., Sheng, L., Xu, D.: 3DVG-transformer: relation modeling for visual grounding on point clouds. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2928–2937 (2021)

    Google Scholar 

  49. Zhong, Y., Xu, L., Luo, J., Ma, L.: Contextual modeling for 3D dense captioning on point clouds. arXiv preprint arXiv:2210.03925 (2022)

  50. Zhu, C., Zhang, W., Wang, T., Liu, X., Chen, K.: Object2Scene: putting objects in context for open-vocabulary 3D detection. arXiv preprint arXiv:2309.09456 (2023)

  51. Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)

  52. Zhu, Z., Ma, X., Chen, Y., Deng, Z., Huang, S., Li, Q.: 3D-VisTA: pre-trained transformer for 3D vision and text alignment. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2911–2921 (2023)

    Google Scholar 

Download references

Acknowledgements

This work is supported in part by HKU Startup Fund, HKU Seed Fund for Basic Research, HKU Seed Fund for Translational and Applied Research, HKU IDS research Seed Fund, and HKU Fintech Academy R&D Funding.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xihui Liu .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 5207 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhu, C., Wang, T., Zhang, W., Chen, K., Liu, X. (2025). ScanReason: Empowering 3D Visual Grounding with Reasoning Capabilities. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15066. Springer, Cham. https://doi.org/10.1007/978-3-031-73242-3_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73242-3_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73241-6

  • Online ISBN: 978-3-031-73242-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics