Skip to main content

NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

Capitalizing on the remarkable advancements in Large Language Models (LLMs), there is a burgeoning initiative to harness LLMs for instruction following robotic navigation. Such a trend underscores the potential of LLMs to generalize navigational reasoning and diverse language understanding. However, a significant discrepancy in agent performance is observed when integrating LLMs in the Vision-and-Language navigation (VLN) tasks compared to previous downstream specialist models. Furthermore, the inherent capacity of language to interpret and facilitate communication in agent interactions is often underutilized in these integrations. In this work, we strive to bridge the divide between VLN-specialized models and LLM-based navigation paradigms, while maintaining the interpretative prowess of LLMs in generating linguistic navigational reasoning. By aligning visual content in a frozen LLM, we encompass visual observation comprehension for LLMs and exploit a way to incorporate LLMs and navigation policy networks for effective action predictions and navigational reasoning. We demonstrate the data efficiency of the proposed methods and eliminate the gap between LM-based agents and state-of-the-art VLN specialists. The source code is available at https://github.com/GengzeZhou/NavGPT-2.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    LLaMA2-7B [63] have 6.74 billion parameters while DUET [12] has only 0.18 billion.

  2. 2.

    Our model is smaller (1.5B and 5B) than original FlanT5 models (3B and 11B) as we only utilize the LLM encoder during navigation.

References

  1. Alayrac, J.B., et al.: Flamingo: a visual language model for few-shot learning. Adv. Neural. Inf. Process. Syst. 35, 23716–23736 (2022)

    Google Scholar 

  2. An, D., Qi, Y., Huang, Y., Wu, Q., Wang, L., Tan, T.: Neighbor-view enhanced model for vision and language navigation. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 5101–5109 (2021)

    Google Scholar 

  3. An, D., et al.: BEVBert: topo-metric map pre-training for language-guided navigation. arXiv preprint arXiv:2212.04385 (2022)

  4. Anderson, P., et al.: On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757 (2018)

  5. Anderson, P., et al.: Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3674–3683 (2018)

    Google Scholar 

  6. Bai, J., et al.: Qwen-VL: a frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 (2023)

  7. Chang, A., et al.: Matterport3D: learning from RGB-D data in indoor environments. In: 2017 International Conference on 3D Vision (3DV), pp. 667–676. IEEE (2017)

    Google Scholar 

  8. Chen, J., Lin, B., Xu, R., Chai, Z., Liang, X., Wong, K.Y.K.: MapGPT: map-guided prompting for unified vision-and-language navigation. arXiv preprint arXiv:2401.07314 (2024)

  9. Chen, J., et al.: MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478 (2023)

  10. Chen, K., Chen, J.K., Chuang, J., Vázquez, M., Savarese, S.: Topological planning with transformers for vision-and-language navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11276–11286 (2021)

    Google Scholar 

  11. Chen, S., Guhur, P.L., Schmid, C., Laptev, I.: History aware multimodal transformer for vision-and-language navigation. Adv. Neural. Inf. Process. Syst. 34, 5834–5847 (2021)

    Google Scholar 

  12. Chen, S., Guhur, P.L., Tapaswi, M., Schmid, C., Laptev, I.: Think global, act local: dual-scale graph transformer for vision-and-language navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16537–16547 (2022)

    Google Scholar 

  13. Chen, Y.-C., et al.: UNITER: UNiversal Image-TExt Representation learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 104–120. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_7

    Chapter  Google Scholar 

  14. Chiang, W.L., et al.: Vicuna: an open-source chatbot impressing GPT-4 with 90%* ChatGPT quality, March 2023. https://lmsys.org/blog/2023-03-30-vicuna/

  15. Chung, H.W., et al.: Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2022)

  16. Dai, W., et al.: InstructBLIP: towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500 (2023)

  17. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  18. Dou, Z.Y., Peng, N.: FOAM: a follower-aware speaker model for vision-and-language navigation. arXiv preprint arXiv:2206.04294 (2022)

  19. Driess, D., et al.: PaLM-E: an embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023)

  20. Fang, Y., et al.: EVA: exploring the limits of masked visual representation learning at scale. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19358–19369 (2023)

    Google Scholar 

  21. Fried, D., et al.: Speaker-follower models for vision-and-language navigation. In: Advances in Neural Information Processing Systems, vol. 31 (2018)

    Google Scholar 

  22. Guhur, P.L., Tapaswi, M., Chen, S., Laptev, I., Schmid, C.: Airbert: in-domain pretraining for vision-and-language navigation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1634–1643 (2021)

    Google Scholar 

  23. Hao, W., Li, C., Li, X., Carin, L., Gao, J.: Towards learning a generic agent for vision-and-language navigation via pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13137–13146 (2020)

    Google Scholar 

  24. Hong, Y., Wu, Q., Qi, Y., Rodriguez-Opazo, C., Gould, S.: A recurrent vision-and-language BERT for navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1643–1653, June 2021

    Google Scholar 

  25. Huang, H., et al.: Transferable representation learning in vision-and-language navigation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7404–7413 (2019)

    Google Scholar 

  26. Ilharco, G., Jain, V., Ku, A., Ie, E., Baldridge, J.: General evaluation for instruction conditioned navigation using dynamic time warping. arXiv preprint arXiv:1907.05446 (2019)

  27. Kamath, A., et al.: A new path: scaling vision-and-language navigation with synthetic instructions and imitation learning. arXiv preprint arXiv:2210.03112 (2022)

  28. Ke, L., et al.: Tactical rewind: self-correction via backtracking in vision-and-language navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6741–6749 (2019)

    Google Scholar 

  29. Ku, A., Anderson, P., Patel, R., Ie, E., Baldridge, J.: Room-across-room: multilingual vision-and-language navigation with dense spatiotemporal grounding. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 4392–4412 (2020)

    Google Scholar 

  30. Li, J., Bansal, M.: PanoGen: text-conditioned panoramic environment generation for vision-and-language navigation. arXiv preprint arXiv:2305.19195 (2023)

  31. Li, J., Tan, H., Bansal, M.: EnvEdit: environment editing for vision-and-language navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15407–15417 (2022)

    Google Scholar 

  32. Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)

  33. Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: VisualBERT: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019)

  34. Li, X., Wang, Z., Yang, J., Wang, Y., Jiang, S.: KERM: knowledge enhanced reasoning for vision-and-language navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2583–2592 (2023)

    Google Scholar 

  35. Li, X., et al.: Robust navigation with language pretraining and stochastic sampling. arXiv preprint arXiv:1909.02244 (2019)

  36. Li, X., et al.: Oscar: object-semantics aligned pre-training for vision-language tasks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 121–137. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_8

    Chapter  Google Scholar 

  37. Lin, B., et al.: NavCoT: boosting LLM-based vision-and-language navigation via learning disentangled reasoning. arXiv preprint arXiv:2403.07376 (2024)

  38. Lin, B., Zhu, Y., Chen, Z., Liang, X., Liu, J., Liang, X.: Adapt: vision-language navigation with modality-aligned action prompts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15396–15406 (2022)

    Google Scholar 

  39. Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Advances in Neural Information Processing Systems, vol. 36 (2024)

    Google Scholar 

  40. Liu, R., Wang, X., Wang, W., Yang, Y.: Bird’s-eye-view scene graph for vision-language navigation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10968–10980 (2023)

    Google Scholar 

  41. Long, Y., Cai, W., Wang, H., Zhan, G., Dong, H.: InstructNav: zero-shot system for generic instruction navigation in unexplored environment. arXiv preprint arXiv:2406.04882 (2024)

  42. Long, Y., Li, X., Cai, W., Dong, H.: Discuss before moving: visual language navigation via multi-expert discussions. arXiv preprint arXiv:2309.11382 (2023)

  43. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (2018)

    Google Scholar 

  44. Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems, vol. 32 (2019)

    Google Scholar 

  45. Ma, C.Y., et al.: Self-monitoring navigation agent via auxiliary progress estimation. arXiv preprint arXiv:1901.03035 (2019)

  46. Ma, C.Y., Wu, Z., AlRegib, G., Xiong, C., Kira, Z.: The regretful agent: heuristic-aided navigation through progress estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6732–6740 (2019)

    Google Scholar 

  47. Majumdar, A., Shrivastava, A., Lee, S., Anderson, P., Parikh, D., Batra, D.: Improving vision-and-language navigation with image-text pairs from the web. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12351, pp. 259–274. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58539-6_16

    Chapter  Google Scholar 

  48. OpenAI: GPT-4 technical report (2023)

    Google Scholar 

  49. Pan, B., et al.: LangNav: language as a perceptual representation for navigation. arXiv preprint arXiv:2310.07889 (2023)

  50. Parvaneh, A., Abbasnejad, E., Teney, D., Shi, J.Q., van den Hengel, A.: Counterfactual vision-and-language navigation: unravelling the unseen. Adv. Neural. Inf. Process. Syst. 33, 5296–5307 (2020)

    Google Scholar 

  51. Peng, Z., et al.: Kosmos-2: grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824 (2023)

  52. Qi, Y., et al.: REVERIE: remote embodied visual referring expression in real indoor environments. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9982–9991 (2020)

    Google Scholar 

  53. Qiao, Y., Qi, Y., Hong, Y., Yu, Z., Wang, P., Wu, Q.: HOP+: history-enhanced and order-aware pre-training for vision-and-language navigation. IEEE Trans. Pattern Anal. Mach. Intell. (2023)

    Google Scholar 

  54. Qiao, Y., Qi, Y., Yu, Z., Liu, J., Wu, Q.: March in chat: interactive prompting for remote embodied referring expression. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15758–15767 (2023)

    Google Scholar 

  55. Radford, A., et al.: Language models are unsupervised multitask learners. OpenAI Blog 1(8), 9 (2019)

    Google Scholar 

  56. Ramakrishnan, S.K., et al.: Habitat-Matterport 3D dataset (HM3D): 1000 large-scale 3D environments for embodied AI. In: Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) (2021)

    Google Scholar 

  57. Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635. JMLR Workshop and Conference Proceedings (2011)

    Google Scholar 

  58. Su, W., et al.: VL-BERT: pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530 (2019)

  59. Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5100–5111 (2019)

    Google Scholar 

  60. Tan, H., Yu, L., Bansal, M.: Learning to navigate unseen environments: back translation with environmental dropout. In: Proceedings of NAACL-HLT, pp. 2610–2621 (2019)

    Google Scholar 

  61. Thomason, J., Murray, M., Cakmak, M., Zettlemoyer, L.: Vision-and-dialog navigation. In: Conference on Robot Learning, pp. 394–406 (2020)

    Google Scholar 

  62. Touvron, H., et al.: Llama: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

  63. Touvron, H., et al.: Llama 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)

  64. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  65. Wang, H., Liang, W., Shen, J., Van Gool, L., Wang, W.: Counterfactual cycle-consistent learning for instruction following and generation in vision-language navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 15471–15481 (2022)

    Google Scholar 

  66. Wang, H., Wang, W., Shu, T., Liang, W., Shen, J.: Active visual information gathering for vision-language navigation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12367, pp. 307–322. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58542-6_19

    Chapter  Google Scholar 

  67. Wang, W., et al.: VisionLLM: large language model is also an open-ended decoder for vision-centric tasks. In: Advances in Neural Information Processing Systems, vol. 36 (2024)

    Google Scholar 

  68. Wang, X., et al.: Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6629–6638 (2019)

    Google Scholar 

  69. Wang, X., Xiong, W., Wang, H., Wang, W.Y.: Look before you leap: bridging model-free and model-based reinforcement learning for planned-ahead vision-and-language navigation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11220, pp. 38–55. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01270-0_3

    Chapter  Google Scholar 

  70. Wang, Z., Li, X., Yang, J., Liu, Y., Jiang, S.: GridMM: grid memory map for vision-and-language navigation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15625–15636 (2023)

    Google Scholar 

  71. Wang, Z., et al.: Scaling data generation in vision-and-language navigation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12009–12020 (2023)

    Google Scholar 

  72. Xia, F., Zamir, A.R., He, Z., Sax, A., Malik, J., Savarese, S.: Gibson Env: real-world perception for embodied agents. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9068–9079 (2018)

    Google Scholar 

  73. Zhan, Z., Yu, L., Yu, S., Tan, G.: MC-GPT: empowering vision-and-language navigation with memory map and reasoning chains. arXiv preprint arXiv:2405.10620 (2024)

  74. Zhang, J., et al.: NaVid: video-based VLM plans the next step for vision-and-language navigation. arXiv preprint arXiv:2402.15852 (2024)

  75. Zhao, Y., et al.: Target-driven structured transformer planner for vision-language navigation. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 4194–4203 (2022)

    Google Scholar 

  76. Zheng, D., Huang, S., Zhao, L., Zhong, Y., Wang, L.: Towards learning a generalist model for embodied navigation. arXiv preprint arXiv:2312.02010 (2023)

  77. Zhou, G., Hong, Y., Wu, Q.: NavGPT: explicit reasoning in vision-and-language navigation with large language models. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 7641–7649 (2024)

    Google Scholar 

  78. Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)

  79. Zhu, F., Zhu, Y., Chang, X., Liang, X.: Vision-language navigation with self-supervised auxiliary reasoning tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10012–10022 (2020)

    Google Scholar 

Download references

Acknowledgements

We thank all the reviewers for their valuable comments and suggestions. Yicong Hong wants to thank NVIDIA for the Academic Hardware Grant that provided GPU support for this project. This project is supported by the University of Adelaide’s Centre for Augmented Reasoning (CAR).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qi Wu .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 6873 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhou, G., Hong, Y., Wang, Z., Wang, X.E., Wu, Q. (2025). NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15065. Springer, Cham. https://doi.org/10.1007/978-3-031-72667-5_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72667-5_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72666-8

  • Online ISBN: 978-3-031-72667-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics