Skip to main content

Open-Ended Medical Visual Question Answering Through Prefix Tuning of Language Models

  • Conference paper
  • First Online:
Medical Image Computing and Computer Assisted Intervention – MICCAI 2023 (MICCAI 2023)

Abstract

Medical Visual Question Answering (VQA) is an important challenge, as it would lead to faster and more accurate diagnoses and treatment decisions. Most existing methods approach it as a multi-class classification problem, which restricts the outcome to a predefined closed-set of curated answers. We focus on open-ended VQA and motivated by the recent advances in language models consider it as a generative task. Leveraging pre-trained language models, we introduce a novel method particularly suited for small, domain-specific, medical datasets. To properly communicate the medical images to the language model, we develop a network that maps the extracted visual features to a set of learnable tokens. Then, alongside the question, these learnable tokens directly prompt the language model. We explore recent parameter-efficient fine-tuning strategies for language models, which allow for resource- and data-efficient fine-tuning. We evaluate our approach on the prime medical VQA benchmarks, namely, Slake, OVQA and PathVQA. The results demonstrate that our approach outperforms existing methods across various training settings while also being computationally efficient.

T. van Sonsbeek, M. M. Derakhshani and I. Najdenkoska—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Code available at: github.com/tjvsonsbeek/open-ended-medical-vqa.

References

  1. Barraco, M., Cornia, M., Cascianelli, S., Baraldi, L., Cucchiara, R.: The unreasonable effectiveness of CLIP features for image captioning: an experimental analysis. In: CVPR, pp. 4662–4670 (2022)

    Google Scholar 

  2. Brown, T., et al.: Language models are few-shot learners. NeurIPS 33, 1877–1901 (2020)

    Google Scholar 

  3. Cong, F., Xu, S., Guo, L., Tian, Y.: Caption-aware medical VQA via semantic focusing and progressive cross-modality comprehension. In: ACM Multimedia, pp. 3569–3577 (2022)

    Google Scholar 

  4. Derakhshani, M.M., et al.: Variational prompt tuning improves generalization of vision-language models. arXiv:2210.02390 (2022)

  5. Do, T., Nguyen, B.X., Tjiputra, E., Tran, M., Tran, Q.D., Nguyen, A.: Multiple meta-model quantifying for medical visual question answering. In: de Bruijne, M., et al. (eds.) MICCAI 2021. LNCS, vol. 12905, pp. 64–74. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87240-3_7

  6. Eslami, S., de Melo, G., Meinel, C.: Does CLIP benefit visual question answering in the medical domain as much as it does in the general domain? arXiv:2112.13906 (2021)

  7. Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: ICLR, pp. 1126–1135 (2017)

    Google Scholar 

  8. Gao, L., et al.: The pile: an 800 GB dataset of diverse text for language modeling. arXiv:2101.00027 (2020)

  9. Gong, H., Chen, G., Liu, S., Yu, Y., Li, G.: Cross-modal self-attention with multi-task pre-training for medical visual question answering. In: ICMR, pp. 456–460 (2021)

    Google Scholar 

  10. Gong, H., Chen, G., Mao, M., Li, Z., Li, G.: Vqamix: conditional triplet mixup for medical visual question answering. IEEE Trans. Med. Imaging (2022)

    Google Scholar 

  11. He, X., Zhang, Y., Mou, L., Xing, E., Xie, P.: Pathvqa: 30000+ questions for medical visual question answering. arXiv:2003.10286 (2020)

  12. Hu, E.J., et al.: Lora: low-rank adaptation of large language models. arXiv:2106.09685 (2021)

  13. Huang, Y., Wang, X., Liu, F., Huang, G.: OVQA: A clinically generated visual question answering dataset. In: ACM SIGIR, pp. 2924–2938 (2022)

    Google Scholar 

  14. Khare, Y., Bagal, V., Mathew, M., Devi, A., Priyakumar, U.D., Jawahar, C.: MMBERT: multimodal BERT pretraining for improved medical VQA. In: ISBI, pp. 1033–1036. IEEE (2021)

    Google Scholar 

  15. Lester, B., Al-Rfou, R., Constant, N.: The power of scale for parameter-efficient prompt tuning. In: EMNLP, pp. 3045–3059 (2021)

    Google Scholar 

  16. Li, X.L., Liang, P.: Prefix-tuning: optimizing continuous prompts for generation. In: ACL, pp. 4582–4597 (2021)

    Google Scholar 

  17. Li, Y., et al.: A bi-level representation learning model for medical visual question answering. J. Biomed. Inf. 134, 104183 (2022)

    Google Scholar 

  18. Lin, Z., et al.: Medical visual question answering: a survey. arXiv:2111.10056 (2021)

  19. Liu, B., Zhan, L.-M., Wu, X.-M.: Contrastive pre-training and representation distillation for medical visual question answering based on radiology images. In: de Bruijne, M., et al. (eds.) MICCAI 2021. LNCS, vol. 12902, pp. 210–220. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87196-3_20

  20. Liu, B., Zhan, L.M., Xu, L., Ma, L., Yang, Y., Wu, X.M.: Slake: a semantically-labeled knowledge-enhanced dataset for medical visual question answering. In: ISBI, pp. 1650–1654. IEEE (2021)

    Google Scholar 

  21. Luo, R., et al.: BioGPT: generative pre-trained transformer for biomedical text generation and mining. Briefings Bioinformat. 23(6) (2022)

    Google Scholar 

  22. Mokady, R., Hertz, A., Bermano, A.H.: Clipcap: clip prefix for image captioning. arXiv:2111.09734 (2021)

  23. Najdenkoska, I., Zhen, X., Worring, M.: Meta learning to bridge vision and language models for multimodal few-shot learning. In: ICLR (2023)

    Google Scholar 

  24. Nguyen, B.D., Do, T.-T., Nguyen, B.X., Do, T., Tjiputra, E., Tran, Q.D.: Overcoming data limitation in medical visual question aswering. In: Shen, D., et al. (eds.) MICCAI 2019. LNCS, vol. 11767, pp. 522–530. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32251-9_57

  25. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763. PMLR (2021)

    Google Scholar 

  26. Radford, A., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019)

    Google Scholar 

  27. Ren, F., Zhou, Y.: Cgmvqa: a new classification and generative model for medical visual question answering. IEEE Access 8, 50626–50636 (2020)

    Article  Google Scholar 

  28. Sharma, D., Purushotham, S., Reddy, C.K.: MedFuseNet: an attention-based multimodal deep learning model for visual question answering in the medical domain. Sci. Rep. 11(1), 19826 (2021)

    Article  Google Scholar 

  29. Taylor, N., Zhang, Y., Joyce, D., Nevado-Holgado, A., Kormilitzin, A.: Clinical prompt learning with frozen language models. arXiv:2205.05535 (2022)

  30. Tsimpoukelli, M., Menick, J.L., Cabi, S., Eslami, S., Vinyals, O., Hill, F.: Multimodal few-shot learning with frozen language models. NeurIPS 34, 200–212 (2021)

    Google Scholar 

  31. Venigalla, A., Frankle, J., Carbin, M.: BioMedLM: a domain-specific large language model for biomedicine. www.mosaicml.com/blog/introducing-pubmed-gpt (2022). Accessed 06 Mar 2022

  32. Wang, J., Huang, S., Du, H., Qin, Y., Wang, H., Zhang, W.: MHKD-MVQA: multimodal hierarchical knowledge distillation for medical visual question answering. In: 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 567–574. IEEE (2022)

    Google Scholar 

  33. Wu, Q., Wang, P., Wang, X., He, X., Zhu, W.: Medical VQA. In: Visual Question Answering: From Theory to Application, pp. 165–176. Springer, Singapore (2022). https://doi.org/10.1007/978-981-19-0964-1_11

  34. Zhan, L.M., Liu, B., Fan, L., Chen, J., Wu, X.M.: Medical visual question answering via conditional reasoning. In: ACM Multimedia, pp. 2345–2354 (2020)

    Google Scholar 

  35. Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: Bertscore: evaluating text generation with bert. In: ICLR (2020)

    Google Scholar 

Download references

Acknowledgements

This work is financially supported by the Inception Institute of Artificial Intelligence, the University of Amsterdam and the allowance Top consortia for Knowledge and Innovation (TKIs) from the Netherlands Ministry of Economic Affairs and Climate Policy.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tom van Sonsbeek .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

van Sonsbeek, T., Derakhshani, M.M., Najdenkoska, I., Snoek, C.G.M., Worring, M. (2023). Open-Ended Medical Visual Question Answering Through Prefix Tuning of Language Models. In: Greenspan, H., et al. Medical Image Computing and Computer Assisted Intervention – MICCAI 2023. MICCAI 2023. Lecture Notes in Computer Science, vol 14224. Springer, Cham. https://doi.org/10.1007/978-3-031-43904-9_70

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-43904-9_70

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-43903-2

  • Online ISBN: 978-3-031-43904-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics