Open-Ended Medical Visual Question Answering Through Prefix Tuning of Language Models

van Sonsbeek, Tom; Derakhshani, Mohammad Mahdi; Najdenkoska, Ivona; Snoek, Cees G. M.; Worring, Marcel

doi:10.1007/978-3-031-43904-9_70

Tom van Sonsbeek¹⁴,
Mohammad Mahdi Derakhshani¹⁴,
Ivona Najdenkoska¹⁴,
Cees G. M. Snoek¹⁴ &
…
Marcel Worring¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14224))

Included in the following conference series:

International Conference on Medical Image Computing and Computer-Assisted Intervention

5631 Accesses

Abstract

Medical Visual Question Answering (VQA) is an important challenge, as it would lead to faster and more accurate diagnoses and treatment decisions. Most existing methods approach it as a multi-class classification problem, which restricts the outcome to a predefined closed-set of curated answers. We focus on open-ended VQA and motivated by the recent advances in language models consider it as a generative task. Leveraging pre-trained language models, we introduce a novel method particularly suited for small, domain-specific, medical datasets. To properly communicate the medical images to the language model, we develop a network that maps the extracted visual features to a set of learnable tokens. Then, alongside the question, these learnable tokens directly prompt the language model. We explore recent parameter-efficient fine-tuning strategies for language models, which allow for resource- and data-efficient fine-tuning. We evaluate our approach on the prime medical VQA benchmarks, namely, Slake, OVQA and PathVQA. The results demonstrate that our approach outperforms existing methods across various training settings while also being computationally efficient.

T. van Sonsbeek, M. M. Derakhshani and I. Najdenkoska—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

BESTMVQA: A Benchmark Evaluation System for Medical Visual Question Answering

Development of a large-scale medical visual question-answering dataset

Article Open access 21 December 2024

Overcoming Data Limitation in Medical Visual Question Answering

Notes

1.
Code available at: github.com/tjvsonsbeek/open-ended-medical-vqa.

References

Barraco, M., Cornia, M., Cascianelli, S., Baraldi, L., Cucchiara, R.: The unreasonable effectiveness of CLIP features for image captioning: an experimental analysis. In: CVPR, pp. 4662–4670 (2022)
Google Scholar
Brown, T., et al.: Language models are few-shot learners. NeurIPS 33, 1877–1901 (2020)
Google Scholar
Cong, F., Xu, S., Guo, L., Tian, Y.: Caption-aware medical VQA via semantic focusing and progressive cross-modality comprehension. In: ACM Multimedia, pp. 3569–3577 (2022)
Google Scholar
Derakhshani, M.M., et al.: Variational prompt tuning improves generalization of vision-language models. arXiv:2210.02390 (2022)
Do, T., Nguyen, B.X., Tjiputra, E., Tran, M., Tran, Q.D., Nguyen, A.: Multiple meta-model quantifying for medical visual question answering. In: de Bruijne, M., et al. (eds.) MICCAI 2021. LNCS, vol. 12905, pp. 64–74. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87240-3_7
Eslami, S., de Melo, G., Meinel, C.: Does CLIP benefit visual question answering in the medical domain as much as it does in the general domain? arXiv:2112.13906 (2021)
Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: ICLR, pp. 1126–1135 (2017)
Google Scholar
Gao, L., et al.: The pile: an 800 GB dataset of diverse text for language modeling. arXiv:2101.00027 (2020)
Gong, H., Chen, G., Liu, S., Yu, Y., Li, G.: Cross-modal self-attention with multi-task pre-training for medical visual question answering. In: ICMR, pp. 456–460 (2021)
Google Scholar
Gong, H., Chen, G., Mao, M., Li, Z., Li, G.: Vqamix: conditional triplet mixup for medical visual question answering. IEEE Trans. Med. Imaging (2022)
Google Scholar
He, X., Zhang, Y., Mou, L., Xing, E., Xie, P.: Pathvqa: 30000+ questions for medical visual question answering. arXiv:2003.10286 (2020)
Hu, E.J., et al.: Lora: low-rank adaptation of large language models. arXiv:2106.09685 (2021)
Huang, Y., Wang, X., Liu, F., Huang, G.: OVQA: A clinically generated visual question answering dataset. In: ACM SIGIR, pp. 2924–2938 (2022)
Google Scholar
Khare, Y., Bagal, V., Mathew, M., Devi, A., Priyakumar, U.D., Jawahar, C.: MMBERT: multimodal BERT pretraining for improved medical VQA. In: ISBI, pp. 1033–1036. IEEE (2021)
Google Scholar
Lester, B., Al-Rfou, R., Constant, N.: The power of scale for parameter-efficient prompt tuning. In: EMNLP, pp. 3045–3059 (2021)
Google Scholar
Li, X.L., Liang, P.: Prefix-tuning: optimizing continuous prompts for generation. In: ACL, pp. 4582–4597 (2021)
Google Scholar
Li, Y., et al.: A bi-level representation learning model for medical visual question answering. J. Biomed. Inf. 134, 104183 (2022)
Google Scholar
Lin, Z., et al.: Medical visual question answering: a survey. arXiv:2111.10056 (2021)
Liu, B., Zhan, L.-M., Wu, X.-M.: Contrastive pre-training and representation distillation for medical visual question answering based on radiology images. In: de Bruijne, M., et al. (eds.) MICCAI 2021. LNCS, vol. 12902, pp. 210–220. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87196-3_20
Liu, B., Zhan, L.M., Xu, L., Ma, L., Yang, Y., Wu, X.M.: Slake: a semantically-labeled knowledge-enhanced dataset for medical visual question answering. In: ISBI, pp. 1650–1654. IEEE (2021)
Google Scholar
Luo, R., et al.: BioGPT: generative pre-trained transformer for biomedical text generation and mining. Briefings Bioinformat. 23(6) (2022)
Google Scholar
Mokady, R., Hertz, A., Bermano, A.H.: Clipcap: clip prefix for image captioning. arXiv:2111.09734 (2021)
Najdenkoska, I., Zhen, X., Worring, M.: Meta learning to bridge vision and language models for multimodal few-shot learning. In: ICLR (2023)
Google Scholar
Nguyen, B.D., Do, T.-T., Nguyen, B.X., Do, T., Tjiputra, E., Tran, Q.D.: Overcoming data limitation in medical visual question aswering. In: Shen, D., et al. (eds.) MICCAI 2019. LNCS, vol. 11767, pp. 522–530. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32251-9_57
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763. PMLR (2021)
Google Scholar
Radford, A., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019)
Google Scholar
Ren, F., Zhou, Y.: Cgmvqa: a new classification and generative model for medical visual question answering. IEEE Access 8, 50626–50636 (2020)
Article Google Scholar
Sharma, D., Purushotham, S., Reddy, C.K.: MedFuseNet: an attention-based multimodal deep learning model for visual question answering in the medical domain. Sci. Rep. 11(1), 19826 (2021)
Article Google Scholar
Taylor, N., Zhang, Y., Joyce, D., Nevado-Holgado, A., Kormilitzin, A.: Clinical prompt learning with frozen language models. arXiv:2205.05535 (2022)
Tsimpoukelli, M., Menick, J.L., Cabi, S., Eslami, S., Vinyals, O., Hill, F.: Multimodal few-shot learning with frozen language models. NeurIPS 34, 200–212 (2021)
Google Scholar
Venigalla, A., Frankle, J., Carbin, M.: BioMedLM: a domain-specific large language model for biomedicine. www.mosaicml.com/blog/introducing-pubmed-gpt (2022). Accessed 06 Mar 2022
Wang, J., Huang, S., Du, H., Qin, Y., Wang, H., Zhang, W.: MHKD-MVQA: multimodal hierarchical knowledge distillation for medical visual question answering. In: 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 567–574. IEEE (2022)
Google Scholar
Wu, Q., Wang, P., Wang, X., He, X., Zhu, W.: Medical VQA. In: Visual Question Answering: From Theory to Application, pp. 165–176. Springer, Singapore (2022). https://doi.org/10.1007/978-981-19-0964-1_11
Zhan, L.M., Liu, B., Fan, L., Chen, J., Wu, X.M.: Medical visual question answering via conditional reasoning. In: ACM Multimedia, pp. 2345–2354 (2020)
Google Scholar
Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: Bertscore: evaluating text generation with bert. In: ICLR (2020)
Google Scholar

Download references

Acknowledgements

This work is financially supported by the Inception Institute of Artificial Intelligence, the University of Amsterdam and the allowance Top consortia for Knowledge and Innovation (TKIs) from the Netherlands Ministry of Economic Affairs and Climate Policy.

Author information

Authors and Affiliations

University of Amsterdam, Amsterdam, The Netherlands
Tom van Sonsbeek, Mohammad Mahdi Derakhshani, Ivona Najdenkoska, Cees G. M. Snoek & Marcel Worring

Authors

Tom van Sonsbeek
View author publications
You can also search for this author in PubMed Google Scholar
Mohammad Mahdi Derakhshani
View author publications
You can also search for this author in PubMed Google Scholar
Ivona Najdenkoska
View author publications
You can also search for this author in PubMed Google Scholar
Cees G. M. Snoek
View author publications
You can also search for this author in PubMed Google Scholar
Marcel Worring
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tom van Sonsbeek .

Editor information

Editors and Affiliations

Icahn School of Medicine, Mount Sinai, NYC, NY, USA, Tel Aviv University, Tel Aviv, Israel
Hayit Greenspan
Emory University, Atlanta, GA, USA
Anant Madabhushi
Queen’s University, Kingston, ON, Canada
Parvin Mousavi
The University of British Columbia, Vancouver, BC, Canada
Septimiu Salcudean
Yale University, New Haven, CT, USA
James Duncan
IBM Research, San Jose, CA, USA
Tanveer Syeda-Mahmood
Johns Hopkins University, Baltimore, MD, USA
Russell Taylor

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

van Sonsbeek, T., Derakhshani, M.M., Najdenkoska, I., Snoek, C.G.M., Worring, M. (2023). Open-Ended Medical Visual Question Answering Through Prefix Tuning of Language Models. In: Greenspan, H., et al. Medical Image Computing and Computer Assisted Intervention – MICCAI 2023. MICCAI 2023. Lecture Notes in Computer Science, vol 14224. Springer, Cham. https://doi.org/10.1007/978-3-031-43904-9_70

Download citation

DOI: https://doi.org/10.1007/978-3-031-43904-9_70
Published: 01 October 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43903-2
Online ISBN: 978-3-031-43904-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The Medical Image Computing and Computer Assisted Intervention Society (opens in a new tab)

Open-Ended Medical Visual Question Answering Through Prefix Tuning of Language Models