Skip to main content
Log in

Multimodal contrastive learning for radiology report generation

  • Original Research
  • Published:
Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Abstract

Automated radiology report generation can not only lighten the workload of clinicians but also improve the efficiency of disease diagnosis. However, it is a challenging task to generate semantically coherent radiology reports that are also highly consistent with medical images. To meet the challenge, we propose a Multimodal Recursive model with Contrastive Learning (MRCL). The proposed MRCL method incorporates both visual and semantic features to generate “Impression” and “Findings” of radiology reports through a recursive network, in which a contrastive pre-training method is proposed to improve the expressiveness of both visual and textual representations. Extensive experiments and analyses prove the efficacy of the proposed MRCL, which can not only generate semantically coherent radiology reports but also outperform state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability

The datasets analysed during the current study are available on reasonable request from https://openi.nlm.nih.gov/, and https://physionet.org/content/mimic-cxr/2.0.0/.

References

  • Alfarghaly O, Khaled R, Elkorany A et al (2021) Automated radiology report generation using conditioned transformers. Inform Med Unlock 24(100):557

    Google Scholar 

  • Anderson P, He X, Buehler C, et al (2018) Bottom–up and top–down attention for image captioning and visual question answering. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6077–6086

  • Banerjee S, Lavie A (2005) Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization@ACL 2005, pp 65–72

  • Chen Z, Song Y, Chang TH, et al (2020) Generating radiology reports via memory-driven transformer. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pp 1439–1449

  • Demner-Fushman D, Kohli MD, Rosenman MB et al (2016) Preparing a collection of radiology examinations for distribution and retrieval. J Am Med Inform Assoc 23(2):304–310

    Article  Google Scholar 

  • Devlin J, Chang M, Lee K, et al (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp 4171–4186

  • Donahue J, Hendricks LA, Guadarrama S, et al (2015) Long-term recurrent convolutional networks for visual recognition and description. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition, pp 2625–2634

  • Gao T, Yao X, Chen D (2021) SimCSE: Simple contrastive learning of sentence embeddings. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp 6894–6910

  • Jing B, Xie P, Xing E (2018) On the automatic generation of medical imaging reports. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pp 2577–2586

  • Johnson AE, Pollard TJ, Greenbaum NR, et al (2019) Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs. Preprint at https://doi.org/10.48550/arXiv.1901.07042

  • Krause J, Johnson J, Krishna R, et al (2017) A hierarchical approach for generating descriptive image paragraphs. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 3337–3345

  • Li CY, Liang X, Hu Z, et al (2018) Hybrid retrieval-generation reinforced agent for medical image report generation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, p 1537–1547

  • Lin CY (2004) Rouge: a package for automatic evaluation of summaries. In: Proceedings of the Workshop on Text Summarization Branches Out, Post-Conference Workshop of ACL 2004, Barcelona, Spain, pp 74–81

  • Liu F, Ge S, Wu X (2021) Competence-based multimodal curriculum learning for medical report generation. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp 3001–3012

  • Lu J, Xiong C, Parikh D, et al (2017) Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 3242–3250

  • Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, p 311-318

  • Rennie SJ, Marcheret E, Mroueh Y, et al (2017) Self-critical sequence training for image captioning. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1179–1195

  • Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 4566–4575

  • Vinyals O, Toshev A, Bengio S, et al (2015) Show and tell: a neural image caption generator. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 3156–3164

  • Xue Y, Xu T, Rodney Long L et al (2018) Multimodal recurrent model with attention for automated radiology report generation. Med Image Comput Comput Assist Intervent MICCAI 2018:457–466. https://doi.org/10.1007/978-3-030-00928-1_52

    Article  Google Scholar 

  • Yan Y, Li R, Wang S, et al (2021) ConSERT: A contrastive framework for self-supervised sentence representation transfer. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pp 5065–5075. https://doi.org/10.18653/v1/2021.acl-long.393

  • Zhang Y, Jiang H, Miura Y, et al (2020a) Contrastive learning of medical visual representations from paired images and text. Preprint at https://doi.org/10.48550/arXiv.2010.00747

  • Zhang Y, Wang X, Xu Z, et al (2020b) When radiology report generation meets knowledge graph. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 12910–12917. https://doi.org/10.1609/aaai.v34i07.6989

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (Grant No. 62172267), the National Key R &D Program of China (Grant No. 2019YFE0190500), the Natural Science Foundation of Shanghai, China (Grant No. 20ZR1420400), the State Key Program of National Natural Science Foundation of China (Grant No. 61936001), the Shanghai Pujiang Program (Grant No. 21PJ1404200), the Key Research Project of Zhejiang Laboratory (No. 2021PE0AC02).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xing Wu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wu, X., Li, J., Wang, J. et al. Multimodal contrastive learning for radiology report generation. J Ambient Intell Human Comput 14, 11185–11194 (2023). https://doi.org/10.1007/s12652-022-04398-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12652-022-04398-4

Keywords