Abstract
Automated radiology report generation can not only lighten the workload of clinicians but also improve the efficiency of disease diagnosis. However, it is a challenging task to generate semantically coherent radiology reports that are also highly consistent with medical images. To meet the challenge, we propose a Multimodal Recursive model with Contrastive Learning (MRCL). The proposed MRCL method incorporates both visual and semantic features to generate “Impression” and “Findings” of radiology reports through a recursive network, in which a contrastive pre-training method is proposed to improve the expressiveness of both visual and textual representations. Extensive experiments and analyses prove the efficacy of the proposed MRCL, which can not only generate semantically coherent radiology reports but also outperform state-of-the-art methods.



Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
The datasets analysed during the current study are available on reasonable request from https://openi.nlm.nih.gov/, and https://physionet.org/content/mimic-cxr/2.0.0/.
References
Alfarghaly O, Khaled R, Elkorany A et al (2021) Automated radiology report generation using conditioned transformers. Inform Med Unlock 24(100):557
Anderson P, He X, Buehler C, et al (2018) Bottom–up and top–down attention for image captioning and visual question answering. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6077–6086
Banerjee S, Lavie A (2005) Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization@ACL 2005, pp 65–72
Chen Z, Song Y, Chang TH, et al (2020) Generating radiology reports via memory-driven transformer. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pp 1439–1449
Demner-Fushman D, Kohli MD, Rosenman MB et al (2016) Preparing a collection of radiology examinations for distribution and retrieval. J Am Med Inform Assoc 23(2):304–310
Devlin J, Chang M, Lee K, et al (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp 4171–4186
Donahue J, Hendricks LA, Guadarrama S, et al (2015) Long-term recurrent convolutional networks for visual recognition and description. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition, pp 2625–2634
Gao T, Yao X, Chen D (2021) SimCSE: Simple contrastive learning of sentence embeddings. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp 6894–6910
Jing B, Xie P, Xing E (2018) On the automatic generation of medical imaging reports. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pp 2577–2586
Johnson AE, Pollard TJ, Greenbaum NR, et al (2019) Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs. Preprint at https://doi.org/10.48550/arXiv.1901.07042
Krause J, Johnson J, Krishna R, et al (2017) A hierarchical approach for generating descriptive image paragraphs. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 3337–3345
Li CY, Liang X, Hu Z, et al (2018) Hybrid retrieval-generation reinforced agent for medical image report generation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, p 1537–1547
Lin CY (2004) Rouge: a package for automatic evaluation of summaries. In: Proceedings of the Workshop on Text Summarization Branches Out, Post-Conference Workshop of ACL 2004, Barcelona, Spain, pp 74–81
Liu F, Ge S, Wu X (2021) Competence-based multimodal curriculum learning for medical report generation. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp 3001–3012
Lu J, Xiong C, Parikh D, et al (2017) Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 3242–3250
Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, p 311-318
Rennie SJ, Marcheret E, Mroueh Y, et al (2017) Self-critical sequence training for image captioning. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1179–1195
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 4566–4575
Vinyals O, Toshev A, Bengio S, et al (2015) Show and tell: a neural image caption generator. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 3156–3164
Xue Y, Xu T, Rodney Long L et al (2018) Multimodal recurrent model with attention for automated radiology report generation. Med Image Comput Comput Assist Intervent MICCAI 2018:457–466. https://doi.org/10.1007/978-3-030-00928-1_52
Yan Y, Li R, Wang S, et al (2021) ConSERT: A contrastive framework for self-supervised sentence representation transfer. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pp 5065–5075. https://doi.org/10.18653/v1/2021.acl-long.393
Zhang Y, Jiang H, Miura Y, et al (2020a) Contrastive learning of medical visual representations from paired images and text. Preprint at https://doi.org/10.48550/arXiv.2010.00747
Zhang Y, Wang X, Xu Z, et al (2020b) When radiology report generation meets knowledge graph. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 12910–12917. https://doi.org/10.1609/aaai.v34i07.6989
Acknowledgements
This work is supported by the National Natural Science Foundation of China (Grant No. 62172267), the National Key R &D Program of China (Grant No. 2019YFE0190500), the Natural Science Foundation of Shanghai, China (Grant No. 20ZR1420400), the State Key Program of National Natural Science Foundation of China (Grant No. 61936001), the Shanghai Pujiang Program (Grant No. 21PJ1404200), the Key Research Project of Zhejiang Laboratory (No. 2021PE0AC02).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wu, X., Li, J., Wang, J. et al. Multimodal contrastive learning for radiology report generation. J Ambient Intell Human Comput 14, 11185–11194 (2023). https://doi.org/10.1007/s12652-022-04398-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12652-022-04398-4