Abstract
The automatic creation of textual descriptions from pictures, or image captioning, has advanced significantly in the last several years. It is a hybrid approach that employs natural language processing and computer vision. Several techniques are used in image-to-text synthesis, such as encoder–decoder frameworks, Unsupervised Learning, Reinforcement Learning, Transformer-based techniques, and attention processes. Despite these improvements, addressing image ambiguity and effectively collecting image context, emotions, and facts remains challenging. This work extends previous surveys by providing a comprehensive analysis encompassing the latest developments, evaluation challenges, and specific features like multilingual capabilities, narrative generation, detailed descriptions and potential integration with other applications. This work also investigates the evaluation metrics used in different application areas to train and assess image captioning algorithms. In addition, different issues and challenges, along with the primary outcomes from previous researchers, are covered to help researchers in future work, this gives researchers and developers the insights and tools they need to drive image captioning technology forward.








Similar content being viewed by others
References
Lei Z, Zhou C, Chen S, Huang Y, Liu X. A sparse transformer-based approach for image captioning. IEEE Access. 2020;8:213437–46.
Sidorov O, Hu R, Rohrbach M, Singh A. Textcaps: a dataset for image captioning with reading comprehension. In: Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16. Springer; 2020. pp. 742–758.
Iwamura K, Louhi Kasahara JY, Moro A, Yamashita A, Asama H. Image captioning using motion-cnn with object detection. Sensors. 2021;21(4):1270.
Wang M, Song L, Yang X, Luo C. A parallel-fusion rnn-lstm architecture for image caption generation. In: 2016 IEEE international conference on image processing (ICIP). IEEE; 2016. pp. 4448–4452.
Chohan M, Khan A, Mahar MS, Hassan S, Ghafoor A, Khan M. Image captioning using deep learning: a systematic. Image. 2020;11(5).
Shen X, Liu B, Zhou Y, Zhao J. Remote sensing image caption generation via transformer and reinforcement learning. Multimedia Tools Appl. 2020;79(35):26661–82.
An J, Zainon NW, Mohd W, Hao Z. Improving targeted multimodal sentiment classification with semantic description of images. Comput Mater Continua 2023;75(3).
Żelaszczyk M, Mańdziuk J. Text-to-image cross-modal generation: a systematic review. arXiv preprint arXiv:2401.11631 2024;
Ghandi T, Pourreza H, Mahyar H. Deep learning approaches on image captioning: a review. ACM Comput Surv. 2023;56(3):1–39.
Park S-M, Kim Y-G. Visual language integration: a survey and open challenges. Comput Sci Rev. 2023;48: 100548.
López-Sánchez M, Hernández-Ocaña B, Chávez-Bosquez O, Hernández-Torruco J. Supervised deep learning techniques for image description: a systematic review. Entropy. 2023;25(4):553.
Zohourianshahzadi Z, Kalita JK. Neural attention for image captioning: review of outstanding methods. Artif Intell Rev. 2022;55(5):3833–62.
Sharma H, Srivastava S. Image captioning: Methods and evaluation metrics. In: 2022 2nd International conference on intelligent technologies (CONIT), 2022;pp. 1–5. IEEE
Ming Y, Hu N, Fan C, Feng F, Zhou J, Yu H. Visuals to text: a comprehensive review on automatic image captioning. IEEE/CAA J Autom Sin. 2022;9(8):1339–65.
Predić B, Manić D, Saračević M, Karabašević D, Stanujkić D, et al. Automatic image caption generation based on some machine learning algorithms. Math Probl Eng. 2022;2022.
Stefanini M, Cornia M, Baraldi L, Cascianelli S, Fiameni G, Cucchiara R. From show to tell: a survey on deep learning-based image captioning. IEEE Trans Pattern Anal Mach Intell. 2022;45(1):539–59.
Elhagry A, Kadaoui K. A thorough review on recent deep learning methodologies for image captioning. arXiv preprint arXiv:2107.13114 2021;
Hossain MZ. Deep learning techniques for image captioning. PhD thesis, Murdoch University; 2020.
Wang H, Zhang Y, Yu X. An overview of image caption generation methods. Comput Intell Neurosci. 2020;2020.
Wang C, Zhou Z, Xu L. An integrative review of image captioning research. In: Journal of Physics: Conference Series, vol. 1748. IOP Publishing; 2021.p. 042060.
Gupta N, Jalal AS. Integration of textual cues for fine-grained image captioning using deep cnn and lstm. Neural Comput Appl. 2020;32(24):17899–908.
Van Houdt G, Mosquera C, Nápoles G. A review on the long short-term memory model. Artif Intell Rev. 2020;53(8):5929–55.
Chu Y, Yue X, Yu L, Sergei M, Wang Z. Automatic image captioning based on resnet50 and lstm with soft attention. Wirel Commun Mob Comput. 2020;2020:1–7.
Xie T, Ding W, Zhang J, Wan X, Wang J. Bi-ls-attm: a bidirectional lstm and attention mechanism model for improving image captioning. Appl Sci. 2023;13(13):7916.
Joshi A, Kalal K, Bhandare D, Patil V, Kulkarni U, Meena S. Image captioning using cnn-lstm. In: International conference on emerging research in computing, information, communication and applications. Springer; 2023. pp. 421–433.
Kumar MP, Snigdha V, Nandini R, Reddy BI. Image captioning generator using cnn and lstm. IJRASET: International Journal for Research in Applied Science and Engineering Technology; 2022.
Alzubi JA, Jain R, Nagrath P, Satapathy S, Taneja S, Gupta P. Deep image captioning using an ensemble of cnn and lstm based deep neural networks. J Intell Fuzzy Syst. 2021;40(4):5761–9.
Khant P, Deshmukh V, Kude A, Kiraula P. Image caption generator using cnn-lstm. Int Res J Eng Technol. 2021;8(07):4100–5.
Yadav AK. Image captioning using r-cnn & lstm deep learning model. Image. 2021;5:8.
Kulkarni U, Tomar K, Kalmat M, Bandi R, Jadhav P, Meena S. Attention based image caption generation (abicg) using encoder-decoder architecture. In: 2023 5th international conference on smart systems and inventive technology (ICSSIT). IEEE; 2023. pp. 1564–1572.
Xiao X, Wang L, Ding K, Xiang S, Pan C. Deep hierarchical encoder-decoder network for image captioning. IEEE Trans Multimedia. 2019;21(11):2942–56.
Fei J, Wang T, Zhang J, He Z, Wang C, Zheng F. Transferable decoding with visual entities for zero-shot image captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, 2023;pp. 3136–3146
Yi J, Wu C, Zhang X, Xiao X, Qiu Y, Zhao W, Hou T, Cao D. Micer: a pre-trained encoder-decoder architecture for molecular image captioning. Bioinformatics. 2022;38(19):4562–72.
Shao J, Yang R. Controllable image caption with an encoder-decoder optimization structure. Appl Intell. 2022;52(10):11382–93.
Rahman M M, Uzzaman A, Sami SI. Implementing deep neural network based encoder-decoder framework for image captioning. In: 2021 IEEE international conference on signal processing, information, communication & systems (SPICSCON). IEEE; 2021. pp. 26–31
Song Z, Zhou X, Dong L, Tan J, Guo L. Direction relation transformer for image captioning. In: Proceedings of the 29th ACM international conference on multimedia. MM ’21, pp. 5056–5064. Association for Computing Machinery, New York, NY, USA. 2021; https://doi.org/10.1145/3474085.3475607.
Atliha V, Šešok D. Text augmentation using bert for image captioning. Appl Sci. 2020. https://doi.org/10.3390/app10175978
Ngo, K.A., Shim, K., Shim, B.: Spatial cross-attention for transformer-based image captioning. In: ICASSP 2023-2023 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE; 2023. pp. 1–5.
Parvin H, Naghsh-Nilchi AR, Mohammadi HM. Transformer-based local-global guidance for image captioning. Expert Syst Appl. 2023;223: 119774.
Srivastava S, Sharma H. Improving scene text image captioning using transformer-based multilevel attention. J Electron Imaging. 2023;32(3):033023–033023.
Yang X, Liu Y, Wang, X. Reformer: The relational transformer for image captioning. In: Proceedings of the 30th ACM international conference on multimedia. 2022. pp. 5398–5406.
Liu C, Zhao R, Chen H, Zou Z, Shi Z. Remote sensing image change captioning with dual-branch transformers: a new method and a large scale dataset. IEEE Trans Geosci Remote Sens. 2022;60:1–20.
Ji J, Luo Y, Sun X, Chen F, Luo G, Wu Y, Gao Y, Ji R. Improving image captioning by leveraging intra-and inter-layer global representation in transformer network. In: Proceedings of the AAAI conference on artificial intelligence, vol. 35. 2021. pp. 1655–1663.
Yu J, Li H, Hao Y, Zhu B, Xu T, He X. Cgt-gan: Clip-guided text gan for image captioning. In: Proceedings of the 31st ACM international conference on multimedia. 2023.pp. 2252–2263.
Kim T, Song G, Lee S, Kim S, Seo Y, Lee S, Kim SH, Lee H, Bae K. L-verse: Bidirectional generation between image and text. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022. pp. 16526–16536.
Song P, Guo D, Zhou J, Xu M, Wang, M.: Memorial gan with joint semantic optimization for unpaired image captioning. IEEE Trans Cybern. 2022.
Tibebu H, Malik A, De Silva V. Text to image synthesis using stacked conditional variational autoencoders and conditional generative adversarial networks. In: Science and information conference. Springer; 2022. pp. 560–580.
Klein F, Mahajan S, Roth S. Diverse image captioning with grounded style. In: DAGM German conference on pattern recognition. Springer; 2021. pp. 421–436.
Wu B, Niu G, Yu J, Xiao X, Zhang J, Wu H. Towards knowledge-aware video captioning via transitive visual relationship detection. IEEE Trans Circuits Syst Video Technol. 2022;32(10):6753–65.
Ma M, Tohti T, Liang Y, Zuo Z, Hamdulla A. A focus fusion attention mechanism integrated with image captions for knowledge graph-based visual question answering. Signal Image Video Process. 2024;1–12
Verma A, Agarwal S, Arya K, Petrlik I, Esparza R, Rodriguez C. Image captioning with reinforcement learning. In: 2023 IEEE international conference on computer vision and machine intelligence (CVMI). IEEE; 2023. pp. 1–7.
Gu X, Chen G, Wang Y, Zhang L, Luo T, Wen L. Text with knowledge graph augmented transformer for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023. pp. 18941–18951.
Zhao W, Wu X. Boosting entity-aware image captioning with multi-modal knowledge graph. IEEE Transactions on Multimedia. 2023.
Liu A-A, Zhai Y, Xu N, Tian H, Nie W, Zhang Y. Event-aware retrospective learning for knowledge-based image captioning. IEEE Transactions on Multimedia. 2023.
Caron M, Misra I, Mairal J, Goyal P, Bojanowski P, Joulin A. Unsupervised learning of visual features by contrasting cluster assignments. Adv Neural Inf Process Syst. 2020;33:9912–24.
Honda U, Ushiku Y, Hashimoto A, Watanabe T, Matsumoto Y. Removing word-level spurious alignment between images and pseudo-captions in unsupervised image captioning (2021). arXiv preprint arXiv:2104.13872.
Zhu P, Wang X, Zhu L, Sun Z, Zheng W-S, Wang Y, Chen C. Prompt-based learning for unpaired image captioning. IEEE Transactions on Multimedia. 2023.
Zhou H, Du X, Xia L, Li S. Self-learning for few-shot remote sensing image captioning. Remote Sens. 2022;14(18):4606.
Hua X, Wang X, Rui T, Shao F, Wang D. Adversarial reinforcement learning with object-scene relational graph for video captioning. IEEE Trans Image Process. 2022;31:2004–16.
Yan Z. Reinforcement learning transformer for image captioning generation model. In: Fifteenth International Conference on Machine Vision (ICMV 2022), vol. 12701. SPIE; 2023. pp. 166–172.
Wadhwa T, Virk H, Aghav J, Borole S. Image captioning using deep learning. Int J Res Appl Sci Eng Technol. 2020;8(6):1430–5.
Gao C, Bian G, Dong Y, Yuan X, Liu H. Infrared image captioning based on unsupervised learning and reinforcement learning. In: 2022 International conference on automation, robotics and computer engineering (ICARCE). IEEE; 2022. pp. 1–4.
Devi P, Thrivikraman V, Kashyap D, Shylaja S. Image captioning using reinforcement learning with bluder optimization. Pattern Recognit Image Anal. 2020;30:607–13.
Mozes M, Schmitt M, Golkov V, Schütze H, Cremers D. Scene graph generation for better image captioning? arXiv preprint arXiv:2109.11398 2021;
Chang X, Ren P, Xu P, Li Z, Chen X, Hauptmann A. A comprehensive survey of scene graphs: generation and application. IEEE Trans Pattern Anal Mach Intell. 2021;45(1):1–26.
Phueaksri I, Kastner MA, Kawanishi Y, Komamizu T, Ide I. Towards captioning an image collection from a combined scene graph representation approach. In: International conference on multimedia modeling. Springer; 2023. pp. 178–190.
Liu W, Zhang N, Wang Y, Di W. An image caption model based on the scene graph and semantic prior network. In: 2022 11th International conference on control, automation and information sciences (ICCAIS). IEEE; 2022. pp. 60–66.
Wang Y, Shang L. Generating spatial-aware captions for textcaps. In: 2022 26th International conference on pattern recognition (ICPR). IEEE; 2022. pp. 379–385.
Zhong Y, Wang L, Chen J, Yu D, Li Y. Comprehensive image captioning via scene graph decomposition. In: Computer vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16.Springer; 2020. pp. 211–229.
Gu J, Joty S, Cai J, Zhao H, Yang X, Wang G. Unpaired image captioning via scene graph alignments. In: Proceedings of the IEEE/CVF international conference on computer vision. 2019; pp. 10323–10332.
Jia J, Xin X, Gao X, Ding X, Pang S. Learning scene graph for better cross-domain image captioning. In: Chinese conference on pattern recognition and computer vision (PRCV). Springer; 2023. pp. 121–137.
Geng M, Zhao Q. Improve image captioning by modeling dynamic scene graph extension. In: Proceedings of the 2022 International Conference on Multimedia Retrieval, 2022;pp. 398–406
Sairam G, Mandha M, Prashanth P, Swetha P. Image captioning using cnn and lstm. In: 4th Smart cities symposium (SCS 2021), vol. 2021. IET; 2021. pp. 274–277.
Ramos R, Martins B. Using neural encoder-decoder models with continuous outputs for remote sensing image captioning. IEEE Access. 2022;10:24852–63.
Asif H. Experimenting encoder-decoder architecture for visual image captioning. In: International conference on cognition and recongition. Springer; 2021. pp. 200–212.
Wang Y, Xu J, Sun Y. End-to-end transformer based model for image captioning. In: Proceedings of the AAAI conference on artificial intelligence, vol. 36. 2022. pp. 2585–2594.
Zeng P, Zhang H, Song J, Gao L. S2 transformer for image captioning. In: IJCAI; 2022. pp. 1608–1614.
Desai P, Sujatha C, Shanbhag R, Gotur R, Hebbar R, Kurtkoti P. Adversarial network for photographic image synthesis from fine-grained captions. In: 2021 International conference on intelligent technologies (CONIT). IEEE; 2021. pp. 1–5.
Sargar O, Kinger S. Image captioning methods and metrics. In: 2021 International conference on emerging smart computing and informatics (ESCI). IEEE; 2021. pp. 522–526.
Moratelli N, Barraco M, Morelli D, Cornia M, Baraldi L, Cucchiara R. Fashion-oriented image captioning with external knowledge retrieval and fully attentive gates. Sensors. 2023;23(3):1286.
Verma D, Haldar A, Dutta T. Leveraging weighted cross-graph attention for visual and semantic enhanced video captioning network. In: Proceedings of the AAAI conference on artificial intelligence, vol. 37; 2023. pp. 2465–2473.
Zhao S, Li L, Peng H. Aligned visual semantic scene graph for image captioning. Displays. 2022;74: 102210.
Ahmadi S, Agrawal A. An examination of the robustness of reference-free image captioning evaluation metrics. arXiv preprint arXiv:2305.14998 2023;
González-Chávez O, Ruiz G, Moctezuma D, Ramirez-delReal T. Are metrics measuring what they should? an evaluation of image captioning task metrics. Signal Process Image Commun. 2024;120: 117071.
Mao Y, Xiao J, Zhang D, Cao M, Shao J, Zhuang Y, Chen L. Improving reference-based distinctive image captioning with contrastive rewards. arXiv preprint arXiv:2306.14259 2023;
Liu Y, Fabbri AR, Zhao Y, Liu P, Joty S, Wu C-S, Xiong C, Radev D. Towards interpretable and efficient automatic reference-based summarization evaluation. arXiv preprint arXiv:2303.03608 2023;
Abdelrahman E, Sun P, Li LE, Elhoseiny M. Imagecaptioner2: Image captioner for image captioning bias amplification assessment. In: Proceedings of the AAAI conference on artificial intelligence, vol. 38; 2024. pp. 20902–20911.
Yuksel KA, Ferreira T, Gunduz A, Al-Badrashiny M, Javadi G. A reference-less quality metric for automatic speech recognition via contrastive-learning of a multi-language model with self-supervision. In: 2023 IEEE international conference on acoustics, speech, and signal processing workshops (ICASSPW). IEEE; 2023. pp. 1–5.
Yi Y, Deng H, Hu J. Improving image captioning evaluation by considering inter references variance. In: Proceedings of the 58th annual meeting of the association for computational linguistics. 2020. pp. 985–994.
Sengan S, Sagar PV, Saravanan N, Amarendra K, Subburaj A, Maheswari S, Rangasamy R. Natural language processing of text-based metrics for image captioning. In: Intelligent systems and sustainable computing: proceedings of ICISSC 2021. Springer; 2022. pp. 203–211.
Ullah N, Mohanta P P. Boosting video captioning with dynamic loss network. arXiv preprint arXiv:2107.11707; 2021.
Wang Q, Wan J, Chan AB. On diversity in image captioning: Metrics and methods. IEEE Trans Pattern Anal Mach Intell. 2020;44(2):1035–49.
Lin C-Y, Och F. Looking for a few good metrics: Rouge and its evaluation. In: Ntcir Workshop 2004.
Barbella M, Tortora G. Rouge metric evaluation for text summarization techniques. Available at SSRN 4120317 2022.
Lavie A, Denkowski MJ. The meteor metric for automatic evaluation of machine translation. Mach Transl. 2009;23:105–15.
Agarwal A, Lavie A. Meteor, m-bleu and m-ter: Evaluation metrics for high-correlation with human rankings of machine translation output. In: Proceedings of the third workshop on statistical machine translation. 2008. pp. 115–118.
Denkowski M, Lavie A. Extending the meteor machine translation evaluation metric to the phrase level. In: Human language technologies: the 2010 annual conference of the North American Chapter of the Association for Computational Linguistics. 2010. pp. 250–253.
Vedantam R, Lawrence Zitnick C, Parikh D. Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2015. pp. 4566–4575.
Santos GOd, Colombini EL, Avila S. Cider-r: robust consensus-based image description evaluation. arXiv preprint arXiv:2109.13701 2021.
Gontier F, Serizel R, Cerisara C. Spice+: evaluation of automatic audio captioning systems with pre-trained language models. In: ICASSP 2023-2023 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE; 2023. pp. 1–5.
Hessel J, Holtzman A, Forbes M, Bras RL, Choi Y. Clipscore: a reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718 2021.
Hu A, Chen S, Zhang L, Jin Q. Infometic: An informative metric for reference-free image caption evaluation. arXiv preprint arXiv:2305.06002 2023;
Shi Y, Yang X, Xu H, Yuan C, Li B, Hu W, Zha Z-J. Emscore: Evaluating video captioning via coarse-grained and fine-grained embedding matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022. pp. 17929–17938.
Levinboim T, Thapliyal AV, Sharma P, Soricut R. Quality estimation for image captions based on large-scale human evaluations. arXiv preprint arXiv:1909.03396 2019.
Anagnostopoulou A, Hartmann M, Sonntag D. Putting humans in the image captioning loop. arXiv preprint arXiv:2306.03476 2023.
Kasai J, Sakaguchi K, Dunagan L, Morrison J, Bras RL, Choi Y, Smith NA. Transparent human evaluation for image captioning. arXiv preprint arXiv:2111.08940 2021.
Zheng E, Yu Q, Li R, Shi P, Haake A. Knowledge acquisition for human-in-the-loop image captioning. In: International conference on artificial intelligence and statistics. PMLR; 2023. pp. 2191–2206.
Wada Y, Kaneda K, Saito D, Sugiura K. Polos: Multimodal metric learning from human feedback for image captioning. arXiv preprint arXiv:2402.18091 2024.
Kumar R, Goel G. Image caption using cnn in computer vision. In: 2023 International conference on artificial intelligence and smart communication (AISC). IEEE; 2023. pp. 874–878.
Calvin R, Suresh S. Image captioning using convolutional neural networks and recurrent neural network. In: 2021 6th International conference for convergence in Technology (I2CT). IEEE; 2021. pp. 1–4.
Bethu S, Chandra NS. Image captioning with cnn and lstm using python. In: 2023 Third international conference on artificial intelligence and smart energy (ICAIS). IEEE; 2023. pp. 1383–1390.
Manay SP, Yaligar SA, Thathva Sri Sai Reddy Y, Saunshimath NJ. Image captioning for the visually impaired. In: Emerging research in computing, information, communication and applications: ERCICA 2020, Vol. 1. Springer; 2022. pp. 511–522.
Keskin R, Moral Ö T, Kılıç V, Onan A. Multi-gru based automated image captioning for smartphones. In: 2021 29th Signal processing and communications applications conference (SIU). IEEE; 2021. pp. 1–4.
Gallardo-García R, Beltran-Martinez B, Hernandez-Gracidas C, Vilariño-Ayala D. Searching for memory-lighter architectures for ocr-augmented image captioning. J Intell Fuzzy Syst. 2022;42(5):4399–410.
Shrimal A, Chakraborty T. Attention beam: An image captioning approach. arXiv preprint arXiv:2011.01753 2020.
Mathur P, Gill A, Yadav A, Mishra A, Bansode NK. Camera2caption: a real-time image caption generator. In: 2017 International conference on computational intelligence in data science (ICCIDS). IEEE; 2017. pp. 1–6.
Grover N, Singh A, Suganeshwari G. Multilingual image caption generator using big data and deep learning.
Ramos R, Martins B, Elliott D. Lmcap: few-shot multilingual image captioning by retrieval augmented language model prompting. arXiv preprint arXiv:2305.19821 2023.
Li Y, Chang C-Y, Rawls S, Vulić I, Korhonen A. Translation-enhanced multilingual text-to-image generation. arXiv preprint arXiv:2305.19216 2023.
Escolano C, Costa-jussà MR, Fonollosa JA. Multilingual machine translation: deep analysis of language-specific encoder-decoders. J Artif Intell Res. 2022;73:1535–52.
Xu Y, Hu Z, Zhou Y, Hao S, Hong R. Cite: Compact interactive transformer for multilingual image captioning. In: Proceedings of the 2023 6th international conference on image and graphics processing. 2023. pp. 175–181.
Za’ter ME, Talafha B. Bench-marking and improving arabic automatic image captioning through the use of multi-task learning paradigm. arXiv preprint arXiv:2202.05474 2022.
Jain A, Guo M, Srinivasan K, Chen T, Kudugunta S, Jia C, Yang Y, Baldridge J. Mural: multimodal, multitask retrieval across languages. arXiv preprint arXiv:2109.05125 2021.
Zheng H, Wu J, Liang R, Li Y, Li X. Multi-task learning for captioning images with novel words. IET Comput Vis. 2019;13(3):294–301.
Mohammadshahi A, Lebret R, Aberer K. Aligning multilingual word embeddings for cross-modal retrieval task. arXiv preprint arXiv:1910.03291 2019.
Elbedwehy S, Medhat T, Hamza T, Alrahmawy MF. Enhanced image captioning using features concatenation and efficient pre-trained word embedding. Comput Syst Sci Eng. 2023;46(3).
Pereira K, Parikh A, Kumar P, Hole V. Multilingual text-based image search using multimodal embeddings. In: 2022 IEEE 6th conference on information and communication technology (CICT). IEEE; 2022. pp. 1–5.
Ghanbarzadeh S, Palangi H, Huang Y, Moreno RC, Khanpour H. Improving pre-trained language models’ generalization. arXiv preprint arXiv:2307.10457 2023.
Paischer F, Adler T, Hofmarcher M, Hochreiter S. Sitta: A semantic image-text alignment for image captioning. arXiv preprint arXiv:2307.05591 2023.
Poddar AK, Rani R. Hybrid architecture using cnn and lstm for image captioning in hindi language. Procedia Comput Sci. 2023;218:686–96.
Afzal MK, Shardlow M, Tuarob S, Zaman F, Sarwar R, Ali M, Aljohani NR, Lytras MD, Nawaz R, Hassan S-U. Generative image captioning in urdu using deep learning. J Ambient Intell Hum Comput. 2023;14(6):7719–31.
Emami J, Nugues P, Elnagar A, Afyouni I. Arabic image captioning using pre-training of deep bidirectional transformers. In: Proceedings of the 15th international conference on natural language generation. 2022. pp. 40–51.
Mishra SK, Dhir R, Saha S, Bhattacharyya P, Singh AK. Image captioning in hindi language using transformer networks. Comput Electr Eng. 2021;92: 107114.
Nguyen K. Empirical study of feature extraction approaches for image captioning in Vietnamese. J Comput Sci Cybern. 2022;38(4):327–46.
Zhong J, Wang D. Image caption generation based on object detection and knowledge enhancement. In: International conference on image, signal processing, and pattern recognition (ISPP 2023), vol. 12707. SPIE; 2023. pp. 226–232.
Cui L, Li L. Semantic enhancement methods for image captioning. In: International conference on artificial intelligence, virtual reality, and visualization (AIVRV 2022). SPIE; 2023. pp. 46–54.
Gaikwad V, Sapkale P, Dongre M, Kadam S, Tandale S, Sonawane J, Waghmode U. Enhancing image caption with lstms and cnn. In: Proceedings of the international conference on intelligent computing, communication and information security. Springer; 2022. pp. 113–126.
Li Z, Sun Q, Guo Q, Wu H, Deng L, Zhang Q, Zhang J, Zhang H, Chen Y. Visual sentiment analysis based on image caption and adjective–noun–pair description. Soft Comput. 2021;1–13.
Alonso Pardo M, Vilares D, Gómez-Rodríguez C, Vilares J. Sentiment analysis for fake news detection. Electronics. 2021;10:1348 https://doi.org/10.3390/electronics10111348
Sathish R, Ezhumalai P. Intermodal sentiment analysis for images with text captions using the vggnet technique. Trans Asian Low-Resour Lang Inf Process. 2021;20(4):1–14.
Li T, Hu Y, Wu X. Image captioning with inherent sentiment. In: 2021 IEEE international conference on multimedia and expo (ICME). IEEE; 2021. pp. 1–6.
Deshmukh Y S, Patankar N S, Chintamani R, Shelke N. Analysis of emotion detection of images using sentiment analysis and machine learning algorithm. In: 2023 5th International conference on smart systems and inventive technology (ICSSIT). IEEE; 2023. pp. 1071–1076.
Ishikawa S, Sugiura K. Affective image captioning for visual artworks using emotion-based cross-attention mechanisms. IEEE Access. 2023;11:24527–34.
Abedi A, Karshenas H, Adibi P. Multi-modal reward for visual relationships-based image captioning. arXiv preprint arXiv:2303.10766 2023.
Zhong R, Zhang Q, Zuo M. Enhanced visual multi-modal fusion framework for dense video captioning. https://doi.org/10.21203/rs.3.rs-2563235/v1
Liu P, Qian W, Xu D, Ren B, Cao J. Multi-modal fake news detection via bridging the gap between modals. Entropy. 2023;25(4):614.
Liu Z, Wu X, Yu Y. Multi-task video captioning with a stepwise multimodal encoder. Electronics. 2022;11(17):2639.
Tian J, Yang Z, Shi S. Unsupervised style control for image captioning. In: International conference of pioneering computer scientists, engineers and educators. Springer; 2022. pp. 413–424.
Das TK. Image captioning using deep transfer learning. In: Deep learning applications in image analysis. Springer; 2023. pp. 51–62.
Zhang Y. Image style transfer–a critical review. In: 2023 IEEE 3rd international conference on power, electronics and computer applications (ICPECA). IEEE, 2023;pp. 1444–1447.
Zhong L, Li N, Sun Y. Research and application of image style transfer method. In: 2022 IEEE 8th Intl conference on big data security on cloud (BigDataSecurity), IEEE intl conference on high performance and smart computing, (HPSC) and IEEE intl conference on intelligent data and security (IDS). IEEE; 2022. pp. 143–147.
An image caption generation method based on improved attention mechanism. 2022. pp. 620–627. https://doi.org/10.1109/cvidliccea56201.2022.9824382.
Image caption generator using attention based neural networks. https://doi.org/10.22214/ijraset.2023.53825.
Nguyen BT, Nguyen ST, Vo AH. Channel and spatial attention mechanism for fashion image captioning. Int J Electr Comput Eng (2088-8708). 2023;13(5).
Song Z, Hu Z, Zhou Y, Zhao Y, Hong R, Wang M. Embedded heterogeneous attention transformer for cross-lingual image captioning. IEEE Trans Multimedia. 2024;1–14. https://doi.org/10.1109/TMM.2024.3384678.
Nezami OM, Dras M, Wan S, Paris C. Senti-attend: image captioning using sentiment and attention. arXiv preprint arXiv:1811.09789 2018.
You Q, Jin H, Luo J. Image captioning at will: a versatile scheme for effectively injecting sentiments into image descriptions. arXiv preprint arXiv:1801.10121 2018.
Bisikalo O, Kovenko V, Bogach I, Chorna O. Explaining emotional attitude through the task of image-captioning. In: Proceedings of the 6th international conference on computational linguistics and intelligent systems (COLINS 2022). Volume I: main conference Gliwice, Poland, 2022; May 12–13, 2022. RWTH Aachen University.
Mohamed Y, Khan FF, Haydarov K, Elhoseiny M. It is okay to not be okay: Overcoming emotional bias in affective image captioning by contrastive data collection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022. pp. 21263–21272.
Laina I, Rupprecht C, Navab N. Towards unsupervised image captioning with shared multimodal embeddings. In: Proceedings of the IEEE/CVF international conference on computer vision. 2019. pp. 7414–7424.
Carmo Nogueira T, Vinhal CDN, Cruz Júnior G, Ullmann MRD. Reference-based model using multimodal gated recurrent units for image captioning. Multimedia Tools Appl. 2020;79:30615–30635.
Song K, Chen L, Wang H. Style-enhanced transformer for image captioning in construction scenes. Entropy. 2024;26(3). https://doi.org/10.3390/e26030224.
Duan Y, Wang Z, Yi L, Wang J. Cross-domain multi-style merge for image captioning. Comput Vis Image Underst. 2023;228:103617. https://doi.org/10.1016/j.cviu.2022.103617.
Huang L, Wang W, Chen J, Wei X-Y. Attention on attention for image captioning. In: Proceedings of the IEEE/CVF international conference on computer vision. 2019. pp. 4634–4643.
Pan Y, Yao T, Li Y, Mei T. X-linear attention networks for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020. pp. 10971–10980.
Zhang H, Song H, Li S, Zhou M, Song D. A survey of controllable text generation using transformer-based pre-trained language models. ACM Comput Surv. 2023;56(3):1–37.
Zhu Y, Yan W. Image-based storytelling using deep learning, 2022;pp. 179–186. https://doi.org/10.1145/3561613.3561641.
Lotfi F, Beheshti A, Farhood H, Pooshideh M, Jamzad M, Beigy H. Storytelling with image data: a systematic review and comparative analysis of methods and tools. Algorithms. 2023;16(3). https://doi.org/10.3390/a16030135.
Jung Y, Kim D, Woo S, Kim K, Kim S, Kweon IS. Hide-and-tell: Learning to bridge photo streams for visual storytelling. In: Proceedings of the AAAI conference on artificial intelligence, vol. 34. 2020. pp. 11213–11220.
Haritha P, Vimala S, Malathi S (2020) A systematic literature review on story-telling for kids using image captioning-deep learning. In: 2020 4th International conference on electronics, communication and aerospace technology (ICECA). IEEE. pp. 1588–1593.
Alnami A, Almasre M, Al-Malki N. Story generation from images using deep learning. In: International conference on information, communication and computing technology. Springer; 2021. pp. 198–208.
Zhu Y, Yan WQ. Image-based storytelling using deep learning. In: Proceedings of the 5th international conference on control and computer vision. 2022. pp. 179–186.
Faurina R, Jelita A, Vatresia A, Agustian I. Image captioning to aid blind and visually impaired outdoor navigation. Int J Artif Intell ISSN. 2252(8938):1105.
Arystanbekov B, Kuzdeuov A, Nurgaliyev S, Varol HA. Image captioning for the visually impaired and blind: a recipe for low-resource languages. In: 2023 45th Annual international conference of the IEEE engineering in medicine & biology society (EMBC). IEEE; 2023. pp. 1–4.
Rane C, Lashkare A, Karande A, Rao Y. Image captioning based smart navigation system for visually impaired. In: 2021 International conference on communication information and computing technology (ICCICT). IEEE; 2021. pp. 1–5.
Delloul K, Larabi S. Image captioning state-of-the-art: Is it enough for the guidance of visually impaired in an environment? In: International conference on computing systems and applications. Springer; 2022. pp. 385–394.
Zhou L, Palangi H, Zhang L, Hu H, Corso J, Gao J. Unified vision-language pre-training for image captioning and vqa. In: Proceedings of the AAAI conference on artificial intelligence, vol. 34. 2020. pp. 13041–13049.
Wu J, Hu Z, Mooney RJ. Joint image captioning and question answering. arXiv preprint arXiv:1805.08389 2018.
Wang X, Wang Y, Chen K, Ding J, Zhang W, Yu N. Icstega: Image captioning-based semantically controllable linguistic steganography. In: ICASSP 2023-2023 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE; 2023. pp. 1–5.
Rogers R. Visual media analysis for instagram and other online platforms. Big Data Soc. 2021;8(1):20539517211022370.
Dan V, Paris B, Donovan J, Hameleers M, Roozenbeek J, Linden S, Sikorski C. Visual mis-and disinformation, social media, and democracy. J Mass Commun Q. 2021;98(3):641–64.
Dong X, Zhang G, Zhan X, Ding Y, Wei Y, Lu M, Liang X. Caption-aided product detection via collaborative pseudo-label harmonization. IEEE Trans Multimedia. 2022.
Zhu W, Zhang Y, Zhang Y, Zhou Y, Feng Y, Wu Y, Da Q, Zeng A. Dha: Product title generation with discriminative hierarchical attention for e-commerce. In: Pacific-Asia conference on knowledge discovery and data mining. Springer; 2022. pp. 275–287.
Harzig P, Zecha D, Lienhart R, Kaiser C, Schallner R. Image captioning with clause-focused metrics in a multi-modal setting for marketing. In: 2019 IEEE conference on multimedia information processing and retrieval (MIPR). IEEE; 2019. pp. 419–424.
Al-Malki RS, Al-Aama AY. Arabic captioning for images of clothing using deep learning. Sensors. 2023;23(8):3783.
Tang Y, Zhang L, Yuan Y, Chen Z. Describe fashion products via local sparse self-attention mechanism and attribute-based re-sampling strategy. In: IEEE Trans Circuits Syst Video Technol. 2022.
Yan B. A cnn-lstm-based model for fashion image aesthetic captioning. In: Third international conference on computer vision and data mining (ICCVDM 2022), vol. 12511. SPIE; 2023. pp. 309–315.
Ahsan H, Bhalla N, Bhatt D, Shah K. Multi-modal image captioning for the visually impaired. arXiv preprint arXiv:2105.08106 2021.
Dognin P, Melnyk I, Mroueh Y, Padhi I, Rigotti M, Ross J, Schiff Y, Young RA, Belgodere B. Image captioning as an assistive technology: lessons learned from vizwiz 2020 challenge. J Artif Intell Res. 2022;73:437–59.
Salaberria A, Azkune G, Lacalle OL, Soroa A, Agirre E. Image captioning for effective use of language models in knowledge-based visual question answering. Expert Syst Appl. 2023;212: 118669.
Banerjee P, Gokhale T, Yang Y, Baral C. Weaqa: Weak supervision via captions for visual question answering. arXiv preprint arXiv:2012.02356 2020.
Tang J, McGoldrick G, Al-Ghossein M, Chen C-W. Captions are worth a thousand words: Enhancing product retrieval with pretrained image-to-text models. arXiv preprint arXiv:2402.08532 2024.
Mandal I, Dwivedi A. Deep learning algorithms for accurate prediction of image description for e-commerce industry. In: Data Management, Analytics and Innovation: Proceedings of ICDMAI 2019, Volume 2. Springer; 2020. pp. 401–418.
Li X, Ye Z, Zhang Z, Zhao M. Clothes image caption generation with attribute detection and visual attention model. Pattern Recogn Lett. 2021;141:68–74.
Cornia M, Stefanini M, Baraldi L, Cucchiara R. Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020. pp. 10578–10587.
Agrawal H, Desai K, Wang Y, Chen X, Jain R, Johnson M, Batra D, Parikh D, Lee S, Anderson P. nocaps: novel object captioning at scale. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV). 2019.
Lu H, Yang R, Deng Z, Zhang Y, Gao G, Lan R. Chinese image captioning via fuzzy attention-based densenet-bilstm. ACM Trans Multimedia Comput Commun Appl (TOMM). 2021;17(1s):1–18.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflicts of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Sharma, A., Aggarwal, M. A Holistic Review of Image-to-Text Conversion: Techniques, Evaluation Metrics, Multilingual Captioning, Storytelling and Integration. SN COMPUT. SCI. 6, 225 (2025). https://doi.org/10.1007/s42979-025-03719-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42979-025-03719-6