Abstract
In this work, we present a novel multi-scale feature fusion network (M-FFN) for image captioning task to incorporate discriminative features and scene contextual information of an image. We construct multi-scale feature fusion network by leveraging spatial transformation and multi-scale feature pyramid networks via feature fusion block to enrich spatial and global semantic information. In particular, we take advantage of multi-scale feature pyramid network to incorporate global contextual information by employing atrous convolutions on top layers of convolutional neural network (CNN). And, the spatial transformation network is exploited on early layers of CNN to remove intra-class variability caused by spatial transformations. Further, the feature fusion block integrates both global contextual information and spatial features to encode the visual information of an input image. Moreover, spatial-semantic attention module is incorporated to learn attentive contextual features to guide the captioning module. The efficacy of the proposed model is evaluated on the COCO dataset.






Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Lin TY, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2117–2125
Kang K, Li H, Yan J, Zeng X, Yang B, Xiao T, Zhang C, Wang Z, Wang R, Wang X et al (2017) T-cnn: Tubelets with convolutional neural networks for object detection from videos. IEEE Trans Circuits Syst Video Technol 28(10):2896–2907
He X, Yang Y, Shi B, Bai X (2019) Vd-san:, Visual-densely semantic attention network for image caption generation. Neurocomputing 328:48–55
Su J, Tang J, Lu Z, Han X, Zhang H (2019) A neural image captioning model with caption-to-images semantic constructor. Neurocomputing 367:144–151
Xiao F, Gong X, Zhang Y, Shen Y, Li J, Gao X (2019) Daa: Dual lstms with adaptive attention for image captioning. Neurocomputing 364:322–329
Feng Q, Wu Y, Fan H, Yan C, Xu M, Yang Y (2020) Cascaded revision network for novel object captioning. IEEE Trans Circuits Syst Video Technol 30(10):3413–3421
Yu J, Li J, Yu Z, Huang Q (2019) Multimodal transformer with multi-view visual representation for image captioning. IEEE Trans Circuits Syst Video Technol 30(12):4467–4480
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv:1409.0473
Tan Ying Hua, Chan Chee Seng (2019) Phrase-based image caption generator with hierarchical lstm network. Neurocomputing 333:86–100
Zhang X, He S, Song X, Lau RWH, Jiao J, Ye Q (2020) Image captioning via semantic element embedding. Neurocomputing 395:212–221
Zhao D, Chang Z, Guo S (2019) A multimodal fusion approach for image captioning. Neurocomputing 329:476–485
Guo L, Liu J, Yao P, Li J, Lu H (2019) Mscap: Multi-style image captioning with unpaired stylized text. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4204–4213
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086
Farahzadeh E, Cham TJ, Li W (2015) Semantic and spatial content fusion for scene recognition. In: New Development in Robot Vision, pp 33–53. Springer
Gatys LA, Ecker AS, Bethge M (2015) Texture synthesis and the controlled generation of natural stimuli using convolutional neural networks. In: In Bernstein Conference, vol 2015, pp 219–219
Mahendran A, Vedaldi A (2015) Understanding deep image representations by inverting them. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5188–5196
Song J, Yu Q, Song YZ, Xiang T, Hospedales TM (2017) Deep spatial-semantic attention for fine-grained sketch-based image retrieval. In: Proceedings of the IEEE international conference on computer vision, pp 5551–5560
Zhang Z, Zhang X, Peng C, Xue X, Sun J (2018) Exfuse: Enhancing feature fusion for semantic segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 269–284
Annunziata R, Sagonas C, Calì J (2018) Destnet:, Densely fused spatial transformer networks. arXiv:1807.04050
Jaderberg M, Simonyan K, Zisserman A et al (2015) Spatial transformer networks. Adv Neural Inf Process Syst 28:2017–2025
Ma C, Huang JB, Yang X, Yang MH (2018) Robust visual tracking via hierarchical convolutional features. IEEE Trans Pattern Anal Mach Intell 41(11):2709–2723
Mishra SR, Mishra TK, Sanyal G, Sarkar A, Satapathy SC (2020) Real time human action recognition using triggered frame extraction and a typical cnn heuristic. Pattern Recogn Lett 135:329– 336
Ding S, Qu S, Xi Y, Wan S (2020) Stimulus-driven and concept-driven analysis for image caption generation. Neurocomputing 398:520–530
Si H, Zhang Z, Lv F, Yu G, Lu F (2019) Real-time semantic segmentation via multiply spatial fusion network. arXiv:1911.072171911.07217
Wang W, Zhang Z, Qi S, Shen J, Pang Y, Shao L (2019) Learning compositional neural information fusion for human parsing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 5703–5713
Lu Y, Feng M, Wu M, Zhang C (2020) C-dlinknet: considering multi-level semantic features for human parsing. arXiv:2001.116902001.11690
Luo H, Jiang W, Fan X, Zhang C (2020) Stnreid: Deep convolutional networks with pairwise spatial transformer networks for partial person re-identification. IEEE Trans Multimed 22(11):2905–2913
Qian Y, Yang M, Zhao X, Wang C, Wang B (2019) Oriented spatial transformer network for pedestrian detection using fish-eye camera. IEEE Trans Multimed 22(2):421–431
Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2017) Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848
Chen LC, Zhu Y, Papandreou G, Schroff F, Adam H (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European conference on computer vision (ECCV), pp 801–818
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164
Jia X, Gavves E, Fernando B, Tuytelaars T (2015) Guiding the long-short term memory model for image caption generation. In: Proceedings of the IEEE international conference on computer vision, pp 2407–2415
Xiao X, Wang L, Ding K, Xiang S, Pan C (2019) Deep hierarchical encoder–decoder network for image captioning. IEEE Trans Multimed 21(11):2942–2956
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057. PMLR
Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 375–383
Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua TS (2017) Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5659–5667
Takahashi R, Matsubara T, Uehara K (2019) Data augmentation using random image cropping and patching for deep cnns. IEEE Trans Circuits Syst Video Technol 30(9):2917–2931
Tu Z, Xie W, Dauwels J, Li B, Yuan J (2018) Semantic cues enhanced multimodality multistream cnn for action recognition. IEEE Trans Circuits Syst Video Technol 29(5):1423– 1437
Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90
Cohen TS, Welling M (2014) Transformation properties of learned visual representations. arXiv:1412.7659
Fu J, Liu J, Tian H, Li Y, Bao Y, Fang Z, Lu H (2019) Dual attention network for scene segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3146–3154
Ainam JP, Qin K, Liu G (2018) Self attention grid for person re-identification. arXiv:1809.08556
Gan Z, Gan C, He X, Pu Y, Tran K, Gao J, Carin L, Li D (2017) Semantic compositional networks for visual captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5630–5639
Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318
Banerjee S, Lavie A (2005) Meteor: an automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72
Lin CY (2004) Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out, pp 74– 81
Vedantam R, Lawrence ZC, Parikh D (2015) Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575
Kingma DP, Ba J (2014) Adam:, A method for stochastic optimization. arXiv:1412.6980
Wang J, Wang W, Wang L, Wang Z, Feng DD, Tan T (2020) Learning visual relationship and context-aware attention for image captioning. Pattern Recogn 98:107075
Karpathy A, Li FF (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137
Li L, Tang S, Zhang Y, Deng L, Qi T (2017) Gla: Global–local attention for image description. IEEE Transactions on Multimedia 20(3):726–737
Tan JH, Chan CS, Chuah JH (2019) Comic: Toward a compact image captioning model with attention. IEEE Transactions on Multimedia 21(10):2686–2696
Wu L, Xu M, Wang J, Perry S (2019) Recall what you see continually using gridlstm in image captioning. IEEE Transactions on Multimedia 22(3):808–818
Huang L, Wang W, Chen J, Wei XY (2019) Attention on attention for image captioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 4634–4643
Cornia M, Stefanini M, Baraldi L, Cucchiara R (2020) Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10578–10587
Li X, Yin X, Li C, Zhang P, Hu X, Zhang L, Wang L, Hu H, Dong L, Wei F et al (2020) Oscar: Object-semantics aligned pre-training for vision-language tasks. In: European Conference on Computer Vision, pp 121–137. Springer
Zhang P, Li X, Hu X, Yang J, Zhang L, Wang L, Choi Y, Gao J (2021) Vinvl: Revisiting visual representations in vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5579–5588
Zhou L, Zhang Y, Jiang YG, Zhang T, Fan W (2019) Re-caption: Saliency-enhanced image captioning through two-phase learning. IEEE Trans Image Process 29:694–709
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article belongs to the Topical Collection: Special Issue on Multi-view Learning
Guest Editors: Guoqing Chao, Xingquan Zhu, Weiping Ding, Jinbo Bi and Shiliang Sun
Rights and permissions
About this article
Cite this article
Prudviraj, J., Vishnu, C. & Mohan, C.K. M-FFN: multi-scale feature fusion network for image captioning. Appl Intell 52, 14711–14723 (2022). https://doi.org/10.1007/s10489-022-03463-x
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-022-03463-x