M-FFN: multi-scale feature fusion network for image captioning

Prudviraj, Jeripothula; Vishnu, Chalavadi; Mohan, Chalavadi Krishna

doi:10.1007/s10489-022-03463-x

M-FFN: multi-scale feature fusion network for image captioning

Published: 24 May 2022

Volume 52, pages 14711–14723, (2022)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Jeripothula Prudviraj ORCID: orcid.org/0000-0002-6653-4991¹,
Chalavadi Vishnu¹ &
Chalavadi Krishna Mohan¹

1395 Accesses
1 Altmetric
Explore all metrics

Abstract

In this work, we present a novel multi-scale feature fusion network (M-FFN) for image captioning task to incorporate discriminative features and scene contextual information of an image. We construct multi-scale feature fusion network by leveraging spatial transformation and multi-scale feature pyramid networks via feature fusion block to enrich spatial and global semantic information. In particular, we take advantage of multi-scale feature pyramid network to incorporate global contextual information by employing atrous convolutions on top layers of convolutional neural network (CNN). And, the spatial transformation network is exploited on early layers of CNN to remove intra-class variability caused by spatial transformations. Further, the feature fusion block integrates both global contextual information and spatial features to encode the visual information of an input image. Moreover, spatial-semantic attention module is incorporated to learn attentive contextual features to guide the captioning module. The efficacy of the proposed model is evaluated on the COCO dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Incorporating attentive multi-scale context information for image captioning

Article 13 January 2022

Delving into Precise Attention in Image Captioning

SATNet: Captioning with Semantic Alignment and Feature Enhancement

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Lin TY, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2117–2125
Kang K, Li H, Yan J, Zeng X, Yang B, Xiao T, Zhang C, Wang Z, Wang R, Wang X et al (2017) T-cnn: Tubelets with convolutional neural networks for object detection from videos. IEEE Trans Circuits Syst Video Technol 28(10):2896–2907
Article Google Scholar
He X, Yang Y, Shi B, Bai X (2019) Vd-san:, Visual-densely semantic attention network for image caption generation. Neurocomputing 328:48–55
Article Google Scholar
Su J, Tang J, Lu Z, Han X, Zhang H (2019) A neural image captioning model with caption-to-images semantic constructor. Neurocomputing 367:144–151
Article Google Scholar
Xiao F, Gong X, Zhang Y, Shen Y, Li J, Gao X (2019) Daa: Dual lstms with adaptive attention for image captioning. Neurocomputing 364:322–329
Article Google Scholar
Feng Q, Wu Y, Fan H, Yan C, Xu M, Yang Y (2020) Cascaded revision network for novel object captioning. IEEE Trans Circuits Syst Video Technol 30(10):3413–3421
Article Google Scholar
Yu J, Li J, Yu Z, Huang Q (2019) Multimodal transformer with multi-view visual representation for image captioning. IEEE Trans Circuits Syst Video Technol 30(12):4467–4480
Article Google Scholar
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv:1409.0473
Tan Ying Hua, Chan Chee Seng (2019) Phrase-based image caption generator with hierarchical lstm network. Neurocomputing 333:86–100
Article Google Scholar
Zhang X, He S, Song X, Lau RWH, Jiao J, Ye Q (2020) Image captioning via semantic element embedding. Neurocomputing 395:212–221
Article Google Scholar
Zhao D, Chang Z, Guo S (2019) A multimodal fusion approach for image captioning. Neurocomputing 329:476–485
Article Google Scholar
Guo L, Liu J, Yao P, Li J, Lu H (2019) Mscap: Multi-style image captioning with unpaired stylized text. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4204–4213
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086
Farahzadeh E, Cham TJ, Li W (2015) Semantic and spatial content fusion for scene recognition. In: New Development in Robot Vision, pp 33–53. Springer
Gatys LA, Ecker AS, Bethge M (2015) Texture synthesis and the controlled generation of natural stimuli using convolutional neural networks. In: In Bernstein Conference, vol 2015, pp 219–219
Mahendran A, Vedaldi A (2015) Understanding deep image representations by inverting them. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5188–5196
Song J, Yu Q, Song YZ, Xiang T, Hospedales TM (2017) Deep spatial-semantic attention for fine-grained sketch-based image retrieval. In: Proceedings of the IEEE international conference on computer vision, pp 5551–5560
Zhang Z, Zhang X, Peng C, Xue X, Sun J (2018) Exfuse: Enhancing feature fusion for semantic segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 269–284
Annunziata R, Sagonas C, Calì J (2018) Destnet:, Densely fused spatial transformer networks. arXiv:1807.04050
Jaderberg M, Simonyan K, Zisserman A et al (2015) Spatial transformer networks. Adv Neural Inf Process Syst 28:2017–2025
Google Scholar
Ma C, Huang JB, Yang X, Yang MH (2018) Robust visual tracking via hierarchical convolutional features. IEEE Trans Pattern Anal Mach Intell 41(11):2709–2723
Article Google Scholar
Mishra SR, Mishra TK, Sanyal G, Sarkar A, Satapathy SC (2020) Real time human action recognition using triggered frame extraction and a typical cnn heuristic. Pattern Recogn Lett 135:329– 336
Article Google Scholar
Ding S, Qu S, Xi Y, Wan S (2020) Stimulus-driven and concept-driven analysis for image caption generation. Neurocomputing 398:520–530
Article Google Scholar
Si H, Zhang Z, Lv F, Yu G, Lu F (2019) Real-time semantic segmentation via multiply spatial fusion network. arXiv:1911.07217 1911.07217
Wang W, Zhang Z, Qi S, Shen J, Pang Y, Shao L (2019) Learning compositional neural information fusion for human parsing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 5703–5713
Lu Y, Feng M, Wu M, Zhang C (2020) C-dlinknet: considering multi-level semantic features for human parsing. arXiv:2001.11690 2001.11690
Luo H, Jiang W, Fan X, Zhang C (2020) Stnreid: Deep convolutional networks with pairwise spatial transformer networks for partial person re-identification. IEEE Trans Multimed 22(11):2905–2913
Article Google Scholar
Qian Y, Yang M, Zhao X, Wang C, Wang B (2019) Oriented spatial transformer network for pedestrian detection using fish-eye camera. IEEE Trans Multimed 22(2):421–431
Article Google Scholar
Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2017) Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848
Article Google Scholar
Chen LC, Zhu Y, Papandreou G, Schroff F, Adam H (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European conference on computer vision (ECCV), pp 801–818
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164
Jia X, Gavves E, Fernando B, Tuytelaars T (2015) Guiding the long-short term memory model for image caption generation. In: Proceedings of the IEEE international conference on computer vision, pp 2407–2415
Xiao X, Wang L, Ding K, Xiang S, Pan C (2019) Deep hierarchical encoder–decoder network for image captioning. IEEE Trans Multimed 21(11):2942–2956
Article Google Scholar
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057. PMLR
Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 375–383
Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua TS (2017) Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5659–5667
Takahashi R, Matsubara T, Uehara K (2019) Data augmentation using random image cropping and patching for deep cnns. IEEE Trans Circuits Syst Video Technol 30(9):2917–2931
Article Google Scholar
Tu Z, Xie W, Dauwels J, Li B, Yuan J (2018) Semantic cues enhanced multimodality multistream cnn for action recognition. IEEE Trans Circuits Syst Video Technol 29(5):1423– 1437
Article Google Scholar
Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90
Article Google Scholar
Cohen TS, Welling M (2014) Transformation properties of learned visual representations. arXiv:1412.7659
Fu J, Liu J, Tian H, Li Y, Bao Y, Fang Z, Lu H (2019) Dual attention network for scene segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3146–3154
Ainam JP, Qin K, Liu G (2018) Self attention grid for person re-identification. arXiv:1809.08556
Gan Z, Gan C, He X, Pu Y, Tran K, Gao J, Carin L, Li D (2017) Semantic compositional networks for visual captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5630–5639
Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318
Banerjee S, Lavie A (2005) Meteor: an automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72
Lin CY (2004) Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out, pp 74– 81
Vedantam R, Lawrence ZC, Parikh D (2015) Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575
Kingma DP, Ba J (2014) Adam:, A method for stochastic optimization. arXiv:1412.6980
Wang J, Wang W, Wang L, Wang Z, Feng DD, Tan T (2020) Learning visual relationship and context-aware attention for image captioning. Pattern Recogn 98:107075
Article Google Scholar
Karpathy A, Li FF (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137
Li L, Tang S, Zhang Y, Deng L, Qi T (2017) Gla: Global–local attention for image description. IEEE Transactions on Multimedia 20(3):726–737
Article Google Scholar
Tan JH, Chan CS, Chuah JH (2019) Comic: Toward a compact image captioning model with attention. IEEE Transactions on Multimedia 21(10):2686–2696
Article Google Scholar
Wu L, Xu M, Wang J, Perry S (2019) Recall what you see continually using gridlstm in image captioning. IEEE Transactions on Multimedia 22(3):808–818
Article Google Scholar
Huang L, Wang W, Chen J, Wei XY (2019) Attention on attention for image captioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 4634–4643
Cornia M, Stefanini M, Baraldi L, Cucchiara R (2020) Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10578–10587
Li X, Yin X, Li C, Zhang P, Hu X, Zhang L, Wang L, Hu H, Dong L, Wei F et al (2020) Oscar: Object-semantics aligned pre-training for vision-language tasks. In: European Conference on Computer Vision, pp 121–137. Springer
Zhang P, Li X, Hu X, Yang J, Zhang L, Wang L, Choi Y, Gao J (2021) Vinvl: Revisiting visual representations in vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5579–5588
Zhou L, Zhang Y, Jiang YG, Zhang T, Fan W (2019) Re-caption: Saliency-enhanced image captioning through two-phase learning. IEEE Trans Image Process 29:694–709
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Indian Institute of Technology Hyderabad, Hyderabad, India
Jeripothula Prudviraj, Chalavadi Vishnu & Chalavadi Krishna Mohan

Authors

Jeripothula Prudviraj
View author publications
You can also search for this author inPubMed Google Scholar
Chalavadi Vishnu
View author publications
You can also search for this author inPubMed Google Scholar
Chalavadi Krishna Mohan
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Jeripothula Prudviraj.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article belongs to the Topical Collection: Special Issue on Multi-view Learning

Guest Editors: Guoqing Chao, Xingquan Zhu, Weiping Ding, Jinbo Bi and Shiliang Sun

Rights and permissions

Reprints and permissions

About this article

Cite this article

Prudviraj, J., Vishnu, C. & Mohan, C.K. M-FFN: multi-scale feature fusion network for image captioning. Appl Intell 52, 14711–14723 (2022). https://doi.org/10.1007/s10489-022-03463-x

Download citation

Accepted: 03 March 2022
Published: 24 May 2022
Issue Date: October 2022
DOI: https://doi.org/10.1007/s10489-022-03463-x

Keywords

Part of a collection:

Special Issue on Multi-view Learning

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

M-FFN: multi-scale feature fusion network for image captioning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Incorporating attentive multi-scale context information for image captioning

Delving into Precise Attention in Image Captioning

SATNet: Captioning with Semantic Alignment and Feature Enhancement

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now