Looking deeper and transferring attention for image captioning

Fang, Fang; Wang, Hanli; Chen, Yihao; Tang, Pengjie

doi:10.1007/s11042-018-6228-6

Looking deeper and transferring attention for image captioning

Published: 05 June 2018

Volume 77, pages 31159–31175, (2018)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Fang Fang^1,2,3,
Hanli Wang^1,2,3,
Yihao Chen^1,2,3 &
…
Pengjie Tang^1,2,3

412 Accesses
Explore all metrics

Abstract

Image captioning is a challenging task which requires not only to extract semantic information but also to generate descriptions with correct sentences. Most of the previous researches employ one-layer or two-layer Recurrent Neural Network (RNN) as the language model to predict sentence words. The language model may easily deal with the word information for a noun or an object, however, it may not be able to learn a verb or an adjective. To address this issue, a deep attention based language model is proposed to learn more abstract word information and three stacked approaches are designed to process attention. The proposed model makes full use of the Long Short Term Memory (LSTM) network and employs the transferred current attention to enhance extra spatial information. The experimental results on the benchmark MSCOCO and Flickr30K datasets have verified the effectiveness of the proposed model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Bidirectional Multimodal Recurrent Neural Networks with Refined Visual Features for Image Captioning

Image Captioning with Object Detection and Localization

A New Attention-Based LSTM for Image Captioning

Article 14 February 2022

References

Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proc. ACL Workshop IEEMMTS’05, vol 29, pp 65–72
Farabet C, Couprie C, Najman L, LeCun Y (2013) Learning hierarchical features for scene labeling. IEEE Trans Pattern Anal Mach Intell 35(8):1915–1929
Article Google Scholar
Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: generating sentences from images. In: Proc. ECCV’10, pp 15–29
Gong Y, Wang L, Hodosh M, Hockenmaier J, Lazebnik S (2014) Improving image-sentence embeddings using large weakly annotated photo collections. In: Proc. ECCV’14, pp 529–545
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proc. CVPR’16, pp 770–778
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: data, models and evaluation metrics. J Artif Intell Res 47:853–899
Article MathSciNet Google Scholar
Jia X, Gavves E, Fernando B, Tuytelaars T (2015) Guiding the long-short term memory model for image caption generation. In: Proc. ICCV’15, pp 2407–2415
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proc. CVPR’15, pp 3128–3137
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollar P, Zitnick CL (2014) Microsoft coco: common objects in context. In: Proc. ECCV’14, pp 740–755
Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proc. CVPR’17, pp 375–383
Mao J, Xu W, Yang Y, Wang J, Huang Z, Yuille A (2015) Deep captioning with multimodal recurrent neural networks (m-RNN). In: Proc. ICLR’15
Papineni K, Roukos S, Ward T, Zhu W (2002) BLEU: a method for automatic evaluation of machine translation. In: Proc. ACL’02, pp 311–318
Prakash A, Hasan S, Lee K, Datla V, Qadir A, Liu J, Farri O (2016) Neural paraphrase generation with stacked residual LSTM networks. In: Proc. COLING’16, pp 2923–2934
Reddy DR (1997) Speech understanding systems: a summary of results of the five-year research effort. Tech. rep., Carnegie-Mellon University. Computer Science Dept
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: Proc. ICLR’15
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15:1929–1958
MathSciNet MATH Google Scholar
Srivastava RK, Greff K, Schmidhuber J (2015) Training very deep networks. In: Proc. NIPS’15, pp 2377–2385
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Proc. NIPS’14, pp 3104–3112
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proc. CVPR’15, pp 1–9
Vedantam R, Zitnick CL, Parikh D (2015) CIDEr: consensus-based image description evaluation. In: Proc. CVPR’15, pp 4566–4575
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: Proc. CVPR’15, pp 3156–3164
Wu Q, Shen C, Liu L, Dick A, Hengel AVD (2016) What value do explicit high level concepts have in vision to language problems? In: Proc. CVPR’16, pp 203–212
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: Proc. ICML’15, pp 2048– 2057
Yao T, Pan Y, Li Y, Qiu Z, Mei T (2017) Boosting image captioning with attributes. In: Proc. ICCV’17, pp 4894–4902
You Q, Jin H, Wang Z, Fang C, Luo J (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proc. CVPR’15, pp 2625–2634
You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: Proc. CVPR’16, pp 4651–4659

Download references

Acknowledgements

This work was supported in part by National Natural Science Foundation of China under Grants 61622115 and 61472281, Program for Professor of Special Appointment (Eastern Scholar) at Shanghai Institutions of Higher Learning (No. GZ2015005), Shanghai Engineering Research Center of Industrial Vision Perception & Intelligent Computing (17DZ2251600), and IBM Shared University Research Awards Program.

Author information

Authors and Affiliations

Department of Computer Science and Technology, Tongji University, Shanghai, 201804, China
Fang Fang, Hanli Wang, Yihao Chen & Pengjie Tang
Key Laboratory of Embedded System and Service Computing, Ministry of Education, Tongji University, Shanghai, 200092, China
Fang Fang, Hanli Wang, Yihao Chen & Pengjie Tang
Shanghai Engineering Research Center of Industrial Vision Perception & Intelligent Computing, Shanghai, 200092, China
Fang Fang, Hanli Wang, Yihao Chen & Pengjie Tang

Authors

Fang Fang
View author publications
You can also search for this author inPubMed Google Scholar
Hanli Wang
View author publications
You can also search for this author inPubMed Google Scholar
Yihao Chen
View author publications
You can also search for this author inPubMed Google Scholar
Pengjie Tang
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Hanli Wang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fang, F., Wang, H., Chen, Y. et al. Looking deeper and transferring attention for image captioning. Multimed Tools Appl 77, 31159–31175 (2018). https://doi.org/10.1007/s11042-018-6228-6

Download citation

Received: 06 September 2017
Revised: 11 March 2018
Accepted: 29 May 2018
Published: 05 June 2018
Issue Date: December 2018
DOI: https://doi.org/10.1007/s11042-018-6228-6

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Looking deeper and transferring attention for image captioning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Bidirectional Multimodal Recurrent Neural Networks with Refined Visual Features for Image Captioning

Image Captioning with Object Detection and Localization

A New Attention-Based LSTM for Image Captioning

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now