Abstract
Image captioning is a challenging task which requires not only to extract semantic information but also to generate descriptions with correct sentences. Most of the previous researches employ one-layer or two-layer Recurrent Neural Network (RNN) as the language model to predict sentence words. The language model may easily deal with the word information for a noun or an object, however, it may not be able to learn a verb or an adjective. To address this issue, a deep attention based language model is proposed to learn more abstract word information and three stacked approaches are designed to process attention. The proposed model makes full use of the Long Short Term Memory (LSTM) network and employs the transferred current attention to enhance extra spatial information. The experimental results on the benchmark MSCOCO and Flickr30K datasets have verified the effectiveness of the proposed model.




Similar content being viewed by others
References
Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proc. ACL Workshop IEEMMTS’05, vol 29, pp 65–72
Farabet C, Couprie C, Najman L, LeCun Y (2013) Learning hierarchical features for scene labeling. IEEE Trans Pattern Anal Mach Intell 35(8):1915–1929
Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: generating sentences from images. In: Proc. ECCV’10, pp 15–29
Gong Y, Wang L, Hodosh M, Hockenmaier J, Lazebnik S (2014) Improving image-sentence embeddings using large weakly annotated photo collections. In: Proc. ECCV’14, pp 529–545
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proc. CVPR’16, pp 770–778
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: data, models and evaluation metrics. J Artif Intell Res 47:853–899
Jia X, Gavves E, Fernando B, Tuytelaars T (2015) Guiding the long-short term memory model for image caption generation. In: Proc. ICCV’15, pp 2407–2415
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proc. CVPR’15, pp 3128–3137
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollar P, Zitnick CL (2014) Microsoft coco: common objects in context. In: Proc. ECCV’14, pp 740–755
Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proc. CVPR’17, pp 375–383
Mao J, Xu W, Yang Y, Wang J, Huang Z, Yuille A (2015) Deep captioning with multimodal recurrent neural networks (m-RNN). In: Proc. ICLR’15
Papineni K, Roukos S, Ward T, Zhu W (2002) BLEU: a method for automatic evaluation of machine translation. In: Proc. ACL’02, pp 311–318
Prakash A, Hasan S, Lee K, Datla V, Qadir A, Liu J, Farri O (2016) Neural paraphrase generation with stacked residual LSTM networks. In: Proc. COLING’16, pp 2923–2934
Reddy DR (1997) Speech understanding systems: a summary of results of the five-year research effort. Tech. rep., Carnegie-Mellon University. Computer Science Dept
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: Proc. ICLR’15
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15:1929–1958
Srivastava RK, Greff K, Schmidhuber J (2015) Training very deep networks. In: Proc. NIPS’15, pp 2377–2385
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Proc. NIPS’14, pp 3104–3112
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proc. CVPR’15, pp 1–9
Vedantam R, Zitnick CL, Parikh D (2015) CIDEr: consensus-based image description evaluation. In: Proc. CVPR’15, pp 4566–4575
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: Proc. CVPR’15, pp 3156–3164
Wu Q, Shen C, Liu L, Dick A, Hengel AVD (2016) What value do explicit high level concepts have in vision to language problems? In: Proc. CVPR’16, pp 203–212
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: Proc. ICML’15, pp 2048– 2057
Yao T, Pan Y, Li Y, Qiu Z, Mei T (2017) Boosting image captioning with attributes. In: Proc. ICCV’17, pp 4894–4902
You Q, Jin H, Wang Z, Fang C, Luo J (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proc. CVPR’15, pp 2625–2634
You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: Proc. CVPR’16, pp 4651–4659
Acknowledgements
This work was supported in part by National Natural Science Foundation of China under Grants 61622115 and 61472281, Program for Professor of Special Appointment (Eastern Scholar) at Shanghai Institutions of Higher Learning (No. GZ2015005), Shanghai Engineering Research Center of Industrial Vision Perception & Intelligent Computing (17DZ2251600), and IBM Shared University Research Awards Program.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Fang, F., Wang, H., Chen, Y. et al. Looking deeper and transferring attention for image captioning. Multimed Tools Appl 77, 31159–31175 (2018). https://doi.org/10.1007/s11042-018-6228-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-018-6228-6