Abstract
Multimodal machine translation (MMT) refers to the extraction of information from more than one modality aiming at performance improvement by utilizing information collected from the modalities other than pure text. The availability of multimodal datasets, particularly for Indian regional languages, is still limited, and thus, there is a need to build such datasets for regional languages to promote the state of MMT research. In this work, we describe the process of creation of the Bengali Visual Genome (BVG) dataset. The BVG is the first multimodal dataset consisting of text and images suitable for English-to-Bengali multimodal machine translation tasks and multimodal research. We also demonstrate the sample use-cases of machine translation and region-specific image captioning using the new BVG dataset. These results can be considered as the baseline for subsequent research.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Sulubacak, U., Caglayan, O., Grönroos, S.A., Rouhe, A., Elliott, D., Specia, L., Tiedemann, J.: Multimodal machine translation through visuals and speech. Mach. Transl. 34(2), 97–147 (2020)
Popel, M., Tomkova, M., Tomek, J., Kaiser, Ł, Uszkoreit, J., Bojar, O., Žabokrtský, Z.: Transforming machine translation: a deep learning system reaches news translation quality comparable to human professionals. Nat. Commun. 11(1), 1–15 (2020)
Parida, S., Motlicek, P., Dash, A.R., Dash, S.R., Mallick, D.K., Biswal, S.P., Pattnaik, P., Nayak, B.N., Bojar, O.: Odianlp’s participation in WAT2020. In: Proceedings of the 7th Workshop on Asian Translation. pp. 103–108 (2020)
Khan, M.F., Sadiq-Ur-Rahman, S., Islam, M.S.: Improved bengali image captioning via deep convolutional neural network based encoder-decoder model. In: Proceedings of International Joint Conference on Advances in Computational Intelligence. pp. 217–229. Springer (2021)
Rahman, M., Mohammed, N., Mansoor, N., Momen, S.: Chittron: an automatic Bangla image captioning system. Procedia Comput. Sci. 154, 636–642 (2019)
Kamruzzaman, T.: Dataset for image captioning system (in bangla) (2021)
Nakazawa, T., Doi, N., Higashiyama, S., Ding, C., Dabre, R., Mino, H., Goto, I., Pa, W.P., Kunchukuttan, A., Oda, Y., Parida, S., Bojar, O., Kurohashi, S.: Overview of the 6th workshop on Asian translation. In: Proceedings of the 6th Workshop on Asian Translation. pp. 1–35. Association for Computational Linguistics, Hong Kong, China (Nov 2019). https://doi.org/10.18653/v1/D19-5201, https://www.aclweb.org/anthology/D19-5201
Parida, S., Bojar, O., Dash, S.R.: Hindi visual genome: a dataset for multi-modal English to hindi machine translation. Comput. Sist. 23(4) (2019)
Kudo, T., Richardson, J.: SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. pp. 66–71. Association for Computational Linguistics, Brussels, Belgium (Nov 2018). https://doi.org/10.18653/v1/D18-2012, https://www.aclweb.org/anthology/D18-2012
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Teh, Y.W., Titterington, M. (eds.) Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. Proceedings of Machine Learning Research, vol. 9, pp. 249–256. PMLR, Chia Laguna Resort, Sardinia, Italy (13–15 May 2010), http://proceedings.mlr.press/v9/glorot10a.html
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization (2014), http://arxiv.org/abs/1412.6980, cite arxiv:1412.6980Comment: Published as a conference paper at the 3rd International Conference for Learning Representations, San Diego, 2015
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3156–3164 (2015). https://doi.org/10.1109/CVPR.2015.7298935
Girshick, R.: Fast r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV). pp. 1440–1448 (2015). https://doi.org/10.1109/ICCV.2015.169
Soh, M.: Learning cnn-lstm architectures for image caption generation
Yu, J., Li, J., Yu, Z., Huang, Q.: Multimodal transformer with multi-view visual representation for image captioning. IEEE Trans. Circuits Syst. Video Technol. 30(12), 4467–4480 (2019)
Acknowledgements
The author Ondřej Bojar would like to acknowledge the support of the grant 19-26934X (NEUREM3) of the Czech Science Foundation.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Sen, A., Parida, S., Kotwal, K., Panda, S., Bojar, O., Dash, S.R. (2022). Bengali Visual Genome: A Multimodal Dataset for Machine Translation and Image Captioning. In: Satapathy, S.C., Peer, P., Tang, J., Bhateja, V., Ghosh, A. (eds) Intelligent Data Engineering and Analytics. Smart Innovation, Systems and Technologies, vol 266. Springer, Singapore. https://doi.org/10.1007/978-981-16-6624-7_7
Download citation
DOI: https://doi.org/10.1007/978-981-16-6624-7_7
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-6623-0
Online ISBN: 978-981-16-6624-7
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)