Abstract
Social media platforms have seen a surge in sharing food experiences and recipes. To manage the vast amount of data generated by this trend, a cross-modal retrieval system has been developed to retrieve recipes and images. Existing literature has mainly focused on textual aspects of recipes, except for cross-modal pre-trained models, disregarding the features of food images themselves. To address this limitation, we comprehensively analyzed the characteristics of food images and proposed a new cross-modal retrieval framework. Our approach uses a pre-trained network based on food images as the image encoder and a hierarchical Transformer as the text encoder. Our research showed that food images have unique features such as color, texture, and shape that can be utilized to enhance cross-modal retrieval. We also discovered that the current state-of-the-art cross-modal retrieval models that rely solely on textual information are limited in their capacity to retrieve images. Therefore, we developed a new cross-modal retrieval framework that combines both textual and visual information. The image encoder, based on a pre-trained network, extracts visual features from food images, while the text encoder, based on a hierarchical Transformer, extracts textual features from recipes. Our experiments demonstrated that our enhanced dual encoders significantly outperformed the existing baseline models on the dataset. Our proposed framework, which incorporates both textual and visual information, can improve the accuracy of cross-modal retrieval systems and enhance the user experience in searching for food-related information.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Aguilar, E., et al.: Grab, pay, and eat: semantic food detection for smart restaurants. IEEE Trans. Multim. 20(12), 3266–3275 (2018)
Bossard, L., et al.: Food-101–mining discriminative components with random forests. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, 6–12 September 2014, Proceedings, Part VI 13, pp. 446–461. Springer (2014)
Carvalho, M., et al.: Cross-modal retrieval in the cooking context: learning semantic text-image embeddings. In: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 35–44 (2018)
Fu, H., et al.: Mcen: bridging cross-modal gap between cooking recipes and dish images with latent variable model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14570–14580 (2020)
Guerrero, R., et al.: Cross-modal retrieval and synthesis (X-MRS): closing the modality gap in shared subspace learning. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 3192–3201 (2021)
Guo, M., et al.: Effective parallel corpus mining using bilingual sentence embeddings. arXiv preprint arXiv:1807.11906 (2018)
Kaur, P., et al.: Foodx-251: a dataset for fine-grained food classification. arXiv preprint arXiv:1907.06167 (2019)
Min, W., et al.: A survey on food computing. ACM Comput. Surv. 52(5), 1–36 (2019)
Min, W., et al.: Large scale visual food recognition. IEEE Trans. Pattern Anal. Mach. Intell. 45(8), 9932–9949 (2023)
Okamoto, K., Yanai, K.: UEC-FoodPIX complete: a large-scale food image segmentation dataset. In: Pattern Recognition. ICPR International Workshops and Challenges: Virtual Event, 10–15 January 2021, Proceedings, Part V, pp. 647–659. Springer (2021)
Salvador, A., et al.: Inverse cooking: recipe generation from food images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10453–10462 (2019)
Salvador, A., et al.: Learning cross-modal embeddings for cooking recipes and food images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3020–3028 (2017)
Salvador, A., et al.: Revamping cross-modal recipe retrieval with hierarchical transformers and self-supervised learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15475–15484 (2021)
Wang, H., et al.: Learning cross-modal embeddings with adversarial networks for cooking recipes and food images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11572–11581 (2019)
Wang, H., et al.: Structure-aware generation network for recipe generation from images. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, 23–28 August 2020, Proceedings, Part XXVII 16, pp. 359–374. Springer (2020)
Yang, J., et al.: Transformer-based cross-modal recipe embeddings with large batch training. In: International Conference on Multimedia Modeling, pp. 471–482 Springer, Cham (2023). https://doi.org/10.1007/978-3-031-27818-1_39
Zhen, L., et al.: Deep supervised cross-modal retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10394–10403 (2019)
Zhu, B., et al.: R2gan: cross-modal recipe retrieval with generative adversarial network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11477–11486 (2019)
Zhu, B., Ngo, C.-W.: CookGAN: causality based text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5519–5527 (2020)
Zhu, L., et al.: Cross-modal retrieval: a systematic review of methods and future directions. arXiv preprint arXiv:2308.14263 (2023)
Acknowledgments
The authors thank the laboratory equipment and configuration for the timely help in analyzing a large amount of data. Fundings from the National Natural Science Foundation (grant number: 62377036).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Qin, H., Zhang, X., Song, C. (2024). Cross-modal Recipe Retrieval with Hierarchical Transformers and Pretrained Food Image Encoder. In: Huang, DS., Zhang, X., Zhang, C. (eds) Advanced Intelligent Computing Technology and Applications. ICIC 2024. Lecture Notes in Computer Science(), vol 14879. Springer, Singapore. https://doi.org/10.1007/978-981-97-5675-9_36
Download citation
DOI: https://doi.org/10.1007/978-981-97-5675-9_36
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-5674-2
Online ISBN: 978-981-97-5675-9
eBook Packages: Computer ScienceComputer Science (R0)