Abstract
This paper addresses the problem of actor and action video segmentation from natural language. Given a video and a language query, the goal is to segment the actor and its action described by the query. Existing methods focus on exploring elaborated multimodal feature fusion networks to combine visual and linguistic features for an effective multimodal representation directly learnt from this labeled segmentation task. In this paper, we propose a novel self-supervised meta auxiliary learning method to improve the primary segmentation task by adding an auxiliary task for better generalization. The auxiliary task is established to reconstruct the input sentence representation so that the multimodal representation can be adapted to a specific query. In addition, the auxiliary task does not require additional labels. It can also be used in test time to update a multimodal representation according to a specific query in a self-supervised way.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bellver, M., Ventura, C., Silberer, C., Kazakos, I., Torres, J., Giro-i Nieto, X.: A closer look at referring expressions for video object segmentation. Multimedia Tools Appl. 82(3), 4419–4438 (2023)
Caba Heilbron, F., Escorcia, V., Ghanem, B., Carlos Niebles, J.: Activitynet: a large-scale video benchmark for human activity understanding. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 961–970 (2015)
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2017)
Gavrilyuk, K., Ghodrati, A., Li, Z., Snoek, C.G.: Actor and action video segmentation from a sentence. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 5958–5966 (2018)
Hu, R., Rohrbach, M., Darrell, T.: Segmentation from natural language expressions. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 108–124. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_7
Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J.: Towards understanding action recognition. In: IEEE International Conference on Computer Vision, pp. 3192–3199 (2013)
Li, Z., Tao, R., Gavves, E., Snoek, C.G., Smeulders, A.W.: Tracking by natural language specification. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 6495–6503 (2017)
Liu, C., Lin, Z., Shen, X., Yang, J., Lu, X., Yuille, A.L.: Recurrent multimodal interaction for referring image segmentation. In: IEEE International Conference on Computer Vision (2017)
Liu, S., Davison, A., Johns, E.: Self-supervised generalisation with meta auxiliary learning. In: Advances in Neural Information Processing Systems, pp. 1677–1687 (2019)
Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: IEEE Conference on Computer Vision and Pattern Recognition (2016)
Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Conference on Empirical Methods in Natural Language Processing, pp. 1532–1543 (2014)
Rohrbach, A., Rohrbach, M., Hu, R., Darrell, T., Schiele, B.: Grounding of textual phrases in images by reconstruction. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 817–834. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_49
Sun, Y., Wang, X., Liu, Z., Miller, J., Efros, A.A., Hardt, M.: Test-time training for out-of-distribution generalization. arXiv:1909.13231 (2019)
Wang, H., Deng, C., Yan, J., Tao, D.: Asymmetric cross-guided attention network for actor and action video segmentation from natural language query. In: IEEE International Conference on Computer Vision, pp. 3939–3948 (2019)
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803 (2018)
Xingjian, S., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.C.: Convolutional lstm network: a machine learning approach for precipitation nowcasting. In: Advances in Neural Information Processing Systems, pp. 802–810 (2015)
Xu, C., Hsieh, S.H., Xiong, C., Corso, J.J.: Can humans fly? action understanding with multiple classes of actors. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2264–2273 (2015)
Yang, J., Huang, Y., Niu, K., Huang, L., Ma, Z., Wang, L.: Actor and action modular network for text-based video segmentation. IEEE Trans. Image Process. 31, 4474–4489 (2022)
Yang, Y., Deng, C., Gao, S., Liu, W., Tao, D., Gao, X.: Discriminative multi-instance multitask learning for 3d action recognition. IEEE Trans. Multimedia 19(3), 519–529 (2016)
Ye, L., Rochan, M., Liu, Z., Zhang, X., Wang, Y.: Referring segmentation in images and videos with cross-modal self-attention network. IEEE Trans. Pattern Anal. Mach. Intell. 44(7), 3719–3732 (2021)
Acknowledgments
This work was supported in part by the National Natural Science Foundation of China (Grant No. 62102289) and in part by the Zhejiang Provincial Natural Science Foundation (Grant No. LQ22F020005).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Ye, L., Wang, Z. (2024). Self-supervised Meta Auxiliary Learning for Actor and Action Video Segmentation from Natural Language. In: Fang, L., Pei, J., Zhai, G., Wang, R. (eds) Artificial Intelligence. CICAI 2023. Lecture Notes in Computer Science(), vol 14473. Springer, Singapore. https://doi.org/10.1007/978-981-99-8850-1_26
Download citation
DOI: https://doi.org/10.1007/978-981-99-8850-1_26
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8849-5
Online ISBN: 978-981-99-8850-1
eBook Packages: Computer ScienceComputer Science (R0)