PointerNet with Local and Global Contexts for Natural Language Moment Localization

Ye, Linwei; Liu, Zhi; Wang, Yang

doi:10.1007/978-981-99-8850-1_25

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14473))

Included in the following conference series:

CAAI International Conference on Artificial Intelligence

658 Accesses

Abstract

We consider the problem of natural language moment localization. Given an untrimmed video and a natural language query, we aim to automatically retrieve a semantically relevant moment in the video referred by the query sentence. Most existing methods work by projecting visual and linguistic data into feature embedding space, then matching the semantic similarity or ranking a set of pre-defined segments to select the moment. In this paper, we propose a novel PointerNet with local and global contexts to solve this problem. Our proposed model first uses a recurrent network over words to interact visual and linguistic features in a fine-grained fashion. The word recurrence represents each clip as a multimodal feature that captures the fine-grained interaction of each clip with all words in the query sentence. It then uses another bi-directional recurrent network that processes all clips in the video. The clip recurrence refines the local context information of each clip and produces a global context representation of the entire video. Finally, the global video context and the local context of each clip are jointly used to determine the start and the end positions of the moment. Extensive experimental results demonstrate the effectiveness of our proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Context Alignment Network for Video Moment Retrieval

Weakly Supervised Moment Localization with Decoupled Consistent Concept Prediction

Article 19 March 2022

CMGN: Cross-Modal Grounding Network for Temporal Sentence Retrieval in Video

References

Anne Hendricks, L., Wang, O., Shechtman, E., Sivic, J., Darrell, T., Russell, B.: Localizing moments in video with natural language. In: IEEE International Conference on Computer Vision, pp. 5803–5812 (2017)
Google Scholar
Caba Heilbron, F., Escorcia, V., Ghanem, B., Carlos Niebles, J.: ActivityNet: a large-scale video benchmark for human activity understanding. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 961–970 (2015)
Google Scholar
Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)
Google Scholar
Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Conference on Empirical Methods in Natural Language Processing (2014)
Google Scholar
Gao, J., Sun, C., Yang, Z., Nevatia, R.: TALL: temporal activity localization via language query. In: IEEE International Conference on Computer Vision, pp. 5267–5275 (2017)
Google Scholar
Krishna, R., Hata, K., Ren, F., Fei-Fei, L., Niebles, J.C.: Dense-captioning events in videos. In: International Conference on Computer Vision (2017)
Google Scholar
Liu, C., Lin, Z., Shen, X., Yang, J., Lu, X., Yuille, A.L.: Recurrent multimodal interaction for referring image segmentation. In: IEEE International Conference on Computer Vision (2017)
Google Scholar
Liu, M., Wang, X., Nie, L., He, X., Chen, B., Chua, T.S.: Attentive moment retrieval in videos. In: ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 15–24 (2018)
Google Scholar
Liu, M., Wang, X., Nie, L., Tian, Q., Chen, B., Chua, T.S.: Cross-modal moment localization in videos. In: ACM International Conference on Multimedia, pp. 843–851 (2018)
Google Scholar
Ma, S., Sigal, L., Sclaroff, S.: Learning activity progression in LSTMs for activity detection and early detection. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1942–1950 (2016)
Google Scholar
Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: Conference on Empirical Methods in Natural Language Processing, pp. 1532–1543 (2014)
Google Scholar
Regneri, M., Rohrbach, M., Wetzel, D., Thater, S., Schiele, B., Pinkal, M.: Grounding action descriptions in videos. Trans. Assoc. Comput. Linguist. 1, 25–36 (2013)
Article Google Scholar
Tunguturi, M.: Moment localization using multi-scale 2D temporal adjacent networks and natural language. Int. J. Mach. Learn. Sustain. Dev. 4(3), 1–10 (2022)
Google Scholar
Underwood, G., Jebbett, L., Roberts, K.: Inspecting pictures for information to verify a sentence: Eye movements in general encoding and in focused search. Q. J. Exp. Psychol. Sect. A 57(1), 165–182 (2004)
Article Google Scholar
Vinyals, O., Fortunato, M., Jaitly, N.: Pointer networks. In: Advances in Neural Information Processing Systems, pp. 2692–2700 (2015)
Google Scholar
Wang, W., Huang, Y., Wang, L.: Language-driven temporal activity localization: a semantic matching reinforcement learning model. In: IEEE Conference on Computer Vision and Pattern Recognition,
Google Scholar
Wei, Z., et al.: Sequence-to-segment networks for segment detection. In: Advances in Neural Information Processing Systems, pp. 3507–3516 (2018)
Google Scholar
Xiao, S., et al.: Boundary proposal network for two-stage natural language video localization. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 35, pp. 2986–2994 (2021)
Google Scholar
Xu, H., He, K., Plummer, B.A., Sigal, L., Sclaroff, S., Saenko, K.: Multilevel language and vision integration for text-to-clip retrieval. In: AAAI Conference on Artificial Intelligence. vol. 33, pp. 9062–9069 (2019)
Google Scholar
Yuan, Y., Mei, T., Zhu, W.: To find where you talk: Temporal sentence localization in video with attention based location regression. In: AAAI Conference on Artificial Intelligence. vol. 33, pp. 9159–9166 (2019)
Google Scholar
Zhang, D., Dai, X., Wang, X., Wang, Y.F., Davis, L.S.: MAN: moment alignment network for natural language moment retrieval via iterative graph adjustment. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1247–1257 (2019)
Google Scholar
Zhang, H., Sun, A., Jing, W., Zhen, L., Zhou, J.T., Goh, R.S.M.: Natural language video localization: A revisit in span-based question answering framework. IEEE Trans. Pattern Anal. Mach. Intell. 44(8), 4252–4266 (2021)
Google Scholar
Zhang, L., Radke, R.J.: Natural language video moment localization through query-controlled temporal convolution. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 682–690 (2022)
Google Scholar
Zheng, M., Huang, Y., Chen, Q., Liu, Y.: Weakly supervised video moment localization with contrastive negative sample mining. In: AAAI Conference on Artificial Intelligence. vol. 36, pp. 3517–3525 (2022)
Google Scholar

Download references

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China (Grant No. 62102289) and in part by the Zhejiang Provincial Natural Science Foundation (Grant No. LQ22F020005).

Author information

Authors and Affiliations

College of Computer Science and Artificial Intelligence, Wenzhou University, Wenzhou, China
Linwei Ye
School of Communication and Information Engineering, Shanghai University, Shanghai, China
Zhi Liu
Department of Computer Science and Software Engineering, Concordia University, Montreal, Canada
Yang Wang

Authors

Linwei Ye
View author publications
You can also search for this author in PubMed Google Scholar
Zhi Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yang Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Linwei Ye .

Editor information

Editors and Affiliations

Tsinghua University, Beijing, China
Lu Fang
Duke University, Durham, NC, USA
Jian Pei
Shanghai Jiao Tong Univeristy, Shanghai, China
Guangtao Zhai
Chinese Academy of Sciences, Beijing, China
Ruiping Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ye, L., Liu, Z., Wang, Y. (2024). PointerNet with Local and Global Contexts for Natural Language Moment Localization. In: Fang, L., Pei, J., Zhai, G., Wang, R. (eds) Artificial Intelligence. CICAI 2023. Lecture Notes in Computer Science(), vol 14473. Springer, Singapore. https://doi.org/10.1007/978-981-99-8850-1_25

Download citation

DOI: https://doi.org/10.1007/978-981-99-8850-1_25
Published: 04 February 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8849-5
Online ISBN: 978-981-99-8850-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics