Improving Video Segmentation via Dynamic Anchor Queries

Zhou, Yikang; Zhang, Tao; Ji, Shunping; Yan, Shuicheng; Li, Xiangtai

doi:10.1007/978-3-031-72973-7_26

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15108))

Included in the following conference series:

European Conference on Computer Vision

325 Accesses
3 Citations

Abstract

Modern video segmentation methods adopt feature transitions between anchor and target queries to perform cross-frame object association. The smooth feature transitions between anchor and target queries enable these methods to achieve satisfactory performance when tracking continuously appearing objects. However, the emergence and disappearance of objects interrupt the smooth feature transition, and even widen this feature transition gap between anchor and target queries, which causes these methods to all underperform on newly emerging and disappearing objects that are common in the real world. We introduce Dynamic Anchor Queries (DAQ) to shorten the transition gap by dynamically generating anchor queries based on the features of potential newly emerging and disappearing candidates. Furthermore, we introduce a query-level object Emergence and Disappearance Simulation (EDS) strategy, which unleashes DAQ’s potential without any additional cost. Finally, we combine our proposed DAQ and EDS with the previous method, DVIS, to obtain DVIS-DAQ. Extensive experiments demonstrate that DVIS-DAQ achieves a new state-of-the-art (SOTA) performance on five mainstream video segmentation benchmarks.

Y. Zhou and T. Zhang : The first two authors contribute equally. This work was performed when Tao Zhang was an Intern at Skywork AI.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Streaming Video Segmentation via Short-Term Hierarchical Segmentation and Frame-by-Frame Markov Random Field Optimization

Learning spatiotemporal relationships with a unified framework for video object segmentation

Article 01 April 2024

General and Task-Oriented Video Segmentation

References

Athar, A., Hermans, A., Luiten, J., Ramanan, D., Leibe, B.: Tarvis: a unified approach for target-based video segmentation. In: CVPR (2023)
Google Scholar
Athar, A., Mahadevan, S., Ošep, A., Leal-Taixé, L., Leibe, B.: Stem-seg: spatio-temporal embeddings for instance segmentation in videos. In: ECCV (2020)
Google Scholar
Bekuzarov, M., Bermudez, A., Lee, J.Y., Li, H.: Xmem++: production-level video segmentation from few annotated frames. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 635–644 (2023)
Google Scholar
Bertasius, G., Torresani, L.: Classifying, segmenting, and tracking object instances in video with mask propagation (2022)
Google Scholar
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: ICML (2021)
Google Scholar
Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B.: Simple online and realtime tracking. In: ICIP (2016)
Google Scholar
Caelles, S., Maninis, K., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., Van Gool, L.: One-shot video object segmentation. In: CVPR (2017)
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: ECCV (2020)
Google Scholar
Chen, Z., et al.: Vision transformer adapter for dense predictions. arXiv preprint arXiv:2205.08534 (2022)
Cheng, B., Choudhuri, A., Misra, I., Kirillov, A., Girdhar, R., Schwing, A.G.: Mask2former for video instance segmentation. arXiv preprint arXiv:2112.10764 (2021)
Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR (2022)
Google Scholar
Cheng, B., Schwing, A., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. NeurIPS (2021)
Google Scholar
Cheng, H.K., Oh, S.W., Price, B., Schwing, A., Lee, J.Y.: Tracking anything with decoupled video segmentation. In: ICCV (2023)
Google Scholar
Cheng, H.K., Tai, Y.W., Tang, C.K.: Rethinking space-time networks with improved memory coverage for efficient video object segmentation. In: NeurIPS (2021)
Google Scholar
Ding, H., Liu, C., He, S., Jiang, X., Torr, P.H., Bai, S.: MOSE: a new dataset for video object segmentation in complex scenes. In: ICCV (2023)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. ICLR (2021)
Google Scholar
Gao, R., Wang, L.: MeMOTR: Long-term memory-augmented transformer for multi-object tracking. In: ICCV (2023)
Google Scholar
Hannan, T., et al.: Gratt-vis: Gated residual attention for auto rectifying video instance segmentation. arXiv preprint arXiv:2305.17096 (2023)
He, J., et al.: Maxtron: mask transformer with trajectory attention for video panoptic segmentation (2023)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Google Scholar
Heo, M., et al.: A generalized framework for video instance segmentation. In: CVPR (2023)
Google Scholar
Heo, M., Hwang, S., Oh, S.W., Lee, J.Y., Kim, S.J.: Vita: video instance segmentation via object token association. NIPS (2022)
Google Scholar
Huang, D.A., Yu, Z., Anandkumar, A.: Minvis: a minimal video instance segmentation framework without video-based training. NeurIPS (2022)
Google Scholar
Kim, D., Woo, S., Lee, J.Y., Kweon, I.S.: Video panoptic segmentation. In: CVPR (2020)
Google Scholar
Kim, D., et al.: Tubeformer-deeplab: Video mask transformer. In: CVPR (2022)
Google Scholar
Leal-Taixé, L., Canton-Ferrer, C., Schindler, K.: Learning by tracking: siamese CNN for robust target association. In: CVPR Workshops (2016)
Google Scholar
Li, J., Yu, B., Rao, Y., Zhou, J., Lu, J.: Tcovis: temporally consistent online video instance segmentation. In: ICCV (2023)
Google Scholar
Li, M., Li, S., Zhang, X., Zhang, L.: Univs: unified and universal video segmentation with prompts as queries (2024)
Google Scholar
Li, X., et al.: Transformer-based visual segmentation: A survey. arXiv pre-print (2023)
Google Scholar
Li, X., et al.: Panoptic-partformer: learning a unified model for panoptic part segmentation. ECCV (2022)
Google Scholar
Li, X., et al.: Omg-seg: is one model good enough for all segmentation? CVPR (2024)
Google Scholar
Li, X., Yuan, H., Zhang, W., Cheng, G., Pang, J., Loy, C.C.: Tube-link: a flexible cross tube framework for universal video segmentation. In: ICCV (2023)
Google Scholar
Li, X., et al.: Video k-net: a simple, strong, and unified baseline for video segmentation. In: CVPR (2022)
Google Scholar
Lin, C.C., Hung, Y., Feris, R., He, L.: Video instance segmentation tracking with a modified vae architecture. In: CVPR (2020)
Google Scholar
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV (2021)
Google Scholar
Meinhardt, T., Feiszli, M., Fan, Y., Leal-Taixe, L., Ranjan, R.: Novis: A case for end-to-end near-online video instance segmentation. arXiv preprint arXiv:2308.15266 (2023)
Meinhardt, T., Kirillov, A., Leal-Taixe, L., Feichtenhofer, C.: Trackformer: multi-object tracking with transformers. In: CVPR (2022)
Google Scholar
Miao, J., Wang, X., Wu, Y., Li, W., Zhang, X., Wei, Y., Yang, Y.: Large-scale video panoptic segmentation in the wild: a benchmark. In: CVPR (2022)
Google Scholar
Miao, J., Wei, Y., Wu, Y., Liang, C., Li, G., Yang, Y.: Vspw: a large-scale dataset for video scene parsing in the wild. In: CVPR (2021)
Google Scholar
Oquab, M., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)
Pang, J., Qiu, L., Li, X., Chen, H., Li, Q., Darrell, T., Yu, F.: Quasi-dense similarity learning for multiple object tracking. In: CVPR (2021)
Google Scholar
Pang, Z., Li, Z., Wang, N.: Simpletrack: Understanding and rethinking 3d multi-object tracking. arXiv preprint arXiv:2111.09621 (2021)
Park, K., Woo, S., Oh, S.W., Kweon, I.S., Lee, J.Y.: Per-clip video object segmentation. In: CVPR (2022)
Google Scholar
Patrick, M., et al.: Keeping your eye on the ball: trajectory attention in video transformers. In: NeurIPS (2021)
Google Scholar
Porzi, L., Hofinger, M., Ruiz, I., Serrat, J., Bulo, S.R., Kontschieder, P.: Learning multi-object tracking and segmentation from automatic annotations. In: CVPR (2020)
Google Scholar
Qi, J., et al.: Occluded video instance segmentation: a benchmark. IJCV (2022)
Google Scholar
Qiao, S., Zhu, Y., Adam, H., Yuille, A., Chen, L.C.: Vip-deeplab: learning visual perception with depth-aware video panoptic segmentation. In: CVPR (2021)
Google Scholar
Sanghyun, W., Kwanyong, P., Seoung, W.O., In, S.K., Joon-Young, L.: Tracking by associating clips. In: ECCV (2022)
Google Scholar
Seong, H., Oh, S.W., Lee, J.Y., Lee, S., Lee, S., Kim, E.: Hierarchical memory matching network for video object segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2021)
Google Scholar
Sun, P., et al.: Dancetrack: Multi-object tracking in uniform appearance and diverse motion. arXiv preprint arXiv:2111.14690 (2021)
Sun, P., et al.: Transtrack: Multiple-object tracking with transformer. arXiv preprint arXiv: 2012.15460 (2020)
Tsai, Y.J., Liu, Y.L., Qi, L., Chan, K.C., Yang, M.H.: Dual associated encoder for face restoration. arXiv preprint arXiv:2308.07314 (2023)
Wang, X., Girdhar, R., Yu, S.X., Misra, I.: Cut and learn for unsupervised object detection and instance segmentation. In: CVPR (2023)
Google Scholar
Wang, Y., et al.: End-to-end video instance segmentation with transformers. In: CVPR (2021)
Google Scholar
Weber, M., et al.: Step: Segmenting and tracking every pixel. arXiv preprint arXiv:2102.11859 (2021)
Weng, Y., et al.: Mask propagation for efficient video semantic segmentation. arXiv preprint arXiv:2310.18954 (2023)
Wu, J., Cao, J., Song, L., Wang, Y., Yang, M., Yuan, J.: Track to detect and segment: an online multi-object tracker. In: CVPR (2021)
Google Scholar
Wu, J., et al.: Towards open vocabulary learning: a survey. T-PAMI (2024)
Google Scholar
Wu, J., Jiang, Y., Bai, S., Zhang, W., Bai, X.: Seqformer: sequential transformer for video instance segmentation. In: ECCV (2022)
Google Scholar
Wu, J., Liu, Q., Jiang, Y., Bai, S., Yuille, A., Bai, X.: In defense of online models for video instance segmentation. In: ECCV (2022)
Google Scholar
Xu, J., Cao, Y., Zhang, Z., Hu, H.: Spatial-temporal relation networks for multi-object tracking. In: ICCV (2019)
Google Scholar
Xu, S., Li, X., Wang, J., Cheng, G., Tong, Y., Tao, D.: Fashionformer: a simple, effective and unified baseline for human fashion segmentation and recognition. In: ECCV (2022)
Google Scholar
Xu, S., et al.: Rap-sam: towards real-time all-purpose segment anything. arXiv preprint (2024)
Google Scholar
Yan, B., et al.: Towards grand unification of object tracking. In: ECCV (2022)
Google Scholar
Yan, B., et al.: Universal instance perception as object discovery and retrieval. In: CVPR (2023)
Google Scholar
Yang, J., et al.: Panoptic video scene graph generation. In: CVPR (2023)
Google Scholar
Yang, L., Fan, Y., Xu, N.: Video instance segmentation. In: ICCV (2019)
Google Scholar
Yang, S., et al.: Crossover learning for fast online video instance segmentation. In: ICCV (2021)
Google Scholar
Ying, K., et al.: Ctvis: consistent training for online video instance segmentation. In: ICCV (2023)
Google Scholar
Yu, F., et al.: Bdd100k: a diverse driving dataset for heterogeneous multitask learning. In: CVPR (2020)
Google Scholar
Zeng, F., Dong, B., Zhang, Y., Wang, T., Zhang, X., Wei, Y.: Motr: end-to-end multiple-object tracking with transformer. In: ECCV (2022)
Google Scholar
Zhang, T., Tian, X., Wu, Y., Ji, S., Wang, X., Zhang, Y., Wan, P.: Dvis: decoupled video instance segmentation framework. In: ICCV (2023)
Google Scholar
Zhang, T., et al.: Dvis++: Improved decoupled framework for universal video segmentation. arXiv preprint arXiv:2312.13305 (2023)
Zhang, W., Pang, J., Chen, K., Loy, C.C.: K-net: towards unified image segmentation. NIPS (2021)
Google Scholar
Zhang, Y., et al.: Bytetrack: multi-object tracking by associating every detection box. ECCV (2022)
Google Scholar
Zhou, Q., et al.: Transvod: end-to-end video object detection with spatial-temporal transformers. T-PAMI (2022)
Google Scholar
Zhou, Y., et al.: Slot-vps: object-centric representation learning for video panoptic segmentation. In: CVPR (2022)
Google Scholar
Zhu, J., Yang, H., Liu, N., Kim, M., Zhang, W., Yang, M.H.: Online multi-object tracking with dual matching attention networks. In: ECCV (2018)
Google Scholar
Zhu, X., Xiong, Y., Dai, J., Yuan, L., Wei, Y.: Deep feature flow for video recognition. CVPR (2016)
Google Scholar

Download references

Author information

Authors and Affiliations

Wuhan University, Wuhan, China
Yikang Zhou, Tao Zhang & Shunping Ji
Skywork AI, Singapore, Singapore
Tao Zhang, Shuicheng Yan & Xiangtai Li

Authors

Yikang Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Tao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Shunping Ji
View author publications
You can also search for this author in PubMed Google Scholar
Shuicheng Yan
View author publications
You can also search for this author in PubMed Google Scholar
Xiangtai Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiangtai Li .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1434 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhou, Y., Zhang, T., Ji, S., Yan, S., Li, X. (2025). Improving Video Segmentation via Dynamic Anchor Queries. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15108. Springer, Cham. https://doi.org/10.1007/978-3-031-72973-7_26

Download citation

DOI: https://doi.org/10.1007/978-3-031-72973-7_26
Published: 01 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72972-0
Online ISBN: 978-3-031-72973-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Improving Video Segmentation via Dynamic Anchor Queries