Abstract
Modern video segmentation methods adopt feature transitions between anchor and target queries to perform cross-frame object association. The smooth feature transitions between anchor and target queries enable these methods to achieve satisfactory performance when tracking continuously appearing objects. However, the emergence and disappearance of objects interrupt the smooth feature transition, and even widen this feature transition gap between anchor and target queries, which causes these methods to all underperform on newly emerging and disappearing objects that are common in the real world. We introduce Dynamic Anchor Queries (DAQ) to shorten the transition gap by dynamically generating anchor queries based on the features of potential newly emerging and disappearing candidates. Furthermore, we introduce a query-level object Emergence and Disappearance Simulation (EDS) strategy, which unleashes DAQ’s potential without any additional cost. Finally, we combine our proposed DAQ and EDS with the previous method, DVIS, to obtain DVIS-DAQ. Extensive experiments demonstrate that DVIS-DAQ achieves a new state-of-the-art (SOTA) performance on five mainstream video segmentation benchmarks.
Y. Zhou and T. Zhang : The first two authors contribute equally. This work was performed when Tao Zhang was an Intern at Skywork AI.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Athar, A., Hermans, A., Luiten, J., Ramanan, D., Leibe, B.: Tarvis: a unified approach for target-based video segmentation. In: CVPR (2023)
Athar, A., Mahadevan, S., Ošep, A., Leal-Taixé, L., Leibe, B.: Stem-seg: spatio-temporal embeddings for instance segmentation in videos. In: ECCV (2020)
Bekuzarov, M., Bermudez, A., Lee, J.Y., Li, H.: Xmem++: production-level video segmentation from few annotated frames. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 635–644 (2023)
Bertasius, G., Torresani, L.: Classifying, segmenting, and tracking object instances in video with mask propagation (2022)
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: ICML (2021)
Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B.: Simple online and realtime tracking. In: ICIP (2016)
Caelles, S., Maninis, K., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., Van Gool, L.: One-shot video object segmentation. In: CVPR (2017)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: ECCV (2020)
Chen, Z., et al.: Vision transformer adapter for dense predictions. arXiv preprint arXiv:2205.08534 (2022)
Cheng, B., Choudhuri, A., Misra, I., Kirillov, A., Girdhar, R., Schwing, A.G.: Mask2former for video instance segmentation. arXiv preprint arXiv:2112.10764 (2021)
Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR (2022)
Cheng, B., Schwing, A., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. NeurIPS (2021)
Cheng, H.K., Oh, S.W., Price, B., Schwing, A., Lee, J.Y.: Tracking anything with decoupled video segmentation. In: ICCV (2023)
Cheng, H.K., Tai, Y.W., Tang, C.K.: Rethinking space-time networks with improved memory coverage for efficient video object segmentation. In: NeurIPS (2021)
Ding, H., Liu, C., He, S., Jiang, X., Torr, P.H., Bai, S.: MOSE: a new dataset for video object segmentation in complex scenes. In: ICCV (2023)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. ICLR (2021)
Gao, R., Wang, L.: MeMOTR: Long-term memory-augmented transformer for multi-object tracking. In: ICCV (2023)
Hannan, T., et al.: Gratt-vis: Gated residual attention for auto rectifying video instance segmentation. arXiv preprint arXiv:2305.17096 (2023)
He, J., et al.: Maxtron: mask transformer with trajectory attention for video panoptic segmentation (2023)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Heo, M., et al.: A generalized framework for video instance segmentation. In: CVPR (2023)
Heo, M., Hwang, S., Oh, S.W., Lee, J.Y., Kim, S.J.: Vita: video instance segmentation via object token association. NIPS (2022)
Huang, D.A., Yu, Z., Anandkumar, A.: Minvis: a minimal video instance segmentation framework without video-based training. NeurIPS (2022)
Kim, D., Woo, S., Lee, J.Y., Kweon, I.S.: Video panoptic segmentation. In: CVPR (2020)
Kim, D., et al.: Tubeformer-deeplab: Video mask transformer. In: CVPR (2022)
Leal-Taixé, L., Canton-Ferrer, C., Schindler, K.: Learning by tracking: siamese CNN for robust target association. In: CVPR Workshops (2016)
Li, J., Yu, B., Rao, Y., Zhou, J., Lu, J.: Tcovis: temporally consistent online video instance segmentation. In: ICCV (2023)
Li, M., Li, S., Zhang, X., Zhang, L.: Univs: unified and universal video segmentation with prompts as queries (2024)
Li, X., et al.: Transformer-based visual segmentation: A survey. arXiv pre-print (2023)
Li, X., et al.: Panoptic-partformer: learning a unified model for panoptic part segmentation. ECCV (2022)
Li, X., et al.: Omg-seg: is one model good enough for all segmentation? CVPR (2024)
Li, X., Yuan, H., Zhang, W., Cheng, G., Pang, J., Loy, C.C.: Tube-link: a flexible cross tube framework for universal video segmentation. In: ICCV (2023)
Li, X., et al.: Video k-net: a simple, strong, and unified baseline for video segmentation. In: CVPR (2022)
Lin, C.C., Hung, Y., Feris, R., He, L.: Video instance segmentation tracking with a modified vae architecture. In: CVPR (2020)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV (2021)
Meinhardt, T., Feiszli, M., Fan, Y., Leal-Taixe, L., Ranjan, R.: Novis: A case for end-to-end near-online video instance segmentation. arXiv preprint arXiv:2308.15266 (2023)
Meinhardt, T., Kirillov, A., Leal-Taixe, L., Feichtenhofer, C.: Trackformer: multi-object tracking with transformers. In: CVPR (2022)
Miao, J., Wang, X., Wu, Y., Li, W., Zhang, X., Wei, Y., Yang, Y.: Large-scale video panoptic segmentation in the wild: a benchmark. In: CVPR (2022)
Miao, J., Wei, Y., Wu, Y., Liang, C., Li, G., Yang, Y.: Vspw: a large-scale dataset for video scene parsing in the wild. In: CVPR (2021)
Oquab, M., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)
Pang, J., Qiu, L., Li, X., Chen, H., Li, Q., Darrell, T., Yu, F.: Quasi-dense similarity learning for multiple object tracking. In: CVPR (2021)
Pang, Z., Li, Z., Wang, N.: Simpletrack: Understanding and rethinking 3d multi-object tracking. arXiv preprint arXiv:2111.09621 (2021)
Park, K., Woo, S., Oh, S.W., Kweon, I.S., Lee, J.Y.: Per-clip video object segmentation. In: CVPR (2022)
Patrick, M., et al.: Keeping your eye on the ball: trajectory attention in video transformers. In: NeurIPS (2021)
Porzi, L., Hofinger, M., Ruiz, I., Serrat, J., Bulo, S.R., Kontschieder, P.: Learning multi-object tracking and segmentation from automatic annotations. In: CVPR (2020)
Qi, J., et al.: Occluded video instance segmentation: a benchmark. IJCV (2022)
Qiao, S., Zhu, Y., Adam, H., Yuille, A., Chen, L.C.: Vip-deeplab: learning visual perception with depth-aware video panoptic segmentation. In: CVPR (2021)
Sanghyun, W., Kwanyong, P., Seoung, W.O., In, S.K., Joon-Young, L.: Tracking by associating clips. In: ECCV (2022)
Seong, H., Oh, S.W., Lee, J.Y., Lee, S., Lee, S., Kim, E.: Hierarchical memory matching network for video object segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2021)
Sun, P., et al.: Dancetrack: Multi-object tracking in uniform appearance and diverse motion. arXiv preprint arXiv:2111.14690 (2021)
Sun, P., et al.: Transtrack: Multiple-object tracking with transformer. arXiv preprint arXiv: 2012.15460 (2020)
Tsai, Y.J., Liu, Y.L., Qi, L., Chan, K.C., Yang, M.H.: Dual associated encoder for face restoration. arXiv preprint arXiv:2308.07314 (2023)
Wang, X., Girdhar, R., Yu, S.X., Misra, I.: Cut and learn for unsupervised object detection and instance segmentation. In: CVPR (2023)
Wang, Y., et al.: End-to-end video instance segmentation with transformers. In: CVPR (2021)
Weber, M., et al.: Step: Segmenting and tracking every pixel. arXiv preprint arXiv:2102.11859 (2021)
Weng, Y., et al.: Mask propagation for efficient video semantic segmentation. arXiv preprint arXiv:2310.18954 (2023)
Wu, J., Cao, J., Song, L., Wang, Y., Yang, M., Yuan, J.: Track to detect and segment: an online multi-object tracker. In: CVPR (2021)
Wu, J., et al.: Towards open vocabulary learning: a survey. T-PAMI (2024)
Wu, J., Jiang, Y., Bai, S., Zhang, W., Bai, X.: Seqformer: sequential transformer for video instance segmentation. In: ECCV (2022)
Wu, J., Liu, Q., Jiang, Y., Bai, S., Yuille, A., Bai, X.: In defense of online models for video instance segmentation. In: ECCV (2022)
Xu, J., Cao, Y., Zhang, Z., Hu, H.: Spatial-temporal relation networks for multi-object tracking. In: ICCV (2019)
Xu, S., Li, X., Wang, J., Cheng, G., Tong, Y., Tao, D.: Fashionformer: a simple, effective and unified baseline for human fashion segmentation and recognition. In: ECCV (2022)
Xu, S., et al.: Rap-sam: towards real-time all-purpose segment anything. arXiv preprint (2024)
Yan, B., et al.: Towards grand unification of object tracking. In: ECCV (2022)
Yan, B., et al.: Universal instance perception as object discovery and retrieval. In: CVPR (2023)
Yang, J., et al.: Panoptic video scene graph generation. In: CVPR (2023)
Yang, L., Fan, Y., Xu, N.: Video instance segmentation. In: ICCV (2019)
Yang, S., et al.: Crossover learning for fast online video instance segmentation. In: ICCV (2021)
Ying, K., et al.: Ctvis: consistent training for online video instance segmentation. In: ICCV (2023)
Yu, F., et al.: Bdd100k: a diverse driving dataset for heterogeneous multitask learning. In: CVPR (2020)
Zeng, F., Dong, B., Zhang, Y., Wang, T., Zhang, X., Wei, Y.: Motr: end-to-end multiple-object tracking with transformer. In: ECCV (2022)
Zhang, T., Tian, X., Wu, Y., Ji, S., Wang, X., Zhang, Y., Wan, P.: Dvis: decoupled video instance segmentation framework. In: ICCV (2023)
Zhang, T., et al.: Dvis++: Improved decoupled framework for universal video segmentation. arXiv preprint arXiv:2312.13305 (2023)
Zhang, W., Pang, J., Chen, K., Loy, C.C.: K-net: towards unified image segmentation. NIPS (2021)
Zhang, Y., et al.: Bytetrack: multi-object tracking by associating every detection box. ECCV (2022)
Zhou, Q., et al.: Transvod: end-to-end video object detection with spatial-temporal transformers. T-PAMI (2022)
Zhou, Y., et al.: Slot-vps: object-centric representation learning for video panoptic segmentation. In: CVPR (2022)
Zhu, J., Yang, H., Liu, N., Kim, M., Zhang, W., Yang, M.H.: Online multi-object tracking with dual matching attention networks. In: ECCV (2018)
Zhu, X., Xiong, Y., Dai, J., Yuan, L., Wei, Y.: Deep feature flow for video recognition. CVPR (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zhou, Y., Zhang, T., Ji, S., Yan, S., Li, X. (2025). Improving Video Segmentation via Dynamic Anchor Queries. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15108. Springer, Cham. https://doi.org/10.1007/978-3-031-72973-7_26
Download citation
DOI: https://doi.org/10.1007/978-3-031-72973-7_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72972-0
Online ISBN: 978-3-031-72973-7
eBook Packages: Computer ScienceComputer Science (R0)