Skip to main content

Improving Video Segmentation via Dynamic Anchor Queries

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

Modern video segmentation methods adopt feature transitions between anchor and target queries to perform cross-frame object association. The smooth feature transitions between anchor and target queries enable these methods to achieve satisfactory performance when tracking continuously appearing objects. However, the emergence and disappearance of objects interrupt the smooth feature transition, and even widen this feature transition gap between anchor and target queries, which causes these methods to all underperform on newly emerging and disappearing objects that are common in the real world. We introduce Dynamic Anchor Queries (DAQ) to shorten the transition gap by dynamically generating anchor queries based on the features of potential newly emerging and disappearing candidates. Furthermore, we introduce a query-level object Emergence and Disappearance Simulation (EDS) strategy, which unleashes DAQ’s potential without any additional cost. Finally, we combine our proposed DAQ and EDS with the previous method, DVIS, to obtain DVIS-DAQ. Extensive experiments demonstrate that DVIS-DAQ achieves a new state-of-the-art (SOTA) performance on five mainstream video segmentation benchmarks.

Y. Zhou and T. Zhang : The first two authors contribute equally. This work was performed when Tao Zhang was an Intern at Skywork AI.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Athar, A., Hermans, A., Luiten, J., Ramanan, D., Leibe, B.: Tarvis: a unified approach for target-based video segmentation. In: CVPR (2023)

    Google Scholar 

  2. Athar, A., Mahadevan, S., Ošep, A., Leal-Taixé, L., Leibe, B.: Stem-seg: spatio-temporal embeddings for instance segmentation in videos. In: ECCV (2020)

    Google Scholar 

  3. Bekuzarov, M., Bermudez, A., Lee, J.Y., Li, H.: Xmem++: production-level video segmentation from few annotated frames. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 635–644 (2023)

    Google Scholar 

  4. Bertasius, G., Torresani, L.: Classifying, segmenting, and tracking object instances in video with mask propagation (2022)

    Google Scholar 

  5. Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: ICML (2021)

    Google Scholar 

  6. Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B.: Simple online and realtime tracking. In: ICIP (2016)

    Google Scholar 

  7. Caelles, S., Maninis, K., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., Van Gool, L.: One-shot video object segmentation. In: CVPR (2017)

    Google Scholar 

  8. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: ECCV (2020)

    Google Scholar 

  9. Chen, Z., et al.: Vision transformer adapter for dense predictions. arXiv preprint arXiv:2205.08534 (2022)

  10. Cheng, B., Choudhuri, A., Misra, I., Kirillov, A., Girdhar, R., Schwing, A.G.: Mask2former for video instance segmentation. arXiv preprint arXiv:2112.10764 (2021)

  11. Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR (2022)

    Google Scholar 

  12. Cheng, B., Schwing, A., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. NeurIPS (2021)

    Google Scholar 

  13. Cheng, H.K., Oh, S.W., Price, B., Schwing, A., Lee, J.Y.: Tracking anything with decoupled video segmentation. In: ICCV (2023)

    Google Scholar 

  14. Cheng, H.K., Tai, Y.W., Tang, C.K.: Rethinking space-time networks with improved memory coverage for efficient video object segmentation. In: NeurIPS (2021)

    Google Scholar 

  15. Ding, H., Liu, C., He, S., Jiang, X., Torr, P.H., Bai, S.: MOSE: a new dataset for video object segmentation in complex scenes. In: ICCV (2023)

    Google Scholar 

  16. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. ICLR (2021)

    Google Scholar 

  17. Gao, R., Wang, L.: MeMOTR: Long-term memory-augmented transformer for multi-object tracking. In: ICCV (2023)

    Google Scholar 

  18. Hannan, T., et al.: Gratt-vis: Gated residual attention for auto rectifying video instance segmentation. arXiv preprint arXiv:2305.17096 (2023)

  19. He, J., et al.: Maxtron: mask transformer with trajectory attention for video panoptic segmentation (2023)

    Google Scholar 

  20. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)

    Google Scholar 

  21. Heo, M., et al.: A generalized framework for video instance segmentation. In: CVPR (2023)

    Google Scholar 

  22. Heo, M., Hwang, S., Oh, S.W., Lee, J.Y., Kim, S.J.: Vita: video instance segmentation via object token association. NIPS (2022)

    Google Scholar 

  23. Huang, D.A., Yu, Z., Anandkumar, A.: Minvis: a minimal video instance segmentation framework without video-based training. NeurIPS (2022)

    Google Scholar 

  24. Kim, D., Woo, S., Lee, J.Y., Kweon, I.S.: Video panoptic segmentation. In: CVPR (2020)

    Google Scholar 

  25. Kim, D., et al.: Tubeformer-deeplab: Video mask transformer. In: CVPR (2022)

    Google Scholar 

  26. Leal-Taixé, L., Canton-Ferrer, C., Schindler, K.: Learning by tracking: siamese CNN for robust target association. In: CVPR Workshops (2016)

    Google Scholar 

  27. Li, J., Yu, B., Rao, Y., Zhou, J., Lu, J.: Tcovis: temporally consistent online video instance segmentation. In: ICCV (2023)

    Google Scholar 

  28. Li, M., Li, S., Zhang, X., Zhang, L.: Univs: unified and universal video segmentation with prompts as queries (2024)

    Google Scholar 

  29. Li, X., et al.: Transformer-based visual segmentation: A survey. arXiv pre-print (2023)

    Google Scholar 

  30. Li, X., et al.: Panoptic-partformer: learning a unified model for panoptic part segmentation. ECCV (2022)

    Google Scholar 

  31. Li, X., et al.: Omg-seg: is one model good enough for all segmentation? CVPR (2024)

    Google Scholar 

  32. Li, X., Yuan, H., Zhang, W., Cheng, G., Pang, J., Loy, C.C.: Tube-link: a flexible cross tube framework for universal video segmentation. In: ICCV (2023)

    Google Scholar 

  33. Li, X., et al.: Video k-net: a simple, strong, and unified baseline for video segmentation. In: CVPR (2022)

    Google Scholar 

  34. Lin, C.C., Hung, Y., Feris, R., He, L.: Video instance segmentation tracking with a modified vae architecture. In: CVPR (2020)

    Google Scholar 

  35. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV (2021)

    Google Scholar 

  36. Meinhardt, T., Feiszli, M., Fan, Y., Leal-Taixe, L., Ranjan, R.: Novis: A case for end-to-end near-online video instance segmentation. arXiv preprint arXiv:2308.15266 (2023)

  37. Meinhardt, T., Kirillov, A., Leal-Taixe, L., Feichtenhofer, C.: Trackformer: multi-object tracking with transformers. In: CVPR (2022)

    Google Scholar 

  38. Miao, J., Wang, X., Wu, Y., Li, W., Zhang, X., Wei, Y., Yang, Y.: Large-scale video panoptic segmentation in the wild: a benchmark. In: CVPR (2022)

    Google Scholar 

  39. Miao, J., Wei, Y., Wu, Y., Liang, C., Li, G., Yang, Y.: Vspw: a large-scale dataset for video scene parsing in the wild. In: CVPR (2021)

    Google Scholar 

  40. Oquab, M., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)

  41. Pang, J., Qiu, L., Li, X., Chen, H., Li, Q., Darrell, T., Yu, F.: Quasi-dense similarity learning for multiple object tracking. In: CVPR (2021)

    Google Scholar 

  42. Pang, Z., Li, Z., Wang, N.: Simpletrack: Understanding and rethinking 3d multi-object tracking. arXiv preprint arXiv:2111.09621 (2021)

  43. Park, K., Woo, S., Oh, S.W., Kweon, I.S., Lee, J.Y.: Per-clip video object segmentation. In: CVPR (2022)

    Google Scholar 

  44. Patrick, M., et al.: Keeping your eye on the ball: trajectory attention in video transformers. In: NeurIPS (2021)

    Google Scholar 

  45. Porzi, L., Hofinger, M., Ruiz, I., Serrat, J., Bulo, S.R., Kontschieder, P.: Learning multi-object tracking and segmentation from automatic annotations. In: CVPR (2020)

    Google Scholar 

  46. Qi, J., et al.: Occluded video instance segmentation: a benchmark. IJCV (2022)

    Google Scholar 

  47. Qiao, S., Zhu, Y., Adam, H., Yuille, A., Chen, L.C.: Vip-deeplab: learning visual perception with depth-aware video panoptic segmentation. In: CVPR (2021)

    Google Scholar 

  48. Sanghyun, W., Kwanyong, P., Seoung, W.O., In, S.K., Joon-Young, L.: Tracking by associating clips. In: ECCV (2022)

    Google Scholar 

  49. Seong, H., Oh, S.W., Lee, J.Y., Lee, S., Lee, S., Kim, E.: Hierarchical memory matching network for video object segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2021)

    Google Scholar 

  50. Sun, P., et al.: Dancetrack: Multi-object tracking in uniform appearance and diverse motion. arXiv preprint arXiv:2111.14690 (2021)

  51. Sun, P., et al.: Transtrack: Multiple-object tracking with transformer. arXiv preprint arXiv: 2012.15460 (2020)

  52. Tsai, Y.J., Liu, Y.L., Qi, L., Chan, K.C., Yang, M.H.: Dual associated encoder for face restoration. arXiv preprint arXiv:2308.07314 (2023)

  53. Wang, X., Girdhar, R., Yu, S.X., Misra, I.: Cut and learn for unsupervised object detection and instance segmentation. In: CVPR (2023)

    Google Scholar 

  54. Wang, Y., et al.: End-to-end video instance segmentation with transformers. In: CVPR (2021)

    Google Scholar 

  55. Weber, M., et al.: Step: Segmenting and tracking every pixel. arXiv preprint arXiv:2102.11859 (2021)

  56. Weng, Y., et al.: Mask propagation for efficient video semantic segmentation. arXiv preprint arXiv:2310.18954 (2023)

  57. Wu, J., Cao, J., Song, L., Wang, Y., Yang, M., Yuan, J.: Track to detect and segment: an online multi-object tracker. In: CVPR (2021)

    Google Scholar 

  58. Wu, J., et al.: Towards open vocabulary learning: a survey. T-PAMI (2024)

    Google Scholar 

  59. Wu, J., Jiang, Y., Bai, S., Zhang, W., Bai, X.: Seqformer: sequential transformer for video instance segmentation. In: ECCV (2022)

    Google Scholar 

  60. Wu, J., Liu, Q., Jiang, Y., Bai, S., Yuille, A., Bai, X.: In defense of online models for video instance segmentation. In: ECCV (2022)

    Google Scholar 

  61. Xu, J., Cao, Y., Zhang, Z., Hu, H.: Spatial-temporal relation networks for multi-object tracking. In: ICCV (2019)

    Google Scholar 

  62. Xu, S., Li, X., Wang, J., Cheng, G., Tong, Y., Tao, D.: Fashionformer: a simple, effective and unified baseline for human fashion segmentation and recognition. In: ECCV (2022)

    Google Scholar 

  63. Xu, S., et al.: Rap-sam: towards real-time all-purpose segment anything. arXiv preprint (2024)

    Google Scholar 

  64. Yan, B., et al.: Towards grand unification of object tracking. In: ECCV (2022)

    Google Scholar 

  65. Yan, B., et al.: Universal instance perception as object discovery and retrieval. In: CVPR (2023)

    Google Scholar 

  66. Yang, J., et al.: Panoptic video scene graph generation. In: CVPR (2023)

    Google Scholar 

  67. Yang, L., Fan, Y., Xu, N.: Video instance segmentation. In: ICCV (2019)

    Google Scholar 

  68. Yang, S., et al.: Crossover learning for fast online video instance segmentation. In: ICCV (2021)

    Google Scholar 

  69. Ying, K., et al.: Ctvis: consistent training for online video instance segmentation. In: ICCV (2023)

    Google Scholar 

  70. Yu, F., et al.: Bdd100k: a diverse driving dataset for heterogeneous multitask learning. In: CVPR (2020)

    Google Scholar 

  71. Zeng, F., Dong, B., Zhang, Y., Wang, T., Zhang, X., Wei, Y.: Motr: end-to-end multiple-object tracking with transformer. In: ECCV (2022)

    Google Scholar 

  72. Zhang, T., Tian, X., Wu, Y., Ji, S., Wang, X., Zhang, Y., Wan, P.: Dvis: decoupled video instance segmentation framework. In: ICCV (2023)

    Google Scholar 

  73. Zhang, T., et al.: Dvis++: Improved decoupled framework for universal video segmentation. arXiv preprint arXiv:2312.13305 (2023)

  74. Zhang, W., Pang, J., Chen, K., Loy, C.C.: K-net: towards unified image segmentation. NIPS (2021)

    Google Scholar 

  75. Zhang, Y., et al.: Bytetrack: multi-object tracking by associating every detection box. ECCV (2022)

    Google Scholar 

  76. Zhou, Q., et al.: Transvod: end-to-end video object detection with spatial-temporal transformers. T-PAMI (2022)

    Google Scholar 

  77. Zhou, Y., et al.: Slot-vps: object-centric representation learning for video panoptic segmentation. In: CVPR (2022)

    Google Scholar 

  78. Zhu, J., Yang, H., Liu, N., Kim, M., Zhang, W., Yang, M.H.: Online multi-object tracking with dual matching attention networks. In: ECCV (2018)

    Google Scholar 

  79. Zhu, X., Xiong, Y., Dai, J., Yuan, L., Wei, Y.: Deep feature flow for video recognition. CVPR (2016)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiangtai Li .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1434 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhou, Y., Zhang, T., Ji, S., Yan, S., Li, X. (2025). Improving Video Segmentation via Dynamic Anchor Queries. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15108. Springer, Cham. https://doi.org/10.1007/978-3-031-72973-7_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72973-7_26

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72972-0

  • Online ISBN: 978-3-031-72973-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics