Super Vision Transformer

Lin, Mingbao; Chen, Mengzhao; Zhang, Yuxin; Shen, Chunhua; Ji, Rongrong; Cao, Liujuan

doi:10.1007/s11263-023-01861-3

Super Vision Transformer

Published: 02 August 2023

Volume 131, pages 3136–3151, (2023)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Mingbao Lin^1,2^na1,
Mengzhao Chen¹^na1,
Yuxin Zhang¹,
Chunhua Shen³,
Rongrong Ji¹ &
…
Liujuan Cao¹

2011 Accesses
1 Altmetric
Explore all metrics

Abstract

We attempt to reduce the computational costs in vision transformers (ViTs), which increase quadratically in the token number. We present a novel training paradigm that trains only one ViT model at a time, but is capable of providing improved image recognition performance with various computational costs. Here, the trained ViT model, termed super vision transformer (SuperViT), is empowered with the versatile ability to solve incoming patches of multiple sizes as well as preserve informative tokens with multiple keeping rates (the ratio of keeping tokens) to achieve good hardware efficiency for inference, given that the available hardware resources often change from time to time. Experimental results on ImageNet demonstrate that our SuperViT can considerably reduce the computational costs of ViT models with even performance increase. For example, we reduce 2 $\times $ FLOPs of DeiT-S while increasing the Top-1 accuracy by 0.2% and 0.7% for 1.5 $\times $ reduction. Also, our SuperViT significantly outperforms existing studies on efficient vision transformers. For example, when consuming the same amount of FLOPs, our SuperViT surpasses the recent state-of-the-art EViT by 1.1% when using DeiT-S as their backbones. The project of this work is made publicly available at https://github.com/lmbxmu/SuperViT.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Investigating Bidimensional Downsampling in Vision Transformer Models

No Token Left Behind: Efficient Vision Transformer via Dynamic Token Idling

ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond

Article 12 January 2023

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data availability

The dataset ImageNet-1k for this study can be downloaded at: https://www.image-net.org/download.php. The dataset CIFAR-100 for this study can be downloaded at: https://www.cs.toronto.edu/~kriz/cifar.html. The dataset ADE20K for this study can be downloaded at: https://groups.csail.mit.edu/vision/datasets/ADE20K/.

Code Availability

Code is made publicly available at https://github.com/lmbxmu/SuperViT.

Notes

Layer normalization is usually inserted before MHSA and FFN. We omit it here for brevity.

References

Arnab, A., Dehghani, M., Heigold, G., et al. (2021). Vivit: A video vision transformer. In IEEE international conference on computer vision (pp. 6836–6846).
Bertasius, G., Wang, H., Torresani, L. (2021). Is space-time attention all you need for video understanding? In International conference on machine learning.
Cai, H., Gan, C., Wang, T., et al. (2019). Once-for-all: Train one network and specialize it for efficient deployment. In The international conference on learning representations.
Carion, N., Massa, F., Synnaeve, G., et al. (2020). End-to-end object detection with transformers. In The European conference on computer vision (pp. 213–229).
Chavan, A., Shen, Z., Liu, Z., et al. (2022). Vision transformer slimming: Multi-dimension searching in continuous optimization space. In IEEE conference on computer vision and pattern Recognition (pp. 4931–4941).
Chen, C. F. R., Fan, Q., Panda, R. (2021). Crossvit: Cross-attention multi-scale vision transformer for image classification. In IEEE conference on computer vision and pattern recognition (pp. 357–366).
Chen, M., Lin, M., Li, K., et al. (2022a). Cf-vit: A general coarse-to-fine method for vision transformer. arXiv:2203.03821
Chen, Y., Dai, X., Chen, D., et al. (2022b). Mobile-former: Bridging mobilenet and transformer. In IEEE conference on computer vision and pattern recognition (pp. 5270–5279).
Chu, X., Tian, Z., Wang, Y., et al. (2021a). Twins: Revisiting the design of spatial attention in vision transformers. In Advances in neural information processing systems (pp. 9355–9366).
Chu, X., Tian, Z., Zhang, B., et al. (2021b). Conditional positional encodings for vision transformers. arXiv:2102.10882
Dai, J., Qi, H., Xiong, Y., et al. (2017) Deformable convolutional networks. In IEEE conference on computer vision and pattern recognition (pp. 764–773).
Deng, J., Dong, W., Socher, R., et al. (2009). Imagenet: A large-scale hierarchical image database. In The international conference on computer vision (pp. 248–255).
Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. In The international conference on learning representations.
Graham, B., El-Nouby, A., Touvron, H., et al. (2021). Levit: A vision transformer in convnet’s clothing for faster inference. In IEEE international conference on computer vision (pp. 12,259–12,269).
Guo, J., Han, K., Wu ,H., et al. (2022). Cmt: Convolutional neural networks meet vision transformers. In IEEE conference on computer vision and pattern recognition (pp. 12,175–12,185).
Han, K., Xiao, A., Wu, E., et al. (2021). Transformer in transformer. In Advances in neural information processing systems (pp. 15,908–15,919).
Han, K., Wang, Y., Chen, H., et al. (2022). A survey on vision transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Hendrycks, D., & Gimpel, K. (2016). Gaussian error linear units (gelus). arXiv:1606.08415
Howard, A. G., Zhu, M., Chen, B., et al. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861
Huang, G., Chen, D., Li, T., et al. (2018). Multi-scale dense networks for resource efficient image classification. In The international conference on learning representations
Huang, L., Tan, J., Liu, J., et al. (2020). Hand-transformer: Non-autoregressive structured modeling for 3d hand pose estimation. In The European Conference on Computer Vision (pp. 17–33).
Huang, Z., Ben, Y., Luo, G., et al. (2021). Shuffle transformer: Rethinking spatial shuffle for vision transformer. arXiv:2106.03650
Jiang, Z. H., Hou, Q., Yuan, L., et al. (2021). All tokens matter: Token labeling for training better vision transformers. In Advances in neural information processing systems (pp. 18,590–18,602).
Khan, S., Naseer, M., Hayat, M., et al. (2021). Transformers in vision: A survey. ACM Computing Surveys.
Krizhevsky, A., Hinton, G., et al. (2009). Learning multiple layers of features from tiny images. Toronto.
Li, W., Wang. X., Xia, X., et al. (2022). Sepvit: Separable vision transformer. arXiv:2203.15380
Liang, J., Cao, J., Sun, G., et al. (2021). Swinir: Image restoration using swin transformer. In The International Conference on Computer Vision (pp. 1833–1844).
Liang, Y., Ge, C., Tong, Z., et al. (2022). Not all patches are what you need: Expediting vision transformers via token reorganizations. In The international conference on learning representations.
Liu, Z., Lin, Y., Cao, Y., et al. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In IEEE International Conference on Computer Vision (pp. 10,012–10,022).
Pan, B., Panda, R., Jiang, Y., et al. (2021). Ia-$\text{red}^{2}$: Interpretability-aware redundancy reduction for vision transformers. In Advances in neural information processing systems (pp. 24,898–24,911).
Rao, Y., Zhao, W., Liu, B., et al. (2021). Dynamicvit: Efficient vision transformers with dynamic token sparsification. In Advances in neural information processing systems (pp. 13,937–13,949).
Ren, S., Zhou, D., He, S., et al. (2022). Shunted self-attention via multi-scale token aggregation. In IEEE conference on computer vision and pattern recognition (pp. 10,853–10,862).
Sun, C., Shrivastava, A., Singh, S., et al. (2017). Revisiting unreasonable effectiveness of data in deep learning era. In The international conference on computer vision (pp. 843–852).
Tang, Y., Han, K., Wang, Y., et al. (2022). Patch slimming for efficient vision transformers. In IEEE conference on computer vision and pattern recognition (pp. 12,165–12,174)
Touvron, H., Cord, M., Douze, M., et al. (2021a). Training data-efficient image transformers and distillation through attention. In International conference on machine learning (pp. 10,347–10,357).
Touvron, H., Cord, M., Sablayrolles, A., et al. (2021b). Going deeper with image transformers. In The international conference on computer vision (pp. 32–42).
Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention is all you need. In Advances in neural information processing systems.
Wang, W., Xie, E., Li, X., et al. (2021a). Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In The international conference on computer vision (pp. 568–578).
Wang, Y., Huang, R., Song, S., et al. (2021b). Not all images are worth 16x16 words: Dynamic transformers for efficient image recognition. In Advances in neural information processing systems (pp. 11,960–11,973).
Xia, Z., Pan, X., Song, S., et al. (2022). Vision transformer with deformable attention. In IEEE conference on computer vision and pattern recognition (pp. 4794–4803).
Xiao, T., Liu, Y., Zhou, B., et al. (2018). Unified perceptual parsing for scene understanding. In The European conference on computer vision (pp. 418–434).
Xie, E., Wang, W., Yu, Z., et al. (2021). Segformer: Simple and efficient design for semantic segmentation with transformers. In Advances in neural information processing systems (pp. 12,077–12,090).
Xu, W., Xu, Y., Chang, T., et al. (2021). Co-scale conv-attentional image transformers. In IEEE conference on computer vision and pattern recognition (pp. 9981–9990).
Xu, Y., Zhang, Z., Zhang, M., et al. (2022). Evo-vit: Slow-fast token evolution for dynamic vision transformer. In AAAI conference on artificial intelligence (pp. 2964–2972).
Yang, L., Han, Y., Chen, X., et al. (2020a). Resolution adaptive networks for efficient inference. In IEEE conference on computer vision and pattern recognition (pp. 2369–2378).
Yang, T., Zhu, S., Chen, C., et al. (2020b). Mutualnet: Adaptive convnet via mutual learning from network width and resolution. InThe European conference on computer vision (pp. 299–315).
Yin, H., Vahdat, A., Alvarez, J., et al. (2022). A-ViT: Adaptive tokens for efficient vision transformer. In IEEE conference on computer vision and pattern recognition (pp. 10,809–10,818).
Yu, J., Yang, L., Xu, N., et al. (2018). Slimmable neural networks. In The international conference on learning representations.
Yuan, L., Chen, Y., Wang, T., et al. (2021). Tokens-to-token vit: Training vision transformers from scratch on imagenet. In The international conference on computer vision (pp. 558–567).
Zamir, S. W., Arora, A., Khan, S., et al. (2022). Restormer: Efficient transformer for high-resolution image restoration. In IEEE conference on computer vision and pattern recognition (pp. 5728–5739).
Zhang, X., Zhou, X., Lin, M., et al. (2018). Shufflenet: An extremely efficient convolutional neural network for mobile devices. In IEEE conference on computer vision and pattern recognition (pp. 6848–6856).
Zheng, S., Lu, J., Zhao, H., et al. (2021). Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In IEEE conference on computer vision and pattern recognition (pp. 6881–6890).
Zhou, B., Zhao, H., Puig, X., et al. (2017). Scene parsing through ade20k dataset. In IEEE conference on computer vision and pattern recognition (pp. 633–641).
Zhu, X., Su, W., Lu, L., et al. (2022). Deformable detr: Deformable transformers for end-to-end object detection. In The international conference on learning representations.
Zhu, Y., Zhu, Y., Du, J., et al. (2021). Make a long image short: Adaptive token length for vision transformers. arXiv:2112.01686

Download references

Funding

This work was supported by National Key R &D Program of China (No. 2022ZD0118202), the National Science Fund for Distinguished Young Scholars (No. 62025603), the National Natural Science Foundation of China (Nos. U21B2037, U22B2051, 62176222, 62176223, 62176226, 62072386, 62072387, 62072389, 62002305 and 62272401), and the Natural Science Foundation of Fujian Province of China (Nos. 2021J01002 and 2022J06001).

Author information

Mingbao Lin and Mengzhao Chen have contributed equally to this work.

Authors and Affiliations

Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, School of Informatics, Xiamen, China
Mingbao Lin, Mengzhao Chen, Yuxin Zhang, Rongrong Ji & Liujuan Cao
Tencent Youtu Lab, Shanghai, China
Mingbao Lin
Zhejiang University, Hangzhou, China
Chunhua Shen

Authors

Mingbao Lin
View author publications
You can also search for this author inPubMed Google Scholar
Mengzhao Chen
View author publications
You can also search for this author inPubMed Google Scholar
Yuxin Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Chunhua Shen
View author publications
You can also search for this author inPubMed Google Scholar
Rongrong Ji
View author publications
You can also search for this author inPubMed Google Scholar
Liujuan Cao
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

Material preparation, data collection and analysis were mostly performed by Mingbao Lin, Mengzhao Chen and Yuxin Zhang. The SuperViT model was originally proposed by Mingbao Lin and Mengzhao Chen, improved by Chunhua Shen. Chunhua Shen also made efforts to revise the manuscript. Rongrong Ji and Liujuan Cao, leaders of this project, delved into specific discussions of the feasibility and polishing manuscript. Liujuan Cao was also involved in partial experimental designs and paper revision. The first draft of the manuscript was written by Mingbao Lin and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Liujuan Cao.

Ethics declarations

Confict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Additional information

Communicated by Nikos KOMODAKIS.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Lin, M., Chen, M., Zhang, Y. et al. Super Vision Transformer. Int J Comput Vis 131, 3136–3151 (2023). https://doi.org/10.1007/s11263-023-01861-3

Download citation

Received: 28 October 2022
Accepted: 13 July 2023
Published: 02 August 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s11263-023-01861-3

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Super Vision Transformer

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Investigating Bidimensional Downsampling in Vision Transformer Models

No Token Left Behind: Efficient Vision Transformer via Dynamic Token Idling

ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond

Explore related subjects

Data availability

Code Availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Confict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now