Abstract
We attempt to reduce the computational costs in vision transformers (ViTs), which increase quadratically in the token number. We present a novel training paradigm that trains only one ViT model at a time, but is capable of providing improved image recognition performance with various computational costs. Here, the trained ViT model, termed super vision transformer (SuperViT), is empowered with the versatile ability to solve incoming patches of multiple sizes as well as preserve informative tokens with multiple keeping rates (the ratio of keeping tokens) to achieve good hardware efficiency for inference, given that the available hardware resources often change from time to time. Experimental results on ImageNet demonstrate that our SuperViT can considerably reduce the computational costs of ViT models with even performance increase. For example, we reduce 2 \(\times \) FLOPs of DeiT-S while increasing the Top-1 accuracy by 0.2% and 0.7% for 1.5 \(\times \) reduction. Also, our SuperViT significantly outperforms existing studies on efficient vision transformers. For example, when consuming the same amount of FLOPs, our SuperViT surpasses the recent state-of-the-art EViT by 1.1% when using DeiT-S as their backbones. The project of this work is made publicly available at https://github.com/lmbxmu/SuperViT.





Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
The dataset ImageNet-1k for this study can be downloaded at: https://www.image-net.org/download.php. The dataset CIFAR-100 for this study can be downloaded at: https://www.cs.toronto.edu/~kriz/cifar.html. The dataset ADE20K for this study can be downloaded at: https://groups.csail.mit.edu/vision/datasets/ADE20K/.
Code Availability
Code is made publicly available at https://github.com/lmbxmu/SuperViT.
Notes
Layer normalization is usually inserted before MHSA and FFN. We omit it here for brevity.
References
Arnab, A., Dehghani, M., Heigold, G., et al. (2021). Vivit: A video vision transformer. In IEEE international conference on computer vision (pp. 6836–6846).
Bertasius, G., Wang, H., Torresani, L. (2021). Is space-time attention all you need for video understanding? In International conference on machine learning.
Cai, H., Gan, C., Wang, T., et al. (2019). Once-for-all: Train one network and specialize it for efficient deployment. In The international conference on learning representations.
Carion, N., Massa, F., Synnaeve, G., et al. (2020). End-to-end object detection with transformers. In The European conference on computer vision (pp. 213–229).
Chavan, A., Shen, Z., Liu, Z., et al. (2022). Vision transformer slimming: Multi-dimension searching in continuous optimization space. In IEEE conference on computer vision and pattern Recognition (pp. 4931–4941).
Chen, C. F. R., Fan, Q., Panda, R. (2021). Crossvit: Cross-attention multi-scale vision transformer for image classification. In IEEE conference on computer vision and pattern recognition (pp. 357–366).
Chen, M., Lin, M., Li, K., et al. (2022a). Cf-vit: A general coarse-to-fine method for vision transformer. arXiv:2203.03821
Chen, Y., Dai, X., Chen, D., et al. (2022b). Mobile-former: Bridging mobilenet and transformer. In IEEE conference on computer vision and pattern recognition (pp. 5270–5279).
Chu, X., Tian, Z., Wang, Y., et al. (2021a). Twins: Revisiting the design of spatial attention in vision transformers. In Advances in neural information processing systems (pp. 9355–9366).
Chu, X., Tian, Z., Zhang, B., et al. (2021b). Conditional positional encodings for vision transformers. arXiv:2102.10882
Dai, J., Qi, H., Xiong, Y., et al. (2017) Deformable convolutional networks. In IEEE conference on computer vision and pattern recognition (pp. 764–773).
Deng, J., Dong, W., Socher, R., et al. (2009). Imagenet: A large-scale hierarchical image database. In The international conference on computer vision (pp. 248–255).
Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. In The international conference on learning representations.
Graham, B., El-Nouby, A., Touvron, H., et al. (2021). Levit: A vision transformer in convnet’s clothing for faster inference. In IEEE international conference on computer vision (pp. 12,259–12,269).
Guo, J., Han, K., Wu ,H., et al. (2022). Cmt: Convolutional neural networks meet vision transformers. In IEEE conference on computer vision and pattern recognition (pp. 12,175–12,185).
Han, K., Xiao, A., Wu, E., et al. (2021). Transformer in transformer. In Advances in neural information processing systems (pp. 15,908–15,919).
Han, K., Wang, Y., Chen, H., et al. (2022). A survey on vision transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Hendrycks, D., & Gimpel, K. (2016). Gaussian error linear units (gelus). arXiv:1606.08415
Howard, A. G., Zhu, M., Chen, B., et al. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861
Huang, G., Chen, D., Li, T., et al. (2018). Multi-scale dense networks for resource efficient image classification. In The international conference on learning representations
Huang, L., Tan, J., Liu, J., et al. (2020). Hand-transformer: Non-autoregressive structured modeling for 3d hand pose estimation. In The European Conference on Computer Vision (pp. 17–33).
Huang, Z., Ben, Y., Luo, G., et al. (2021). Shuffle transformer: Rethinking spatial shuffle for vision transformer. arXiv:2106.03650
Jiang, Z. H., Hou, Q., Yuan, L., et al. (2021). All tokens matter: Token labeling for training better vision transformers. In Advances in neural information processing systems (pp. 18,590–18,602).
Khan, S., Naseer, M., Hayat, M., et al. (2021). Transformers in vision: A survey. ACM Computing Surveys.
Krizhevsky, A., Hinton, G., et al. (2009). Learning multiple layers of features from tiny images. Toronto.
Li, W., Wang. X., Xia, X., et al. (2022). Sepvit: Separable vision transformer. arXiv:2203.15380
Liang, J., Cao, J., Sun, G., et al. (2021). Swinir: Image restoration using swin transformer. In The International Conference on Computer Vision (pp. 1833–1844).
Liang, Y., Ge, C., Tong, Z., et al. (2022). Not all patches are what you need: Expediting vision transformers via token reorganizations. In The international conference on learning representations.
Liu, Z., Lin, Y., Cao, Y., et al. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In IEEE International Conference on Computer Vision (pp. 10,012–10,022).
Pan, B., Panda, R., Jiang, Y., et al. (2021). Ia-\(\text{red}^{2}\): Interpretability-aware redundancy reduction for vision transformers. In Advances in neural information processing systems (pp. 24,898–24,911).
Rao, Y., Zhao, W., Liu, B., et al. (2021). Dynamicvit: Efficient vision transformers with dynamic token sparsification. In Advances in neural information processing systems (pp. 13,937–13,949).
Ren, S., Zhou, D., He, S., et al. (2022). Shunted self-attention via multi-scale token aggregation. In IEEE conference on computer vision and pattern recognition (pp. 10,853–10,862).
Sun, C., Shrivastava, A., Singh, S., et al. (2017). Revisiting unreasonable effectiveness of data in deep learning era. In The international conference on computer vision (pp. 843–852).
Tang, Y., Han, K., Wang, Y., et al. (2022). Patch slimming for efficient vision transformers. In IEEE conference on computer vision and pattern recognition (pp. 12,165–12,174)
Touvron, H., Cord, M., Douze, M., et al. (2021a). Training data-efficient image transformers and distillation through attention. In International conference on machine learning (pp. 10,347–10,357).
Touvron, H., Cord, M., Sablayrolles, A., et al. (2021b). Going deeper with image transformers. In The international conference on computer vision (pp. 32–42).
Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention is all you need. In Advances in neural information processing systems.
Wang, W., Xie, E., Li, X., et al. (2021a). Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In The international conference on computer vision (pp. 568–578).
Wang, Y., Huang, R., Song, S., et al. (2021b). Not all images are worth 16x16 words: Dynamic transformers for efficient image recognition. In Advances in neural information processing systems (pp. 11,960–11,973).
Xia, Z., Pan, X., Song, S., et al. (2022). Vision transformer with deformable attention. In IEEE conference on computer vision and pattern recognition (pp. 4794–4803).
Xiao, T., Liu, Y., Zhou, B., et al. (2018). Unified perceptual parsing for scene understanding. In The European conference on computer vision (pp. 418–434).
Xie, E., Wang, W., Yu, Z., et al. (2021). Segformer: Simple and efficient design for semantic segmentation with transformers. In Advances in neural information processing systems (pp. 12,077–12,090).
Xu, W., Xu, Y., Chang, T., et al. (2021). Co-scale conv-attentional image transformers. In IEEE conference on computer vision and pattern recognition (pp. 9981–9990).
Xu, Y., Zhang, Z., Zhang, M., et al. (2022). Evo-vit: Slow-fast token evolution for dynamic vision transformer. In AAAI conference on artificial intelligence (pp. 2964–2972).
Yang, L., Han, Y., Chen, X., et al. (2020a). Resolution adaptive networks for efficient inference. In IEEE conference on computer vision and pattern recognition (pp. 2369–2378).
Yang, T., Zhu, S., Chen, C., et al. (2020b). Mutualnet: Adaptive convnet via mutual learning from network width and resolution. InThe European conference on computer vision (pp. 299–315).
Yin, H., Vahdat, A., Alvarez, J., et al. (2022). A-ViT: Adaptive tokens for efficient vision transformer. In IEEE conference on computer vision and pattern recognition (pp. 10,809–10,818).
Yu, J., Yang, L., Xu, N., et al. (2018). Slimmable neural networks. In The international conference on learning representations.
Yuan, L., Chen, Y., Wang, T., et al. (2021). Tokens-to-token vit: Training vision transformers from scratch on imagenet. In The international conference on computer vision (pp. 558–567).
Zamir, S. W., Arora, A., Khan, S., et al. (2022). Restormer: Efficient transformer for high-resolution image restoration. In IEEE conference on computer vision and pattern recognition (pp. 5728–5739).
Zhang, X., Zhou, X., Lin, M., et al. (2018). Shufflenet: An extremely efficient convolutional neural network for mobile devices. In IEEE conference on computer vision and pattern recognition (pp. 6848–6856).
Zheng, S., Lu, J., Zhao, H., et al. (2021). Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In IEEE conference on computer vision and pattern recognition (pp. 6881–6890).
Zhou, B., Zhao, H., Puig, X., et al. (2017). Scene parsing through ade20k dataset. In IEEE conference on computer vision and pattern recognition (pp. 633–641).
Zhu, X., Su, W., Lu, L., et al. (2022). Deformable detr: Deformable transformers for end-to-end object detection. In The international conference on learning representations.
Zhu, Y., Zhu, Y., Du, J., et al. (2021). Make a long image short: Adaptive token length for vision transformers. arXiv:2112.01686
Funding
This work was supported by National Key R &D Program of China (No. 2022ZD0118202), the National Science Fund for Distinguished Young Scholars (No. 62025603), the National Natural Science Foundation of China (Nos. U21B2037, U22B2051, 62176222, 62176223, 62176226, 62072386, 62072387, 62072389, 62002305 and 62272401), and the Natural Science Foundation of Fujian Province of China (Nos. 2021J01002 and 2022J06001).
Author information
Authors and Affiliations
Contributions
Material preparation, data collection and analysis were mostly performed by Mingbao Lin, Mengzhao Chen and Yuxin Zhang. The SuperViT model was originally proposed by Mingbao Lin and Mengzhao Chen, improved by Chunhua Shen. Chunhua Shen also made efforts to revise the manuscript. Rongrong Ji and Liujuan Cao, leaders of this project, delved into specific discussions of the feasibility and polishing manuscript. Liujuan Cao was also involved in partial experimental designs and paper revision. The first draft of the manuscript was written by Mingbao Lin and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Confict of interest
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Additional information
Communicated by Nikos KOMODAKIS.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Lin, M., Chen, M., Zhang, Y. et al. Super Vision Transformer. Int J Comput Vis 131, 3136–3151 (2023). https://doi.org/10.1007/s11263-023-01861-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-023-01861-3