Three Things Everyone Should Know About Vision Transformers

Touvron, Hugo; Cord, Matthieu; El-Nouby, Alaaeldin; Verbeek, Jakob; Jégou, Hervé

doi:10.1007/978-3-031-20053-3_29

Hugo Touvron^12,13,
Matthieu Cord¹³,
Alaaeldin El-Nouby^12,14,
Jakob Verbeek¹² &
…
Hervé Jégou¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13684))

Included in the following conference series:

European Conference on Computer Vision

4325 Accesses
51 Citations

Abstract

After their initial success in natural language processing, transformer architectures have rapidly gained traction in computer vision, providing state-of-the-art results for tasks such as image classification, detection, segmentation, and video analysis. We offer three insights based on simple and easy to implement variants of vision transformers. (1) The residual layers of vision transformers, which are usually processed sequentially, can to some extent be processed efficiently in parallel without noticeably affecting the accuracy. (2) Fine-tuning the weights of the attention layers is sufficient to adapt vision transformers to a higher resolution and to other classification tasks. This saves compute, reduces the peak memory consumption at fine-tuning time, and allows sharing the majority of weights across tasks. (3) Adding MLP-based patch pre-processing layers improves Bert-like self-supervised training based on patch masking. We evaluate the impact of these design choices using the ImageNet-1k dataset, and confirm our findings on the ImageNet-v2 test set. Transfer performance is measured across six smaller datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

The Current State of the Art in Deep Learning for Image Classification: A Review

Pyramid Swin Transformer for Multi-task: Expanding to More Computer Vision Tasks

Deep Architectures in Visual Transfer Learning

Notes

1.
We have not found any papers in the literature analyzing the effect of width versus depth for ViT on common GPUs and CPUs.

References

Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: ViVit: a video vision transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6836–6846 (2021)
Google Scholar
Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
Bao, H., Dong, L., Wei, F.: BEiT: BERT pre-training of image transformers. arXiv preprint arXiv:2106.08254 (2021)
Bello, I., Zoph, B., Vaswani, A., Shlens, J., Le, Q.V.: Attention augmented convolutional networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3286–3295 (2019)
Google Scholar
Berriel, R., et al.: Budget-aware adapters for multi-domain learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 382–391 (2019)
Google Scholar
Brown, T.B., et al.: Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chapter Google Scholar
Caron, M., et al.: Emerging properties in self-supervised vision transformers. arXiv preprint arXiv:2104.14294 (2021)
Chang, H., Zhang, H., Jiang, L., Liu, C., Freeman, W.T.: MaskGIT: masked generative image transformer. arXiv preprint arXiv:2202.04200 (2022)
Chen, X., Xie, S., He, K.: An empirical study of training self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9640–9649 (2021)
Google Scholar
d’Ascoli, S., Touvron, H., Leavitt, M.L., Morcos, A.S., Biroli, G., Sagun, L.: ConViT: improving vision transformers with soft convolutional inductive biases. In: International Conference on Machine Learning, pp. 2286–2296. PMLR (2021)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL (2019)
Google Scholar
Ding, X., Zhang, X., Ma, N., Han, J., Ding, G., Sun, J.: RepVGG: making VGG-style convnets great again. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13733–13742 (2021)
Google Scholar
Ding, X., Zhang, X., Han, J., Ding, G.: RepMLP: re-parameterizing convolutions into fully-connected layers for image recognition. arXiv preprint arXiv:2105.01883 (2021)
Dong, X., et al.: PeCo: perceptual codebook for BERT pre-training of vision transformers. arXiv preprint arXiv:2111.12710 (2021)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021)
Google Scholar
El-Nouby, A., Izacard, G., Touvron, H., Laptev, I., Jegou, H., Grave, E.: Are large-scale datasets necessary for self-supervised pre-training? arXiv preprint arXiv:2112.10740 (2021)
El-Nouby, A., et al.: XCiT: cross-covariance image transformers. In: NeurIPS (2021)
Google Scholar
Fan, H., et al.: Multiscale vision transformers. arXiv preprint arXiv:2104.11227 (2021)
Goyal, A., Bochkovskiy, A., Deng, J., Koltun, V.: Non-deep networks. arXiv preprint arXiv:2110.07641 (2021)
Graham, B., et al.: LeViT: a vision transformer in convnet’s clothing for faster inference. arXiv preprint arXiv:2104.01136 (2021)
Guo, Y., Shi, H., Kumar, A., Grauman, K., Simunic, T., Feris, R.S.: SpotTune: transfer learning through adaptive fine-tuning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4805–4814 (2019)
Google Scholar
Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., Wang, Y.: Transformer in transformer. arXiv preprint arXiv:2103.00112 (2021)
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377 (2021)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. arXiv preprint arXiv:1603.05027 (2016)
Hendrycks, D., Gimpel, K.: Gaussian error linear units (GELUs). arXiv preprint arXiv:1606.08415 (2016)
Horn, G.V., et al.: The inaturalist challenge 2017 dataset. arXiv preprint arXiv:1707.06642 (2017)
Houlsby, N., et al.: Parameter-efficient transfer learning for NLP. In: International Conference on Machine Learning, pp. 2790–2799. PMLR (2019)
Google Scholar
Huang, G., Sun, Yu., Liu, Z., Sedra, D., Weinberger, K.Q.: Deep networks with stochastic depth. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 646–661. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_39
Chapter Google Scholar
Hudson, D.A., Zitnick, C.L.: Generative adversarial transformers. In: International Conference on Machine Learning, pp. 4487–4499. PMLR (2021)
Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456. PMLR (2015)
Google Scholar
Karita, S., Chen, N., Hayashi, T., et al.: A comparative study on transformer vs RNN in speech applications. arXiv preprint arXiv:1909.06317 (2019)
Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3D object representations for fine-grained categorization. In: IEEE Workshop on 3D Representation and Recognition (2013)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.: ImageNet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2012)
Article Google Scholar
Krizhevsky, A.: Learning multiple layers of features from tiny images. Tech. rep., CIFAR (2009)
Google Scholar
Lample, G., Charton, F.: Deep learning for symbolic mathematics. arXiv preprint arXiv:1912.01412 (2019)
Li, X., Wang, W., Hu, X., Yang, J.: Selective kernel networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 510–519 (2019)
Google Scholar
Liu, H., Dai, Z., So, D.R., Le, Q.V.: Pay attention to MLPs. arXiv preprint arXiv:2105.08050 (2021)
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030 (2021)
Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. arXiv preprint arXiv:2201.03545 (2022)
Loshchilov, I., Hutter, F.: Fixing weight decay regularization in Adam. arXiv preprint arXiv:1711.05101 (2017)
Lüscher, C., Beck, E., Irie, K., et al.: RWTH ASR systems for LibriSpeech: hybrid vs attention. In: Interspeech (2019)
Google Scholar
Mahabadi, R.K., Ruder, S., Dehghani, M., Henderson, J.: Parameter-efficient multi-task fine-tuning for transformers via shared hypernetworks. In: ACL/IJCNLP (2021)
Google Scholar
Mancini, M., Ricci, E., Caputo, B., Bulò, S.R.: Adding new tasks to a single network with weight transformations using binary masks. In: European Conference on Computer Vision Workshops (2018)
Google Scholar
Melas-Kyriazi, L.: Do you even need attention? A stack of feed-forward layers does surprisingly well on ImageNet. arXiv preprint arXiv:2105.02723 (2021)
Nilsback, M.E., Zisserman, A.: Automated flower classification over a large number of classes. In: Proceedings of the Indian Conference on Computer Vision, Graphics and Image Processing (2008)
Google Scholar
Nouby, A.E., Izacard, G., Touvron, H., Laptev, I., Jégou, H., Grave, E.: Are large-scale datasets necessary for self-supervised pre-training? arXiv preprint arXiv:2112.10740 (2021)
Oquab, M., Bottou, L., Laptev, I., Sivic, J.: Learning and transferring mid-level image representations using convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1717–1724 (2014)
Google Scholar
Pfeiffer, J., Rücklé, A., Poth, C., Kamath, A., Vulic, I., Ruder, S., Cho, K., Gurevych, I.: AdapterHub: A framework for adapting transformers. In: EMNLP (2020)
Google Scholar
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019)
Google Scholar
Ramesh, A., et al.: Zero-shot text-to-image generation. arXiv preprint arXiv:2102.12092 (2021)
Rebuffi, S.A., Bilen, H., Vedaldi, A.: Efficient parametrization of multi-domain deep neural networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8119–8127 (2018)
Google Scholar
Recht, B., Roelofs, R., Schmidt, L., Shankar, V.: Do ImageNet classifiers generalize to ImageNet? In: International Conference on Machine Learning, pp. 5389–5400. PMLR (2019)
Google Scholar
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y
Article MathSciNet Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (2015)
Google Scholar
Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? Data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021)
Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
Google Scholar
Tolstikhin, I., et al.: MLP-Mixer: an all-MLP architecture for vision. arXiv preprint arXiv:2105.01601 (2021)
Touvron, H., et al.: ResMLP: feedforward networks for image classification with data-efficient training. arXiv preprint arXiv:2105.03404 (2021)
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357. PMLR (2021)
Google Scholar
Touvron, H., et al.: Augmenting convolutional networks with attention-based aggregation. arXiv preprint arXiv:2112.13692 (2021)
Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., Jégou, H.: Going deeper with image transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 32–42(2021)
Google Scholar
Touvron, H., Vedaldi, A., Douze, M., Jegou, H.: Fixing the train-test resolution discrepancy. Adv. Neural Inf. Process. Syst. 32 (2019)
Google Scholar
Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)
Google Scholar
Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. arXiv preprint arXiv:2102.12122 (2021)
Wei, C., Fan, H., Xie, S., Wu, C.Y., Yuille, A., Feichtenhofer, C.: Masked feature prediction for self-supervised visual pre-training. arXiv preprint arXiv:2112.09133 (2021)
Wightman, R., Touvron, H., Jégou, H.: ResNet strikes back: an improved training procedure in timm. arXiv preprint arXiv:2110.00476 (2021)
Wu, H., et al.: CvT: introducing convolutions to vision transformers. arXiv preprint arXiv:2103.15808 (2021)
Xiao, T., Singh, M., Mintun, E., Darrell, T., Dollár, P., Girshick, R.B.: Early convolutions help transformers see better. arXiv preprint arXiv:2106.14881 (2021)
Xie, Z., et al.: SimMIM: a simple framework for masked image modeling. arXiv preprint arXiv:2111.09886 (2021)
Yosinski, J., Clune, J., Bengio, Y., Lipson, H.: How transferable are features in deep neural networks? arXiv preprint arXiv:1411.1792 (2014)
Yuan, L., et al.: Tokens-to-Token ViT: training vision transformers from scratch on ImageNet. arXiv preprint arXiv:2101.11986 (2021)
Zagoruyko, S., Komodakis, N.: Wide residual networks. arXiv preprint arXiv:1605.07146 (2016)
Zhou, J., et al.: iBOT: image BERT pre-training with online tokenizer. International Conference on Learning Representations (2022)
Google Scholar

Download references

Acknowledgement

We thank Francisco Massa for valuable discussions and insights about optimizing the implementation of block parallelization.

Author information

Authors and Affiliations

Meta AI, Fundamental AI Research (FAIR), Paris, France
Hugo Touvron, Alaaeldin El-Nouby, Jakob Verbeek & Hervé Jégou
Sorbonne University, Paris, France
Hugo Touvron & Matthieu Cord
INRIA, Paris, France
Alaaeldin El-Nouby

Authors

Hugo Touvron
View author publications
You can also search for this author in PubMed Google Scholar
Matthieu Cord
View author publications
You can also search for this author in PubMed Google Scholar
Alaaeldin El-Nouby
View author publications
You can also search for this author in PubMed Google Scholar
Jakob Verbeek
View author publications
You can also search for this author in PubMed Google Scholar
Hervé Jégou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hugo Touvron .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 38 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Touvron, H., Cord, M., El-Nouby, A., Verbeek, J., Jégou, H. (2022). Three Things Everyone Should Know About Vision Transformers. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13684. Springer, Cham. https://doi.org/10.1007/978-3-031-20053-3_29

Download citation

DOI: https://doi.org/10.1007/978-3-031-20053-3_29
Published: 06 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20052-6
Online ISBN: 978-3-031-20053-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics