Skip to main content

SUMMNet: Using Transformer as a Summary of ConvNet for Image Classification

  • Conference paper
  • First Online:
Pattern Recognition (ICPR 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15308))

Included in the following conference series:

  • 268 Accesses

Abstract

Convolutional neural networks (CNNs) and Vision Transformers (ViTs) are essential in computer vision. While convolutional neural networks remain robust and are widely used in traditional computer vision tasks, ViTs are rapidly emerging as a new trend. Convolutional neural networks capture local features of images by local dependency and are computationally efficient, especially when dealing with large-scale image data. ViTs globally capture the relationships between input image sequences by self-attention. Still, the large amount of computational resources are required for the similarity computation between tokens and the computation complexity quadratically increases with the number of tokens. To address these issues, we propose a hybrid neural network architecture combining convolution and transformer, called SUMMNet, which consists of four stages, each containing two important blocks, i.e., Local Large Convolution Block (LLCB) and Global Self-Attention Block (GSAB). LLCB introduces the Large Kernel Convolution Attention (LKCA) to capture local detail features more efficiently. GSAB employs a new Lightweight Cross-Head Self-Attention (LCHSA) to enhance the interactions between heads for global abstract information, while decreasing the computation complexity with the dimensionality reduction of Key and Value in self-attention. The proposed SUMMNet has the advantages of both CNNs and ViTs in terms of efficiency and effectiveness. We evaluate our SUMMNet through extensive experiments and it shows promising performance in image classification tasks. Our SUMMNet achieves 84.1% top-1 accuracy with 11.4G FLOPs on the ImageNet-1K image classification task, surpassing Swin-Transformer by 0.6% with 36% less parameters and 30% fewer FLOPs. Our source code is available at https://github.com/YaqiLi01/SUMMNet.git.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Guo, Q., Wu, X., Yang, X., Xu, T., Si, T.: A new hybrid neural architecture of convolution and transformer for visual recognition, 2023. https://github.com/QingbeiGuo/HybridFormer.git

  2. Ding, X., Zhang, X., Han, J., Ding, G.: Scaling up your kernels to 31x31: revisiting large kernel design in cnns. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11963–11975 (2022)

    Google Scholar 

  3. Liu, S., et al.: More convnets in the 2020s: scaling up kernels beyond 51x51 using sparsity. In: International Conference on Learning Representations (ICLR) (2023)

    Google Scholar 

  4. Chen, H., Chu, X., Ren, Y., Zhao, X., Huang, K.: Pelk: parameter-efficient large kernel convnets with peripheral convolution. arXiv preprint arXiv:2403.07589 (2024)

  5. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (ICLR), pp. 13713–13722 (2020)

    Google Scholar 

  6. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jegou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357. PMLR (2021)

    Google Scholar 

  7. Zhou, D., et al.: Deepvit: towards deeper vision transformer. arXiv preprint arXiv:2103.11886 (2021)

  8. Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., Jégou, H.: Going deeper with image transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 32–42 (2021)

    Google Scholar 

  9. Yuan, L., et al.: Tokens-to-token vit: training vision transformers from scratch on imagenet. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 558–567 (2021)

    Google Scholar 

  10. Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 568–578 (2021)

    Google Scholar 

  11. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10012–10022 (2021)

    Google Scholar 

  12. Chen, C.F., Panda, R., Fan, Q.: Regionvit: regional-to-local attention for vision transformers. arXiv preprint arXiv:2106.02689 (2021)

  13. Grainger, R., Paniagua, T., Song, X., Cuntoor, N., Lee, M.W., Wu, T.: Paca-vit: learning patch-to-cluster attention in vision transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18568–18578 (2023)

    Google Scholar 

  14. Li, K., et al.: Uniformer: unifying convolution and self-attention for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 45, 1–18 (2023)

    Article  Google Scholar 

  15. Graham, B., et al.: Levit: a vision transformer in convnet’s clothing for faster inference. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 12259–12269 (2021)

    Google Scholar 

  16. Yang, C., et al.: Moat: alternating mobile convolution and attention brings strong vision models. In: International Conference on Learning Representations (ICLR) (2023)

    Google Scholar 

  17. Yu, W., et al.: Metaformer is actually what you need for vision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10819–10829 (2022)

    Google Scholar 

  18. Fan, H., et al.: Multiscale vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6824–6835 (2021)

    Google Scholar 

  19. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248–255. IEEE (2009)

    Google Scholar 

  20. Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. In: Handbook of Systemic Autoimmune Diseases, vol. 1, no. 4 (2009)

    Google Scholar 

  21. Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3d object representations for fine-grained categorization. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 554–561 (2013)

    Google Scholar 

  22. Parkhi, O.M., Vedaldi, A., Zisserman, A., Jawahar, C.V.: Cats and dogs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3498–3505. IEEE (2012)

    Google Scholar 

  23. Chu, X., et al.: Twins: revisiting the design of spatial attention in vision transformers. Adv. Neural. Inf. Process. Syst. 34, 9355–9366 (2021)

    Google Scholar 

  24. Wang, W., et al.: Crossformer: a versatile vision transformer hinging on cross-scale attention. arxiv 2021 (2022)

    Google Scholar 

  25. Pan, Z., Zhuang, B., He, H., Liu, J., Cai, J.: Less is more: pay less attention in vision transformers. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 36, pp. 2035–2043 (2022)

    Google Scholar 

  26. Li, Y., et al.: Mvitv2: improved multiscale vision transformers for classification and detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4804–4814 (2022)

    Google Scholar 

  27. Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11976–11986 (2022)

    Google Scholar 

  28. Xia, Z., Pan, X., Song, S., Li, L.E., Huang, G.: Vision transformer with deformable attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4794–4803 (2022)

    Google Scholar 

  29. Long, S., Zhao, Z., Pi, J., Wang, S., Wang, J.: Beyond attentive tokens: incorporating token importance and diversity for efficient vision transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10334–10343 (2023)

    Google Scholar 

  30. Li, S., et al.: Moganet: multi-order gated aggregation network. In: International Conference on Learning Representations (2024)

    Google Scholar 

  31. Zhang, S., Liu, H., Lin, S., He, K.: You only need less attention at each stage in vision transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6057–6066 (2024)

    Google Scholar 

  32. Guo, M.-H., Cheng-Ze, L., Liu, Z.-N., Cheng, M.-M., Shi-Min, H.: Visual attention network. Comput. Visual Media 9(4), 733–752 (2023)

    Article  Google Scholar 

  33. Hassani, A., Walton, S., Li, J., Li, S., Shi, H.: Neighborhood attention transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6185–6194 (2023)

    Google Scholar 

  34. Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: beyond empirical risk minimization. In: International Conference on Machine Learning (2017)

    Google Scholar 

  35. Yun, S., Han, D., Oh, S.J., Chun, S., Choe, J., Yoo, Y.: Cutmix: regularization strategy to train strong classifiers with localizable features. In: Proceedings of the IEEE/CVF International Conference on Computer Vision(ICCV), pp. 6023–6032 (2019)

    Google Scholar 

  36. Zhong, Z., Zheng, L., Kang, G., Li, S., Yang, Y.: Random erasing data augmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 34, pp. 13001–13008 (2020)

    Google Scholar 

  37. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (ICLR) (2019)

    Google Scholar 

  38. Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 618–626 (2017)

    Google Scholar 

  39. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)

    Google Scholar 

Download references

Acknowledgment

This work is supported by Shandong Provincial Natural Science Foundation (Grant No. ZR2022MF263), by Science and Technology Program of University of Jinan (Grant No. XKY1913), and by Doctoral Foundation of University of Jinan (Grant No. XJ2024002305), and by High-performance Computing Platform at University of Jinan.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Qingbei Guo or Zhongtao Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Li, Y., Guo, Q., Li, Z. (2025). SUMMNet: Using Transformer as a Summary of ConvNet for Image Classification. In: Antonacopoulos, A., Chaudhuri, S., Chellappa, R., Liu, CL., Bhattacharya, S., Pal, U. (eds) Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15308. Springer, Cham. https://doi.org/10.1007/978-3-031-78186-5_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-78186-5_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-78185-8

  • Online ISBN: 978-3-031-78186-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics