Abstract
Convolutional neural networks (CNNs) and Vision Transformers (ViTs) are essential in computer vision. While convolutional neural networks remain robust and are widely used in traditional computer vision tasks, ViTs are rapidly emerging as a new trend. Convolutional neural networks capture local features of images by local dependency and are computationally efficient, especially when dealing with large-scale image data. ViTs globally capture the relationships between input image sequences by self-attention. Still, the large amount of computational resources are required for the similarity computation between tokens and the computation complexity quadratically increases with the number of tokens. To address these issues, we propose a hybrid neural network architecture combining convolution and transformer, called SUMMNet, which consists of four stages, each containing two important blocks, i.e., Local Large Convolution Block (LLCB) and Global Self-Attention Block (GSAB). LLCB introduces the Large Kernel Convolution Attention (LKCA) to capture local detail features more efficiently. GSAB employs a new Lightweight Cross-Head Self-Attention (LCHSA) to enhance the interactions between heads for global abstract information, while decreasing the computation complexity with the dimensionality reduction of Key and Value in self-attention. The proposed SUMMNet has the advantages of both CNNs and ViTs in terms of efficiency and effectiveness. We evaluate our SUMMNet through extensive experiments and it shows promising performance in image classification tasks. Our SUMMNet achieves 84.1% top-1 accuracy with 11.4G FLOPs on the ImageNet-1K image classification task, surpassing Swin-Transformer by 0.6% with 36% less parameters and 30% fewer FLOPs. Our source code is available at https://github.com/YaqiLi01/SUMMNet.git.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Guo, Q., Wu, X., Yang, X., Xu, T., Si, T.: A new hybrid neural architecture of convolution and transformer for visual recognition, 2023. https://github.com/QingbeiGuo/HybridFormer.git
Ding, X., Zhang, X., Han, J., Ding, G.: Scaling up your kernels to 31x31: revisiting large kernel design in cnns. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11963–11975 (2022)
Liu, S., et al.: More convnets in the 2020s: scaling up kernels beyond 51x51 using sparsity. In: International Conference on Learning Representations (ICLR) (2023)
Chen, H., Chu, X., Ren, Y., Zhao, X., Huang, K.: Pelk: parameter-efficient large kernel convnets with peripheral convolution. arXiv preprint arXiv:2403.07589 (2024)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (ICLR), pp. 13713–13722 (2020)
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jegou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357. PMLR (2021)
Zhou, D., et al.: Deepvit: towards deeper vision transformer. arXiv preprint arXiv:2103.11886 (2021)
Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., Jégou, H.: Going deeper with image transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 32–42 (2021)
Yuan, L., et al.: Tokens-to-token vit: training vision transformers from scratch on imagenet. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 558–567 (2021)
Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 568–578 (2021)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10012–10022 (2021)
Chen, C.F., Panda, R., Fan, Q.: Regionvit: regional-to-local attention for vision transformers. arXiv preprint arXiv:2106.02689 (2021)
Grainger, R., Paniagua, T., Song, X., Cuntoor, N., Lee, M.W., Wu, T.: Paca-vit: learning patch-to-cluster attention in vision transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18568–18578 (2023)
Li, K., et al.: Uniformer: unifying convolution and self-attention for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 45, 1–18 (2023)
Graham, B., et al.: Levit: a vision transformer in convnet’s clothing for faster inference. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 12259–12269 (2021)
Yang, C., et al.: Moat: alternating mobile convolution and attention brings strong vision models. In: International Conference on Learning Representations (ICLR) (2023)
Yu, W., et al.: Metaformer is actually what you need for vision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10819–10829 (2022)
Fan, H., et al.: Multiscale vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6824–6835 (2021)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248–255. IEEE (2009)
Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. In: Handbook of Systemic Autoimmune Diseases, vol. 1, no. 4 (2009)
Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3d object representations for fine-grained categorization. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 554–561 (2013)
Parkhi, O.M., Vedaldi, A., Zisserman, A., Jawahar, C.V.: Cats and dogs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3498–3505. IEEE (2012)
Chu, X., et al.: Twins: revisiting the design of spatial attention in vision transformers. Adv. Neural. Inf. Process. Syst. 34, 9355–9366 (2021)
Wang, W., et al.: Crossformer: a versatile vision transformer hinging on cross-scale attention. arxiv 2021 (2022)
Pan, Z., Zhuang, B., He, H., Liu, J., Cai, J.: Less is more: pay less attention in vision transformers. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 36, pp. 2035–2043 (2022)
Li, Y., et al.: Mvitv2: improved multiscale vision transformers for classification and detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4804–4814 (2022)
Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11976–11986 (2022)
Xia, Z., Pan, X., Song, S., Li, L.E., Huang, G.: Vision transformer with deformable attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4794–4803 (2022)
Long, S., Zhao, Z., Pi, J., Wang, S., Wang, J.: Beyond attentive tokens: incorporating token importance and diversity for efficient vision transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10334–10343 (2023)
Li, S., et al.: Moganet: multi-order gated aggregation network. In: International Conference on Learning Representations (2024)
Zhang, S., Liu, H., Lin, S., He, K.: You only need less attention at each stage in vision transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6057–6066 (2024)
Guo, M.-H., Cheng-Ze, L., Liu, Z.-N., Cheng, M.-M., Shi-Min, H.: Visual attention network. Comput. Visual Media 9(4), 733–752 (2023)
Hassani, A., Walton, S., Li, J., Li, S., Shi, H.: Neighborhood attention transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6185–6194 (2023)
Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: beyond empirical risk minimization. In: International Conference on Machine Learning (2017)
Yun, S., Han, D., Oh, S.J., Chun, S., Choe, J., Yoo, Y.: Cutmix: regularization strategy to train strong classifiers with localizable features. In: Proceedings of the IEEE/CVF International Conference on Computer Vision(ICCV), pp. 6023–6032 (2019)
Zhong, Z., Zheng, L., Kang, G., Li, S., Yang, Y.: Random erasing data augmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 34, pp. 13001–13008 (2020)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (ICLR) (2019)
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 618–626 (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
Acknowledgment
This work is supported by Shandong Provincial Natural Science Foundation (Grant No. ZR2022MF263), by Science and Technology Program of University of Jinan (Grant No. XKY1913), and by Doctoral Foundation of University of Jinan (Grant No. XJ2024002305), and by High-performance Computing Platform at University of Jinan.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Li, Y., Guo, Q., Li, Z. (2025). SUMMNet: Using Transformer as a Summary of ConvNet for Image Classification. In: Antonacopoulos, A., Chaudhuri, S., Chellappa, R., Liu, CL., Bhattacharya, S., Pal, U. (eds) Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15308. Springer, Cham. https://doi.org/10.1007/978-3-031-78186-5_18
Download citation
DOI: https://doi.org/10.1007/978-3-031-78186-5_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-78185-8
Online ISBN: 978-3-031-78186-5
eBook Packages: Computer ScienceComputer Science (R0)