Differentiable Feature Aggregation Search for Knowledge Distillation

Guan, Yushuo; Zhao, Pengyu; Wang, Bingxuan; Zhang, Yuanxing; Yao, Cong; Bian, Kaigui; Tang, Jian

doi:10.1007/978-3-030-58520-4_28

Yushuo Guan¹²,
Pengyu Zhao¹²,
Bingxuan Wang¹²,
Yuanxing Zhang¹²,
Cong Yao¹³,
Kaigui Bian^12,14 &
…
Jian Tang¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12362))

Included in the following conference series:

European Conference on Computer Vision

3576 Accesses
25 Citations

Abstract

Knowledge distillation has become increasingly important in model compression. It boosts the performance of a miniaturized student network with the supervision of the output distribution and feature maps from a sophisticated teacher network. Some recent works introduce multi-teacher distillation to provide more supervision to the student network. However, the effectiveness of multi-teacher distillation methods are accompanied by costly computation resources. To tackle with both the efficiency and the effectiveness of knowledge distillation, we introduce the feature aggregation to imitate the multi-teacher distillation in the single-teacher distillation framework by extracting informative supervision from multiple teacher feature maps. Specifically, we introduce DFA, a two-stage Differentiable Feature Aggregation search method that motivated by DARTS in neural architecture search, to efficiently find the aggregations. In the first stage, DFA formulates the searching problem as a bi-level optimization and leverages a novel bridge loss, which consists of a student-to-teacher path and a teacher-to-student path, to find appropriate feature aggregations. The two paths act as two players against each other, trying to optimize the unified architecture parameters to the opposite directions while guaranteeing both expressivity and learnability of the feature aggregation simultaneously. In the second stage, DFA performs knowledge distillation with the derived feature aggregation. Experimental results show that DFA outperforms existing distillation methods on CIFAR-100 and CINIC-10 datasets under various teacher-student settings, verifying the effectiveness and robustness of the design.

Y. Guan and P. Zhao: These authors contributed equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Cross-Layer Fusion for Feature Distillation

SCL-IKD: intermediate knowledge distillation via supervised contrastive representation learning

Article 06 October 2023

Knowledge distillation based on projector integration and classifier sharing

Article Open access 20 March 2024

References

Bender, G., Kindermans, P.J., Zoph, B., Vasudevan, V., Le, Q.: Understanding and simplifying one-shot architecture search. In: ICML, pp. 550–559 (2018)
Google Scholar
Bucilua, C., Caruana, R., Niculescu-Mizil, A.: Model compression. In: KDD, pp. 535–541. ACM (2006)
Google Scholar
Cai, H., Yang, J., Zhang, W., Han, S., Yu, Y.: Path-level network transformation for efficient architecture search. In: International Conference on Machine Learning, pp. 678–687 (2018)
Google Scholar
Cai, H., Zhu, L., Han, S.: ProxylessNAS: direct neural architecture search on target task and hardware. In: ICLR (2019)
Google Scholar
Chen, X., Xie, L., Wu, J., Tian, Q.: Progressive differentiable architecture search: bridging the depth gap between search and evaluation. In: ICCV (2019)
Google Scholar
Darlow, L.N., Crowley, E.J., Antoniou, A., Storkey, A.J.: CINIC-10 is not ImageNet or CIFAR-10. arXiv preprint arXiv:1810.03505 (2018)
Dong, X., Yang, Y.: Network pruning via transformable architecture search. In: Advances in Neural Information Processing Systems, pp. 759–770 (2019)
Google Scholar
Dong, X., Yang, Y.: One-shot neural architecture search via self-evaluated template network. In: ICCV, pp. 3681–3690 (2019)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1026–1034 (2015)
Google Scholar
Heo, B., Kim, J., Yun, S., Park, H., Kwak, N., Choi, J.Y.: A comprehensive overhaul of feature distillation. In: ICCV, October 2019
Google Scholar
Heo, B., Kim, J., Yun, S., Park, H., Kwak, N., Choi, J.Y.: A comprehensive overhaul of feature distillation. arXiv preprint arXiv:1904.01866 (2019)
Heo, B., Lee, M., Yun, S., Choi, J.Y.: Knowledge transfer via distillation of activation boundaries formed by hidden neurons. In: AAAI, vol. 33, pp. 3779–3787 (2019)
Google Scholar
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, Y.: Binarized neural networks. In: Advances in Neural Information Processing Systems, pp. 4107–4115 (2016)
Google Scholar
Kang, M., Mun, J., Han, B.: Towards oracle knowledge distillation with neural architecture search. arXiv preprint arXiv:1911.13019 (2019)
Kim, J., Park, S., Kwak, N.: Paraphrasing complex network: network compression via factor transfer. In: Advances in Neural Information Processing Systems, pp. 2760–2769 (2018)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009)
Google Scholar
Leng, C., Dou, Z., Li, H., Zhu, S., Jin, R.: Extremely low bit neural network: squeeze the last bit out with ADMM. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
Google Scholar
Li, C., et al.: Blockwisely supervised neural architecture search with knowledge distillation. arXiv preprint arXiv:1911.13053 (2019)
Li, H., Kadav, A., Durdanovic, I., Samet, H., Graf, H.P.: Pruning filters for efficient convNets. arXiv preprint arXiv:1608.08710 (2016)
Li, W., Gong, S., Zhu, X.: Neural graph embedding for neural architecture search. In: AAAI (2020)
Google Scholar
Lin, X., Zhao, C., Pan, W.: Towards accurate binary convolutional neural network. In: Advances in Neural Information Processing Systems, pp. 345–353 (2017)
Google Scholar
Liu, H., Simonyan, K., Yang, Y.: DARTS: differentiable architecture search. In: ICLR (2019)
Google Scholar
Ma, N., Zhang, X., Zheng, H.T., Sun, J.: ShuffleNet V2: Practical guidelines for efficient CNN architecture design. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 116–131 (2018)
Google Scholar
Nayman, N., Noy, A., Ridnik, T., Friedman, I., Jin, R., Zelnik, L.: XNAS: neural architecture search with expert advice. In: Advances in Neural Information Processing Systems, pp. 1975–1985 (2019)
Google Scholar
Pham, H., Guan, M., Zoph, B., Le, Q., Dean, J.: Efficient neural architecture search via parameter sharing. In: ICML, pp. 4092–4101 (2018)
Google Scholar
Real, E., Aggarwal, A., Huang, Y., Le, Q.V.: Regularized evolution for image classifier architecture search. In: AAAI, vol. 33, pp. 4780–4789 (2019)
Google Scholar
Real, E., et al.: Large-scale evolution of image classifiers. In: ICML, pp. 2902–2911 (2017)
Google Scholar
Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: FitNets: hints for thin deep nets. arXiv preprint arXiv:1412.6550 (2014)
Srinivas, S., Fleuret, F.: Knowledge transfer with Jacobian matching. arXiv preprint arXiv:1803.00443 (2018)
Tan, M., et al.: MnasNet: platform-aware neural architecture search for mobile. In: CVPR, pp. 2820–2828 (2019)
Google Scholar
Tung, F., Mori, G.: Similarity-preserving knowledge distillation. In: ICCV, pp. 1365–1374 (2019)
Google Scholar
Wen, W., Wu, C., Wang, Y., Chen, Y., Li, H.: Learning structured sparsity in deep neural networks. In: Advances in Neural Information Processing Systems, pp. 2074–2082 (2016)
Google Scholar
Xie, S., Zheng, H., Liu, C., Lin, L.: SNAS: stochastic neural architecture search. In: ICLR (2019)
Google Scholar
Xu, Y., et al.: PC-DARTS: partial channel connections for memory-efficient differentiable architecture search. In: ICLR (2020)
Google Scholar
You, S., Xu, C., Xu, C., Tao, D.: Learning from multiple teacher networks. In: KDD, pp. 1285–1294. ACM (2017)
Google Scholar
Zagoruyko, S., Komodakis, N.: Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928 (2016)
Zagoruyko, S., Komodakis, N.: Wide residual networks. arXiv preprint arXiv:1605.07146 (2016)
Zela, A., Elsken, T., Saikia, T., Marrakchi, Y., Brox, T., Hutter, F.: Understanding and robustifying differentiable architecture search. In: ICLR (2020)
Google Scholar
Zoph, B., Le, Q.V.: Neural architecture search with reinforcement learning. In: ICLR (2017)
Google Scholar
Zoph, B., Vasudevan, V., Shlens, J., Le, Q.V.: Learning transferable architectures for scalable image recognition. In: CVPR, pp. 8697–8710 (2018)
Google Scholar

Download references

Acknowledgment

This work is partially supported by National Key Research and Development Program No. 2017YFB0803302, Beijing Academy of Artificial Intelligence (BAAI), and NSFC 61632017.

Author information

Authors and Affiliations

Peking University, Beijing, China
Yushuo Guan, Pengyu Zhao, Bingxuan Wang, Yuanxing Zhang & Kaigui Bian
Megvii (Face++) Technology Inc, Beijing, China
Cong Yao
National Engineering Laboratory for Big Data Analysis and Applications, Beijing, China
Kaigui Bian
DiDi AI Labs, Beijing, China
Jian Tang

Authors

Yushuo Guan
View author publications
You can also search for this author in PubMed Google Scholar
Pengyu Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Bingxuan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yuanxing Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Cong Yao
View author publications
You can also search for this author in PubMed Google Scholar
Kaigui Bian
View author publications
You can also search for this author in PubMed Google Scholar
Jian Tang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kaigui Bian .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Guan, Y. et al. (2020). Differentiable Feature Aggregation Search for Knowledge Distillation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12362. Springer, Cham. https://doi.org/10.1007/978-3-030-58520-4_28

Download citation

DOI: https://doi.org/10.1007/978-3-030-58520-4_28
Published: 19 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58519-8
Online ISBN: 978-3-030-58520-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics