Abstract
With the widespread use of unmanned aerial vehicles (UAV), their safety issues have become increasingly prominent in recent years. Therefore, UAV detection and identification technology has become a hot spot. Radar-based methods make it challenging to monitor low-flying UAVs, and video-based methods require high imaging quality. The acoustic signal-based UAV detection method can compensate for the shortcomings of these traditional UAV detection methods. This paper proposes an integrated learning model based on multi-scale convolution and global local attention by processing audio signals. The model can perform accurate UAV identification through UAV audio signals and aims to complement the shortcomings of other methods. The model proposed in this paper adopts an integrated learning framework, which can directly process the raw audio signals of UAVs without manual feature extraction. The proposed model consists of two first-level expert models and a meta-classifier. Firstly, the two first-level expert models perform feature extraction on the data separately. Then, the obtained classification results are inputted to the meta-classifier. Then, the meta-classifier integrates and fuses the results of the first-level models and finally outputs the results of UAV monitoring and recognition. The two first-level expert models add a multi-scale global local attention module based on the residual and depth-separable convolutional structures. The method in this paper is compared with other methods for processing one-dimensional signals on a self-created UAV dataset. Experiments verify the effectiveness and superiority of the model.









Similar content being viewed by others
Data availability
A link to download the data has been provided.
References
Fan, B., Li, Y., Zhang, R., Fu, Q.: Review on the technological development and application of UAV systems. Chin. J. Electron. 29(2), 199–207 (2020)
Wellig, P., Speirs, P., Schuepbach, C., Oechslin, R., Renker, M., Boeniger, U., Pratisto, H.: Radar systems and challenges for C-UAV. In: 2018 19th International Radar Symposium (IRS), pp. 1–8. IEEE (2018)
Nie, W., Han, Z.-C., Zhou, M., Xie, L.-B., Jiang, Q.: UAV detection and identification based on WiFi signal and RF fingerprint. IEEE Sens. J. 21(12), 13540–13550 (2021)
Li, J., Ye, D.H., Kolsch, M., Wachs, J.P., Bouman, C.A.: Fast and robust UAV to UAV detection and tracking from video. IEEE Trans. Emerg. Top. Comput. 10(3), 1519–1531 (2021)
Fang, H., Ding, L., Wang, L., Chang, Y., Yan, L., Han, J.: Infrared small UAV target detection based on depthwise separable residual dense network and multiscale feature fusion. IEEE Trans. Instrum. Meas. 71, 1–20 (2022)
Schmähl, M., Rieger, C., Speck, S., Hornung, M.: Semi-empiric noise modeling of a cargo eVTOL UAV by means of system identification from flight noise measurement data. CEAS Aeronaut. J. 13, 1–12 (2021)
Kawaguchi, D., Nakamura, R., Hadama, H.: Evaluation on a drone classification method using UWB radar image recognition with deep learning. In: 2021 IEEE 93rd Vehicular Technology Conference (VTC2021-Spring), pp. 1–5 (2021)
Basak, S., Rajendran, S., Pollin, S., Scheers, B.: Combined RF-based drone detection and classification. IEEE Trans. Cogn. Commun. Netw. 8(1), 111–120 (2021)
Teutsch, M., Krüger, W., Heinze, N.: Detection and classification of moving objects from UAVs with optical sensors. In: Signal Processing, Sensor Fusion, and Target Recognition XX, vol. 8050, pp. 597–610 (2011)
Shi, Z., Chang, X., Yang, C., Wu, Z., Wu, J.: An acoustic-based surveillance system for amateur drones detection and localization. IEEE Trans. Veh. Technol. 69(3), 2731–2739 (2020)
Harvey, B., O’Young, S.: Acoustic detection of a fixed-wing UAV. Drones 2(1), 4 (2018)
Bernardini, A., Mangiatordi, F., Pallotti, E., Capodiferro, L.: Drone detection by acoustic signature identification. Electron. Imaging 29, 60–64 (2017)
Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE J. Sel. Top. Signal Process. 13(2), 206–219 (2019)
Le, Q.V.: Building high-level features using large scale unsupervised learning. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 8595–8598 (2013)
Talaei Khoei, T., Ould Slimane, H., Kaabouch, N.: Deep learning: systematic review, models, challenges, and research directions. Neural Comput. Appl. 35(31), 23103–23124 (2023)
Zhao, J., Mao, X., Chen, L.: Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomed. Signal Process. Control 47, 312–323 (2019)
Xu, H., Tian, Y., Ren, H., Liu, X.: A lightweight channel and time attention enhanced 1D CNN model for environmental sound classification. Expert Syst. Appl. 249, 123768 (2024)
Mienye, I.D., Sun, Y.: A survey of ensemble learning: concepts, algorithms, applications, and prospects. IEEE Access 10, 99129–99149 (2022)
Dietterich, T.G., et al.: Ensemble learning. Handb. Brain Theory Neural Netw. 2(1), 110–125 (2002)
Sagi, O., Rokach, L.: Ensemble learning: a survey. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 8(4), 1249 (2018)
Xie, Y., Sun, W., Ren, M., Chen, S., Huang, Z., Pan, X.: Stacking ensemble learning models for daily runoff prediction using 1D and 2D CNNs. Expert Syst. Appl. 217, 119469 (2023)
Zhao, Y., Chen, J., Xu, X., Lei, J., Zhou, W.: Sev-net: residual network embedded with attention mechanism for plant disease severity detection. Concurr. Comput. Practice Exp. 33(10), 6161 (2021)
Dutt, A., Gader, P.: Wavelet multiresolution analysis based speech emotion recognition system using 1D CNN LSTM networks. IEEE/ACM Trans. Audio Speech Lang. Process 31, 2043–2054 (2023)
Flower, T.M.L., Jaya, T.: A novel concatenated 1D-CNN model for speech emotion recognition. Biomed. Signal Process. Control 93, 106201 (2024)
Moussavou Boussougou, M.K., Park, D.-J.: Attention-based 1D CNN-BILSTM hybrid model enhanced with FastText word embedding for Korean voice phishing detection. Mathematics 11(14), 3217 (2023)
Abdoli, S., Cardinal, P., Koerich, A.L.: End-to-end environmental sound classification using a 1D convolutional neural network. Expert Syst. Appl. 136, 252–263 (2019)
Zhao, D., Qiu, Z., Jiang, Y., Zhu, X., Zhang, X., Tao, Z.: A depthwise separable CNN-based interpretable feature extraction network for automatic pathological voice detection. Biomed. Signal Process. Control 88, 105624 (2024)
Funding
This work is supported in part by the Laboratory of Aerodynamic Noise Control Program (Grant No. ANCL20230204), in part by the National Natural Science Foundation of China (Grant Nos. 62201478 and 61971100), in part by the Sichuan Science and Technology Program (Grant Nos. 2024NSFSC1434 and 2022YFG0148), in part by the Southwest University of Science and Technology Doctor Fund (Grant No. 20zx7119), and in part by the Heilongjiang Provincial Science and Technology Program (Grant No. 2022ZX01A16).
Author information
Authors and Affiliations
Contributions
JL: Methodology, Validation, Writing-original draft, Software, Investigation, Conceptualization. JZ: Writing—Review and Editing, Supervision, Funding acquisition. JR: Methodology, Investigation, Software, Conceptualization. XG: Writing review, Visualization. ZL: Writing-review and editing, Visualization.
Corresponding author
Ethics declarations
Conflict of interest
There is no potential conflict of interest.
Ethics approval and consent to participate
There are no ethical or right-to-know issues with the data used in the article.
Consent for publication
Written informed consent for publication was obtained from all participants.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Li, J., Zhao, J., Ren, J. et al. A multi-scale integrated learning model with attention mechanisms for UAV audio signal detection. SIViP 19, 344 (2025). https://doi.org/10.1007/s11760-025-03944-9
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11760-025-03944-9