Adversarial Examples Attack and Countermeasure for Speech Recognition System: A Survey

Wang, Donghua; Wang, Rangding; Dong, Li; Yan, Diqun; Zhang, Xueyuan; Gong, Yongkang

doi:10.1007/978-981-15-9129-7_31

Donghua Wang⁸,
Rangding Wang⁸,
Li Dong⁸,
Diqun Yan⁸,
Xueyuan Zhang⁸ &
…
Yongkang Gong⁸

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1268))

Included in the following conference series:

International Conference on Security and Privacy in Digital Economy

2408 Accesses
12 Citations

Abstract

Speech recognition technology is affecting and changing the current human-computer interaction profoundly. Due to the remarkable progress of deep learning, the performance of the Automatic Speech Recognition (ASR) system has also increased significantly. As the core component of the speech assistant in the smartphone or other smart devices, ASR receives speech and responds accordingly, which allows us to control and interact with those devices remotely. However, speech adversarial samples where crafted by adding tiny perturbation to original speech, which can make the ASR system to generate malicious instructions while imperceptual to humans. This new attack brings several potential severe security risks to the deep-learning-based ASR system. In this paper, we provide a systematic survey on the speech adversarial examples. We first proposed a taxonomy of existing adversarial examples. Next, we give a brief introduction of existing adversarial examples for the acoustic system, especially for the ASR system, and summarize several major methods of generating the speech adversarial examples. Finally, after elaborating on the existing countermeasures of adversarial examples, we discuss the current challenges and countermeasures against speech adversarial examples. We also give several promising research directions on both making the attack constructing more realistic and the acoustic system more robust, respectively.

Supported by the National Natural Science Foundation of China (Grant No. U1736215, 61672302, 61901237), Zhejiang Natural Science Foundation (Grant No. LY20F020010, LY17F020010), K.C. Wong Magna Fund in Ningbo University.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Adversarial Attacks Against Deep Learning-Based Speech Recognition Systems

Adaptive unified defense framework for tackling adversarial audio attacks

Article Open access 26 July 2024

Targeted Universal Adversarial Perturbations for Automatic Speech Recognition

References

Vaidya, T., Zhang, Y., Sherr, M., Shields, C.: Cocaine noodles: exploiting the gap between human and machine speech recognition. In: 9th $\{$USENIX$\}$ Workshop on Offensive Technologies ($\{$WOOT$\}$ 2015) (2015)
Google Scholar
Carlini, N., et al.: Hidden voice commands. In: 25th $\{$USENIX$\}$ Security Symposium ($\{$USENIX$\}$ Security 2016), pp. 513–530 (2016)
Google Scholar
Hu, S., Shang, X., Qin, Z., Li, M., Wang, Q., Wang, C.: Adversarial examples for automatic speech recognition: attacks and countermeasures. IEEE Commun. Mag. 57(10), 120–126 (2019)
Article Google Scholar
Audhkhasi, K., Ramabhadran, B., Saon, G., Picheny, M., Nahamoo, D.: Direct acoustics-to-word models for English conversational speech recognition, arXiv preprint arXiv:1703.07754 (2017)
Xiong, W., Wu, L., Alleva, F., Droppo, J., Huang, X., Stolcke, A.: The Microsoft 2017 conversational speech recognition system. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5934–5938. IEEE (2018)
Google Scholar
Povey, D., et al.: The kaldispeech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, no. CONF. IEEE Signal Processing Society (2011)
Google Scholar
Kaldi. https://github.com/kaldi-asr/kaldi
Schönherr, L., Kohls, K., Zeiler, S., Holz, T., Kolossa, D.: Adversarial attacks against automatic speech recognition systems via psychoacoustic hiding, arXiv preprint arXiv:1808.05665 (2018)
Hannun, A., et al.: Deep speech: scaling up end-to-end speech recognition, arXiv preprint arXiv:1412.5567 (2014)
DeepSpeech. https://github.com/mozilla/DeepSpeech
Du, T., Ji, S., Li, J., Gu, Q., Wang, T., Beyah, R.: Sirenattack: generating adversarial audio for end-to-end acoustic systems, arXiv preprint arXiv:1901.07846 (2019)
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376 (2006)
Google Scholar
Szegedy, C., et al.: Intriguing properties of neural networks, arXiv preprint arXiv:1312.6199 (2013)
Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik, Z.B., Swami, A.: Practical black-box attacks against machine learning. In: Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, pp. 506–519 (2017)
Google Scholar
Moosavi-Dezfooli, S.M., Fawzi, A., Fawzi, O., Frossard, P.: Universal adversarial perturbations. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1765–1773 (2017)
Google Scholar
Su, J., Vargas, D.V., Sakurai, K.: One pixel attack for fooling deep neural networks. IEEE Trans. Evol. Comput. 23(5), 828–841 (2019)
Article Google Scholar
Dong, Y., et al.: Efficient decision-based black-box adversarial attacks on face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7714–7722 (2019)
Google Scholar
Mozilla common voice (2017). https://voice.mozilla.org/en
Warden, P.: Speech commands: a public dataset for single-word speech recognition, vol. 1 (2017). Dataset. http://download.tensorflow.org/data/speech_commands_v0.01.tar.gz
Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an ASR corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210. IEEE (2015)
Google Scholar
Busso, C., et al.: Iemocap: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42(4), 335 (2008)
Article Google Scholar
Zhang, G., Yan, C., Ji, X., Zhang, T., Zhang, T., Xu, W.: Dolphinattack: inaudible voice commands. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 103–117 (2017)
Google Scholar
Yuan, X., et al.: Commandersong: a systematic approach for practical adversarial voice recognition. In: 27th $\{$USENIX$\}$ Security Symposium ($\{$USENIX$\}$ Security 2018), pp. 49–64 (2018)
Google Scholar
Cisse, M., Adi, Y., Neverova, N., Keshet, J.: Houdini: fooling deep structured prediction models, arXiv preprint arXiv:1707.05373 (2017)
Amodei, D., et al.: Deep speech 2: end-to-end speech recognition in English and mandarin. In: International Conference on Machine Learning, pp. 173–182 (2016)
Google Scholar
Iter, D., Huang, J., Jermann, M.: Generating adversarial examples for speech recognition. Stanford Technical Report (2017)
Google Scholar
Abdullah, H., Garcia, W., Peeters, C., Traynor, P., Butler, K.R., Wilson, J.: Practical hidden voice attacks against speech and speaker recognition systems, arXiv preprint arXiv:1904.05734 (2019)
Carlini, N., Wagner, D.: Audio adversarial examples: targeted attacks on speech-to-text. In: 2018 IEEE Security and Privacy Workshops (SPW), pp. 1–7. IEEE (2018)
Google Scholar
Yakura, H., Sakuma, J.: Robust audio adversarial example for a physical attack, arXiv preprint arXiv:1810.11793 (2018)
Qin, Y., Carlini, N., Cottrell, G., Goodfellow, I., Raffel, C.: Imperceptible, robust, and targeted adversarial examples for automatic speech recognition, arXiv preprint arXiv:1903.10346 (2019)
Shen, J., et al.: Lingvo: a modular and scalable framework for sequence-to-sequence modeling, arXiv preprint arXiv:1902.08295 (2019)
Schönherr, L., Zeiler, S., Holz, T., Kolossa, D.: Robust over-the-air adversarial examples against automatic speech recognition systems, arXiv preprint arXiv:1908.01551 (2019)
Szurley, J., Kolter, J.Z.: Perceptual based adversarial audio attacks, arXiv preprint arXiv:1906.06355 (2019)
Liu, X., Zhang, X., Wan, K., Zhu, Q., Ding, Y.: Towards weighted-sampling audio adversarial example attack. arXiv, Audio and Speech Processing (2019)
Google Scholar
Kwon, H.W., Kwon, H., Yoon, H., Choi, D.: Selective audio adversarial example in evasion attack on speech recognition system. IEEE Trans. Inf. Forensics Secur. 15, 526–538 (2020)
Article Google Scholar
Abdoli, S., Hafemann, L.G., Rony, J., Ayed, I.B., Cardinal, P., Koerich, A.L.: Universal adversarial audio perturbations, arXiv, vol. abs/1908.03173 (2019)
Google Scholar
Vadillo, J., Santana, R.: Universal adversarial examples in speech command classification, arXiv, vol. abs/1911.10182 (2019)
Google Scholar
Moosavi-Dezfooli, S.M., Fawzi, A., Frossard, P.: Deepfool: a simple and accurate method to fool deep neural networks. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2574–2582 (2015)
Google Scholar
Neekhara, P., Hussain, S., Pandey, P., Dubnov, S., McAuley, J., Koushanfar, F.: Universal adversarial perturbations for speech recognition systems, arXiv, vol. abs/1905.03828 (2019)
Google Scholar
Gong, Y., Poellabauer, C.: Crafting adversarial examples for speech paralinguistics applications, arXiv, vol. abs/1711.03280 (2017)
Google Scholar
Kreuk, F., Adi, Y., Cissé, M., Keshet, J.: Fooling end-to-end speaker verification with adversarial examples. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1962–1966 (2018)
Google Scholar
Alzantot, M., Balaji, B., Srivastava, M.: Did you hear that? Adversarial examples against automatic speech recognition, arXiv, vol. abs/1801.00554 (2018)
Google Scholar
Taori, R., Kamsetty, A., Chu, B., Vemuri, N.: Targeted adversarial examples for black box audio systems. In: 2019 IEEE Security and Privacy Workshops (SPW), pp. 15–20 (2018)
Google Scholar
Khare, S., Aralikatte, R., Mani, S.: Adversarial black-box attacks on automatic speech recognition systems using multi-objective evolutionary optimization, arXiv preprint arXiv:1811.01312 (2018)
Zagoruyko, S., Komodakis, N.: Wide residual networks, arXiv preprint arXiv:1605.07146 (2016)
Sun, S., Yeh, C.-F., Ostendorf, M., Hwang, M.-Y., Xie, L.: Training augmentation with adversarial examples for robust speech recognition, arXiv preprint arXiv:1806.02782 (2018)
Zeng, Q., et al.: A multiversion programming inspired approach to detecting audio adversarial examples. In: 2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 39–51. IEEE (2019)
Google Scholar
Latif, S., Rana, R., Qadir, J.: Adversarial machine learning and speech emotion recognition: utilizing generative adversarial networks for robustness, arXiv preprint arXiv:1811.11402 (2018)
Rajaratnam, K., Shah, K., Kalita, J.: Isolated and ensemble audio preprocessing methods for detecting adversarial examples against automatic speech recognition, arXiv preprint arXiv:1809.04397 (2018)
Rajaratnam, K., Kalita, J.: Noise flooding for detecting audio adversarial examples against automatic speech recognition. In: 2018 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT), pp. 197–201. IEEE (2018)
Google Scholar
Samizade, S., Tan, Z.-H., Shen, C., Guan, X.: Adversarial example detection by classification for deep speech recognition, arXiv preprint arXiv:1910.10013 (2019)
Yang, Z., Li, B., Chen, P.-Y., Song, D.: Characterizing audio adversarial examples using temporal dependency, arXiv preprint arXiv:1809.10875 (2018)
Kwon, H., Yoon, H., Park, K.-W.: Poster: detecting audio adversarial example through audio modification. In: Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, pp. 2521–2523 (2019)
Google Scholar
Ma, P., Petridis, S., Pantic, M.: Detecting adversarial attacks on audio-visual speech recognition, arXiv preprint arXiv:1912.08639 (2019)
Esmaeilpour, M., Cardinal, P., Koerich, A.L.: A robust approach for securing audio classification against adversarial attacks. IEEE Trans. Inf. Forensics Secur. 15, 2147–2159 (2019)
Article Google Scholar
Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014)
Google Scholar
Roy, N., Hassanieh, H., Roy Choudhury, R.: Backdoor: making microphones hear inaudible sounds. In: Proceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services, pp. 2–14 (2017)
Google Scholar
Tamura, K., Omagari, A., Hashida, S.: Novel defense method against audio adversarial example for speech-to-text transcription neural networks. In: 2019 IEEE 11th International Workshop on Computational Intelligence and Applications (IWCIA), pp. 115–120. IEEE (2019)
Google Scholar
Yang, C.-H., Qi, J., Chen, P.-Y., Ma, X., Lee, C.-H.: Characterizing speech adversarial examples using self-attention u-net enhancement, arXiv preprint arXiv:2003.13917 (2020)

Download references

Author information

Authors and Affiliations

Ningbo University, Ningbo, China
Donghua Wang, Rangding Wang, Li Dong, Diqun Yan, Xueyuan Zhang & Yongkang Gong

Authors

Donghua Wang
View author publications
You can also search for this author in PubMed Google Scholar
Rangding Wang
View author publications
You can also search for this author in PubMed Google Scholar
Li Dong
View author publications
You can also search for this author in PubMed Google Scholar
Diqun Yan
View author publications
You can also search for this author in PubMed Google Scholar
Xueyuan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yongkang Gong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rangding Wang .

Editor information

Editors and Affiliations

University of Technology Sydney, Sydney, NSW, Australia
Shui Yu
IBM Zurich Research Laboratory, Zurich, Switzerland
Peter Mueller
Ningbo University, Ningbo, China
Jiangbo Qian

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, D., Wang, R., Dong, L., Yan, D., Zhang, X., Gong, Y. (2020). Adversarial Examples Attack and Countermeasure for Speech Recognition System: A Survey. In: Yu, S., Mueller, P., Qian, J. (eds) Security and Privacy in Digital Economy. SPDE 2020. Communications in Computer and Information Science, vol 1268. Springer, Singapore. https://doi.org/10.1007/978-981-15-9129-7_31

Download citation

DOI: https://doi.org/10.1007/978-981-15-9129-7_31
Published: 22 October 2020
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-9128-0
Online ISBN: 978-981-15-9129-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics