Abstract
Speech recognition technology is affecting and changing the current human-computer interaction profoundly. Due to the remarkable progress of deep learning, the performance of the Automatic Speech Recognition (ASR) system has also increased significantly. As the core component of the speech assistant in the smartphone or other smart devices, ASR receives speech and responds accordingly, which allows us to control and interact with those devices remotely. However, speech adversarial samples where crafted by adding tiny perturbation to original speech, which can make the ASR system to generate malicious instructions while imperceptual to humans. This new attack brings several potential severe security risks to the deep-learning-based ASR system. In this paper, we provide a systematic survey on the speech adversarial examples. We first proposed a taxonomy of existing adversarial examples. Next, we give a brief introduction of existing adversarial examples for the acoustic system, especially for the ASR system, and summarize several major methods of generating the speech adversarial examples. Finally, after elaborating on the existing countermeasures of adversarial examples, we discuss the current challenges and countermeasures against speech adversarial examples. We also give several promising research directions on both making the attack constructing more realistic and the acoustic system more robust, respectively.
Supported by the National Natural Science Foundation of China (Grant No. U1736215, 61672302, 61901237), Zhejiang Natural Science Foundation (Grant No. LY20F020010, LY17F020010), K.C. Wong Magna Fund in Ningbo University.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Vaidya, T., Zhang, Y., Sherr, M., Shields, C.: Cocaine noodles: exploiting the gap between human and machine speech recognition. In: 9th \(\{\)USENIX\(\}\) Workshop on Offensive Technologies (\(\{\)WOOT\(\}\) 2015) (2015)
Carlini, N., et al.: Hidden voice commands. In: 25th \(\{\)USENIX\(\}\) Security Symposium (\(\{\)USENIX\(\}\) Security 2016), pp. 513–530 (2016)
Hu, S., Shang, X., Qin, Z., Li, M., Wang, Q., Wang, C.: Adversarial examples for automatic speech recognition: attacks and countermeasures. IEEE Commun. Mag. 57(10), 120–126 (2019)
Audhkhasi, K., Ramabhadran, B., Saon, G., Picheny, M., Nahamoo, D.: Direct acoustics-to-word models for English conversational speech recognition, arXiv preprint arXiv:1703.07754 (2017)
Xiong, W., Wu, L., Alleva, F., Droppo, J., Huang, X., Stolcke, A.: The Microsoft 2017 conversational speech recognition system. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5934–5938. IEEE (2018)
Povey, D., et al.: The kaldispeech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, no. CONF. IEEE Signal Processing Society (2011)
Schönherr, L., Kohls, K., Zeiler, S., Holz, T., Kolossa, D.: Adversarial attacks against automatic speech recognition systems via psychoacoustic hiding, arXiv preprint arXiv:1808.05665 (2018)
Hannun, A., et al.: Deep speech: scaling up end-to-end speech recognition, arXiv preprint arXiv:1412.5567 (2014)
DeepSpeech. https://github.com/mozilla/DeepSpeech
Du, T., Ji, S., Li, J., Gu, Q., Wang, T., Beyah, R.: Sirenattack: generating adversarial audio for end-to-end acoustic systems, arXiv preprint arXiv:1901.07846 (2019)
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376 (2006)
Szegedy, C., et al.: Intriguing properties of neural networks, arXiv preprint arXiv:1312.6199 (2013)
Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik, Z.B., Swami, A.: Practical black-box attacks against machine learning. In: Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, pp. 506–519 (2017)
Moosavi-Dezfooli, S.M., Fawzi, A., Fawzi, O., Frossard, P.: Universal adversarial perturbations. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1765–1773 (2017)
Su, J., Vargas, D.V., Sakurai, K.: One pixel attack for fooling deep neural networks. IEEE Trans. Evol. Comput. 23(5), 828–841 (2019)
Dong, Y., et al.: Efficient decision-based black-box adversarial attacks on face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7714–7722 (2019)
Mozilla common voice (2017). https://voice.mozilla.org/en
Warden, P.: Speech commands: a public dataset for single-word speech recognition, vol. 1 (2017). Dataset. http://download.tensorflow.org/data/speech_commands_v0.01.tar.gz
Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an ASR corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210. IEEE (2015)
Busso, C., et al.: Iemocap: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42(4), 335 (2008)
Zhang, G., Yan, C., Ji, X., Zhang, T., Zhang, T., Xu, W.: Dolphinattack: inaudible voice commands. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 103–117 (2017)
Yuan, X., et al.: Commandersong: a systematic approach for practical adversarial voice recognition. In: 27th \(\{\)USENIX\(\}\) Security Symposium (\(\{\)USENIX\(\}\) Security 2018), pp. 49–64 (2018)
Cisse, M., Adi, Y., Neverova, N., Keshet, J.: Houdini: fooling deep structured prediction models, arXiv preprint arXiv:1707.05373 (2017)
Amodei, D., et al.: Deep speech 2: end-to-end speech recognition in English and mandarin. In: International Conference on Machine Learning, pp. 173–182 (2016)
Iter, D., Huang, J., Jermann, M.: Generating adversarial examples for speech recognition. Stanford Technical Report (2017)
Abdullah, H., Garcia, W., Peeters, C., Traynor, P., Butler, K.R., Wilson, J.: Practical hidden voice attacks against speech and speaker recognition systems, arXiv preprint arXiv:1904.05734 (2019)
Carlini, N., Wagner, D.: Audio adversarial examples: targeted attacks on speech-to-text. In: 2018 IEEE Security and Privacy Workshops (SPW), pp. 1–7. IEEE (2018)
Yakura, H., Sakuma, J.: Robust audio adversarial example for a physical attack, arXiv preprint arXiv:1810.11793 (2018)
Qin, Y., Carlini, N., Cottrell, G., Goodfellow, I., Raffel, C.: Imperceptible, robust, and targeted adversarial examples for automatic speech recognition, arXiv preprint arXiv:1903.10346 (2019)
Shen, J., et al.: Lingvo: a modular and scalable framework for sequence-to-sequence modeling, arXiv preprint arXiv:1902.08295 (2019)
Schönherr, L., Zeiler, S., Holz, T., Kolossa, D.: Robust over-the-air adversarial examples against automatic speech recognition systems, arXiv preprint arXiv:1908.01551 (2019)
Szurley, J., Kolter, J.Z.: Perceptual based adversarial audio attacks, arXiv preprint arXiv:1906.06355 (2019)
Liu, X., Zhang, X., Wan, K., Zhu, Q., Ding, Y.: Towards weighted-sampling audio adversarial example attack. arXiv, Audio and Speech Processing (2019)
Kwon, H.W., Kwon, H., Yoon, H., Choi, D.: Selective audio adversarial example in evasion attack on speech recognition system. IEEE Trans. Inf. Forensics Secur. 15, 526–538 (2020)
Abdoli, S., Hafemann, L.G., Rony, J., Ayed, I.B., Cardinal, P., Koerich, A.L.: Universal adversarial audio perturbations, arXiv, vol. abs/1908.03173 (2019)
Vadillo, J., Santana, R.: Universal adversarial examples in speech command classification, arXiv, vol. abs/1911.10182 (2019)
Moosavi-Dezfooli, S.M., Fawzi, A., Frossard, P.: Deepfool: a simple and accurate method to fool deep neural networks. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2574–2582 (2015)
Neekhara, P., Hussain, S., Pandey, P., Dubnov, S., McAuley, J., Koushanfar, F.: Universal adversarial perturbations for speech recognition systems, arXiv, vol. abs/1905.03828 (2019)
Gong, Y., Poellabauer, C.: Crafting adversarial examples for speech paralinguistics applications, arXiv, vol. abs/1711.03280 (2017)
Kreuk, F., Adi, Y., Cissé, M., Keshet, J.: Fooling end-to-end speaker verification with adversarial examples. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1962–1966 (2018)
Alzantot, M., Balaji, B., Srivastava, M.: Did you hear that? Adversarial examples against automatic speech recognition, arXiv, vol. abs/1801.00554 (2018)
Taori, R., Kamsetty, A., Chu, B., Vemuri, N.: Targeted adversarial examples for black box audio systems. In: 2019 IEEE Security and Privacy Workshops (SPW), pp. 15–20 (2018)
Khare, S., Aralikatte, R., Mani, S.: Adversarial black-box attacks on automatic speech recognition systems using multi-objective evolutionary optimization, arXiv preprint arXiv:1811.01312 (2018)
Zagoruyko, S., Komodakis, N.: Wide residual networks, arXiv preprint arXiv:1605.07146 (2016)
Sun, S., Yeh, C.-F., Ostendorf, M., Hwang, M.-Y., Xie, L.: Training augmentation with adversarial examples for robust speech recognition, arXiv preprint arXiv:1806.02782 (2018)
Zeng, Q., et al.: A multiversion programming inspired approach to detecting audio adversarial examples. In: 2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 39–51. IEEE (2019)
Latif, S., Rana, R., Qadir, J.: Adversarial machine learning and speech emotion recognition: utilizing generative adversarial networks for robustness, arXiv preprint arXiv:1811.11402 (2018)
Rajaratnam, K., Shah, K., Kalita, J.: Isolated and ensemble audio preprocessing methods for detecting adversarial examples against automatic speech recognition, arXiv preprint arXiv:1809.04397 (2018)
Rajaratnam, K., Kalita, J.: Noise flooding for detecting audio adversarial examples against automatic speech recognition. In: 2018 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT), pp. 197–201. IEEE (2018)
Samizade, S., Tan, Z.-H., Shen, C., Guan, X.: Adversarial example detection by classification for deep speech recognition, arXiv preprint arXiv:1910.10013 (2019)
Yang, Z., Li, B., Chen, P.-Y., Song, D.: Characterizing audio adversarial examples using temporal dependency, arXiv preprint arXiv:1809.10875 (2018)
Kwon, H., Yoon, H., Park, K.-W.: Poster: detecting audio adversarial example through audio modification. In: Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, pp. 2521–2523 (2019)
Ma, P., Petridis, S., Pantic, M.: Detecting adversarial attacks on audio-visual speech recognition, arXiv preprint arXiv:1912.08639 (2019)
Esmaeilpour, M., Cardinal, P., Koerich, A.L.: A robust approach for securing audio classification against adversarial attacks. IEEE Trans. Inf. Forensics Secur. 15, 2147–2159 (2019)
Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014)
Roy, N., Hassanieh, H., Roy Choudhury, R.: Backdoor: making microphones hear inaudible sounds. In: Proceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services, pp. 2–14 (2017)
Tamura, K., Omagari, A., Hashida, S.: Novel defense method against audio adversarial example for speech-to-text transcription neural networks. In: 2019 IEEE 11th International Workshop on Computational Intelligence and Applications (IWCIA), pp. 115–120. IEEE (2019)
Yang, C.-H., Qi, J., Chen, P.-Y., Ma, X., Lee, C.-H.: Characterizing speech adversarial examples using self-attention u-net enhancement, arXiv preprint arXiv:2003.13917 (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Wang, D., Wang, R., Dong, L., Yan, D., Zhang, X., Gong, Y. (2020). Adversarial Examples Attack and Countermeasure for Speech Recognition System: A Survey. In: Yu, S., Mueller, P., Qian, J. (eds) Security and Privacy in Digital Economy. SPDE 2020. Communications in Computer and Information Science, vol 1268. Springer, Singapore. https://doi.org/10.1007/978-981-15-9129-7_31
Download citation
DOI: https://doi.org/10.1007/978-981-15-9129-7_31
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-9128-0
Online ISBN: 978-981-15-9129-7
eBook Packages: Computer ScienceComputer Science (R0)