Skip to main content

Adversarial Examples Attack and Countermeasure for Speech Recognition System: A Survey

  • Conference paper
  • First Online:
Security and Privacy in Digital Economy (SPDE 2020)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1268))

Included in the following conference series:

Abstract

Speech recognition technology is affecting and changing the current human-computer interaction profoundly. Due to the remarkable progress of deep learning, the performance of the Automatic Speech Recognition (ASR) system has also increased significantly. As the core component of the speech assistant in the smartphone or other smart devices, ASR receives speech and responds accordingly, which allows us to control and interact with those devices remotely. However, speech adversarial samples where crafted by adding tiny perturbation to original speech, which can make the ASR system to generate malicious instructions while imperceptual to humans. This new attack brings several potential severe security risks to the deep-learning-based ASR system. In this paper, we provide a systematic survey on the speech adversarial examples. We first proposed a taxonomy of existing adversarial examples. Next, we give a brief introduction of existing adversarial examples for the acoustic system, especially for the ASR system, and summarize several major methods of generating the speech adversarial examples. Finally, after elaborating on the existing countermeasures of adversarial examples, we discuss the current challenges and countermeasures against speech adversarial examples. We also give several promising research directions on both making the attack constructing more realistic and the acoustic system more robust, respectively.

Supported by the National Natural Science Foundation of China (Grant No. U1736215, 61672302, 61901237), Zhejiang Natural Science Foundation (Grant No. LY20F020010, LY17F020010), K.C. Wong Magna Fund in Ningbo University.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Vaidya, T., Zhang, Y., Sherr, M., Shields, C.: Cocaine noodles: exploiting the gap between human and machine speech recognition. In: 9th \(\{\)USENIX\(\}\) Workshop on Offensive Technologies (\(\{\)WOOT\(\}\) 2015) (2015)

    Google Scholar 

  2. Carlini, N., et al.: Hidden voice commands. In: 25th \(\{\)USENIX\(\}\) Security Symposium (\(\{\)USENIX\(\}\) Security 2016), pp. 513–530 (2016)

    Google Scholar 

  3. Hu, S., Shang, X., Qin, Z., Li, M., Wang, Q., Wang, C.: Adversarial examples for automatic speech recognition: attacks and countermeasures. IEEE Commun. Mag. 57(10), 120–126 (2019)

    Article  Google Scholar 

  4. Audhkhasi, K., Ramabhadran, B., Saon, G., Picheny, M., Nahamoo, D.: Direct acoustics-to-word models for English conversational speech recognition, arXiv preprint arXiv:1703.07754 (2017)

  5. Xiong, W., Wu, L., Alleva, F., Droppo, J., Huang, X., Stolcke, A.: The Microsoft 2017 conversational speech recognition system. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5934–5938. IEEE (2018)

    Google Scholar 

  6. Povey, D., et al.: The kaldispeech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, no. CONF. IEEE Signal Processing Society (2011)

    Google Scholar 

  7. Kaldi. https://github.com/kaldi-asr/kaldi

  8. Schönherr, L., Kohls, K., Zeiler, S., Holz, T., Kolossa, D.: Adversarial attacks against automatic speech recognition systems via psychoacoustic hiding, arXiv preprint arXiv:1808.05665 (2018)

  9. Hannun, A., et al.: Deep speech: scaling up end-to-end speech recognition, arXiv preprint arXiv:1412.5567 (2014)

  10. DeepSpeech. https://github.com/mozilla/DeepSpeech

  11. Du, T., Ji, S., Li, J., Gu, Q., Wang, T., Beyah, R.: Sirenattack: generating adversarial audio for end-to-end acoustic systems, arXiv preprint arXiv:1901.07846 (2019)

  12. Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376 (2006)

    Google Scholar 

  13. Szegedy, C., et al.: Intriguing properties of neural networks, arXiv preprint arXiv:1312.6199 (2013)

  14. Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik, Z.B., Swami, A.: Practical black-box attacks against machine learning. In: Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, pp. 506–519 (2017)

    Google Scholar 

  15. Moosavi-Dezfooli, S.M., Fawzi, A., Fawzi, O., Frossard, P.: Universal adversarial perturbations. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1765–1773 (2017)

    Google Scholar 

  16. Su, J., Vargas, D.V., Sakurai, K.: One pixel attack for fooling deep neural networks. IEEE Trans. Evol. Comput. 23(5), 828–841 (2019)

    Article  Google Scholar 

  17. Dong, Y., et al.: Efficient decision-based black-box adversarial attacks on face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7714–7722 (2019)

    Google Scholar 

  18. Mozilla common voice (2017). https://voice.mozilla.org/en

  19. Warden, P.: Speech commands: a public dataset for single-word speech recognition, vol. 1 (2017). Dataset. http://download.tensorflow.org/data/speech_commands_v0.01.tar.gz

  20. Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an ASR corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210. IEEE (2015)

    Google Scholar 

  21. Busso, C., et al.: Iemocap: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42(4), 335 (2008)

    Article  Google Scholar 

  22. Zhang, G., Yan, C., Ji, X., Zhang, T., Zhang, T., Xu, W.: Dolphinattack: inaudible voice commands. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 103–117 (2017)

    Google Scholar 

  23. Yuan, X., et al.: Commandersong: a systematic approach for practical adversarial voice recognition. In: 27th \(\{\)USENIX\(\}\) Security Symposium (\(\{\)USENIX\(\}\) Security 2018), pp. 49–64 (2018)

    Google Scholar 

  24. Cisse, M., Adi, Y., Neverova, N., Keshet, J.: Houdini: fooling deep structured prediction models, arXiv preprint arXiv:1707.05373 (2017)

  25. Amodei, D., et al.: Deep speech 2: end-to-end speech recognition in English and mandarin. In: International Conference on Machine Learning, pp. 173–182 (2016)

    Google Scholar 

  26. Iter, D., Huang, J., Jermann, M.: Generating adversarial examples for speech recognition. Stanford Technical Report (2017)

    Google Scholar 

  27. Abdullah, H., Garcia, W., Peeters, C., Traynor, P., Butler, K.R., Wilson, J.: Practical hidden voice attacks against speech and speaker recognition systems, arXiv preprint arXiv:1904.05734 (2019)

  28. Carlini, N., Wagner, D.: Audio adversarial examples: targeted attacks on speech-to-text. In: 2018 IEEE Security and Privacy Workshops (SPW), pp. 1–7. IEEE (2018)

    Google Scholar 

  29. Yakura, H., Sakuma, J.: Robust audio adversarial example for a physical attack, arXiv preprint arXiv:1810.11793 (2018)

  30. Qin, Y., Carlini, N., Cottrell, G., Goodfellow, I., Raffel, C.: Imperceptible, robust, and targeted adversarial examples for automatic speech recognition, arXiv preprint arXiv:1903.10346 (2019)

  31. Shen, J., et al.: Lingvo: a modular and scalable framework for sequence-to-sequence modeling, arXiv preprint arXiv:1902.08295 (2019)

  32. Schönherr, L., Zeiler, S., Holz, T., Kolossa, D.: Robust over-the-air adversarial examples against automatic speech recognition systems, arXiv preprint arXiv:1908.01551 (2019)

  33. Szurley, J., Kolter, J.Z.: Perceptual based adversarial audio attacks, arXiv preprint arXiv:1906.06355 (2019)

  34. Liu, X., Zhang, X., Wan, K., Zhu, Q., Ding, Y.: Towards weighted-sampling audio adversarial example attack. arXiv, Audio and Speech Processing (2019)

    Google Scholar 

  35. Kwon, H.W., Kwon, H., Yoon, H., Choi, D.: Selective audio adversarial example in evasion attack on speech recognition system. IEEE Trans. Inf. Forensics Secur. 15, 526–538 (2020)

    Article  Google Scholar 

  36. Abdoli, S., Hafemann, L.G., Rony, J., Ayed, I.B., Cardinal, P., Koerich, A.L.: Universal adversarial audio perturbations, arXiv, vol. abs/1908.03173 (2019)

    Google Scholar 

  37. Vadillo, J., Santana, R.: Universal adversarial examples in speech command classification, arXiv, vol. abs/1911.10182 (2019)

    Google Scholar 

  38. Moosavi-Dezfooli, S.M., Fawzi, A., Frossard, P.: Deepfool: a simple and accurate method to fool deep neural networks. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2574–2582 (2015)

    Google Scholar 

  39. Neekhara, P., Hussain, S., Pandey, P., Dubnov, S., McAuley, J., Koushanfar, F.: Universal adversarial perturbations for speech recognition systems, arXiv, vol. abs/1905.03828 (2019)

    Google Scholar 

  40. Gong, Y., Poellabauer, C.: Crafting adversarial examples for speech paralinguistics applications, arXiv, vol. abs/1711.03280 (2017)

    Google Scholar 

  41. Kreuk, F., Adi, Y., Cissé, M., Keshet, J.: Fooling end-to-end speaker verification with adversarial examples. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1962–1966 (2018)

    Google Scholar 

  42. Alzantot, M., Balaji, B., Srivastava, M.: Did you hear that? Adversarial examples against automatic speech recognition, arXiv, vol. abs/1801.00554 (2018)

    Google Scholar 

  43. Taori, R., Kamsetty, A., Chu, B., Vemuri, N.: Targeted adversarial examples for black box audio systems. In: 2019 IEEE Security and Privacy Workshops (SPW), pp. 15–20 (2018)

    Google Scholar 

  44. Khare, S., Aralikatte, R., Mani, S.: Adversarial black-box attacks on automatic speech recognition systems using multi-objective evolutionary optimization, arXiv preprint arXiv:1811.01312 (2018)

  45. Zagoruyko, S., Komodakis, N.: Wide residual networks, arXiv preprint arXiv:1605.07146 (2016)

  46. Sun, S., Yeh, C.-F., Ostendorf, M., Hwang, M.-Y., Xie, L.: Training augmentation with adversarial examples for robust speech recognition, arXiv preprint arXiv:1806.02782 (2018)

  47. Zeng, Q., et al.: A multiversion programming inspired approach to detecting audio adversarial examples. In: 2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 39–51. IEEE (2019)

    Google Scholar 

  48. Latif, S., Rana, R., Qadir, J.: Adversarial machine learning and speech emotion recognition: utilizing generative adversarial networks for robustness, arXiv preprint arXiv:1811.11402 (2018)

  49. Rajaratnam, K., Shah, K., Kalita, J.: Isolated and ensemble audio preprocessing methods for detecting adversarial examples against automatic speech recognition, arXiv preprint arXiv:1809.04397 (2018)

  50. Rajaratnam, K., Kalita, J.: Noise flooding for detecting audio adversarial examples against automatic speech recognition. In: 2018 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT), pp. 197–201. IEEE (2018)

    Google Scholar 

  51. Samizade, S., Tan, Z.-H., Shen, C., Guan, X.: Adversarial example detection by classification for deep speech recognition, arXiv preprint arXiv:1910.10013 (2019)

  52. Yang, Z., Li, B., Chen, P.-Y., Song, D.: Characterizing audio adversarial examples using temporal dependency, arXiv preprint arXiv:1809.10875 (2018)

  53. Kwon, H., Yoon, H., Park, K.-W.: Poster: detecting audio adversarial example through audio modification. In: Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, pp. 2521–2523 (2019)

    Google Scholar 

  54. Ma, P., Petridis, S., Pantic, M.: Detecting adversarial attacks on audio-visual speech recognition, arXiv preprint arXiv:1912.08639 (2019)

  55. Esmaeilpour, M., Cardinal, P., Koerich, A.L.: A robust approach for securing audio classification against adversarial attacks. IEEE Trans. Inf. Forensics Secur. 15, 2147–2159 (2019)

    Article  Google Scholar 

  56. Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014)

    Google Scholar 

  57. Roy, N., Hassanieh, H., Roy Choudhury, R.: Backdoor: making microphones hear inaudible sounds. In: Proceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services, pp. 2–14 (2017)

    Google Scholar 

  58. Tamura, K., Omagari, A., Hashida, S.: Novel defense method against audio adversarial example for speech-to-text transcription neural networks. In: 2019 IEEE 11th International Workshop on Computational Intelligence and Applications (IWCIA), pp. 115–120. IEEE (2019)

    Google Scholar 

  59. Yang, C.-H., Qi, J., Chen, P.-Y., Ma, X., Lee, C.-H.: Characterizing speech adversarial examples using self-attention u-net enhancement, arXiv preprint arXiv:2003.13917 (2020)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rangding Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, D., Wang, R., Dong, L., Yan, D., Zhang, X., Gong, Y. (2020). Adversarial Examples Attack and Countermeasure for Speech Recognition System: A Survey. In: Yu, S., Mueller, P., Qian, J. (eds) Security and Privacy in Digital Economy. SPDE 2020. Communications in Computer and Information Science, vol 1268. Springer, Singapore. https://doi.org/10.1007/978-981-15-9129-7_31

Download citation

  • DOI: https://doi.org/10.1007/978-981-15-9129-7_31

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-15-9128-0

  • Online ISBN: 978-981-15-9129-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics