Context-Sensitive Adapter: Contextual Biasing for Personalized End-to-End Speech Recognition with Attention Fusion and Bias Filtering

Cai, Yineng; Sun, Lixu; Li, Yongchao; Yolwas, Nurmemet; Silamu, Wushouer

doi:10.1007/978-981-97-5594-3_30

Yineng Cai¹⁰,
Lixu Sun¹⁰,
Yongchao Li¹⁰,
Nurmemet Yolwas¹⁰ &
…
Wushouer Silamu¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14866))

Included in the following conference series:

International Conference on Intelligent Computing

512 Accesses

Abstract

Despite improvements in the generalization performance of Automatic Speech Recognition models, accurately recognizing infrequent words remains a challenging task, for example, language assistants in smart homes. A straightforward and viable approach to enhance the recognition accuracy of such rare vocabularies is to incorporate contextual information into the model. Consequently, the area of contextual biasing has increasingly garnered the attention of researchers. In this work, we introduce the Context-Sensitive Adapter, which leverages an attention mechanism to extract pertinent information from the hidden vectors of acoustic and contextual data. For the first time in the field of context bias, we introduce Hyperconformer, exploring its potential for novel applications. We propose a dual-thread architecture to train our model that ensures the accuracy of general speech recognition while also bolstering the recognition of context-specific words. Experimental results demonstrate that our method, employing the Hyperconformer-based Context-Sensitive Adapter, outperforms both non-contextual models and shallow fusion models. Compared to the baseline, our method achieved a maximum relative error rate reduction of 5.9% and 2.98%. Notably, against the current state-of-the-art (SOTA) models, our method achieved a performance increase of up to 41.72%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Introduction of Semantic Model to Help Speech Recognition

ECMISM: Speech Recognition via Enhancing Conformer Models with Innovative Scoring Matrices

An Equal Data Setting for Attention-Based Encoder-Decoder and HMM/DNN Models: A Case Study in Finnish ASR

References

Bahdanau, D., Chorowski, J., Serdyuk, D., Brakel, P., Bengio, Y.: End- to-End attention-based large vocabulary speech recognition (2016). https://doi.org/10.48550/arXiv.1508.04395
Chan, W., Jaitly, N., Le, Q., Vinyals, O.: Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: 2016 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), pp. 4960–4964 (2016). https://doi.org/10.1109/ICASSP.2016.7472621
Chang, F.J.: Context-aware transformer transducer for speech recognition. In: 2021 IEEE automatic speech recognition and understanding workshop (ASRU), pp. 503–510 (2021). https://doi.org/10.1109/ASRU51503.2021.9687895
Chorowski, J., Bahdanau, D., Serdyuk, D., Cho, K., Bengio, Y.: Attention-Based Models for Speech Recognition (2015). https://doi.org/10.48550/arXiv.1506.07503
Devlin, J., Chang, M. W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2–7, 2019, Vol. 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics (2019). https://doi.org/10.18653/V1/N19-1423
Dong, L., Xu, S., Xu, B.: Speech-Transformer: a No-recurrence sequence-to-sequence model for speech recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5884–5888 (2018). https://doi.org/10.1109/ICASSP.2018.8462506
Fu, X., et al.: Robust acoustic and semantic contextual biasing in neural transducers for speech recognition. In: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 1–5 (2023). https://doi.org/10.1109/ICASSP49357.2023.10094808
Gourav, A., et al.: personalization strategies for End-to-End speech recognition systems. In: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7348–7352 (2021). https://doi.org/10.1109/ICASSP39728.2021.9413962
Graves, A.: Sequence Transduction with Recurrent Neural Networks (2012). https://doi.org/10.48550/arXiv.1211.3711
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376. ICML ‘06, Association for Computing Machinery, New York, NY, USA (2006). https://doi.org/10.1145/1143844.1143891
Graves, A., Jaitly, N.: Towards End-To-End speech recognition with recurrent neural networks. In: Proceedings of the 31st International Conference on Machine Learning, pp. 1764–1772. PMLR (2014)
Google Scholar
Ha, D., Dai, A.M., Le, Q.V.: HyperNetworks. In: International Conference on Learning Representations (2016)
Google Scholar
He, Y., et al.: Streaming End-to-end speech recognition for mobile devices. In: ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6381–6385 (2019). https://doi.org/10.1109/ICASSP.2019.8682336
Jain, M., Keren, G., Mahadeokar, J., Zweig, G., Metze, F., Saraf, Y.: Contextual RNN-T for Open Domain ASR (2020). https://doi.org/10.48550/arXiv.2006.03411
Li, B., et al.: Towards fast and accurate streaming End-To-End ASR. In: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6069–6073 (2020). https://doi.org/10.1109/ICASSP40776.2020.9054715
Li, J.: Recent Advances in End-to-End Automatic Speech Recognition (2022). https://doi.org/10.48550/arXiv.2111.01690
Mai, F., Zuluaga-Gomez, J., Parcollet, T., Motlicek, P.: HyperCon-former: Multi-head HyperMixer for Efficient Speech Recognition (2023). https://doi.org/10.48550/arXiv.2305.18281
Munkhdalai, T., et al.: fast contextual adaptation with neural associative memory for on-device personalized speech recognition. In: ICASSP 2022 - 2022 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6632–6636 (2022). https://doi.org/10.1109/ICASSP43922.2022.9747726
Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: An ASR corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015). https://doi.org/10.1109/ICASSP.2015.7178964
Pundak, G., Sainath, T. N., Prabhavalkar, R., Kannan, A., Zhao, D.: Deep Context: End-to-end Contextual Speech Recognition. In: 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 418–425 (2018). https://doi.org/10.1109/SLT.2018.8639034
Ravanelli, M.: A general-purpose speech Toolkit (2021). https://doi.org/10.48550/arXiv.2106.04624
Sathyendra, K.M.: Contextual Adapters for Personalized Speech Recognition in Neural Transducers. In: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8537–8541 (2022). https://doi.org/10.1109/ICASSP43922.2022.9746126
Sun, G., Zhang, C., Woodland, P.C.: Tree-constrained pointer generator for end-to-end contextual speech recognition. In: 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 780–787 (2021). https://doi.org/10.1109/ASRU51503.2021.9687915
Sun, G., Zhang, C., Woodland, P.C.: Minimising biasing word errors for contextual ASR with the tree-constrained pointer generator. IEEE-ACM Trans. Audio Speech Lang. Process. 31, 345–354 (2023). https://doi.org/10.1109/TASLP.2022.3224286
Article Google Scholar
Tripathi, A., Lu, H., Sak, H., Soltau, H.: monotonic recurrent neural network transducer and decoding strategies. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 944–948 (2019). https://doi.org/10.1109/ASRU46091.2019.9003822
Vaswani, A., et al.: Attention is all you need. In: Adv. Neural Inf. Process. Syst. vol. 30. Curran Associates, Inc. (2017)
Google Scholar
Xu, T., et al.: Adaptive contextual biasing for transducer based streaming speech recognition (2023) https://doi.org/10.48550/arXiv.2306.00804
Zhao, D., et al.: Shallow-Fusion End-to-End Contextual Biasing. In: Proc. Interspeech 2019, pp. 1418–1422 (2019).https://doi.org/10.21437/Interspeech.2019-12

Download references

Acknowledgments

This work was supported by the National Natural Science Foundation of China—Research on Key Technologies of Speech Recognition of Chinese and Western Asian Languages under Resource Constraints (Grant No. 62066043).

Author information

Authors and Affiliations

Xinjiang University, Uygur Autonomous Region, Urumqi, 830000, Xinjiang, China
Yineng Cai, Lixu Sun, Yongchao Li, Nurmemet Yolwas & Wushouer Silamu

Authors

Yineng Cai
View author publications
You can also search for this author in PubMed Google Scholar
Lixu Sun
View author publications
You can also search for this author in PubMed Google Scholar
Yongchao Li
View author publications
You can also search for this author in PubMed Google Scholar
Nurmemet Yolwas
View author publications
You can also search for this author in PubMed Google Scholar
Wushouer Silamu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nurmemet Yolwas .

Editor information

Editors and Affiliations

Eastern Institute of Technology, Ningbo, China
De-Shuang Huang
Tianjin University of Science and Technology, Tianjin, China
Xiankun Zhang
Xiamen University, Xiamen, China
Jiayang Guo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cai, Y., Sun, L., Li, Y., Yolwas, N., Silamu, W. (2024). Context-Sensitive Adapter: Contextual Biasing for Personalized End-to-End Speech Recognition with Attention Fusion and Bias Filtering. In: Huang, DS., Zhang, X., Guo, J. (eds) Advanced Intelligent Computing Technology and Applications. ICIC 2024. Lecture Notes in Computer Science, vol 14866. Springer, Singapore. https://doi.org/10.1007/978-981-97-5594-3_30

Download citation

DOI: https://doi.org/10.1007/978-981-97-5594-3_30
Published: 14 August 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-5593-6
Online ISBN: 978-981-97-5594-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics