Abstract
This paper presents experiments with a multi-lingual multi-speaker TTS synthesis system jointly trained on English, German, Russian, and Czech speech data. The experimental LSTM-based TTS system with a trainable neural vocoder utilizes the International Phonetic Alphabet (IPA) which allows a straight combination of different languages. We analyzed whether the joint model is capable to generalize and mix the information contained in the training data and whether particular voices can be used for the synthesis of different languages, including the language-specific phonemes. The intelligibility of generated speech was assessed by an SUS (Semantically Unpredictable Sentences) listening tests containing Czech sentences spoken by non-Czech speakers. The performance of the joint multi-lingual model was also compared with independent single-voice models where the missing non-native phonemes were mapped to the most similar native phonemes. Besides the Czech sentences, the preference test also contained the English sentences spoken by Czech voices. The multi-lingual model was preferred for all evaluated voices. Although the generated speech did not sound like a native speaker, the phonetic and prosodic features were definitely better.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
We realize that IPA is describing primarily phones, not phonemes, nevertheless it can be used for phonemic transcription, too.
- 2.
Though the glottal stop
is rather a phone than a phoneme in the Czech language, too.
- 3.
Audio samples available at https://bit.ly/2Ryog0I.
References
Badino, L., Barolo, C., Quazza, S.: A general approach to TTS reading of mixed-language texts. In: Proceedings of ISCA Speech Synthesis Workshop (2004)
Benoît, C., Grice, M., Hazan, V.: The SUS test: a method for the assessment of text-to-speech synthesis intelligibility using semantically unpredictable sentences. Speech Commun. 18(4), 381–392 (1996)
Fan, Y., Qian, Y., Soong, F.K., He, L.: Multi-speaker modeling and speaker adaptation for DNN-based TTS synthesis. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP pp. 4475–4479 (2015)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997)
International Phonetic Association: Handbook of the International Phonetic Association: A Guide to the Use of the International Phonetic Alphabet. Cambridge University Press (1999)
Kalchbrenner, N., Elsen, E., Simonyan, K., Noury, S., Casagrande, N., Lockhart, E., Stimberg, F., van den Oord, A., Dieleman, S., Kavukcuoglu, K.: Efficient neural audio synthesis. Proc. Mach. Learn. Res. 80, 2410–2419 (2018)
Li, B., Zen, H.: Multi-language multi-speaker acoustic modeling for LSTM-RNN based statistical parametric speech synthesis. In: Proceedings of Interspeech 2016, pp. 2468–2472 (2016)
Luong, H.T., Wang, X., Yamagishi, J., Nishizawa, N.: Training multi-speaker neural text-to-speech systems using speaker-imbalanced speech corpora. In: Proceedings of Interspeech 2019, pp. 1303–1307 (2019)
Morise, M.: D4C, a band-aperiodicity estimator for high-quality speech synthesis. Speech Commun. 84, 57–65 (2016)
Tihelka, D., Matoušek, J.: The design of Czech language formal listening tests for the evaluation of TTS systems. In: Proceedings of International Conference on Language Resources and Evaluation, LREC 2004, pp. 2099–2102 (2004)
Vít, J., Hanzlíček, Z., Matoušek, J.: Czech speech synthesis with generative neural vocoder. In: Ekštein, K. (ed.) TSD 2019. LNCS (LNAI), vol. 11697, pp. 307–315. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-27947-9_26
Wells, J.: Handbook of Standards and Resources for Spoken Language Systems, chap. SAMPA computer readable phonetic alphabet, pp. 684–732. Mouton de Gruyter, Berlin and New York (1997)
Ze, H., Senior, A., Schuster, M.: Statistical parametric speech synthesis using deep neural networks. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2013, pp. 7962–7966 (2013)
Zen, H.: Acoustic modeling in statistical parametric speech synthesis - from HMM to LSTM-RNN. In: Proceedings of MLSLP (2015)
Zhang, Y., et al.: Learning to speak fluently in a foreign language: Multilingual speech synthesis and cross-language voice cloning. In: Proceedings of Interspeech 2019, pp. 2080–2084 (2019)
Acknowledgment
This research was supported by the Czech Science Foundation (GA CR), project No. GA19-19324S, and by the grant of the University of West Bohemia, project No. SGS-2019-027. Computational resources were supplied by the project “e-Infrastruktura CZ” (e-INFRA LM2018140) provided within the program Projects of Large Research, Development and Innovations Infrastructures.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Hanzlíček, Z., Vít, J., Řezáčková, M. (2021). Speakers Talking Foreign Languages in a Multi-lingual TTS System. In: Ekštein, K., Pártl, F., Konopík, M. (eds) Text, Speech, and Dialogue. TSD 2021. Lecture Notes in Computer Science(), vol 12848. Springer, Cham. https://doi.org/10.1007/978-3-030-83527-9_42
Download citation
DOI: https://doi.org/10.1007/978-3-030-83527-9_42
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-83526-2
Online ISBN: 978-3-030-83527-9
eBook Packages: Computer ScienceComputer Science (R0)