Skip to main content

Speakers Talking Foreign Languages in a Multi-lingual TTS System

  • Conference paper
  • First Online:
Text, Speech, and Dialogue (TSD 2021)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12848))

Included in the following conference series:

  • 1362 Accesses

Abstract

This paper presents experiments with a multi-lingual multi-speaker TTS synthesis system jointly trained on English, German, Russian, and Czech speech data. The experimental LSTM-based TTS system with a trainable neural vocoder utilizes the International Phonetic Alphabet (IPA) which allows a straight combination of different languages. We analyzed whether the joint model is capable to generalize and mix the information contained in the training data and whether particular voices can be used for the synthesis of different languages, including the language-specific phonemes. The intelligibility of generated speech was assessed by an SUS (Semantically Unpredictable Sentences) listening tests containing Czech sentences spoken by non-Czech speakers. The performance of the joint multi-lingual model was also compared with independent single-voice models where the missing non-native phonemes were mapped to the most similar native phonemes. Besides the Czech sentences, the preference test also contained the English sentences spoken by Czech voices. The multi-lingual model was preferred for all evaluated voices. Although the generated speech did not sound like a native speaker, the phonetic and prosodic features were definitely better.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    We realize that IPA is describing primarily phones, not phonemes, nevertheless it can be used for phonemic transcription, too.

  2. 2.

    Though the glottal stop is rather a phone than a phoneme in the Czech language, too.

  3. 3.

    Audio samples available at https://bit.ly/2Ryog0I.

References

  1. Badino, L., Barolo, C., Quazza, S.: A general approach to TTS reading of mixed-language texts. In: Proceedings of ISCA Speech Synthesis Workshop (2004)

    Google Scholar 

  2. Benoît, C., Grice, M., Hazan, V.: The SUS test: a method for the assessment of text-to-speech synthesis intelligibility using semantically unpredictable sentences. Speech Commun. 18(4), 381–392 (1996)

    Google Scholar 

  3. Fan, Y., Qian, Y., Soong, F.K., He, L.: Multi-speaker modeling and speaker adaptation for DNN-based TTS synthesis. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP pp. 4475–4479 (2015)

    Google Scholar 

  4. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997)

    Google Scholar 

  5. International Phonetic Association: Handbook of the International Phonetic Association: A Guide to the Use of the International Phonetic Alphabet. Cambridge University Press (1999)

    Google Scholar 

  6. Kalchbrenner, N., Elsen, E., Simonyan, K., Noury, S., Casagrande, N., Lockhart, E., Stimberg, F., van den Oord, A., Dieleman, S., Kavukcuoglu, K.: Efficient neural audio synthesis. Proc. Mach. Learn. Res. 80, 2410–2419 (2018)

    Google Scholar 

  7. Li, B., Zen, H.: Multi-language multi-speaker acoustic modeling for LSTM-RNN based statistical parametric speech synthesis. In: Proceedings of Interspeech 2016, pp. 2468–2472 (2016)

    Google Scholar 

  8. Luong, H.T., Wang, X., Yamagishi, J., Nishizawa, N.: Training multi-speaker neural text-to-speech systems using speaker-imbalanced speech corpora. In: Proceedings of Interspeech 2019, pp. 1303–1307 (2019)

    Google Scholar 

  9. Morise, M.: D4C, a band-aperiodicity estimator for high-quality speech synthesis. Speech Commun. 84, 57–65 (2016)

    Google Scholar 

  10. Tihelka, D., Matoušek, J.: The design of Czech language formal listening tests for the evaluation of TTS systems. In: Proceedings of International Conference on Language Resources and Evaluation, LREC 2004, pp. 2099–2102 (2004)

    Google Scholar 

  11. Vít, J., Hanzlíček, Z., Matoušek, J.: Czech speech synthesis with generative neural vocoder. In: Ekštein, K. (ed.) TSD 2019. LNCS (LNAI), vol. 11697, pp. 307–315. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-27947-9_26

    Chapter  Google Scholar 

  12. Wells, J.: Handbook of Standards and Resources for Spoken Language Systems, chap. SAMPA computer readable phonetic alphabet, pp. 684–732. Mouton de Gruyter, Berlin and New York (1997)

    Google Scholar 

  13. Ze, H., Senior, A., Schuster, M.: Statistical parametric speech synthesis using deep neural networks. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2013, pp. 7962–7966 (2013)

    Google Scholar 

  14. Zen, H.: Acoustic modeling in statistical parametric speech synthesis - from HMM to LSTM-RNN. In: Proceedings of MLSLP (2015)

    Google Scholar 

  15. Zhang, Y., et al.: Learning to speak fluently in a foreign language: Multilingual speech synthesis and cross-language voice cloning. In: Proceedings of Interspeech 2019, pp. 2080–2084 (2019)

    Google Scholar 

Download references

Acknowledgment

This research was supported by the Czech Science Foundation (GA CR), project No. GA19-19324S, and by the grant of the University of West Bohemia, project No. SGS-2019-027. Computational resources were supplied by the project “e-Infrastruktura CZ” (e-INFRA LM2018140) provided within the program Projects of Large Research, Development and Innovations Infrastructures.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zdeněk Hanzlíček .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hanzlíček, Z., Vít, J., Řezáčková, M. (2021). Speakers Talking Foreign Languages in a Multi-lingual TTS System. In: Ekštein, K., Pártl, F., Konopík, M. (eds) Text, Speech, and Dialogue. TSD 2021. Lecture Notes in Computer Science(), vol 12848. Springer, Cham. https://doi.org/10.1007/978-3-030-83527-9_42

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-83527-9_42

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-83526-2

  • Online ISBN: 978-3-030-83527-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics