Abstract
This paper proposes a new approach of parameterizing the excitation signal for improving the quality of HMM-based speech synthesis system. The proposed method tries to model the excitation or residual signal by segregating the regions of the residual signal based on their perceptual importance. Initially, a study on the characteristics of the residual signal around glottal closure instant (GCI) is performed using principal component analysis (PCA). Based on the present study, and from the previous literature (Adiga and Prasanna in Proceedings of Interspeech, pp 1677–1681, 2013; Cabral in Proceedings of Interspeech, pp 1082–1086, 2013), it is concluded that the segment of the residual signal around GCI which carries perceptually important information is considered as the deterministic component and the remaining part of the residual signal is considered as the noise component. The deterministic component is compactly represented using PCA coefficients (with about 95% accuracy), and the noise component is parameterized in terms of spectral and amplitude envelopes. The proposed excitation modeling approach is incorporated in the HMM-based speech synthesis system. Subjective evaluation results show a significant improvement of quality for both female and male speakers’ speech synthesized by the proposed method, compared to three existing excitation modeling methods. Accurate parameterization of the segment of the residual signal around GCI resulted in the improvement of the quality of the synthesized speech. Synthesized speech samples of the proposed and existing source models are made available online at http://www.sit.iitkgp.ernet.in/~ksrao/parametric-hts/pcd-hts.html.
















Similar content being viewed by others
References
N. Adiga, S.R.M. Prasanna, Significance of instants of significant excitation for source modeling, in Proceedings of Interspeech (2013), pp. 1677–1681
P. Alku, Glottal wave analysis with pitch synchronous iterative adaptive inverse filtering. Speech Commun. 11(2–3), 109–118 (1992)
J.P. Cabral, S. Renals, J. Yamagishi, K. Richmond, HMM-based speech synthesiser using the LF-model of the glottal source, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2011), pp. 4704–4707
J.P. Cabral, Uniform concatenative excitation model for synthesising speech without voiced/unvoiced classification, in Proceedings of Interspeech (2013) pp. 1082–1086
CMU ARCTIC speech synthesis databases (online). http://festvox.org/cmu_arctic/
T.G. Csapó, G. Németh, A novel irregular voice model for HMM-based speech synthesis. in Proceedings of ISCA Speech Synthesis Workshop (2013), pp. 229–234
T.G. Csapó, G. Németh, Modeling irregular voice in statistical parametric speech synthesis with residual codebook based excitation. IEEE J. Sel. Top. Signal Process. 8(2), 209–220 (2014)
T. Drugman, A. Moinet, T. Dutoit, G. Wilfart, Using a pitch-synchrounous residual codebook for hybrid HMM/frame selection speech synthesis, in Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2009), pp. 3793–3796
T. Drugman, G. Wilfart, T. Dutoit, A deterministic plus stochastic model of the residual signal for improved parametric speech synthesis, in Proceeding of Interspeech (2009), pp. 1779–1782
T. Drugman, G. Wilfart, T. Dutoit, Eigenresiduals for improved parametric speech synthesis, in Proceedings of European Signal Processing Conference (EUSIPCO) (2009), pp. 2177–2180
T. Drugman, T. Dutoit, The deterministic plus stochastic model of the residual signal and its applications. IEEE Trans. Audio Speech Lang. Process. 20(3), 968–981 (2012)
T. Drugman, T. Raitio, Excitation modeling for HMM-based speech synthesis: breaking down the impact of periodic and aperiodic components, in Proceedings of International Conference on Audio, Speech and Signal Processing (ICASSP) (2014), pp. 260–264
HMM-based speech synthesis system (HTS) (online). http://hts.sp.nitech.ac.jp/
X. Huang, A. Acero, H.W. Hon, Spoken Language Processing: A Guide to Theory, Algorithm and System Development (Prentice Hall, Upper Saddle River, 2001)
ITU-T Draft Recommendation P.862, Perceptual evaluation of speech quality (PESQ), an objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs (2000)
H. Kawahara, I. Masuda-Katsuse, A. de Cheveigne, Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: possible role of a repetitive structure in sounds. Speech Commun. 27, 187–207 (1998)
H. Kawahara, M. Morise, T. Takahashi, R. Nisimura, T. Irino, H. Banno, Tandem-STRAIGHT: a temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation, in Proceeding of International Conference on Audio, Speech and Signal Processing (ICASSP) (2008), pp. 3933–3936
S. Kim, J. Kim, M. Hahn, HMM-based Korean speech synthesis system for hand-held devices. IEEE Trans. Consum. Electron. 52, 1384–1390 (2006)
P. Loizou, Speech Enhancement: Theory and Practice (CRC Press, Boca Raton, 2007)
S.L. Maguer, N. Barbot, O. Boeffard, Evaluation of contextual descriptors for HMM-based speech synthesis in French, in Proceedings of ISCA Speech Synthesis Workshop (2013), pp. 153–158
R. Maia, T. Toda, H. Zen, Y. Nankaku, K. Tokuda, An excitation model for HMM-based speech synthesis based on residual modeling, in Proceeding of International Speech Communication Association Speech Synthesis Workshop 6 (ISCA SW6) (2007), pp. 131–136
J.D. Markel, A.H. Gray, Linear Prediction of Speech (Springer, Berlin, 1976)
A. McCree, K. Truong, E. George, T. Barnwell, V. Viswanathan, A 2.4 kbit/s MELP coder candidate for the new U.S. Federal Standard, in Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP) (1996), pp. 200–203
A. McCree, A 14 kb/s wideband speech coder with a parametric highband model, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2000), pp. 1153–1156
K.S.R. Murty, B. Yegnanarayana, Epoch extraction from speech signals. IEEE Trans. Audio Speech Lang. Process. 16(8), 1602–1613 (2008)
N.P. Narendra, K.S. Rao, K. Ghosh, R.R. Vempada, S. Maity, Development of syllable-based text to speech synthesis system in Bengali. Int. J. Speech Technol. 14(3), 167–181 (2011)
N.P. Narendra, K.S. Rao, K. Ghosh, V.R. Reddy, S. Maity, Development of Bengali screen reader using Festival speech synthesizer, in Proceedings of IEEE India Conference (INDICON) (2011), pp. 1–4
N.P. Narendra, K.S. Rao, Robust voicing detection and F0 estimation for HMM-based speech synthesis. Circuits Syst. Signal Process. 34(8), 2597–2619 (2015)
N.P. Narendra, K.S. Rao, A deterministic plus noise model of excitation signal using principal component analysis for parametric speech synthesis, in Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2016), pp. 5635–5639
J.J. Odella, The Use of Context in Large Vocabulary Speech Recognition. Ph.D. thesis, Cambridge University, Cambridge (1995)
K. Paliwal, W. Kleijn, Quantization of LPC parameters, in Speech Coding and Synthesis, ed. by W. Kleijn, E.K. Paliwal (Elsevier, Amsterdam, 1995)
Y. Pantazis, Y. Stylianou, Improving the modeling of the noise part in the harmonic plus noise model of speech, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4609–4612 (2008)
B. Picart, T. Drugman, T. Dutoit, HMM-based speech synthesis with various degrees of articulation: a perceptual study. J. Neurocomput. 132, 142–147 (2014)
T.F. Quatieri, Discrete-Time Speech Signal Processing: Principles and Practice (Prentice Hall, Upper Saddle River, 2002)
E.V. Raghavendra, K. Prahallad, A multilingual screen reader in Indian languages, in Proceedings of National Conference on Communications (NCC) (2010), pp. 1–5
T. Raitio, A. Suni, J. Yamagishi, H. Pulakka, J. Nurminen, M. Vainio, P. Alku, HMM-based speech synthesis utilizing glottal inverse filtering. IEEE Trans. Audio Speech Lang. Process. 19(1), 153–165 (2011)
T. Raitio, A. Suni, H. Pulakka, M. Vainio, P. Alku, Utilizing glottal source pulse library for generating improved excitation signal for HMM-based speech synthesis, in Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2011), pp. 4564–4567
K. Shinoda, T. Watanabe, MDL-based context-dependent subword modeling for speech recognition. J. Acoust. Soc. Jpn. (E) 21(2), 79–86 (2000)
F. Soong, B. Juang, Line spectrum pair (LSP) and speech data compression, in Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP) (1984) pp. 37–40
Y. Stylianou, Harmonic Plus Noise Models for Speech, Combined with Statistical Methods, for Speech and Speaker Modification. Ph.D. thesis, Ecole Nationale Supérieure des Télécommunications (1996)
T. Toda, K. Tokuda, A speech parameter generation algorithm considering global variance for HMM-based speech synthesis. IEICE Trans. Inform. Syst. 90(5), 816–824 (2007)
K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, T. Kitamura, Speech parameter generation algorithms for HMM-based speech synthesis, in Proceedings of International Conference on Acoustics, Speech, and Signal Processing, (ICASSP) (2000), pp. 1315–1318
K. Tokuda, Y. Nankaku, T. Toda, H. Zen, J. Yamagishi, K. Oura, Speech synthesis based on hidden Markov models. Proc. IEEE 101(5), 1234–1252 (2013)
Z. Wen, J. Tao, S. Pan, Y. Wang, Pitch-scaled spectrum based excitation model for HMM-based speech synthesis. J. Signal Process. Syst. 74(3), 423–435 (2013)
T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, T. Kitamura, Mixed-excitation for HMM-based speech synthesis, in Proceedings of Eurospeech (2001), pp. 2259–2262
E. Yumoto, W. Gould, T. Baer, Harmonics-to-noise ratio as an index of the degree of hoarseness. J. Acoust. Soc. Am. 71(6), 1544–1550 (1982)
H. Zen, T. Toda, M. Nakamura, K. Tokuda, Details of Nitech HMM-based speech synthesis system for the Blizzard Challenge 2005. IEICE Trans. Inform. Syst. E90-D, 325–333 (2007)
H. Zen, T. Toda, K. Tokuda, The Nitech-NAIST HMM-based speech synthesis system for the Blizzard Challenge 2006. IEICE Trans. Inform. Syst. E91-D(6), 1764–1773 (2008)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Narendra, N.P., Rao, K.S. Parameterization of Excitation Signal for Improving the Quality of HMM-Based Speech Synthesis System. Circuits Syst Signal Process 36, 3650–3673 (2017). https://doi.org/10.1007/s00034-016-0476-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-016-0476-3