Abstract
The performance of speaker recognition system is highly dependent on the duration of speech used in enrollment and test. This work presents a detailed experimental review and analysis of the GMM-SVM based speaker recognition system in presence of duration variability. This article also reports a comparison of the performance of GMM-SVM classifier with its precursor technique Gaussian mixture model- universal background model (GMM-UBM) classifier in presence of duration variability. The goal of this research work is not to propose a new algorithm for improving speaker recognition performance in presence of duration variability. However, the main focus of this work is on utterance partitioning (UP), a commonly used strategy to compensate the duration variability issue. We have analysed in detailed the impact of training utterance partitioning in speaker recognition performance under GMM-SVM framework. We further investigate the reason why the utterance partitioning is important for boosting speaker recognition performance. We have also shown in which case the utterance partitioning could be useful and where not. Our study has revealed that utterance partitioning does not reduce the data imbalance problem of the GMM-SVM classifier as claimed in earlier study. Apart from these, we also discuss issues related to the impact of parameters such as number of Gaussians, supervector length, amount of splitting required for obtaining better performance in short and long duration test conditions from speech duration perspective. We have performed the experiments with telephone speech from POLYCOST corpus consisting of 130 speakers.
















Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Alpaydin, E. (2004). Introduction to machine learning (2nd ed.). Cambridge: MIT Press.
Bilmes, J. A. (1998). A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models. Tech. Rep. ICSI-TR-97–021, Department of Electrical Engineering and Computer Science,U.C. Berkeley. pp. 1–15.
Burges, C. J. C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2), 121–167.
Campbell, W. M., Sturim, D. E., & Reynolds, D. A. (2006a). Support vector machines using GMM supervectors for speaker verification. IEEE Signal Processing Letters, 13(5), 308–311.
Campbell,W.M., Sturim, D.E., Reynolds, D.A. & Solomonoff, A. (2006b). SVM based speaker verification using a GMM supervector kernel and NAP variability compensation. In: ICASSP06, vol. 1, pp 97–100.
Chakroborty, S. (2008). Some studies on acoustic feature extraction, feature selection and multi-level fusion strategies for robust text-independent speaker identification. Ph.D. Thesis, department of electronics and electrical communication engineering, IIT Kharagpur, India.
Chang, C.-C. & Lin, C.-J. (2001). LIBSVM: A Library for Support Vector Machines. [Online]. http://www.csie.ntu.edu.tw/~cjlin/libsvm
Davis, S. B., & Mermelsteine, P. (1980). Comparison of parametric representation for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions Acousting, Speech, Signal Processing ASSP, 28(4), 357–365.
Dehak, N., Chollet, G. (2006). Support vector GMMs for speaker verification. In: Proc. IEEE Odyssey: the Speaker and Language Recognition Workshop (Odyssey 2006), San Juan, Puerto Rico, June 2006.
Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., & Ouellet, P. (2010). Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4), 788–798.
Fauve, B., Evans, N., Pearson, N., Bonastre, J.-F., Mason, J. (2007). Influence of task duration in text-independent speaker verification. In: Proc. Interspeech2007, Antwerp, Belgium, pp. 794–797.
Hansen, J. H., & Hasan, T. (2015). Speaker recognition by machines and humans: A tutorial review. IEEE Signal Processing Magazine, 32(6), 74–99.
Hautamäki, R. G., Sahidullah, M., Hautamäki, V., & Kinnunen, T. (2017). Acoustical and perceptual study of voice disguise by age modification in speaker verification. Speech Communication, 95, 1–15.
Kanagasundaram, A., Dean, D., Sridharan, S., Ghaemmaghami, H., & Fookes, C. (2017). A study on the effects of using short utterance length development data in the design of GPLDA speaker verification systems. International Journal of Speech Technology, 20(2), 247–259.
Kanagasundaram, A., Dean, D., Sridharan, S., Gonzalez-Dominguez, J., Gonzalez-Rodriguez, J., & Ramos, D. (2014). Improving short utterance i-vector speaker verification using utterance variance modelling and compensation techniques. Speech Communication, 59, 69–82.
Kandali, A. B. (2012). Classification of discrete emotions in speech using prosodic and spectral features: Intra and cross-lingual studies in five native languages of Assam. Ph.D. Thesis, department of electrical engineering, IIT Kharagpur, India.
Kinnunen, T., & Li, H. (2010). An overview of text-independent speaker recognition: From features to supervectors. Speech Communication, 52(1), 12–40.
Kinnunen, T. (2004). Spectral features for automatic text-independent speaker recognition. Ph.D. Thesis, University of Joensuu.
Kinnunen, T., Saastamoinen, J., Hautamäki, V., Vinni, M., & Franti, P. (2009). Comparative evaluation of maximum a posteriori vector quantization and Gaussian mixture models in speaker verification. Pattern Recognition Letters., 30(4), 341–347.
Mak, M. W., & Rao, W. (2011). Utterance partitioning with acoustic vector resampling for GMM–SVM speaker verification. Speech Communication, 53(1), 119–130.
Matějka, P., Glembek, O., Castaldo, F., Alam, M.J., Plchot, O., Kenny, P., Burget, L. and Černocky, J. (May 2011). Full-covariance UBM and heavy-tailed PLDA in i-vector speaker verification. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4828–4831).
Patil, H. A. (2005). Speaker Recognition in Indian Languages: A Feature Based Approach. Ph.D. Thesis, department of electrical engineering, IIT Kharagpur, India.
Petrovska, D., et al. (1998). POLYCOST: A Telephonic speech database for speaker recognition. RLA2C, Avignon, France, April 20–23, pp. 211–214.
Poddar, A., Sahidullah, M., & Saha, G. (2017). Speaker verification with short utterances: A review of challenges, trends and opportunities. IET Biometrics, 7(2), 91–101.
Rao, W., & Mak, M. W. (2013). Boosting the performance of i-vector based speaker verification via utterance partitioning. IEEE Transactions on Audio, Speech, and Language Processing, 21(5), 1012–1022.
Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, 10, 19–41.
Sahidullah, Md. (2015). Enhancement of speaker recognition performance using block level, relative, and temporal information of subband energies. Ph.D. Thesis, Department of Electronics and Electrical Communication Engineering, IIT Kharagpur, India.
Sahidullah, Md., & Saha, G. (2012). Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition. Speech Communication., 54(4), 543–565.
Sen, N. (2014). Enhancement of speaker recognition performance for short test segments using GMM-SVM and polynomial classifiers. Ph.D. Thesis, Centre for Educational Technology, IIT Kharagpur, India.
Snyder, D., Garcia-Romero, D., Sell, G., Povey, D. & Khudanpur, S., 2018, April. X-vectors: Robust dnn embeddings for speaker recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5329–5333).
Vapnik, V. (1995). The nature of statistical learning theory. Springer-Verlag.
Acknowledgements
The authors are grateful to Professor Goutam Saha, Department of E & ECE, IIT Kharagpur for his help in the experimentation with the POLYCOST database. First author is extremely grateful to Dr. Richa Mittal, erstwhile student of Department of CET, IIT Kharagpur for her help at the time of preparation of the manuscript. First author is also extremely grateful to Dr. Rahul Dasgupta, erstwhile student of Department of CET, IIT Kharagpur for rigorous technical discussions.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Sen, N., Sahidullah, M., Patil, H.A. et al. Utterance partitioning for speaker recognition: an experimental review and analysis with new findings under GMM-SVM framework. Int J Speech Technol 24, 1067–1088 (2021). https://doi.org/10.1007/s10772-021-09862-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-021-09862-8