Abstract
This chapter reviews distant speech recognition experimentation using the AMI corpus of multiparty meetings. The chapter compares conventional approaches using microphone array beamforming followed by single-channel acoustic modelling with approaches which combine multichannel signal processing with acoustic modelling in the context of convolutional networks.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Mics 1 and 5 were used in the 2-mic case; mics 1, 3, 5 and 7 in the 4-mic case.
- 2.
However, since the networks were being tasked with additional processing, it may be that deeper architectures would be more suitable.
- 3.
The convolution of two vectors of size X and Y may result either in a vector of size X + Y − 1 for a full convolution with zero-padding of non-overlapping regions, or a vector of size X − Y + 1 for a valid convolution where only the points which overlap completely are considered.
References
Abdel-Hamid, O., Mohamed, A.R., Hui, J., Penn, G.: Applying convolutional neural networks concepts to hybrid NN–HMM model for speech recognition. In: Proceedings of the IEEE ICASSP, pp. 4277–4280 (2012)
Abdel-Hamid, O., Deng, L., Yu, D.: Exploring convolutional neural network structures and optimization techniques for speech recognition. In: Proceedings of the ICSA Interspeech (2013)
Adcock, J., Gotoh, Y., Mashao, D., Silverman, H.: Microphone-array speech recognition via incremental MAP training. In: Proceedings of the IEEE ICASSP, pp. 897–900 (1996)
Anastasakos, T., McDonough, J., Schwartz, R., Makhoul, J.: A compact model for speaker-adaptive training. In: Proceedings of the ICSLP, pp. 1137–1140 (1996)
Anguera, X., Wooters, C., Hernando, J.: Acoustic beamforming for speaker diarization of meetings. IEEE Trans. Audio Speech Lang. Process. 15, 2011–2021 (2007)
Carletta, J., Lincoln, M.: Data collection. In: Renals, S., Bourlard, H., Carletta, J., Popescu-Belis, A. (eds.) Multimodal Signal Processing: Human Interactions in Meetings, chap. 2, pp. 11–27. Cambridge University Press, Cambridge (2012)
Carletta, J., Ashby, S., Bourban, S., Flynn, M., Guillemot, M., Hain, T., Kadlec, J., Karaiskos, V., Kraaij, W., Kronenthal, M., Lathoud, G., Lincoln, M., Lisowska, A., McCowan, I., Post, W., Reidsma, D., Wellner, P.: The AMI meeting corpus: a pre-announcement. In: Proceedings of the Machine Learning for Multimodal Interaction (MLMI), pp. 28–39 (2005)
Carletta, J., Evert, S., Heid, U., Kilgour, J.: The NITE XML toolkit: data model and query language. Lang. Resour. Eval. 39, 313–334 (2005)
Fiscus, J., Ajot, J., Radde, N., Laprun, C.: Multiple dimension Levenshtein edit distance calculations for evaluating ASR systems during simultaneous speech. In: Proceedings of the LREC (2006)
Gales, M.: Semi-tied covariance matrices for hidden Markov models. IEEE Trans. Speech Audio Process. 7(3), 272–281 (1999)
Ghoshal, A., Swietojanski, P., Renals, S.: Multilingual training of deep neural networks. In: Proceedings of the IEEE ICASSP (2013)
Grezl, F., Karafiat, M., Kontar, S., Cernocky, J.: Probabilistic and bottle-neck features for LVCSR of meetings. In: Proceedings of IEEE ICASSP, pp. IV-757–IV-760 (2007)
Haeb-Umbach, R., Ney, H.: Linear discriminant analysis for improved large vocabulary continuous speech recognition. In: Proceedings of the IEEE ICASSP, pp. 13–16 (1992). http://dl.acm.org/citation.cfm?id=1895550.1895555
Hain, T., Burget, L., Dines, J., Garner, P., Grézl, F., El Hannani, A., KarafÃat, M., Lincoln, M., Wan, V.: Transcribing meetings with the AMIDA systems. IEEE Trans. Audio Speech Lang. Process. 20, 486–498 (2012)
Janin, A., Baron, D., Edwards, J., Ellis, D., Gelbart, D., Morgan, N., Peskin, B., Pfau, T., Shriberg, E., Stolcke, A., Wooters, C.: The ICSI meeting corpus. In: Proceedings of the IEEE ICASSP, pp. I-364–I-367 (2003)
Lang, K., Waibel, A., Hinton, G.: A time-delay neural network architecture for isolated word recognition. Neural Netw. 3, 23–43 (1990)
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998)
Lincoln, M., McCowan, I., Vepa, J., Maganti, H.: The multi-channel Wall Street Journal audio visual corpus (MC-WSJ-AV): specification and initial experiments. In: Proceedings of the IEEE ASRU (2005)
Marino, D., Hain, T.: An analysis of automatic speech recognition with multiple microphones. In: Proceedings of the Interspeech, pp. 1281–1284 (2011)
Omologo, M., Matassoni, M., Svaizer, P., Giuliani, D.: Microphone array based speech recognition with different talker-array positions. In: Proceedings of the IEEE ICASSP, pp. 227–230 (1997)
Povey, D., Woodland, P.: Minimum phone error and I-smoothing for improved discriminative training. In: Proceedings of the IEEE ICASSP, pp. 105–108 (2002)
Renals, S., Swietojanski, P.: Neural networks for distant speech recognition. In: Proceedings of the HSCMA (2014)
Sainath, T., Kingsbury, B., Mohamed, A., Dahl, G., Saon, G., Soltau, H., Beran, T., Aravkin, A., Ramabhadran, B.: Improvements to deep convolutional neural networks for LVCSR. In: Proceedings of the IEEE ASRU (2013)
Seltzer, M., Stern, R.: Subband likelihood-maximizing beamforming for speech recognition in reverberant environments. IEEE Trans. Audio Speech Lang. Process. 14, 2109–2121 (2006)
Seltzer, M., Raj, B., Stern, R.: Likelihood-maximizing beamforming for robust hands-free speech recognition. IEEE Trans. Speech Audio Process. 12, 489–498 (2004)
Stolcke, A., Anguera, X., Boakye, K., Cetin, O., Janin, A., Magimai-Doss, M., Wooters, C., Zheng, J.: The SRI-ICSI spring 2007 meeting and lecture recognition system. In: Stiefelhagen, R., Bowers, R., Fiscus, J. (eds.) Multimodal Technologies for Perception of Humans. Lecture Notes in Computer Science, vol. 4625, pp. 373–389. Springer, New York (2008)
Swietojanski, P., Ghoshal, A., Renals, S.: Hybrid acoustic models for distant and multichannel large vocabulary speech recognition. In: Proceedings of the IEEE ASRU (2013). doi:10.1109/ASRU.2013.6707744
Swietojanski, P., Ghoshal, A., Renals, S.: Convolutional neural networks for distant speech recognition. IEEE Signal Process. Lett. 21, 1120–1124 (2014)
Van Compernolle, D., Ma, W., Xie, F., Van Diest, M.: Speech recognition in noisy environments with the aid of microphone arrays. Speech Commun. 9, 433–442 (1990)
Wölfel, M., McDonough, J.: Distant Speech Recognition. Wiley, Chichester (2009)
Zwyssig, E., Lincoln, M., Renals, S.: A digital microphone array for distant speech recognition. In: Proceedings of the IEEE ICASSP, pp. 5106–5109 (2010). doi:10.1109/ICASSP.2010.5495040
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this chapter
Cite this chapter
Renals, S., Swietojanski, P. (2017). Distant Speech Recognition Experiments Using the AMI Corpus. In: Watanabe, S., Delcroix, M., Metze, F., Hershey, J. (eds) New Era for Robust Speech Recognition. Springer, Cham. https://doi.org/10.1007/978-3-319-64680-0_16
Download citation
DOI: https://doi.org/10.1007/978-3-319-64680-0_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-64679-4
Online ISBN: 978-3-319-64680-0
eBook Packages: Computer ScienceComputer Science (R0)