Distant Speech Recognition Experiments Using the AMI Corpus

Renals, Steve; Swietojanski, Pawel

doi:10.1007/978-3-319-64680-0_16

Steve Renals⁵ &
Pawel Swietojanski⁵

2376 Accesses

Abstract

This chapter reviews distant speech recognition experimentation using the AMI corpus of multiparty meetings. The chapter compares conventional approaches using microphone array beamforming followed by single-channel acoustic modelling with approaches which combine multichannel signal processing with acoustic modelling in the context of convolutional networks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Acoustic Event Mixing to Multichannel AMI Data for Distant Speech Recognition and Acoustic Event Classification Benchmarking

USC: An Open-Source Uzbek Speech Corpus and Initial Speech Recognition Experiments

Open Source German Distant Speech Recognition: Corpus and Acoustic Model

Notes

1.
Mics 1 and 5 were used in the 2-mic case; mics 1, 3, 5 and 7 in the 4-mic case.
2.
However, since the networks were being tasked with additional processing, it may be that deeper architectures would be more suitable.
3.
The convolution of two vectors of size X and Y may result either in a vector of size X + Y − 1 for a full convolution with zero-padding of non-overlapping regions, or a vector of size X − Y + 1 for a valid convolution where only the points which overlap completely are considered.

References

Abdel-Hamid, O., Mohamed, A.R., Hui, J., Penn, G.: Applying convolutional neural networks concepts to hybrid NN–HMM model for speech recognition. In: Proceedings of the IEEE ICASSP, pp. 4277–4280 (2012)
Google Scholar
Abdel-Hamid, O., Deng, L., Yu, D.: Exploring convolutional neural network structures and optimization techniques for speech recognition. In: Proceedings of the ICSA Interspeech (2013)
Google Scholar
Adcock, J., Gotoh, Y., Mashao, D., Silverman, H.: Microphone-array speech recognition via incremental MAP training. In: Proceedings of the IEEE ICASSP, pp. 897–900 (1996)
Google Scholar
Anastasakos, T., McDonough, J., Schwartz, R., Makhoul, J.: A compact model for speaker-adaptive training. In: Proceedings of the ICSLP, pp. 1137–1140 (1996)
Google Scholar
Anguera, X., Wooters, C., Hernando, J.: Acoustic beamforming for speaker diarization of meetings. IEEE Trans. Audio Speech Lang. Process. 15, 2011–2021 (2007)
Article Google Scholar
Carletta, J., Lincoln, M.: Data collection. In: Renals, S., Bourlard, H., Carletta, J., Popescu-Belis, A. (eds.) Multimodal Signal Processing: Human Interactions in Meetings, chap. 2, pp. 11–27. Cambridge University Press, Cambridge (2012)
Chapter Google Scholar
Carletta, J., Ashby, S., Bourban, S., Flynn, M., Guillemot, M., Hain, T., Kadlec, J., Karaiskos, V., Kraaij, W., Kronenthal, M., Lathoud, G., Lincoln, M., Lisowska, A., McCowan, I., Post, W., Reidsma, D., Wellner, P.: The AMI meeting corpus: a pre-announcement. In: Proceedings of the Machine Learning for Multimodal Interaction (MLMI), pp. 28–39 (2005)
Google Scholar
Carletta, J., Evert, S., Heid, U., Kilgour, J.: The NITE XML toolkit: data model and query language. Lang. Resour. Eval. 39, 313–334 (2005)
Article Google Scholar
Fiscus, J., Ajot, J., Radde, N., Laprun, C.: Multiple dimension Levenshtein edit distance calculations for evaluating ASR systems during simultaneous speech. In: Proceedings of the LREC (2006)
Google Scholar
Gales, M.: Semi-tied covariance matrices for hidden Markov models. IEEE Trans. Speech Audio Process. 7(3), 272–281 (1999)
Article Google Scholar
Ghoshal, A., Swietojanski, P., Renals, S.: Multilingual training of deep neural networks. In: Proceedings of the IEEE ICASSP (2013)
Book Google Scholar
Grezl, F., Karafiat, M., Kontar, S., Cernocky, J.: Probabilistic and bottle-neck features for LVCSR of meetings. In: Proceedings of IEEE ICASSP, pp. IV-757–IV-760 (2007)
Google Scholar
Haeb-Umbach, R., Ney, H.: Linear discriminant analysis for improved large vocabulary continuous speech recognition. In: Proceedings of the IEEE ICASSP, pp. 13–16 (1992). http://dl.acm.org/citation.cfm?id=1895550.1895555
Hain, T., Burget, L., Dines, J., Garner, P., Grézl, F., El Hannani, A., Karafíat, M., Lincoln, M., Wan, V.: Transcribing meetings with the AMIDA systems. IEEE Trans. Audio Speech Lang. Process. 20, 486–498 (2012)
Article Google Scholar
Janin, A., Baron, D., Edwards, J., Ellis, D., Gelbart, D., Morgan, N., Peskin, B., Pfau, T., Shriberg, E., Stolcke, A., Wooters, C.: The ICSI meeting corpus. In: Proceedings of the IEEE ICASSP, pp. I-364–I-367 (2003)
Google Scholar
Lang, K., Waibel, A., Hinton, G.: A time-delay neural network architecture for isolated word recognition. Neural Netw. 3, 23–43 (1990)
Article Google Scholar
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998)
Article Google Scholar
Lincoln, M., McCowan, I., Vepa, J., Maganti, H.: The multi-channel Wall Street Journal audio visual corpus (MC-WSJ-AV): specification and initial experiments. In: Proceedings of the IEEE ASRU (2005)
Google Scholar
Marino, D., Hain, T.: An analysis of automatic speech recognition with multiple microphones. In: Proceedings of the Interspeech, pp. 1281–1284 (2011)
Google Scholar
Omologo, M., Matassoni, M., Svaizer, P., Giuliani, D.: Microphone array based speech recognition with different talker-array positions. In: Proceedings of the IEEE ICASSP, pp. 227–230 (1997)
Google Scholar
Povey, D., Woodland, P.: Minimum phone error and I-smoothing for improved discriminative training. In: Proceedings of the IEEE ICASSP, pp. 105–108 (2002)
Google Scholar
Renals, S., Swietojanski, P.: Neural networks for distant speech recognition. In: Proceedings of the HSCMA (2014)
Book Google Scholar
Sainath, T., Kingsbury, B., Mohamed, A., Dahl, G., Saon, G., Soltau, H., Beran, T., Aravkin, A., Ramabhadran, B.: Improvements to deep convolutional neural networks for LVCSR. In: Proceedings of the IEEE ASRU (2013)
Book Google Scholar
Seltzer, M., Stern, R.: Subband likelihood-maximizing beamforming for speech recognition in reverberant environments. IEEE Trans. Audio Speech Lang. Process. 14, 2109–2121 (2006)
Article Google Scholar
Seltzer, M., Raj, B., Stern, R.: Likelihood-maximizing beamforming for robust hands-free speech recognition. IEEE Trans. Speech Audio Process. 12, 489–498 (2004)
Article Google Scholar
Stolcke, A., Anguera, X., Boakye, K., Cetin, O., Janin, A., Magimai-Doss, M., Wooters, C., Zheng, J.: The SRI-ICSI spring 2007 meeting and lecture recognition system. In: Stiefelhagen, R., Bowers, R., Fiscus, J. (eds.) Multimodal Technologies for Perception of Humans. Lecture Notes in Computer Science, vol. 4625, pp. 373–389. Springer, New York (2008)
Chapter Google Scholar
Swietojanski, P., Ghoshal, A., Renals, S.: Hybrid acoustic models for distant and multichannel large vocabulary speech recognition. In: Proceedings of the IEEE ASRU (2013). doi:10.1109/ASRU.2013.6707744
Google Scholar
Swietojanski, P., Ghoshal, A., Renals, S.: Convolutional neural networks for distant speech recognition. IEEE Signal Process. Lett. 21, 1120–1124 (2014)
Article Google Scholar
Van Compernolle, D., Ma, W., Xie, F., Van Diest, M.: Speech recognition in noisy environments with the aid of microphone arrays. Speech Commun. 9, 433–442 (1990)
Article Google Scholar
Wölfel, M., McDonough, J.: Distant Speech Recognition. Wiley, Chichester (2009)
Book Google Scholar
Zwyssig, E., Lincoln, M., Renals, S.: A digital microphone array for distant speech recognition. In: Proceedings of the IEEE ICASSP, pp. 5106–5109 (2010). doi:10.1109/ICASSP.2010.5495040
Google Scholar

Download references

Author information

Authors and Affiliations

Centre for Speech Technology Research, University of Edinburgh, Edinburgh, UK
Steve Renals & Pawel Swietojanski

Authors

Steve Renals
View author publications
You can also search for this author in PubMed Google Scholar
Pawel Swietojanski
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Steve Renals .

Editor information

Editors and Affiliations

Mitsubishi Electric Research Laboratories (MERL), Cambridge, Massachusetts, USA
Shinji Watanabe
NTT Communication Science Laboratories, NTT Corporation, Kyoto, Japan
Marc Delcroix
Language Technologies Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
Florian Metze
Mitsubishi Electric Research Laboratories (MERL), Cambridge, Massachusetts, USA
John R. Hershey

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Renals, S., Swietojanski, P. (2017). Distant Speech Recognition Experiments Using the AMI Corpus. In: Watanabe, S., Delcroix, M., Metze, F., Hershey, J. (eds) New Era for Robust Speech Recognition. Springer, Cham. https://doi.org/10.1007/978-3-319-64680-0_16

Download citation

DOI: https://doi.org/10.1007/978-3-319-64680-0_16
Published: 26 July 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-64679-4
Online ISBN: 978-3-319-64680-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics