Multimodal speaker clustering in full length movies

Kapsouras, I.; Tefas, A.; Nikolaidis, N.; Peeters, G.; Benaroya, L.; Pitas, I.

doi:10.1007/s11042-015-3181-5

Multimodal speaker clustering in full length movies

Published: 11 January 2016

Volume 76, pages 2223–2242, (2017)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

I. Kapsouras¹,
A. Tefas¹,
N. Nikolaidis¹,
G. Peeters²,
L. Benaroya² &
…
I. Pitas¹

516 Accesses
Explore all metrics

Abstract

Multimodal clustering/diarization tries to answer the question ”who spoke when” by using audio and visual information. Diarizationconsists of two steps, at first segmentation of the audio information and detection of the speech segments and then clustering of the speech segments to group the speakers. This task has been mainly studied on audiovisual data from meetings, news broadcasts or talk shows. In this paper, we use visual information to aid speaker clustering and we introduce a new video-based feature, called actor presence that can be used to enhance audio-based speaker clustering. We tested the proposed method in three full length stereoscopic movies, i.e. a scenario much more difficult than the ones used so far, where there is no certainty that speech segments and video appearances of actors will always overlap. The results proved that the visual information can improve the speaker clustering accuracy and hence the diarization process.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multimodal Speaker Diarization Utilizing Face Clustering Information

Who is Really Talking? A Visual-Based Speaker Diarization Strategy

Video visualization via face and speaker clustering

Article 10 March 2023

References

Alameda-Pineda X, Yan Y, Ricci E, Lanz O, Sebe N (2015) Analyzing free-standing conversational groups: a multimodal approach. In: Proceedings of the 23rd ACM international conference on multimedia, MM ’15. ACM, New York, pp 5–14
Chapter Google Scholar
Asthana A, Zafeiriou S, Cheng S, Pantic M (2013) Robust discriminative response map fitting with constrained local models. In: Proceedings of 2013 IEEE conference on computer vision and pattern recognition (CVPR), pp 3444–3451
Baltzakis H, Argyros A, Lourakis M, Trahanias P (2008) Tracking of human hands and faces through probabilistic fusion of multiple visual cues. In: Proceedings of the 6th international conference on computer vision systems, ICVS’08. Springer, Berlin, Heidelberg, pp 33–42
Google Scholar
Calic J, Campbell N, Dasiopoulou S, Kompatsiaris Y (2005) A survey on multimodal video representation for semantic retrieval. In: The international conference on computer as a tool, 2005. EUROCON 2005, vol 1, pp 135–138
Carletta J (2006) Announcing the ami meeting corpus. The ELRA Newsletter 1(1):3–5
Google Scholar
Chen S, Gopalakrishnan P (1998) Speaker, environment and channel change detection and clustering via the bayesian information criterion. In: Proceedings of DARPA broadcast news transcription and understanding workshop
El Khoury E, Snac C, Joly P (2014) Audiovisual diarization of people in video content. Multimed Tools Appl 68(3):747–775
Article Google Scholar
Elmansori MM, Omar K (2011) An enhanced face detection method using skin color and back-propagation neural network. Eur J Sci Res 55(1):80
Google Scholar
Feng W, Xie L, Zeng J, Liu ZQ (2009) Audio-visual human recognition using semi-supervised spectral learning and hidden markov models. J Vis Lang Comput 20(3):188–195
Article Google Scholar
Friedland G, Hung H, Yeo C (2009) Multi-modal speaker diarization of real-world meetings using compressed-domain video features. In: Proceedings of the IEEE international conference on acoustics, speech and signal processing, 2009. ICASSP 2009, pp 4069–4072
Friedland G, Yeo C, Hung H (2009) Visual speaker localization aided by acoustic models. In: Proceedings of the 17th ACM international conference on multimedia, MM ’09. ACM, New York, pp 195–202
Chapter Google Scholar
Garau G, Bourlard H (2010) Using audio and visual cues for speaker diarisation initialisation. In: Proceedings of the IEEE international conference on acoustics speech and signal processing (ICASSP), pp 4942–4945
Iosifidis A, Tefas A, Pitas I (2015) On the kernel extreme learning machine classifier. Pattern Recogn Lett 54:11–17
Article Google Scholar
Jaimes A, Sebe N (2005) Multimodal human computer interaction: a survey. In: Computer vision in human-computer interaction. Lecture notes in computer science, vol 3766. Springer, Berlin Heidelberg, pp 1–15
Khalidov V, Forbes F, Hansard M, Arnaud E, Horaud R (2008) Audio-visual clustering for 3d speaker localization. In: Proceedings of the 5th international workshop on machine learning for multimodal interaction, MLMI ’08. Springer, Berlin, Heidelberg, pp 86–97
Chapter Google Scholar
Ng AY, Jordan MI, Weiss Y (2001) On spectral clustering: analysis and an algorithm. In: Proceedings of NIPS. MIT Press, Cambridge, MA, pp 849–856
Google Scholar
Noulas A, Englebienne G, Krose B (2012) Multimodal speaker diarization. IEEE Trans Pattern Anal Mach Intell 34(1):79–93
Article Google Scholar
Ohn-Bar E, Trivedi MM (2013) Joint angles similiarities and HOG ² for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops: human activity understanding from 3D Data, CVPR ’13. IEEE Press, Piscataway, NJ
Google Scholar
Ojala T, Pietikainen M, Harwood D (1994) Performance evaluation of texture measures with classification based on kullback discrimination of distributions. In: Proceedings of the 12th IAPR international conference on pattern recognition, vol 1, pp 582–585
Orfanidis G, Tefas A, Nikolaidis N, Pitas I (2014) Facial image clustering in stereo videos using local binary patterns and double spectral analysis. In: IEEE Symposium Series on Computational Intelligence (SSCI)
Orfanidis G, Tefas A, Nikolaidis N, Pitas I (2015) Facial image clustering in stereoscopic videos using double spectral analysis. Signal Process Image Commun 33:86–105
Article Google Scholar
Patrona F, Iosifidis A, Tefas A, Nikolaidis N, Pitas I (2015) Visual voice activity detection based on spatiotemporal information and bag of words. In: IEEE international conference on image processing, ICIP 2015
Sargin M, Aradhye H, Moreno P, Zhao M (2009) Audiovisual celebrity recognition in unconstrained web videos. In: Proceedings of the IEEE international conference on acoustics, speech and signal processing, 2009. ICASSP 2009, pp 1977–1980
Snoek CGM, Worring M (2005) Multimodal video indexing: A review of the state-of-the-art. Multimed Tools Appl 25(1):5–35
Article Google Scholar
Stamou G, Krinidis M, Nikolaidis N, Pitas I (2007) A monocular system for person tracking: implementation and testing. Journal on Multimodal User Interfaces 1(2):31–47
Article Google Scholar
Subramanian R, Yan Y, Staiano J, Lanz O, Sebe N (2013) On the relationship between head pose, social attention and personality prediction for unstructured and dynamic group interactions. In: Proceedings of the 15th ACM on international conference on multimodal interaction, ICMI ’13. ACM, New York, pp 3–10
Chapter Google Scholar
Uricar M, Franc V, Hlac V (2012) Detector of facial landmarks learned by the structured output svm. In: Proceedings of VISAPP 2012, pp 547–556
Vallet F, Essid S, Carrive J (2013) A multimodal approach to speaker diarization on tv talk-shows. IEEE Trans Multimedia 15(3):509–520
Article Google Scholar
Wang H, Kläser A, Schmid C, Liu CL (2011) Action recognition by dense trajectories. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2011. CVPR 2011. IEEE, pp 3169–3176
Wang H, Ullah M, Kläserr A, Laptev I, Schmid C (2009) Evaluation of local spatio-temporal features for action recognition. In: BMVC 2009-British machine vision conference
Yan Y, Yang Y, Meng D, Liu G, Tong W, Hauptmann A, Sebe N (2015) Event oriented dictionary learning for complex event detection. IEEE Trans Image Process 24(6):1867–1878
Article MathSciNet Google Scholar
Zoidi O, Nikolaidis N, Tefas A, Pitas I (2014) Stereo object tracking with fusion of texture, color and disparity information. Signal Process Image Commun 29(5):573–589
Article Google Scholar
Zoidi O, Nikolaidis N, Pitas I (2013) Appearance based object tracking in stereo sequences. In: Proceedings of the 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 2434–2438

Download references

Acknowledgments

The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement number 287674 (3DTVS). This publication reflects only the authors views. The European Union is not liable for any use that may be made of the information contained therein.

Author information

Authors and Affiliations

Department of Informatics, Aristotle University of Thessaloniki, 54124, Thessaloniki, Greece
I. Kapsouras, A. Tefas, N. Nikolaidis & I. Pitas
Sound Analysis/Synthesis Team, STMS IRCAM-CNRS-UPMC, 1, Place Igor-Stravinsky, 75004, Paris, France
G. Peeters & L. Benaroya

Authors

I. Kapsouras
View author publications
You can also search for this author inPubMed Google Scholar
A. Tefas
View author publications
You can also search for this author inPubMed Google Scholar
N. Nikolaidis
View author publications
You can also search for this author inPubMed Google Scholar
G. Peeters
View author publications
You can also search for this author inPubMed Google Scholar
L. Benaroya
View author publications
You can also search for this author inPubMed Google Scholar
I. Pitas
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to I. Kapsouras.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kapsouras, I., Tefas, A., Nikolaidis, N. et al. Multimodal speaker clustering in full length movies. Multimed Tools Appl 76, 2223–2242 (2017). https://doi.org/10.1007/s11042-015-3181-5

Download citation

Received: 08 April 2015
Revised: 25 November 2015
Accepted: 18 December 2015
Published: 11 January 2016
Issue Date: January 2017
DOI: https://doi.org/10.1007/s11042-015-3181-5

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multimodal speaker clustering in full length movies

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multimodal Speaker Diarization Utilizing Face Clustering Information

Who is Really Talking? A Visual-Based Speaker Diarization Strategy

Video visualization via face and speaker clustering

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now