Abstract
Multimodal clustering/diarization tries to answer the question ”who spoke when” by using audio and visual information. Diarizationconsists of two steps, at first segmentation of the audio information and detection of the speech segments and then clustering of the speech segments to group the speakers. This task has been mainly studied on audiovisual data from meetings, news broadcasts or talk shows. In this paper, we use visual information to aid speaker clustering and we introduce a new video-based feature, called actor presence that can be used to enhance audio-based speaker clustering. We tested the proposed method in three full length stereoscopic movies, i.e. a scenario much more difficult than the ones used so far, where there is no certainty that speech segments and video appearances of actors will always overlap. The results proved that the visual information can improve the speaker clustering accuracy and hence the diarization process.








Similar content being viewed by others
References
Alameda-Pineda X, Yan Y, Ricci E, Lanz O, Sebe N (2015) Analyzing free-standing conversational groups: a multimodal approach. In: Proceedings of the 23rd ACM international conference on multimedia, MM ’15. ACM, New York, pp 5–14
Asthana A, Zafeiriou S, Cheng S, Pantic M (2013) Robust discriminative response map fitting with constrained local models. In: Proceedings of 2013 IEEE conference on computer vision and pattern recognition (CVPR), pp 3444–3451
Baltzakis H, Argyros A, Lourakis M, Trahanias P (2008) Tracking of human hands and faces through probabilistic fusion of multiple visual cues. In: Proceedings of the 6th international conference on computer vision systems, ICVS’08. Springer, Berlin, Heidelberg, pp 33–42
Calic J, Campbell N, Dasiopoulou S, Kompatsiaris Y (2005) A survey on multimodal video representation for semantic retrieval. In: The international conference on computer as a tool, 2005. EUROCON 2005, vol 1, pp 135–138
Carletta J (2006) Announcing the ami meeting corpus. The ELRA Newsletter 1(1):3–5
Chen S, Gopalakrishnan P (1998) Speaker, environment and channel change detection and clustering via the bayesian information criterion. In: Proceedings of DARPA broadcast news transcription and understanding workshop
El Khoury E, Snac C, Joly P (2014) Audiovisual diarization of people in video content. Multimed Tools Appl 68(3):747–775
Elmansori MM, Omar K (2011) An enhanced face detection method using skin color and back-propagation neural network. Eur J Sci Res 55(1):80
Feng W, Xie L, Zeng J, Liu ZQ (2009) Audio-visual human recognition using semi-supervised spectral learning and hidden markov models. J Vis Lang Comput 20(3):188–195
Friedland G, Hung H, Yeo C (2009) Multi-modal speaker diarization of real-world meetings using compressed-domain video features. In: Proceedings of the IEEE international conference on acoustics, speech and signal processing, 2009. ICASSP 2009, pp 4069–4072
Friedland G, Yeo C, Hung H (2009) Visual speaker localization aided by acoustic models. In: Proceedings of the 17th ACM international conference on multimedia, MM ’09. ACM, New York, pp 195–202
Garau G, Bourlard H (2010) Using audio and visual cues for speaker diarisation initialisation. In: Proceedings of the IEEE international conference on acoustics speech and signal processing (ICASSP), pp 4942–4945
Iosifidis A, Tefas A, Pitas I (2015) On the kernel extreme learning machine classifier. Pattern Recogn Lett 54:11–17
Jaimes A, Sebe N (2005) Multimodal human computer interaction: a survey. In: Computer vision in human-computer interaction. Lecture notes in computer science, vol 3766. Springer, Berlin Heidelberg, pp 1–15
Khalidov V, Forbes F, Hansard M, Arnaud E, Horaud R (2008) Audio-visual clustering for 3d speaker localization. In: Proceedings of the 5th international workshop on machine learning for multimodal interaction, MLMI ’08. Springer, Berlin, Heidelberg, pp 86–97
Ng AY, Jordan MI, Weiss Y (2001) On spectral clustering: analysis and an algorithm. In: Proceedings of NIPS. MIT Press, Cambridge, MA, pp 849–856
Noulas A, Englebienne G, Krose B (2012) Multimodal speaker diarization. IEEE Trans Pattern Anal Mach Intell 34(1):79–93
Ohn-Bar E, Trivedi MM (2013) Joint angles similiarities and HOG 2 for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops: human activity understanding from 3D Data, CVPR ’13. IEEE Press, Piscataway, NJ
Ojala T, Pietikainen M, Harwood D (1994) Performance evaluation of texture measures with classification based on kullback discrimination of distributions. In: Proceedings of the 12th IAPR international conference on pattern recognition, vol 1, pp 582–585
Orfanidis G, Tefas A, Nikolaidis N, Pitas I (2014) Facial image clustering in stereo videos using local binary patterns and double spectral analysis. In: IEEE Symposium Series on Computational Intelligence (SSCI)
Orfanidis G, Tefas A, Nikolaidis N, Pitas I (2015) Facial image clustering in stereoscopic videos using double spectral analysis. Signal Process Image Commun 33:86–105
Patrona F, Iosifidis A, Tefas A, Nikolaidis N, Pitas I (2015) Visual voice activity detection based on spatiotemporal information and bag of words. In: IEEE international conference on image processing, ICIP 2015
Sargin M, Aradhye H, Moreno P, Zhao M (2009) Audiovisual celebrity recognition in unconstrained web videos. In: Proceedings of the IEEE international conference on acoustics, speech and signal processing, 2009. ICASSP 2009, pp 1977–1980
Snoek CGM, Worring M (2005) Multimodal video indexing: A review of the state-of-the-art. Multimed Tools Appl 25(1):5–35
Stamou G, Krinidis M, Nikolaidis N, Pitas I (2007) A monocular system for person tracking: implementation and testing. Journal on Multimodal User Interfaces 1(2):31–47
Subramanian R, Yan Y, Staiano J, Lanz O, Sebe N (2013) On the relationship between head pose, social attention and personality prediction for unstructured and dynamic group interactions. In: Proceedings of the 15th ACM on international conference on multimodal interaction, ICMI ’13. ACM, New York, pp 3–10
Uricar M, Franc V, Hlac V (2012) Detector of facial landmarks learned by the structured output svm. In: Proceedings of VISAPP 2012, pp 547–556
Vallet F, Essid S, Carrive J (2013) A multimodal approach to speaker diarization on tv talk-shows. IEEE Trans Multimedia 15(3):509–520
Wang H, Kläser A, Schmid C, Liu CL (2011) Action recognition by dense trajectories. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2011. CVPR 2011. IEEE, pp 3169–3176
Wang H, Ullah M, Kläserr A, Laptev I, Schmid C (2009) Evaluation of local spatio-temporal features for action recognition. In: BMVC 2009-British machine vision conference
Yan Y, Yang Y, Meng D, Liu G, Tong W, Hauptmann A, Sebe N (2015) Event oriented dictionary learning for complex event detection. IEEE Trans Image Process 24(6):1867–1878
Zoidi O, Nikolaidis N, Tefas A, Pitas I (2014) Stereo object tracking with fusion of texture, color and disparity information. Signal Process Image Commun 29(5):573–589
Zoidi O, Nikolaidis N, Pitas I (2013) Appearance based object tracking in stereo sequences. In: Proceedings of the 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 2434–2438
Acknowledgments
The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement number 287674 (3DTVS). This publication reflects only the authors views. The European Union is not liable for any use that may be made of the information contained therein.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kapsouras, I., Tefas, A., Nikolaidis, N. et al. Multimodal speaker clustering in full length movies. Multimed Tools Appl 76, 2223–2242 (2017). https://doi.org/10.1007/s11042-015-3181-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-015-3181-5