Abstract
This research explores the development of a deep learning-based audio-visual emotion recognition system, aiming to enhance the accuracy and robustness of emotion classification by integrating multiple modalities. Traditional speech emotion recognition (SER) systems often rely on unimodal data, which limits their ability to fully capture human emotional expressions. Our study leverages the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) to implement a multimodal approach, combining audio and visual data. The proposed model incorporates Convolutional Neural Networks (CNNs), Long Short-Term Memory (LSTM) networks, and attention mechanisms to improve performance. Experimental results demonstrate that the attention-based audio model achieves the highest accuracy of 62%, outperforming other tested configurations. The study highlights the potential of integrating attention mechanisms and multimodal data to enhance SER systems, while also identifying areas for future research, such as utilizing additional datasets and transfer learning techniques to further improve model performance and generalizability.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Breuer, R., Kimmel, R.: A Deep Learning Perspective on the Origin of Facial Expressions. arXiv.org. https://doi.org/10.48550/arxiv.1705.01842 (2017)
Chen, S., Zhang, M., Yang, X., Zhao, Z., Zou, T., Sun, X.: The impact of attention mechanisms on speech emotion recognition. Sensors 21(22), 7530 (2021). https://doi.org/10.3390/s21227530
Keltner, D., Sauter, D., Tracy, J., Cowen, A.: Emotional expression: advances in basic emotion theory. J. Nonverbal Behav. 43(2), 133–160 (2019). https://doi.org/10.1007/s10919-019-00293-3
Khan, W., Qudous, H., Farhan, A.: Speech emotion recognition using feature fusion: a hybrid approach to deep learning. Multimedia Tools Appl. 1–28 (2024)
Lian, H., Lu, C., Li, S., Zhao, Y., Tang, C., Zong, Y.: A survey of deep learning-based multimodal emotion recognition: speech, text, and face. Entropy 25, 1440 (2023)
Livingstone, S.R., Russo, F.A.: The ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5), e0196391 (2018). https://doi.org/10.1371/journal.pone.0196391
Liu, W., Qiu, J.-L., Zheng, W.-L., Lu, B.-L.: Comparing recognition performance and robustness of multimodal deep learning models for multimodal emotion recognition. IEEE Trans. Cogn. Dev. Syst. 14(2), 715–729 (2022). https://doi.org/10.1109/TCDS.2021.3071170
Luna-Jiménez, C., Griol, D., Callejas, Z., Kleinlein, R., Montero, J.M., Fernández-MartÃnez, F.: Multimodal emotion recognition on RAVDESS dataset using transfer learning. Sensors 21(22), 7665 (2021). https://doi.org/10.3390/s21227665
Middya, A.I., Nag, B., Roy, S.: Deep learning based multimodal emotion recognition using model-level fusion of audio-visual modalities. Knowl.-Based Syst. 244, 108580 (2022). https://doi.org/10.1016/j.knosys.2022.108580
Sharafi, M., Yazdchi, M., Rasti, R., Nasimi, F.: A novel spatio-temporal convolutional neural framework for multimodal emotion recognition. Biomed. Signal Process. Control 78, 103970 (2022). https://doi.org/10.1016/j.bspc.2022.103970
Singh, J., Saheer, L.B., Faust, O.: Speech emotion recognition using attention model. Int. J. Environ. Res. Public Health 20(6), 5140 (2023). https://doi.org/10.3390/ijerph20065140
Zhang, S., Liu, R., Tao, X., Zhao, X.: Deep cross-corpus speech emotion recognition: recent advances and perspectives. Front. Neurorobotics 15, 784514–784514 (2021). https://doi.org/10.3389/fnbot.2021.784514
Zhang, S., Zhang, S., Huang, T., Gao, W., Tian, Q.: Learning affective features with a hybrid deep model for audio-visual emotion recognition. IEEE Trans. Circuits Syst. Video Technol. 28(10), 3030–3043 (2018). https://doi.org/10.1109/TCSVT.2017.2719043
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Tolegenov, M., Saheer, L.B., Oghaz, M.M. (2025). Audio-Visual Emotion Recognition Using Deep Learning Methods. In: Bramer, M., Stahl, F. (eds) Artificial Intelligence XLI. SGAI 2024. Lecture Notes in Computer Science(), vol 15446. Springer, Cham. https://doi.org/10.1007/978-3-031-77915-2_24
Download citation
DOI: https://doi.org/10.1007/978-3-031-77915-2_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-77914-5
Online ISBN: 978-3-031-77915-2
eBook Packages: Computer ScienceComputer Science (R0)