Skip to main content

Audio-Visual Emotion Recognition Using Deep Learning Methods

  • Conference paper
  • First Online:
Artificial Intelligence XLI (SGAI 2024)

Abstract

This research explores the development of a deep learning-based audio-visual emotion recognition system, aiming to enhance the accuracy and robustness of emotion classification by integrating multiple modalities. Traditional speech emotion recognition (SER) systems often rely on unimodal data, which limits their ability to fully capture human emotional expressions. Our study leverages the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) to implement a multimodal approach, combining audio and visual data. The proposed model incorporates Convolutional Neural Networks (CNNs), Long Short-Term Memory (LSTM) networks, and attention mechanisms to improve performance. Experimental results demonstrate that the attention-based audio model achieves the highest accuracy of 62%, outperforming other tested configurations. The study highlights the potential of integrating attention mechanisms and multimodal data to enhance SER systems, while also identifying areas for future research, such as utilizing additional datasets and transfer learning techniques to further improve model performance and generalizability.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 74.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Breuer, R., Kimmel, R.: A Deep Learning Perspective on the Origin of Facial Expressions. arXiv.org. https://doi.org/10.48550/arxiv.1705.01842 (2017)

  2. Chen, S., Zhang, M., Yang, X., Zhao, Z., Zou, T., Sun, X.: The impact of attention mechanisms on speech emotion recognition. Sensors 21(22), 7530 (2021). https://doi.org/10.3390/s21227530

    Article  Google Scholar 

  3. Keltner, D., Sauter, D., Tracy, J., Cowen, A.: Emotional expression: advances in basic emotion theory. J. Nonverbal Behav. 43(2), 133–160 (2019). https://doi.org/10.1007/s10919-019-00293-3

    Article  Google Scholar 

  4. Khan, W., Qudous, H., Farhan, A.: Speech emotion recognition using feature fusion: a hybrid approach to deep learning. Multimedia Tools Appl. 1–28 (2024)

    Google Scholar 

  5. Lian, H., Lu, C., Li, S., Zhao, Y., Tang, C., Zong, Y.: A survey of deep learning-based multimodal emotion recognition: speech, text, and face. Entropy 25, 1440 (2023)

    Article  Google Scholar 

  6. Livingstone, S.R., Russo, F.A.: The ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5), e0196391 (2018). https://doi.org/10.1371/journal.pone.0196391

    Article  Google Scholar 

  7. Liu, W., Qiu, J.-L., Zheng, W.-L., Lu, B.-L.: Comparing recognition performance and robustness of multimodal deep learning models for multimodal emotion recognition. IEEE Trans. Cogn. Dev. Syst. 14(2), 715–729 (2022). https://doi.org/10.1109/TCDS.2021.3071170

    Article  Google Scholar 

  8. Luna-Jiménez, C., Griol, D., Callejas, Z., Kleinlein, R., Montero, J.M., Fernández-Martínez, F.: Multimodal emotion recognition on RAVDESS dataset using transfer learning. Sensors 21(22), 7665 (2021). https://doi.org/10.3390/s21227665

    Article  Google Scholar 

  9. Middya, A.I., Nag, B., Roy, S.: Deep learning based multimodal emotion recognition using model-level fusion of audio-visual modalities. Knowl.-Based Syst. 244, 108580 (2022). https://doi.org/10.1016/j.knosys.2022.108580

    Article  Google Scholar 

  10. Sharafi, M., Yazdchi, M., Rasti, R., Nasimi, F.: A novel spatio-temporal convolutional neural framework for multimodal emotion recognition. Biomed. Signal Process. Control 78, 103970 (2022). https://doi.org/10.1016/j.bspc.2022.103970

    Article  Google Scholar 

  11. Singh, J., Saheer, L.B., Faust, O.: Speech emotion recognition using attention model. Int. J. Environ. Res. Public Health 20(6), 5140 (2023). https://doi.org/10.3390/ijerph20065140

    Article  Google Scholar 

  12. Zhang, S., Liu, R., Tao, X., Zhao, X.: Deep cross-corpus speech emotion recognition: recent advances and perspectives. Front. Neurorobotics 15, 784514–784514 (2021). https://doi.org/10.3389/fnbot.2021.784514

    Article  Google Scholar 

  13. Zhang, S., Zhang, S., Huang, T., Gao, W., Tian, Q.: Learning affective features with a hybrid deep model for audio-visual emotion recognition. IEEE Trans. Circuits Syst. Video Technol. 28(10), 3030–3043 (2018). https://doi.org/10.1109/TCSVT.2017.2719043

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mukhambet Tolegenov .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tolegenov, M., Saheer, L.B., Oghaz, M.M. (2025). Audio-Visual Emotion Recognition Using Deep Learning Methods. In: Bramer, M., Stahl, F. (eds) Artificial Intelligence XLI. SGAI 2024. Lecture Notes in Computer Science(), vol 15446. Springer, Cham. https://doi.org/10.1007/978-3-031-77915-2_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-77915-2_24

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-77914-5

  • Online ISBN: 978-3-031-77915-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics