i-manager Publications

Development of a Model for Detecting Emotions using CNN and LSTM

Manish Goswami*, Aditya Parate**, Nisarga Kapde***, Shashwat Singh****, Nitiksha Gupta*****, Meena Surjuse******

*-****** Computer Science and Engineering, S. B. Jain Institute of Technology, Management and Research, Nagpur, India.

Periodicity:July - September'2024
DOI : https://doi.org/10.26634/jse.19.1.21324

Abstract

This paper presents the development of a real-time deep learning system for emotion recognition using both speech and facial inputs. For speech emotion recognition, three significant datasets: SAVEE, Toronto Emotion Speech Set (TESS), and CREMA-D were utilized, comprising over 75,000 samples that represent a spectrum of emotions: Anger, Sadness, Fear, Disgust, Calm, Happiness, Neutral, and Surprise, mapped to numerical labels from 1 to 8. The system identifies emotions from live speech inputs and pre-recorded audio files using a Long Short-Term Memory (LSTM) network, which is particularly effective for sequential data. The LSTM model, trained on the RAVDEES dataset (7,356 audio files), achieved a training accuracy of 83%. For facial emotion recognition, a Convolutional Neural Network (CNN) architecture was employed, using datasets such as FER2013, CK+, AffectNet, and JAFFE. FER2013 includes over 35,000 labeled images representing seven key emotions, while CK+ provides 593 video sequences for precise emotion classification. By integrating LSTM for speech and CNN for facial emotion recognition, the system shows robust capabilities in identifying and classifying emotions across modalities, enabling comprehensive real-time emotion recognition.

Keywords

Emotion Recognition, Short-Term Memory (LSTM), Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), RAVDEES, CREMA-D, Toronto Emotion Speech Set (TESS), SAVEE, Extreme Learning Machine (ELM), Support Vector Machine (SVM).

How to Cite this Article?

Goswami, M., Parate, A., Kapde, N., Singh, S., Gupta, N., and Surjuse, M. (2024). Development of a Model for Detecting Emotions using CNN and LSTM. i-manager’s Journal on Software Engineering, 19(1), 17-28. https://doi.org/10.26634/jse.19.1.21324

References

[1]. Abbaschian, B. J., Sierra-Sosa, D., & Elmaghraby, A. (2021). Deep learning techniques for speech emotion recognition, from databases to models. Sensors, 21(4), 1249.

[2]. Balvir, S., Sahu, S., & Rohankar, J. (2021). Approaches and applications of sentiment analysis on users data. Journal of University of Shanghai for Science and Technology, 23(6), 1761-1767.

[3]. Ekman, P., & Keltner, D. (1970). Universal facial expressions of emotion. California Mental Health Research Digest, 8(4), 151-158.

[4]. Ingale, A. B., & Chaudhari, D. S. (2012). Speech emotion recognition. International Journal of Soft Computing and Engineering (IJSCE), 2(1), 235-238.

[5]. Koduru, A., Valiveti, H. B., & Budati, A. K. (2020). Feature extraction algorithms to improve the speech emotion recognition rate. International Journal of Speech Technology, 23(1), 45-55.

[6]. Kurpukdee, N., Koriyama, T., Kobayashi, T., Kasuriya, S., Wutiwiwatchai, C., & Lamsrichan, P. (2017, December). Speech emotion recognition using convolutional long short-term memory neural network and support vector machines. In 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) (pp. 1744-1749). IEEE.

[7]. Latif, S., Rana, R., Khalifa, S., Jurdak, R., Qadir, J., & Schuller, B. (2021). Survey of deep representation learning for speech emotion recognition. IEEE Transactions on Affective Computing, 14(2), 1634-1654.

[8]. Lim, W., Jang, D., & Lee, T. (2016, December). Speech emotion recognition using convolutional and recurrent neural networks. In 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA) (pp. 1-4). IEEE.

[9]. Michel, P., & El Kaliouby, R. (2003, November). Real time facial expression recognition in video using support vector machines. In Proceedings of the 5th International Conference on Multimodal Interfaces (pp. 258-264).

[10]. Neumann, M., & Vu, N. T. (2017). Attentive convolutional neural network based speech emotion recognition: A study on the impact of input features, signal length, and acted speech. arXiv preprint arXiv:1706.00612.

[11]. Pantic, M., & Rothkrantz, L. J. M. (2000). Automatic analysis of facial expressions: The state of the art. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(12), 1424-1445.

[12]. Rao, K. S., Koolagudi, S. G., & Vempada, R. R. (2013). Emotion recognition from speech using global and local prosodic features. International Journal of Speech Technology, 16, 143-160.

[13]. Schuller, B., Vlasenko, B., Eyben, F., Wöllmer, M., Stuhlsatz, A., Wendemuth, A., & Rigoll, G. (2010). Cross-corpus acoustic emotion recognition: Variances and strategies. IEEE Transactions on Affective Computing, 1(2), 119-131.

[14]. Stolcke, A., Ries, K., Coccaro, N., Shriberg, E., Bates, R., Jurafsky, D., & Meteer, M. (2000). Dialogue act modeling for automatic tagging and recognition of conversational speech. Computational Linguistics, 26(3), 339-373.

[15]. Vaidya, C. D., Botre, M., Rokde, Y., Kumbhalkar, S., Linge, S., Pitale, S., & Bawne, S. (2023). Unveiling sentiment analysis: A comparative study of LSTM and logistic regression models with XAI insights. i-manager's Journal on Computer Science, 11(3).

[16]. Zhang, S., Zhang, S., Huang, T., & Gao, W. (2017). Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching. IEEE Transactions on Multimedia, 20(6), 1576-1590.