Development OF A Model For Detecting Emotions Using CNN And LSTM

Shashwat Singh*
Periodicity:July - September'2024

Abstract

In this paper, we focus on developing a real-time deep learning system for emotion recognition using both speech and facial inputs. For speech emotion recognition, we utilized three major datasets: SAVEE, Toronto Emotion Speech Set (TESS), and CREMA-D, which collectively contain over 75,000 samples. These datasets cover a wide range of human emotions, such as Anger, Sadness, Fear, Disgust, Calm, Happiness, Neutral, and Surprise, with emotions mapped to numerical labels from 1 to 8. Our primary objective was to build a system capable of detecting emotions from both live speech inputs via a PC microphone and pre-recorded audio files. To achieve this, we employed the Long Short-Term Memory (LSTM) network architecture, a type of Recurrent Neural Network (RNN) that is particularly effective for sequential data, such as speech. The LSTM model was rigorously trained on the RAVDEES dataset, which contains 7,356 distinct audio files, with 5,880 used for training. The model achieved an impressive training accuracy of 83%, marking significant progress in speech-based emotion recognition. For facial emotion recognition, we applied a Convolutional Neural Network (CNN), known for its strength in image processing tasks. We leveraged four well-known facial emotion datasets: FER2013, CK+, AffectNet, and JAFFE. FER2013 includes over 35,000 labeled images, representing various facial expressions associated with seven key emotions. CK+ provides 593 video sequences that capture the transition from neutral to peak expressions, allowing for precise emotion classification. By combining the LSTM for speech emotion detection and the CNN for facial emotion recognition, our system demonstrated robust capabilities in identifying and classifying emotions across multiple modalities. The integration of these two architectures enabled us to create a comprehensive real-time emotion recognition system capable of processing both audio and visual data.

Keywords

Long Short-Term Memory (LSTM), Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), Speech Emotion Recognition, Face Emotion Recognition, RAVDEES, CREMA-D, TESS, SAVEE, FER2013, CK+,AffectNet, JAFFE

How to Cite this Article?

References

If you have access to this article please login to view the article or kindly login to purchase the article

Purchase Instant Access

Single Article

North Americas,UK,
Middle East,Europe
India Rest of world
USD EUR INR USD-ROW
Pdf 35 35 200 20
Online 35 35 200 15
Pdf & Online 35 35 400 25

Options for accessing this content:
  • If you would like institutional access to this content, please recommend the title to your librarian.
    Library Recommendation Form
  • If you already have i-manager's user account: Login above and proceed to purchase the article.
  • New Users: Please register, then proceed to purchase the article.