Speech Feature Extraction and Emotion Recognition using Deep Learning Techniques

Pagidirayi Anil Kumar*, Anuradha B.**
*-** Department of Electronics and Communication Engineering, Sri Venkateswara University College of Engineering, Tirupati, Andhra Pradesh, India.
Periodicity:July - December'2024
DOI : https://doi.org/10.26634/jdp.12.2.21179

Abstract

Speech Emotion Recognition (SER) is crucial for human-computer interaction, enabling systems to better understand emotions. Traditional feature extraction methods like Gamma Tone Cepstral Coefficients (GTCC) are used in SER for their ability to capture auditory features aligned with human hearing, but these methods fail to capture emotional nuances effectively. Mel Frequency Cepstral Coefficients (MFCC) have gained prominence for better representing speech signals in emotion recognition. This work introduces an approach combining traditional and modern techniques, comparing GTCC-based extraction with MFCC and utilizing the Ensemble Subspace k-Nearest Neighbors (ES-kNN) classifier to improve accuracy. Additionally, deep learning models like Long Short-Term Memory (LSTM) and Bidirectional LSTM (Bi-LSTM) are explored for their ability to capture temporal dependencies in speech. Datasets such as CREMA-D and SAVEE are used

Keywords

Speech Emotion Recognition (SER), Deep Learning, Bidirectional LSTM (Bi-LSTM), Temporal Dependencies, SAVEE Dataset, Feature Extraction.

How to Cite this Article?

Kumar, P. A., and Anuradha, B. (2024). Speech Feature Extraction and Emotion Recognition using Deep Learning Techniques. i-manager’s Journal on Digital Signal Processing, 12(2), 1-12. https://doi.org/10.26634/jdp.12.2.21179

References

[2]. Aziz, R., Verma, C. K., & Srivastava, N. (2017). Dimension reduction methods for microarray data: A review. AIMS Bioengineering, 4(2), 179-197.
[4]. Chen, F., & Jokinen, K. (2010). Speech Technology. Springer.
[5]. Connor, J. D. O., & Arnold, G. F. (1973). Intonation of Colloquial English. Longman, London.
[8]. Deller Jr, J. R., Proakis, J. G., & Hansen, J. H. (1993). Discrete Time Processing of Speech Signals. Prentice Hall PTR.
[13]. Jurafsky, D. (2000). Speech and Language Processing. Pearson Education.
[17]. Kwon, O. W., Chan, K., Hao, J., & Lee, T. W. (2003, September). Emotion recognition by speech signals. In Interspeech (pp. 125-128).
[19]. Malhotra, P., Vig, L., Shroff, G., & Agarwal, P. (2015, April). Long short term memory networks for anomaly detection in time series. In the European Symposium on Artificial Neural Networks.
[20]. Mitchell, T. M. (1997). Machine Learning. McGraw-hill, New York.
[24]. Peinado, A., & Segura, J. (2006). Speech Recognition over Digital Channels: Robustness and Standards. John Wiley & Sons.
[26]. Ramakrishnan, S. (2012). Recognition of emotion from speech: A review. Speech Enhancement, Modeling and Recognition-Algorithms and Applications, 7, 121-137.
If you have access to this article please login to view the article or kindly login to purchase the article

Purchase Instant Access

Single Article

North Americas,UK,
Middle East,Europe
India Rest of world
USD EUR INR USD-ROW
Pdf 35 35 200 20
Online 15 15 200 15
Pdf & Online 35 35 400 25

Options for accessing this content:
  • If you would like institutional access to this content, please recommend the title to your librarian.
    Library Recommendation Form
  • If you already have i-manager's user account: Login above and proceed to purchase the article.
  • New Users: Please register, then proceed to purchase the article.