Speech Emotion Recognition (SER) is crucial for human-computer interaction, enabling systems to better understand emotions. Traditional feature extraction methods like Gamma Tone Cepstral Coefficients (GTCC) have been used in SER for their ability to capture auditory features aligned with human hearing, but they often fall short in capturing emotional nuances. Mel Frequency Cepstral Coefficients (MFCC) have gained prominence for better representing speech signals in emotion recognition. This work introduces an approach combining traditional and modern techniques, comparing GTCC-based extraction with MFCC and utilizing the Ensemble Subspace k-Nearest Neighbors (ES-kNN) classifier to improve accuracy. Additionally, deep learning models like Long Short-Term Memory (LSTM) and Bidirectional LSTM (Bi-LSTM) are explored for their ability to capture temporal dependencies in speech. Datasets such as CREMA-D and SAVEE are used.