In this paper, we focus on developing a real-time deep learning system for emotion recognition using both speech and facial inputs. For speech emotion recognition, we utilized three major datasets: SAVEE, Toronto Emotion Speech Set (TESS), and CREMA-D, which collectively contain over 75,000 samples. These datasets cover a wide range of human emotions, such as Anger, Sadness, Fear, Disgust, Calm, Happiness, Neutral, and Surprise, with emotions mapped to numerical labels from 1 to 8. Our primary objective was to build a system capable of detecting emotions from both live speech inputs via a PC microphone and pre-recorded audio files. To achieve this, we employed the Long Short-Term Memory (LSTM) network architecture, a type of Recurrent Neural Network (RNN) that is particularly effective for sequential data, such as speech. The LSTM model was rigorously trained on the RAVDEES dataset, which contains 7,356 distinct audio files, with 5,880 used for training. The model achieved an impressive training accuracy of 83%, marking significant progress in speech-based emotion recognition. For facial emotion recognition, we applied a Convolutional Neural Network (CNN), known for its strength in image processing tasks. We leveraged four well-known facial emotion datasets: FER2013, CK+, AffectNet, and JAFFE. FER2013 includes over 35,000 labeled images, representing various facial expressions associated with seven key emotions. CK+ provides 593 video sequences that capture the transition from neutral to peak expressions, allowing for precise emotion classification. By combining the LSTM for speech emotion detection and the CNN for facial emotion recognition, our system demonstrated robust capabilities in identifying and classifying emotions across multiple modalities. The integration of these two architectures enabled us to create a comprehensive real-time emotion recognition system capable of processing both audio and visual data.