Automatic lip-reading, the process of decoding spoken language through visual analysis of lip movements, presents a promising avenue for advancing human-computer interaction and accessibility. This research proposes an innovative model integrating 3D Convolutional Neural Networks (3D-CNN) and Long Short-Term Memory (LSTM) networks to enhance the accuracy and efficiency of lip-reading systems. The model addresses challenges related to lighting variations, speaker articulation, and linguistic diversity. This contrasts with traditional 2D-CNN, which focuses solely on spatial information, often missing temporal intricacies vital for accurate lip-reading. By incorporating 3D-CNN alongside LSTM, the proposed model significantly enhances recognition accuracy, offering a more comprehensive understanding of speech nuances. Extensive training on a diverse dataset and the exploration of transfer learning techniques contribute to the robustness and generalization of the model.