Automatic Wave to Lip Syncing and Voice Dubbing Using Machine Learning

C. Vignesh*, J. Yaswanth Kumar **, M. Mayuranathan ***, J. Sunil Kumar ****
*-**** Department of Computer Science Engineering, SRM Valliammai Engineering College, Kattankulathur, India.
Periodicity:January - June'2021
DOI : https://doi.org/10.26634/jpr.8.1.18157

Abstract

The provision of methods to support audiovisual interactions with growing volumes of video data is an increasingly important challenge for data processing. Currently, there has been some success in generating lip movements using speech or generating a talking face. Among them, talking face generation aims to get realistic talking heads synchronized with the audio or text input. This task requires mining the connection between audio signal/text and lip-sync video frames and ensures the temporal continuity between frames. Thanks to the problems like polysemy, ambiguity, and fuzziness of sentences, creating visual images with lip synchronization remains challenging. This problem is solved employing a datamining framework to find out the synchronous pattern between different channels from large recorded audio/text dataset and visual dataset, and applying it to get realistic talking face animations. Specifically, we decompose this task into two steps: muscular movement of mouth prediction and video synthesis. First, a multimodal learning method is proposed to get accurate lip movement while speaking with multimedia inputs (both text and audio). In the second step, Face2Vid framework is used to get video frames conditioned on the projected lip movement. This model is used to translate the language within the audio to a different language and dub the video in new language alongside proper lip synchronization. This model uses tongue processing and machine translation (MT) to translate the audio then uses the generative adversarial network (GAN) and recurrent neural network (RNN) to apply proper lip synchronization.

Keywords

Machine Learning, Deep Learning, Audio, Video, Speech Converter, Lip Synchronization.

How to Cite this Article?

Vignesh, C., Kumar, J. Y., Mayuranathan, M., and Kumar, J. S. (2021). Automatic Wave to Lip Syncing and Voice Dubbing Using Machine Learning. i-manager's Journal on Pattern Recognition, 8(1), 19-24. https://doi.org/10.26634/jpr.8.1.18157

References

[1]. Brock, A., Donahue, J., & Simonyan, K. (2018). Large scale GAN training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096.
[2]. Brundage, M., Avin, S., Clark, J., Toner, H., Eckersley, P., Garfinkel, B., ... & Amodei, D. (2018). The malicious use of artificial intelligence: Forecasting, prevention, and mitigation. arXiv preprint arXiv:1802.07228.
[3]. Chen, L., Maddox, R. K., Duan, Z., & Xu, C. (2019). Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7832-7841).
[4]. Christian, J. (2018). Experts Fear Face Swapping Tech Could Start an International Showdown. Retrieved from https://theoutline.com/post/3179/deepfake-videos-arefreaking- experts-out
[5]. Cooke, M., Barker, J., Cunningham, S., & Shao, X. (2006). An audio-visual corpus for speech perception and automatic speech recognition. The Journal of the Acoustical Society of America, 120(5), 2421-2424. https:// doi.org/10.1121/1.2229005
[6]. Dale, K., Sunkavalli, K., Johnson, M. K., Vlasic, D., Matusik, W., & Pfister, H. (2011, December). Video face replacement. In Proceedings of the 2011 SIGGRAPH Asia conference (pp. 1-10). https://doi.org/10.1145/2024156. 2024164
[7]. Fried, O., Tewari, A., Zollhöfer, M., Finkelstein, A., Shechtman, E., Goldman, D. B., Genova, K., Jin, Z., Theobalt, C., & Agrawala, M. (2019). Text-based editing of talking-head video. ACM Transactions on Graphics, 38(4), 1-14. https://doi.org/10.1145/3306346.3323028
[8]. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. Communications of the ACM, 63(11), 139-144. https://doi.org/10.1145/3422622
[9]. Hashmi, M. F., Ashish, B. K. K., Keskar, A. G., Bokde, N. D., Yoon, J. H., & Geem, Z. W. (2020). An exploratory analysis on visual counterfeits using conv-lstm hybrid architecture. IEEE Access, 8, 101293-101308. https://doi.org/10.1109/ ACCESS.2020.2998330
[10]. Jamaludin, A., Chung, J. S., & Zisserman, A. (2019). You said that?: Synthesising talking faces from audio. International Journal of Computer Vision, 127(11), 1767- 1779. https://doi.org/10.1007/s11263-019-01150-y
[11]. Karras, T., Aila, T., Laine, S., & Lehtinen, J. (2017). Progressive growing of GANs for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196.
[12]. Karras, T., Laine, S., & Aila, T. (2019). A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4401-4410).
[13]. Laptev, I., Marszalek, M., Schmid, C., & Rozenfeld, B. (2008, June). Learning realistic human actions from movies. In 2008, IEEE Conference on Computer Vision and Pattern Recognition, (pp. 1-8). IEEE. https://doi.org/10.11 09/CVPR.2008.4587756
[14]. Prajwal, K. R., Mukhopadhyay, R., Namboodiri, V. P., & Jawahar, C. V. (2020, October). A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM International Conference on Multimedia (pp. 484-492). https://doi.org/10.1145/339417 1.3413532
[15]. Prajwal, K. R., Mukhopadhyay, R., Philip, J., Jha, A., Namboodiri, V., & Jawahar, C. V. (2019, October). Towards automatic face-to-face translation. In Proceedings of the 27th ACM International Conference on Multimedia (pp. 1428-1436). https://doi.org/10.1145/3343031.3351066
[16]. Thies, J., Elgharib, M., Tewari, A., Theobalt, C., & Nießner, M. (2020, August). Neural voice puppetry: Audiodriven facial reenactment. In European Conference on Computer Vision (pp. 716-731). Springer, Cham. https://doi.org/10.1007/978-3-030-58517-4_42
[17]. Yoo, J. H. (2017). Large-scale video classification guided by batch normalized LSTM translator. arXiv preprint arXiv:1707.04045.
[18]. Yu, L., Yu, J., Li, M., & Ling, Q. (2020). Multimodal inputs driven talking face generation with spatial–temporal Video Technology, 31(1), 203-216. https://doi.org/10.1109/ TCSVT.2020.2973374 dependency. IEEE Transactions on Circuits and Systems for
[19]. Yu, L., Yu, J., & Ling, Q. (2019, November). Mining audio, text and visual information for talking face generation. In 2019, IEEE International Conference on Data Mining (ICDM), (pp. 787-795). IEEE. https://doi.org/10. 1109/ICDM.2019.00089
If you have access to this article please login to view the article or kindly login to purchase the article

Purchase Instant Access

Single Article

North Americas,UK,
Middle East,Europe
India Rest of world
USD EUR INR USD-ROW
Online 15 15

Options for accessing this content:
  • If you would like institutional access to this content, please recommend the title to your librarian.
    Library Recommendation Form
  • If you already have i-manager's user account: Login above and proceed to purchase the article.
  • New Users: Please register, then proceed to purchase the article.