i-manager Publications

Synthetic Audio and Video Generation for Language Translation using GANs

Aynaan Quraishi*, Jaydeep Jethwa**, Shiwani Gupta***

*-***Department of Computer Engineering, Thakur College of Engineering, & Technology, Mumbai, India.

Periodicity:January - June'2023
DOI : https://doi.org/10.26634/javr.1.1.19412

Abstract

Language barriers create a digital divide that prevents people from benefiting from the vast amount of content produced worldwide. In addition, content creators face challenges in producing content in multiple languages to reach a wider audience. To address this problem, this study proposed a solution through a survey that utilized Generative Adversarial Networks (GAN), Natural Language Processing (NPL), and Computer Vision. A Generative Adversarial Network (GAN) is a Machine Learning (ML) model in which two neural networks compete with each other by using deep learning methods to obtain more accurate predictions. The solution provided in this study can generate synthesized videos that are close to reality, ultimately bridging the language barrier and providing access to content.

Keywords

Generative Adversarial Networks (GANs), Machine Learning (ML), Natural Language Processing, Language Barrier, Computer Vision.

How to Cite this Article?

Quraishi, A., Jethwa, J., and Gupta, S. (2023). Synthetic Audio and Video Generation for Language Translation using GANs. i-manager's Journal on Augmented & Virtual Reality, 1(1), 1-8. https://doi.org/10.26634/javr.1.1.19412

References

[1]. 2001 Census of India. (n.d.). In Wikipedia.

[2]. Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33, 12449-12460.

[3]. Denton, E. L., Chintala, S., & Fergus, R. (2015). Deep generative image models using a laplacian pyramid of adversarial networks. Advances in Neural Information Processing Systems, 28, 1-9.

[4]. Googletrans. (n.d.).

[5]. Guven, B. (2020). Extracting Audio from Video using Python.

[6]. Hugging Face. (n.d.). Wav2Vec2.

[7]. Kushal, L., Evgeny, K., Wei-Ning, H., Yossi, A., Adam, P., Tu-Anh, N., ... & Emmanuel, D. (2021). Generative spoken language modeling from raw audio. Transactions of the Association for Computational Linguistics, 9, 1336–1354.

[8]. Librosa. (n.d.).

[9]. Markowitz, D. (2021). AI Dubs Over Subs? Translating and Dubbing Videos with AI.

[10]. Mathieu, M. F., Zhao, J. J., Zhao, J., Ramesh, A., Sprechmann, P., & LeCun, Y. (2016). Disentangling factors of variation in deep representation using adversarial training. Advances in Neural Information Processing Systems, 29, 1-9.

[11]. Mathieu, M., Couprie, C., & LeCun, Y. (2015). Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440.

[12]. Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., & Efros, A. A. (2016). Context encoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2536-2544).

[13]. Prajwal, K. R., Mukhopadhyay, R., Namboodiri, V. P., & Jawahar, C. V. (2020, October). A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM International Conference on Multimedia (pp. 484-492).

[14]. Radford, A., Metz, L., & Chintala, S. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434.

[15]. Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., & Lee, H. (2016, June). Generative adversarial text to image synthesis. In International Conference on Machine Learning (pp. 1060-1069). PMLR.

[16]. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., & Chen, X. (2016). Improved Techniques for Training Gans. arXiv:1606.03498.

[17]. The World Counts. (n.d.). State of the Planet.

[18]. TorToise. (n.d.).

[19]. Twinword. (n.d.). 6 Common Features of Top 250 YouTube Channels.

[20]. Vondrick, C., Pirsiavash, H., & Torralba, A. (2016). Generating videos with scene dynamics. Advances in Neural Information Processing Systems, 29, 1-9.

[21]. Wu, J., Zhang, C., Xue, T., Freeman, B., & Tenenbaum, J. (2016). Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. Advances in Neural Information Processing Systems, 29, 1-9.

[22]. Yonash, S. (2022). From Wav2Vec2 to Decoded Sentences.

[23]. Zhao, J., Mathieu, M., & LeCun, Y. (2016). Energybased generative adversarial network. arXiv preprint arXiv:1609.03126.

[24]. Zhou, T., Krahenbuhl, P., Aubry, M., Huang, Q., & Efros, A. A. (2016). Learning dense correspondence via 3d-guided cycle consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 117-126).

[25]. Zhu, J. Y., Krähenbühl, P., Shechtman, E., & Efros, A. A. (2016). Generative visual manipulation on the natural image manifold. In Computer Vision–ECCV 2016, 14, 597-613. Springer International Publishing.

[26]. Zhu, J. Y., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired image-to-image translation using cycleconsistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2223-2232).

	North Americas,UK, Middle East,Europe		India	Rest of world
	USD	EUR	INR	USD-ROW
Pdf	40	40	300
Online	15	15	300
Pdf & Online	40	40	300

Synthetic Audio and Video Generation for Language Translation using GANs

Abstract

Keywords

How to Cite this Article?

References

If you have access to this article please login to view the article or kindly login to purchase the article

Purchase Instant Access

Options for accessing this content: