References
[1]. Brock, A., Donahue, J., & Simonyan, K. (2018). Large
scale GAN training for high fidelity natural image synthesis.
arXiv preprint arXiv:1809.11096.
[2]. Brundage, M., Avin, S., Clark, J., Toner, H., Eckersley, P.,
Garfinkel, B., ... & Amodei, D. (2018). The malicious use of
artificial intelligence: Forecasting, prevention, and
mitigation. arXiv preprint arXiv:1802.07228.
[3]. Chen, L., Maddox, R. K., Duan, Z., & Xu, C. (2019).
Hierarchical cross-modal talking face generation with
dynamic pixel-wise loss. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition
(pp. 7832-7841).
[4]. Christian, J. (2018). Experts Fear Face Swapping Tech
Could Start an International Showdown. Retrieved from
https://theoutline.com/post/3179/deepfake-videos-arefreaking-
experts-out
[5]. Cooke, M., Barker, J., Cunningham, S., & Shao, X.
(2006). An audio-visual corpus for speech perception and
automatic speech recognition. The Journal of the
Acoustical Society of America, 120(5), 2421-2424. https://
doi.org/10.1121/1.2229005
[6]. Dale, K., Sunkavalli, K., Johnson, M. K., Vlasic, D.,
Matusik, W., & Pfister, H. (2011, December). Video face
replacement. In Proceedings of the 2011 SIGGRAPH Asia
conference (pp. 1-10). https://doi.org/10.1145/2024156.
2024164
[7]. Fried, O., Tewari, A., Zollhöfer, M., Finkelstein, A.,
Shechtman, E., Goldman, D. B., Genova, K., Jin, Z.,
Theobalt, C., & Agrawala, M. (2019). Text-based editing of
talking-head video. ACM Transactions on Graphics, 38(4), 1-14. https://doi.org/10.1145/3306346.3323028
[8]. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,
Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014).
Generative adversarial nets. Communications of the ACM,
63(11), 139-144. https://doi.org/10.1145/3422622
[9]. Hashmi, M. F., Ashish, B. K. K., Keskar, A. G., Bokde, N. D.,
Yoon, J. H., & Geem, Z. W. (2020). An exploratory analysis
on visual counterfeits using conv-lstm hybrid architecture.
IEEE Access, 8, 101293-101308. https://doi.org/10.1109/
ACCESS.2020.2998330
[10]. Jamaludin, A., Chung, J. S., & Zisserman, A. (2019).
You said that?: Synthesising talking faces from audio.
International Journal of Computer Vision, 127(11), 1767-
1779. https://doi.org/10.1007/s11263-019-01150-y
[11]. Karras, T., Aila, T., Laine, S., & Lehtinen, J. (2017).
Progressive growing of GANs for improved quality, stability,
and variation. arXiv preprint arXiv:1710.10196.
[12]. Karras, T., Laine, S., & Aila, T. (2019). A style-based
generator architecture for generative adversarial networks.
In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (pp. 4401-4410).
[13]. Laptev, I., Marszalek, M., Schmid, C., & Rozenfeld, B.
(2008, June). Learning realistic human actions from
movies. In 2008, IEEE Conference on Computer Vision and
Pattern Recognition, (pp. 1-8). IEEE. https://doi.org/10.11
09/CVPR.2008.4587756
[14]. Prajwal, K. R., Mukhopadhyay, R., Namboodiri, V. P., &
Jawahar, C. V. (2020, October). A lip sync expert is all you
need for speech to lip generation in the wild. In
Proceedings of the 28th ACM International Conference on
Multimedia (pp. 484-492). https://doi.org/10.1145/339417
1.3413532
[15]. Prajwal, K. R., Mukhopadhyay, R., Philip, J., Jha, A.,
Namboodiri, V., & Jawahar, C. V. (2019, October). Towards
automatic face-to-face translation. In Proceedings of the
27th ACM International Conference on Multimedia (pp.
1428-1436). https://doi.org/10.1145/3343031.3351066
[16]. Thies, J., Elgharib, M., Tewari, A., Theobalt, C., &
Nießner, M. (2020, August). Neural voice puppetry: Audiodriven
facial reenactment. In European Conference on
Computer Vision (pp. 716-731). Springer, Cham. https://doi.org/10.1007/978-3-030-58517-4_42
[17]. Yoo, J. H. (2017). Large-scale video classification
guided by batch normalized LSTM translator. arXiv preprint
arXiv:1707.04045.
[18]. Yu, L., Yu, J., Li, M., & Ling, Q. (2020). Multimodal inputs
driven talking face generation with spatial–temporal Video Technology, 31(1), 203-216. https://doi.org/10.1109/
TCSVT.2020.2973374
dependency. IEEE Transactions on Circuits and Systems for
[19]. Yu, L., Yu, J., & Ling, Q. (2019, November). Mining
audio, text and visual information for talking face
generation. In 2019, IEEE International Conference on
Data Mining (ICDM), (pp. 787-795). IEEE. https://doi.org/10.
1109/ICDM.2019.00089