Voice Conversion using GMM with Minimum Distance Spectral Mapping Plus Amplitude Scaling

Neha Yadav*, Vinay Kumar Jain**
* PG Scholar, Department of Electronics and Telecommunication Engineering, Shri Shankaracharya Technical Campus, Bhilai, India.
** Associate Professor, Department of Electronics and Telecommunication Engineering, Shri Shankaracharya Technical Campus, Bhilai, India.
Periodicity:September - November'2016
DOI : https://doi.org/10.26634/jele.7.1.8273

Abstract

VC is one of the fields of the speech processing voice transformation approaches for transforming the characteristics of voice produced by a person speaking, singing or audio samples, transforming voice into simple and flexible ways, so that a listener would be able to identify the speech uttered by the target speaker. Speech processing is widely used in the research for last two decades, with an increasing commercial interest and applications of VC such as Speech-to- Speech Translation (SST), and Text-To-Speech (TTS). More traditional methods are available for voice conversion, but they do not provide better converted speech; such as GMM doesn't generate high quality converted voice because GMM based VC creates over-smoothing. Hence in this research, the authors have proposed a new voice conversion algorithm, Minimum Distance Spectral Mapping (MDSM) based on the idea of Dynamic Time Warping (DTW), where point-to-point mapping is used in the warping function. Also amplitude scaling function is intended to adjust mean source log amplitude spectrum to the mean target log amplitude spectrum, in order to reduce over-smoothing. Since most of the spectral envelopes in amplitude scaling mean log spectral envelopes are smooth, there is no necessity in finding the smoothing factors. The proposed MDSMAS preserves spectral details, provides improved speech quality and identifies the similarity between source and target data and also provides improved result in objective tests.

Keywords

Gaussian Mixture Model, Minimum Distance Spectral Mapping, Dynamic Time Warping Amplitude Scaling, Speech-to-Speech, Text-to-Speech

How to Cite this Article?

Yadav, N., and Jain, V.K. (2016). Voice Conversion using GMM with Minimum Distance Spectral Mapping Plus Amplitude Scaling.i-manager's Journal on Electronics Engineering, 7(1), 9-15. https://doi.org/10.26634/jele.7.1.8273

References

[1]. Helena Duxans, Antonio Bonafonte, Alexender Kain, and Jan van Santen, (2004). “Including Dynamic and Phonetic Information in Voice Conversion System”. In International Conference on Spoken Language Processing.
[2]. D. Silndermann, and H. Hoge, (2003). “VTLN Based cross language voice conversion”. In IEEE Automatic Speech Recognition and Understanding Workshop, pp.676-678.
[3]. M. Mashimo, T. Toda, H. Kawanami, H. Kashioka, K. Shikano, and N. Campbell, (2002). “Evaluation of cross language voice conversion using bilingual and non bilingual databases”. In International Conference on Spoken Language Processing.
[4]. J. Hosom, A. Kain, T. Mishra, J. Van Santen, M. Friedoken, and J. Staehely, (2003). “Ineligibility of modifications to dysarthric speech”. In International Conference on Acoustic Speech and Signal Processing, pp. 924-927.
[5]. O. Turk, and L. Arslan, (2002). “Subband based voice conversion”. In International Conference on Spoken Language Processing, Bogazici University, Istanbul, pp. 289-292.
[6]. R.H. Laskara, D. Chakrabarty, F.A. Taludar, K. Sreenivasa Rao, and K. Banerjee, (2012). “Comparing ANN and GMM in a voice conversion framework”. Elsevier Journal on Applied Soft Computing, Vol. 12, pp. 3332- 3342.
[7]. Kevin D'souza, and K.T.V. Talele, (2015). “Voice conversion using GMM”. In IEEE International Conference on Communication Information & Computing Technology (ICCICT).
[8]. Yannis Stylianou, (2009). “Voice transformation: A survey”. IEEE ICASSP, pp.3585-3588.
[9]. Benesty Jacob, Sondhi, and Huang, (2008). “Voice Transformation”. In Springer Handbook of Speech Processing, pp. 489-503.
[10]. Anderson F. Machado, and Maracelo Queiroz, (2010). “Voice Conversion: A Critical Survey”. SMC 2010 Proceedings, pp. 291-298.
[11]. Tomoki Toda, Hiroshi Saruwatari, and Kiyohiro Shikano, (2001). “Voice Conversion Algorithm on Gaussian Mixture Model with Dynamic Frequency Warping of STRAIGHT Spectrum”. ICASSP, pp.127-130.
[12]. Elizabeth Godoy, Olivier Rosec, and Thierry Chonavel, (2012). “Voice Conversion using Dynamic Frequency Warping with Amplitude Scaling, for Parallel or Nonparallel Corpora”. IEEE Transactions on Audio, Speech, and Language Processing, Vol.20, No.4, pp.1313-1323.
[13]. Daniel Erro, Asunción Moreno, and Antonio Bonafonte, (2010). “Voice Conversion Based on Weighted Frequency Warping”. IEEE Transactions on Audio, Speech, and Language Processing, Vol.18, No.5, pp.922-931.
[14]. Daniel Erro, Eva Navas, and Inma Hernaez, (2013). “Parametric Voice Conversion Based on Bilinear Frequency Warping Plus Amplitude Scaling”. IEEE Transcations on Audio, Speech, and Language Processing, Vol.21, No.3, pp.556-566.
[15]. Xiaohai Tian, Zhizheng Wu, S. W. Lee, and Eng Siong Chng, (2014). “Correlation-based Frequency Warping for th Voice Conversion”. 9 International Symposium on Chinese Spoken Language Processing (ISCSLP), pp.211- 215.
[16]. Jianchun MA, and Wenju LIU, (2005). “Voice Conversion based on Joint Pitch and Spectral Transformation with Component Group-GMM ”. Proceedings of 2005 IEEE International Conference on Natural Language Processing and Knowledge Engineering (IEEE NLP-KE '05).
[17]. Zhihua Jian, and Zhen Yang, (2007). “Voice Conversion Using Canonical Correlation Analysis Based on Gaussian Mixture Model”. Eighth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing, pp.210-215.
[18]. Binh Phu Nguyen and Masato Akagi, (2008). “Phoneme-based Spectral Voice Conversion using Temporal Decomposition and Gaussian Mixture Model”. Second International Conference on Communications and Electronics (ICCE 2008). , pp.224-229.
[19]. S. Desai, E. Raghavendra B. Yegnanarayana, A. Black, and K. Prahallad, (2008). “Voice conversion using Artificial Neural Networks”. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2009).
[20]. Daojian Zeng, and Yibiao Yu, (2010). “Voice Conversion using Structured Gaussian Mixture Model”. In Proc. ICSP, pp.541-544.
[21]. Jian Zhihua, and Yang Zhen, (2007). “Voice conversion using Viterbi algorithm based on Gaussian mixture Model”. Proceedings of 2007 International Symposium on Intelligent Signal Processing and Communication Systems, pp.32-35.
[22]. Gui Jin, Michael T. Johnson, Jia Liu, and Xiaokang Lin, (2015). “Voice Conversion based on Gaussian Mixture Modules with Minimum Distance Spectral Mapping”. 5th International Conference on Information Science and Technology (ICIST).
If you have access to this article please login to view the article or kindly login to purchase the article

Purchase Instant Access

Single Article

North Americas,UK,
Middle East,Europe
India Rest of world
USD EUR INR USD-ROW
Pdf 35 35 200 20
Online 35 35 200 15
Pdf & Online 35 35 400 25

Options for accessing this content:
  • If you would like institutional access to this content, please recommend the title to your librarian.
    Library Recommendation Form
  • If you already have i-manager's user account: Login above and proceed to purchase the article.
  • New Users: Please register, then proceed to purchase the article.