Kernel methods match deep neural networks on TIMIT
Po-Sen Huang, Haim Avron, et al.
ICASSP 2014
HMM-TTS synthesis is a popular approach toward flexible, low-foot-print, data driven systems that produce highly intelligible speech. In spite of these strengths, speech generated by these systems exhibit some degradation in quality, attributable to an inadequacy in modeling the excitation signal that drives the parametric models of the vocal tract. This paper proposes a novel method for modeling the excitation as a low-dimensional set of coefficients, based on a non-linear map learned through an autoencoder. Through analysis-and-resynthesis experiments, and a formal listening test, we show that this model produces speech of higher perceptual quality compared to conventional pulse-excited speech signals at the p < 0.01 significance level. ©2010 IEEE.
Po-Sen Huang, Haim Avron, et al.
ICASSP 2014
Bhuvana Ramabhadran, Jing Huang, et al.
INTERSPEECH - Eurospeech 2003
Asaf Rendel, Raul Fernandez, et al.
ICASSP 2016
Tara N. Sainath, Avishy Carmi, et al.
ICASSP 2010