“TTS-project-synthesis”版本间的差异
第6行: | 第6行: | ||
=Introduction= | =Introduction= | ||
− | + | We are interested in a flexible syntehsis based on neural model . The basic idea is that since the neural model can be | |
+ | traind with multiple conditions, we can treat speaker and emotion as the conditional factors. We use the speaker vector | ||
+ | and emotion vector as addiiontal input to the model, and then train a single model that can produce sound of different | ||
+ | speakers and different emotions. | ||
− | + | In the following experiments, we use a simple DNN architecture to implement the training. The vocoder is WORD. | |
− | + | ||
==Mono-speaker== | ==Mono-speaker== | ||
+ | |||
+ | The first step is mono-speaker systems. We trained three systems: a female, a male and a child, each with a | ||
+ | single network. The performance is like the ofllowing. | ||
+ | |||
+ | Synthesis text:好雨知时节,当春乃发声,随风潜入夜,润物细无声 | ||
+ | |||
*Female[http://zhangzy.cslt.org/categories/tts/sample-wav/mimic-wangd-front-end/multi-speakers/huilian/female01/female01_5_amdurTanh_acTanh_mlpg1_postfilter1.world.wav01.wav] | *Female[http://zhangzy.cslt.org/categories/tts/sample-wav/mimic-wangd-front-end/multi-speakers/huilian/female01/female01_5_amdurTanh_acTanh_mlpg1_postfilter1.world.wav01.wav] | ||
第19行: | 第27行: | ||
==Multi-speaker== | ==Multi-speaker== | ||
+ | |||
+ | Now we combine all the data from male, female and child to train a single model. | ||
+ | |||
===Without Speaker-vector=== | ===Without Speaker-vector=== | ||
+ | |||
+ | The first experiment is that the data are blindly combined, without any indicator of speakers. | ||
+ | |||
*Female & Male[http://zhangzy.cslt.org/categories/tts/sample-wav/mimic-wangd-front-end/multi-speakers/mix/female01-male01/female01-male01_5_amdurTanh_acTanh_mlpg1_postfilter1.world.wav01.wav] | *Female & Male[http://zhangzy.cslt.org/categories/tts/sample-wav/mimic-wangd-front-end/multi-speakers/mix/female01-male01/female01-male01_5_amdurTanh_acTanh_mlpg1_postfilter1.world.wav01.wav] | ||
第28行: | 第42行: | ||
===With Speaker-vector=== | ===With Speaker-vector=== | ||
− | *Specific person | + | |
+ | Now we use speaker vector as an indicator of the speaker trait. | ||
+ | |||
+ | *Specific person | ||
+ | |||
+ | Firstly, use the speaker fector to specifiy a particular person: | ||
+ | |||
:*Female[http://zhangzy.cslt.org/categories/tts/sample-wav/mimic-wangd-front-end/multi-speakers/mix/all.dvector40/female01.dvec40_5_amdurTanh_acTanh_mlpg1_postfilter1.world.wav01.wav] | :*Female[http://zhangzy.cslt.org/categories/tts/sample-wav/mimic-wangd-front-end/multi-speakers/mix/all.dvector40/female01.dvec40_5_amdurTanh_acTanh_mlpg1_postfilter1.world.wav01.wav] | ||
第34行: | 第54行: | ||
*Interpolate of different person | *Interpolate of different person | ||
+ | |||
+ | Now let's produce interpolated voice by interpolating two speakers: female and amle. | ||
+ | |||
:* Female & Male with different ratio | :* Female & Male with different ratio | ||
第59行: | 第82行: | ||
==Mono-speaker Multi-Emotion== | ==Mono-speaker Multi-Emotion== | ||
+ | |||
+ | Using emotion vectors can specify which emotio to use, and the emotion can be also interpolated. | ||
+ | |||
*Specific emotion | *Specific emotion | ||
:* Neutral emotion [http://zhangzy.cslt.org/categories/tts/sample-wav/mimic-wangd-front-end/emotion/roobo.child/x-neutral_5_amdurTanh_acTanh_mlpg1_postfilter1.world.wav01.wav] | :* Neutral emotion [http://zhangzy.cslt.org/categories/tts/sample-wav/mimic-wangd-front-end/emotion/roobo.child/x-neutral_5_amdurTanh_acTanh_mlpg1_postfilter1.world.wav01.wav] | ||
第80行: | 第106行: | ||
==Multi-speaker Multi-emotion== | ==Multi-speaker Multi-emotion== | ||
+ | |||
+ | Finally, all the data (different speakers and different emotions) are combined together. Note that only the child voice | ||
+ | has different emotions of training data. We hope that this emotion can be learned so that we can generate voice of | ||
+ | other speakers with emotion, although they do not have any training data with emtoions. | ||
*Female | *Female |
2017年12月7日 (四) 09:04的版本
目录
Project name
Text To Speech
Project members
Dong Wang, Zhiyong Zhang
Introduction
We are interested in a flexible syntehsis based on neural model . The basic idea is that since the neural model can be traind with multiple conditions, we can treat speaker and emotion as the conditional factors. We use the speaker vector and emotion vector as addiiontal input to the model, and then train a single model that can produce sound of different speakers and different emotions.
In the following experiments, we use a simple DNN architecture to implement the training. The vocoder is WORD.
Mono-speaker
The first step is mono-speaker systems. We trained three systems: a female, a male and a child, each with a single network. The performance is like the ofllowing.
Synthesis text:好雨知时节,当春乃发声,随风潜入夜,润物细无声
- Female[1]
- Male[2]
- Child[3]
Multi-speaker
Now we combine all the data from male, female and child to train a single model.
Without Speaker-vector
The first experiment is that the data are blindly combined, without any indicator of speakers.
- Female & Male[4]
- Female & Child[5]
- Male & Child[6]
With Speaker-vector
Now we use speaker vector as an indicator of the speaker trait.
- Specific person
Firstly, use the speaker fector to specifiy a particular person:
- Female[7]
- Male[8]
- Interpolate of different person
Now let's produce interpolated voice by interpolating two speakers: female and amle.
- Female & Male with different ratio
- (1) 0.0:1.0[9]
- (2) 0.1:0.9[10]
- (3) 0.2:0.8[11]
- (4) 0.3:0.7[12]
- (5) 0.4:0.6[13]
- (6) 0.5:0.5[14]
- (7) 0.6:0.4[15]
- (8) 0.7:0.3[16]
- (9) 0.8:0.2[17]
- (10) 0.9:0.1[18]
- (11) 1.0:0.0[19]
Mono-speaker Multi-Emotion
Using emotion vectors can specify which emotio to use, and the emotion can be also interpolated.
- Specific emotion
- Interpolation emotion
- Angry & neutral with different ratio
Multi-speaker Multi-emotion
Finally, all the data (different speakers and different emotions) are combined together. Note that only the child voice has different emotions of training data. We hope that this emotion can be learned so that we can generate voice of other speakers with emotion, although they do not have any training data with emtoions.
- Female
- Male