Sinovoice-2014-02-17
来自cslt Wiki
目录
DNN training
Environment setting
- 2nd GPU machine is ready. 3T * 4 RAID-0 is fast enough.
- The new machine has been added into the SGE env.
Corpora
- Beijing mobile 120h speech data are ready.
- PICC data are under labeling (200h), ready in two weeks.
- Now totally 1100h telephone speech will be ready soon.
470 hour 8k training
- 470 + 300h + Beijing mobile 120h training
- Re-train the whole models including gmm+dnn, with noise model involved.
- Train noise model by treating noise as a special phone
- The noise should be treated specifically in L construction
- 7.2h per iteration, the xEnt training might be finished in 1 week
- Run incremental DT training on the CSLT cluster, by mapping noise to the silence phone.
6000 hour 16k training
- Ran CE DNN to iteration 8 (8400 states, 80000 pdf)
- Testing results go down to 12.69% WER (Sinovoice results: 10.70).
Model | WER | RT |
---|---|---|
small LM, it 4, -5/-9 | 15.80 | 1.18 |
large LM, it 4, -5/-9 | 15.30 | 1.50 |
large LM, it 4, -6/-9 | 15.36 | 1.30 |
large LM, it 4, -7/-9 | 15.25 | 1.30 |
large LM, it 5, -5/-9 | 14.17 | 1.10 |
large LM, it 5, -5/-10 | 13.77 | 1.29 |
large LM, it 6, -5/-9 | 13.64 | - |
large LM, it 6, -5/-10 | 13.25 | - |
large LM, it 7, -5/-9 | 13.29 | - |
large LM, it 7, -5/-10 | 12.87 | - |
large LM, it 8, -5/-9 | 13.09 | - |
large LM, it 8, -5/-10 | 12.69 | - |
- A new round of training with shared trees for tone variations has been kicked off and run into dnn training again.
- Need to test the new gmm model, need to compare to Xiaoming's original settings
Adaptation
- Adaptation with 10, 20, 30 sentences are conducted
- 30 sentences can reach reasonable performance (from 14.6 to 11.2).
- Hidden layer adaptation is better than input and output adaptation
- Cross entropy regularization with P=0.3 is reasonable
- Results are here
Auto Transcription
- PICC development set decoding obtained 45% WER.
- PICC training set decoding done (200h), confidence generated
- Set threshold=0.9, reduce the training data from 230k sentences to 40k.
- Do discriminative training with the filtered 40k sentences and test on the development set
DNN Decoder
- Faster decoder
- The new RT is reported here
- The RT of the latest decoder on train203 is 0.144 (HCLG) 0.148 (CLG).
- Online decoder
- Interface design completed
- CMN strategy is clear: (1) global CMN model be trained first (2) Apply the model in decoding directly (3) may need to adapt the DNN model slightly with the feature.