Sinovoice-2014-02-17

来自cslt Wiki
跳转至: 导航搜索

DNN training

Environment setting

  • 2nd GPU machine is ready. 3T * 4 RAID-0 is fast enough.
  • The new machine has been added into the SGE env.

Corpora

  • Beijing mobile 120h speech data are ready.
  • PICC data are under labeling (200h), ready in two weeks.
  • Now totally 1100h telephone speech will be ready soon.

470 hour 8k training

  • 470 + 300h + Beijing mobile 120h training
  • Re-train the whole models including gmm+dnn, with noise model involved.
  • Train noise model by treating noise as a special phone
  • The noise should be treated specifically in L construction
  • 7.2h per iteration, the xEnt training might be finished in 1 week
  • Run incremental DT training on the CSLT cluster, by mapping noise to the silence phone.


6000 hour 16k training

  • Ran CE DNN to iteration 8 (8400 states, 80000 pdf)
  • Testing results go down to 12.69% WER (Sinovoice results: 10.70).
Model WER RT
small LM, it 4, -5/-9 15.80 1.18
large LM, it 4, -5/-9 15.30 1.50
large LM, it 4, -6/-9 15.36 1.30
large LM, it 4, -7/-9 15.25 1.30
large LM, it 5, -5/-9 14.17 1.10
large LM, it 5, -5/-10 13.77 1.29
large LM, it 6, -5/-9 13.64 -
large LM, it 6, -5/-10 13.25 -
large LM, it 7, -5/-9 13.29 -
large LM, it 7, -5/-10 12.87 -
large LM, it 8, -5/-9 13.09 -
large LM, it 8, -5/-10 12.69 -
  • A new round of training with shared trees for tone variations has been kicked off and run into dnn training again.
  • Need to test the new gmm model, need to compare to Xiaoming's original settings


Adaptation

  • Adaptation with 10, 20, 30 sentences are conducted
  • 30 sentences can reach reasonable performance (from 14.6 to 11.2).
  • Hidden layer adaptation is better than input and output adaptation
  • Cross entropy regularization with P=0.3 is reasonable
  • Results are here

Auto Transcription

  • PICC development set decoding obtained 45% WER.
  • PICC training set decoding done (200h), confidence generated
  • Set threshold=0.9, reduce the training data from 230k sentences to 40k.
  • Do discriminative training with the filtered 40k sentences and test on the development set


DNN Decoder

  • Faster decoder
  • The new RT is reported here
  • The RT of the latest decoder on train203 is 0.144 (HCLG) 0.148 (CLG).
  • Online decoder
  • Interface design completed
  • CMN strategy is clear: (1) global CMN model be trained first (2) Apply the model in decoding directly (3) may need to adapt the DNN model slightly with the feature.