Sinovoice-2014-03-04
来自cslt Wiki
DNN training
Environment setting
- Schedule cluster poweroff on 3/06, construct RAID-0 on train212
- The new RAID-0 used three new disks on train212
- Change nfs names: disk1 -> /nfs/disk212, the raid disks: /nfs/raid212, disk2->/nfs/raid215
Corpora
- PICC data are under labeling (200h), ready in one week.
- 105h data from BJ mobile
- 127h Hubei telecom
- Now totally 1121h (470 + 346 + 105 + 200) telephone speech will be ready soon.
- 16k 6000h data: 978h online data from DataTang + 656h online mobile data + 4300h recording data
Telephone model training
470 + 300h + BJ mobile 105h training
Training condition NO NOISE NOISE in LM opt noise NOISE LM + opt noise
No noise: 30.61% - - - noise phone added: 31.88% 30.76% 31.27% 31.07
BJ mobile incremental training
(1) Original 470 + 300 model: 30.24% WER
MPE2 MPE3 MPE3+iLM MPE4+iLM 27.01% 26.72% 25.09% 24.53%
6000 hour 16k training
Training progress
- Ran CE DNN to iteration 11 (8400 states, 80000 pdf)
- Testing results go down to 12.49% WER (Sinovoice results: 10.49).
Model | WER | RT |
---|---|---|
small LM, it 4, -5/-9 | 15.80 | 1.18 |
large LM, it 4, -5/-9 | 15.30 | 1.50 |
large LM, it 4, -6/-9 | 15.36 | 1.30 |
large LM, it 4, -7/-9 | 15.25 | 1.30 |
large LM, it 5, -5/-9 | 14.17 | 1.10 |
large LM, it 5, -5/-10 | 13.77 | 1.29 |
large LM, it 6, -5/-9 | 13.64 | 1.12 |
large LM, it 6, -5/-10 | 13.25 | 1.33 |
large LM, it 7, -5/-9 | 13.29 | 1.12 |
large LM, it 7, -5/-10 | 12.87 | 1.17 |
large LM, it 8, -5/-9 | 13.09 | - |
large LM, it 8, -5/-10 | 12.69 | - |
large LM, it 9, -5/-9 | 12.87 | - |
large LM, it 9, -5/-10 | 12.55 | - |
large LM, it 10, -5/-9 | 12.83 | 1.51 |
large LM, it 10, -5/-10 | 12.48 | 1.65 |
large LM, it 11, -5/-9 | 12.87 | 1.61 |
large LM, it 11, -5/-10 | 12.46 | 1.28 |
large LM, it 12, -5/-9 | 12.91 | 1.61 |
large LM, it 12, -5/-10 | 12.49 | 1.28 |
- xEnt training is done
Training Analysis
- Shared tree GMM model training completed, WER% is similar to non-shared model .
- Selected 100h online data, trained two systems: (1) di-syllable system (2) jt-phone system
di-syl jt-ph Xent 15.42% 14.78% MPE1 14.46% 14.23% MPE2 14.22% 14.09% MPE3 14.26% 13.80% MPE4 14.24% 13.68%
Hybrid training
- Receipe
- 100h MPE training
- 1700h MPE alignment/lattice
- 1700h MPE training
- 1 week to complete 3 MPE iterations
- MPE2 result: 1e-9: 10.67% (8.61%), 1e-10: 10.34% (8.27%)
Auto Transcription
PICC
- PICC development set decoding obtained 45% WER.
- PICC auto-trans incremental DT training completed
Threshold WER org: 45.03% 0.9: 41.89% 0.8: 41.64%
- Current training data with 0.8 involve 80k sentences, amounting to about 60h data.
- Sampling 60h labelled data to enrich the training
- Prepare to compare the unsupervised incremental training and supervised training
Hubei telecom
- Hubei telecom data (127 h), retrieve 60k sentence by conf thred=0.9, amounting to 50%
xEnt org: - wer_15 29.05 MPE iter1:wer_14 29.23;wer_15 29.38 MPE iter2:wer_14 29.05;wer_15 29.11 MPE iter3:wer_14 29.32;wer_15 29.28 MPE iter4:wer_14 29.29;wer_15 29.28
- retrieve 30k sentences by conf thred=0.95, amounting to 25%, plus the original 770h data
xEnt org: - wer_15 29.05 MPE iter1: - wer_15: 29.36
DNN Decoder
Online decoder
- Various CMN implementation test
- 200ms/500ms frame block adaptation
- 10ms frame block adaptation: totally wrong
prior weight | -1 | 1 | 5 | 10 | 20 | 50 | 100 | 200ms | 28.29 | 37.53 | 35.50 | 34.08 | 32.90 | 32.30 | 32.77 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
500ms | 28.29 | 31.28 | 30.83 | 30.22 | 29.50 | 29.32 | 29.36 |