Sinovoice-2016-4-28
来自cslt Wiki
目录
Data
- 16K LingYun
- 2000h data ready
- 4300h real-env data to label
- YueYu
- Total 250h(190h-YueYu + 60h-English)
- Add 60h YueYu
- CER: 75%->76%
- WeiYu
- 50h for training
- 120h labeled ready
Model training
Deletion Error Promblem
- Add one noise phone to alleviate the silence over-training
- Omit sil accuracy in discriminative training
- H smoothing of XEnt and MPE
- Testdata: test_1000ju from 8000ju
----------------------------------------------------------------------------- model | ins | del | sub | wer/tot-err ----------------------------------------------------------------------------- svd600_lr2e-5_1000H_mpe_uv-fix | 24 | 56 | 408 | 8.26/488 ----------------------------------------------------------------------------- svd600_lr2e-5_1000H_mpe_uv-fix_omitsilacc | 32 | 48 | 409 | 8.28/489 ----------------------------------------------------------------------------- svd600_lr2e-5_1000H_mpe_uv-fix_xent0.1 | 24 | 57 | 406 | 8.24/487 -----------------------------------------------------------------------------
- Testdata: test_8000ju
----------------------------------------------------------------------------- model | ins | del | sub | wer/tot-err ----------------------------------------------------------------------------- svd600_lr2e-5_1000H_mpe_uv-fix | 140 | 562 | 3686 | 9.19/4388 | 47753-total-word ----------------------------------------------------------------------------- svd600_lr2e-5_1000H_mpe_uv-fix_xent0.1 | 146 | 510 | 3705 | 9.13/481 -----------------------------------------------------------------------------
- Testdata: test_2000ju from 10000ju
----------------------------------------------------------------------------- model | ins | del | sub | wer/tot-err ----------------------------------------------------------------------------- svd600_lr2e-5_1000H_mpe_uv-fix | 86 | 790 | 1471 | 18.55/2347 ----------------------------------------------------------------------------- svd600_lr2e-5_1000H_mpe_uv-fix_omitsilacc | 256 | 473 | 1669 | 18.95/2398 ----------------------------------------------------------------------------- svd600_lr2e-5_1000H_mpe_uv-fix_xent0.1 | 95 | 704 | 1548 | 18.55/2347 -----------------------------------------------------------------------------
- Testdata: test_10000ju
----------------------------------------------------------------------------- model | ins | del | sub | wer/tot-err ----------------------------------------------------------------------------- svd600_lr2e-5_1000H_mpe_uv-fix | 478 | 3905 | 7698 | 18.31/12081 | 65989-total-word ----------------------------------------------------------------------------- svd600_lr2e-5_1000H_mpe_uv-fix_xent0.1 | 481 | 3741 | 7773 | 18.18/11995 -----------------------------------------------------------------------------
- Add one silence arc from start-state to end-state
Big-Model Training
- 16k
================================================================================================ | | TDNN 7-1200 | TDNN 7-1200 enhance | TDNN 7-1200 svd600 | ------------------------------------------------------------------------------------------------ |8000ju frame_skip=1 | | 0.0556 / 0.349 | 0.0559 / 0.306 | |8000ju frame_skip=2 | 0.059 / 0.243 | 0.0591 / 0.231 | 0.0589 / 0.228 | ------------------------------------------------------------------------------------------------ |10000ju frame_skip=1 | | 0.1241 / 0.341 | 0.1244 / 0.358 | |10000ju frame_skip=2 | 0.1348 / 0.234 | 0.1315 / 0.245 | 0.1311 / 0.204 | ------------------------------------------------------------------------------------------------ |English frame_skip=1 | | 0.3897 / 0.370 | 0.4062 / 0.353 | |English frame_skip=2 | 0.4296 | 0.4237 / 0.276 | 0.4306 / 0.252 | ================================================================================================
- 8k
PingAn: =============================================================================== | AM / config | all beam9 |all beam9 biglm|| KeHu beam9 | ------------------------------------------------------------------------------- | tdnn 7-2048 xEnt | 16.45 | 16.22 || 36.49 / 25.18 | | tdnn 7-2048 MPE | 15.22 | 14.87 || 32.77 / 23.48 | | tdnn 7-2048 MPE adapt-PABX | 14.67 | 14.63 || 31.33 / 22.76 | ------------------------------------------------------------------------------- | tdnn 7-1024 xEnt | 16.60 | 16.25 || 35.91 / 25.58 | | tdnn 7-1024 MPE | 15.67 | 15.61 || 32.77 / 26.09 | | tdnn 7-1024 MPE adapt-PABX | 14.80 | 14.76 || 30.48 / 22.56 | ===============================================================================
LiaoNingYiDong: ============================================================================== | AM / config | beam9 | beam9 biglm | beam13 | ------------------------------------------------------------------------------ | tdnn 7-2048 xEnt | 21.51 | 21.05 | 21.17 | | tdnn 7-2048 MPE | 20.09 | 19.74 | 19.74 | | tdnn 7-2048 MPE adapt-LNYD | 17.92 | 17.87 | 17.58 | ------------------------------------------------------------------------------ | tdnn 7-1024 xEnt | 21.72 | 22.74 | 21.64 | | tdnn 7-1024 MPE | 20.99 | 20.77 | 20.74 | | tdnn 7-1024 MPE adapt-LNYD | | | | ==============================================================================
Embedding
- The size of nnet1 AM is 6.4M (3M after decomposition). So we need to control AM size within 10M.
- 5*576-2400 TDNN model training done. AM size is about 17M
- 5*500-2400 TDNN model on training.
SinSong Robot
- Test based on 10000h(7*2048-xent) model
------------------------------------------------ condition | clean | replay(0.5m) | real-env ------------------------------------------------ wer | 3 | 18(mpe-14) | too-bad ------------------------------------------------
- Plan to record in restaurant on April 10.
Character LM
- Except Sogou-2T, 9-gram has been done.
- Worse than word-lm(9%->6%)
- Add word boundary tag to Character-LM trainig
- Merge Character-LM & word-LM
- Union
- Compose, success.
- 2-step decoding: first, character-based LM. Then, word-based LM.
- Word boundary character training
Project
- Pingan & Yueyu Deletion error too more
- TDNN deletion error rate > DNN deletion error rate
- TDNN Silence scale is too sensitive for different test cases.
SID
Digit
- DNN-PLDA gets better performance than i-Vector;
DNN cosine 10.4167%, at threshold 89.3973 9.72222%, at threshold 87.8146 8.68056%, at threshold 84.2021 3.47222%, at threshold 11.5852 lda 3.125%, at threshold 54.1172 2.77778%, at threshold 50.1447 2.43056%, at threshold 48.6887 1.73611%, at threshold 14.5075 plda 2.43056%, at threshold -23.954 2.08333%, at threshold -24.6051 2.08333%, at threshold -21.0524 1.73611%, at threshold 4.83949
ivector plda 3.15789%, at threshold 0.563044 3.85965%, at threshold 0.525273 3.85965%, at threshold 0.502531 2.80702%, at threshold 0.429186