Sinovoice-2014-04-22
来自cslt Wiki
h1. Environment setting
- Sinovoice internal server deployment. Usage standard draft is released
- Email notification is problematic. Need obtain an SMTP server
- Will train an redmine administrator for Sinovoice
h1. Corpora
- 300h Guangxi telecom text transcription prepared. 180h completed.
- Now totally 1338h (470 + 346 + 105BJ mobile + 200 PICC + 108h HBTc + 109h New BJ mobile) telephone speech is ready.
- 16k 6000h data: 978h online data from DataTang + 656h online mobile data + 4300h recording data.
- Standard established for LM-speech-text labeling (speech data transcription for LM enhancement)
- Xiaona is preparing noise database. Extract noise data from the original wav files.
h1. Acoustic modeling
h2. Telephone model training
h3. 1000h Training
- Baseline: 8k states, 470+300 MPE4, 20.29
- Jietong phone, 200 hour seed, 10k states training:
- Xent 16 iteration: 22.90
- MPE1 : 20.89
- MPE2 : 20.68
- MPE3 : 20.61
- MPE4 : 20.56
- CSLT phone, 8k states training
- MPE1: 20.60
- MPE2: 20.37
- MPE3: 20.37
- MPE4: 20.37
- Found a problem on data processing. Some data were cut off incorrectly. Re-training the model.
h2. 6000 hour 16k training
h3. Training progress
- Baseline: 1700h, MPE5, JT phone. 9.91
- 6000h/CSLT phone set training
- Xent: 12.83
- MPE1: 9.21
- MPE2: 9.13
- MPE3: 9.10
- 6000h/jt phone set phone set training
- MPE1: 10.63
h3. Train Analysis
- The Qihang model used a subset of the 6k data
- 2500+950H+tang500h*+20131220, approximately 1700+2400 hours
- GMM training using this subset achieved 22.47%. Xiaoming's result is 16.1%.
- Seems the database is still not very consistent
- Xiaoming kicked off the job to reproduce the Qihang training using this subset
h3. Multilanguage Training
- Prepare Chinglish data: will try to select 100h first to train a baseline model
- AMIDA database downloading
- Prepare shared DNN structure for multilingual training
- Baseline Chinese-English system is done
- Need some configuration on the size of hidden layers, need more sharing structure
- Need investigate knowledge based phone sharing
h3. Noise robust feature
- GFbank can be propagated to Sinovoice
- 1700h JT phone: MPE3: Fbank: 10.48 GFBank: 10.23
- Prepare to train the 1000h telephone speech
- Liuchao will prepare fast computing code
h1. Language modeling
h2. Domain specific atom-LM construction
h3. Some potential problems
- Unclear domain definition
- Using the same development set (8k transcription) is not very appropriate
h3. Text data filtering
- A telecom specific word list is ready. Will work with Xiaona to prepare a new version of lexicon.
- A comparison of document classification is done by LiuRong:
财经 IT 健康 体育 旅游 教育 招聘 文化 军事 总体 vsm 0.92 0.906 0.921 0.983 0.954 0.916 0.953 0.996 0.9339 0.94 lda(50) 0.84 0.39 0.79 0.85 0.60 0.368 0.61 0.31 0.86 0.62 w2v (50) 0.69 0.77 0.67 0.59 0.70 0.62 0.74 0.79 0.88 0.73
h1. DNN Decoder
h2. decoder optimization
- Test computation cost of each step
- beam 9/5000: netforward 65%
- beam 13/7000: netforward 28%
- This has been verified by Liuchao with the CSLT engine
- The acceleration code was checked in to GIT, with small modification on heap management.
h2. Frame-skipping
- Zhiyong & Liuchao will deliver the frame-skipping approach.
h2. BigLM optimization
- Investigate BigLM retrieval optimization.