2014-03-07
来自cslt Wiki
目录
Resoruce Building
- Current text resource has been re-arranged and listed
AM development
Sparse DNN
- Optimal Brain Damage(OBD).
- GA-based block sparsity
Efficient DNN training
- Asymmetric window: Great improvement on training set(WER 34% to 24%), however the improvement is lost on test. Overfitting?
Multi GPU training
- Error encountered
GMM - DNN co-training
- Error encountered
Multilanguage training
- Pure Chinese training reached 4.9%
- Chinese + English reduced to 7.9%
- English phone set should discriminate beginning phone and ending phone
- Should set up multilingual network structure which shares low layers but separate languages at high layers
Noise training
- Train with wsj database by corrupting data with various noise types
- White noise + car noise training partially completed
- Mixture training produces better performance for both car and white noise
- Unknown noise testing is on progress
AMR compression re-training
- WeChat uses AMR compression method, which requires adaptation for our AM
- Test AMR & non-AMR model
test-wav WAV AMR model WAV 4.31 26.09 AMR 13.80 6.77
- Prepare to do adaptation
GFbank
- Finished the first round of gfbank training & test
- The same gmm model (mfcc feature) was used to get the alignment
- Traing fbank & gfbank based on the mfcc alignment
- Clean training and noise test
clean 25db 5db gf 4.22 5.60 73.03 fb 5.87 84.12
Engine optimization
- Investigating LOUDS FST.
Word to Vector
- Test a training toolkit Standford University, which can involve global information into word2vector training
- C++ implementation (instead of python) for data pre-processing. Failed. Just use python.
- Basic wordvector plus global sense
- 1 MB corpus costs 5 mins,vocab size 16698
- 10 MB corpus costs about 82 mins vocab size 56287
- Improved wordvector with multi sense
- Almost impossible with the toolkit
- Can think of pre-training vectors and then do clusering
- WordVecteor-based keyword extraction
- Prepared 7 category totally 500+ articles
- A problem in keyword identification. Fix it by using the article vector space
- Investigating Senna toolkit from NEC. Intending to implement POS tagging based on word vectors.
LM development
NN LM
- Character-based NNLM (6700 chars, 7gram), 500M data training done.
- Performance lower than word-based NNLM
- Prepare to run boundary-involved char NNLM
- WordVector-based word and char NNLM training done
- Google wordvecotr-based NNLM is worse than random initialized NNLM
3T Sogou LM
- Improved training
- 3T LM + Tencent 80k lM: performance worse than the original 80K LM
- Need to check if it is caused by the mismatched vocabu9lary
- 3T LM + QA LM : use online1 as the EM target, performance worse than QA LM
- Probably due to the incorrect EM target
QA Matching
- Working on edit FST for fuzzy matching
- TF/IDF score matching completed
Embedded development
- CLG embedded decoder is almost done. Online compiler is on progress.
- English scoring is under go
Speech QA
- N-best with entity LM was analyzed
- Entity-class LM comparision
- re-segmentation & re-train
- SRILM class-based LM ???
- Subgraph integration from Zhiyong
- WER summary is done
- Prepare to compose a paper