Cslt：以内容“==Resoruce Building== * Current text resource has been re-arranged and listed == AM development == === Sparse DNN === * Optimal Brain Damage(OBD). # GA-based block...”创建新页面

2014-02-28T02:24:21Z

以内容“==Resoruce Building== * Current text resource has been re-arranged and listed == AM development == === Sparse DNN === * Optimal Brain Damage(OBD). # GA-based block...”创建新页面

新页面

==Resoruce Building==
* Current text resource has been re-arranged and listed

== AM development ==

=== Sparse DNN ===

* Optimal Brain Damage(OBD).

# GA-based block sparsity

=== Efficient DNN training ===

# Asymmetric window: Great improvement on training set(WER 34% to 24%), however the improvement is lost on test. Overfitting?

===Multi GPU training===
* Error encountered

===GMM - DNN co-training===
* Error encountered

=== Multilanguage training===

# Pure Chinese training reached 4.9%
# Chinese + English reduced to 7.9%
# English phone set should discriminate beginning phone and ending phone
# Should set up multilingual network structure which shares low layers but separate languages at high layers

===Noise training===

* Train with wsj database by corrupting data with various noise types
:* White noise training completed. All results are fine
:* Car noise training almost finished. Large-variance training on progress

===Engine optimization===

* Investigating LOUDS FST.

==Word to Vector==

* Test a training toolkit Standford University, which can involve global information into word2vector training
:* C++ implementation (instead of python) for data pre-processing. Failed. Just use python.

* Basic wordvector plus global sense
:* 1 MB corpus costs 5 mins,vocab size 16698
:* 10 MB corpus costs about 82 mins vocab size 56287

* Improved wordvector with multi sense
:* Almost impossible with the toolkit
:* Can think of pre-training vectors and then do clusering

* WordVecteor-based keyword extraction
:* wordvector keyword extraction seems more reasonable if the keywords are in the lexicon
:* For oov words, wv-based extraction is limited by the vocabulary
:* Need a standard new word extraction

* Investigating Senna toolkit from NEC. Intending to implement POS tagging based on word vectors.

==LM development==

===NN LM===

* Character-based NNLM (6700 chars, 7gram), 500M data training done.
:* 3hours per iteration
:* For word-based NNLM, 1 hour/iteration for 1024 words, 4 hours/iteration for 10240 words
:* Performance lower than word-based NNLM

* WordVector-based word and char NNLM training done
:* Google wordvecotr-based NNLM is worse than random initialized NNLM

===3T Sogou LM===

* Improved training
:* re-segmentation by Tencent 110k lexicon
:* re-train with 4G text blocks
:* 1/6 merge done. PPL reduced to 466(vs Tencent 8w8 213.74)
:* Need to check the OOV problem
:* Need to finish the final merge.

==Embedded development==

* CLG embedded decoder is almost done. Online compiler is on progress.
* Zhiyong is working on layer-by-layer DNN training.

==Speech QA==

* N-best with entity LM was analyzed
* Entity-class LM comparision
:* re-segmentation & re-train
:* SRILM class-based LM ???
:* Subgraph integration from Zhiyong

2014-02-28 - 版本历史

Cslt：以内容“==Resoruce Building== * Current text resource has been re-arranged and listed == AM development == === Sparse DNN === * Optimal Brain Damage(OBD). # GA-based block...”创建新页面