Cslt：以内容“==Resoruce Building== * Current text resource has been re-arranged and listed == AM development == === Sparse DNN === * Optimal Brain Damage(OBD). # GA-based block...”创建新页面

2014-02-21T02:22:19Z

以内容“==Resoruce Building== * Current text resource has been re-arranged and listed == AM development == === Sparse DNN === * Optimal Brain Damage(OBD). # GA-based block...”创建新页面

新页面

==Resoruce Building==
* Current text resource has been re-arranged and listed

== AM development ==

=== Sparse DNN ===

* Optimal Brain Damage(OBD).

# GA-based block sparsity

=== Efficient DNN training ===

# Asymmetric window: Great improvement on training set(WER 34% to 24%), however the improvement is lost on test. Overfitting?

=== Multilanguage training===

# Pure Chinese training reached 4.9%
# Chinese + English reduced to 7.9%
# English phone set should discriminate beginning phone and ending phone
# Should set up multilingual network structure which shares low layers but separate languages at high layers

===Noise traing===

* Train with wsj database by corrupting data with various noise types
:* baseline system ready
:* noise data ready, selected 5 noise which is noise in reality
:* Liuchao's noise-adding toolkit ready

===Engine optimization===

* Investigating LOUDS FST.

=== Adaptation ===

* Tested adaptation performance with adapted utterances from 10 to 40.

==Word to Vector==

* Test a training toolkit Standford University, which can involve global information into word2vector training
:* C++ implementation (instead of python) for data pre-processing, problem encountered

* Basic wordvector plus global sense
:* Training 100M data (with global sense), memory overflow
:* Split the data into small pieces

* Improved wordvector with multi sense
:* Prepare scripts

* Keyword extraction based on wordvectors
:* Using google word vectors
:* Using k-mean to cluster

* Investigating Senna toolkit from NEC. Intending to implement POS tagging based on word vectors.

==LM development==

===NN LM===

* Character-based NNLM (6700 chars, 7gram), 500M data training done.
:* 3hours per iteration
:* For word-based NNLM, 1 hour/iteration for 1024 words, 4 hours/iteration for 10240 words
:* Performance lower than word-based NNLM

* WordVector-based word and char NNLM training done
:* Google wordvecotr-based NNLM is worse than random initialized NNLM

===3T Sogou LM===

* Naive training
:* all-word in lexicon
:* split into 9G text blocks
:* Merge one-by-one
:* Cutting to 110k lexicon
:* Test on QA
:* Performance reduced compared to Liurong's previous LM

* Improved training
:* re-segmentation by Tencent 110k lexicon
:* re-train with 4G text blocks
:* sub-model training done, ready for merge based Tencent online1 test set.

==Embedded development==

* CLG embedded decoder is almost done. Online compiler is on progress.
* Zhiyong is working on layer-by-layer DNN training.

==Speech QA==

* Current N-best results
:* N-best search plus pinyin correction
:* Total 2718 QA requests
:* default 1844 QA correct
:* no-entity 1650 QA correct
:* with-entity 1884 QA correct

* Analyze error patterns for Nbest match

:* 10.8% song transcriptions errors
:* 18.3% English error
:* 38.7% entity (song name, singer name) recognition lost
:* 32.3% non-entity recognition error

* Computing complexity
:* 11000 entity has 23000 different pronunciations
:* Use tree to improve efficiency

* Entity-class LM comparision
:* re-segmentation & re-train
:* SRILM class-based LM
:* Subgraph integration from Zhiyong

2014-02-21 - 版本历史

Cslt：以内容“==Resoruce Building== * Current text resource has been re-arranged and listed == AM development == === Sparse DNN === * Optimal Brain Damage(OBD). # GA-based block...”创建新页面