<?xml version="1.0"?>
<?xml-stylesheet type="text/css" href="http://cslt.org/mediawiki/skins/common/feed.css?303"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="zh-cn">
		<id>http://cslt.org/mediawiki/index.php?action=history&amp;feed=atom&amp;title=2014-02-21</id>
		<title>2014-02-21 - 版本历史</title>
		<link rel="self" type="application/atom+xml" href="http://cslt.org/mediawiki/index.php?action=history&amp;feed=atom&amp;title=2014-02-21"/>
		<link rel="alternate" type="text/html" href="http://cslt.org/mediawiki/index.php?title=2014-02-21&amp;action=history"/>
		<updated>2026-04-14T11:24:09Z</updated>
		<subtitle>本wiki的该页面的版本历史</subtitle>
		<generator>MediaWiki 1.23.3</generator>

	<entry>
		<id>http://cslt.org/mediawiki/index.php?title=2014-02-21&amp;diff=9210&amp;oldid=prev</id>
		<title>Cslt：以内容“==Resoruce Building== * Current text resource has been re-arranged and listed  == AM development ==  === Sparse DNN ===  * Optimal Brain Damage(OBD).   # GA-based block...”创建新页面</title>
		<link rel="alternate" type="text/html" href="http://cslt.org/mediawiki/index.php?title=2014-02-21&amp;diff=9210&amp;oldid=prev"/>
				<updated>2014-02-21T02:22:19Z</updated>
		
		<summary type="html">&lt;p&gt;以内容“==Resoruce Building== * Current text resource has been re-arranged and listed  == AM development ==  === Sparse DNN ===  * Optimal Brain Damage(OBD).   # GA-based block...”创建新页面&lt;/p&gt;
&lt;p&gt;&lt;b&gt;新页面&lt;/b&gt;&lt;/p&gt;&lt;div&gt;==Resoruce Building==&lt;br /&gt;
* Current text resource has been re-arranged and listed&lt;br /&gt;
&lt;br /&gt;
== AM development ==&lt;br /&gt;
&lt;br /&gt;
=== Sparse DNN ===&lt;br /&gt;
&lt;br /&gt;
* Optimal Brain Damage(OBD). &lt;br /&gt;
&lt;br /&gt;
# GA-based block sparsity&lt;br /&gt;
&lt;br /&gt;
=== Efficient DNN training ===&lt;br /&gt;
&lt;br /&gt;
# Asymmetric window: Great improvement on training set(WER 34% to 24%), however the improvement is lost on test. Overfitting? &lt;br /&gt;
&lt;br /&gt;
=== Multilanguage training===&lt;br /&gt;
&lt;br /&gt;
# Pure Chinese training reached 4.9%&lt;br /&gt;
# Chinese + English reduced to 7.9%&lt;br /&gt;
# English phone set should discriminate beginning phone and ending phone&lt;br /&gt;
# Should set up multilingual network structure which shares low layers but separate languages at high layers&lt;br /&gt;
&lt;br /&gt;
===Noise traing===&lt;br /&gt;
&lt;br /&gt;
* Train with wsj database by corrupting data with various noise types&lt;br /&gt;
:* baseline system ready&lt;br /&gt;
:* noise data ready, selected 5 noise which is noise in reality&lt;br /&gt;
:* Liuchao's noise-adding toolkit ready&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Engine optimization===&lt;br /&gt;
&lt;br /&gt;
* Investigating LOUDS FST. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Adaptation ===&lt;br /&gt;
&lt;br /&gt;
* Tested adaptation performance with adapted utterances from 10 to 40.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Word to Vector==&lt;br /&gt;
&lt;br /&gt;
* Test a training toolkit Standford University, which can involve global information into word2vector training&lt;br /&gt;
:* C++ implementation (instead of python) for data pre-processing, problem encountered&lt;br /&gt;
&lt;br /&gt;
* Basic wordvector plus global sense&lt;br /&gt;
:* Training 100M data (with global sense), memory overflow&lt;br /&gt;
:* Split the data into small pieces&lt;br /&gt;
&lt;br /&gt;
* Improved wordvector with multi sense&lt;br /&gt;
:* Prepare scripts &lt;br /&gt;
&lt;br /&gt;
* Keyword extraction based on wordvectors&lt;br /&gt;
:* Using google word vectors&lt;br /&gt;
:* Using k-mean to cluster&lt;br /&gt;
&lt;br /&gt;
* Investigating Senna toolkit from NEC. Intending to implement POS tagging based on word vectors. &lt;br /&gt;
&lt;br /&gt;
==LM development==&lt;br /&gt;
&lt;br /&gt;
===NN LM===&lt;br /&gt;
&lt;br /&gt;
* Character-based NNLM (6700 chars, 7gram), 500M data training done.&lt;br /&gt;
:* 3hours per iteration&lt;br /&gt;
:* For word-based NNLM, 1 hour/iteration for 1024 words, 4 hours/iteration for 10240 words&lt;br /&gt;
:* Performance lower than word-based NNLM&lt;br /&gt;
&lt;br /&gt;
* WordVector-based word and char NNLM training done&lt;br /&gt;
:* Google wordvecotr-based NNLM is worse than random initialized NNLM&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===3T Sogou LM===&lt;br /&gt;
&lt;br /&gt;
* Naive training &lt;br /&gt;
:* all-word in lexicon &lt;br /&gt;
:* split into 9G text blocks&lt;br /&gt;
:* Merge one-by-one&lt;br /&gt;
:* Cutting to 110k lexicon&lt;br /&gt;
:* Test on QA &lt;br /&gt;
:* Performance reduced compared to Liurong's previous LM&lt;br /&gt;
&lt;br /&gt;
* Improved training&lt;br /&gt;
:* re-segmentation by Tencent 110k lexicon&lt;br /&gt;
:* re-train with 4G text blocks&lt;br /&gt;
:* sub-model training done, ready for merge based Tencent online1 test set.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Embedded development==&lt;br /&gt;
&lt;br /&gt;
* CLG embedded decoder is almost done. Online compiler is on progress.&lt;br /&gt;
* Zhiyong is working on layer-by-layer DNN training.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Speech QA==&lt;br /&gt;
&lt;br /&gt;
* Current N-best results&lt;br /&gt;
:* N-best search plus pinyin correction&lt;br /&gt;
:* Total 2718 QA requests&lt;br /&gt;
:* default 1844 QA correct&lt;br /&gt;
:* no-entity 1650 QA correct&lt;br /&gt;
:* with-entity 1884 QA correct&lt;br /&gt;
&lt;br /&gt;
* Analyze error patterns for Nbest match&lt;br /&gt;
&lt;br /&gt;
:* 10.8% song transcriptions errors&lt;br /&gt;
:* 18.3% English error&lt;br /&gt;
:* 38.7% entity (song name, singer name) recognition lost&lt;br /&gt;
:* 32.3% non-entity recognition error&lt;br /&gt;
&lt;br /&gt;
* Computing complexity&lt;br /&gt;
:* 11000 entity has 23000 different pronunciations&lt;br /&gt;
:* Use tree to improve efficiency&lt;br /&gt;
&lt;br /&gt;
* Entity-class LM comparision&lt;br /&gt;
:* re-segmentation &amp;amp; re-train&lt;br /&gt;
:* SRILM class-based LM&lt;br /&gt;
:* Subgraph integration from Zhiyong&lt;/div&gt;</summary>
		<author><name>Cslt</name></author>	</entry>

	</feed>