2014-11-10
来自cslt Wiki
目录
Text Processing
LM development
Domain specific LM
- domain lm
- weibo lm with pruning 0 10 10 20 20 testing done. weibo lm with pruning 0 10 8 8 8 under testing. weibo lm without pruning 4/8 done.
- merger weibo、baiduhi and baiduzhidao lm and test (this week)
- confirm the size of alpa with xiaomin for business application.(like e-13)
- get the general test data from miaomin .this test set may get from online.
- new dict.
- Tested the earlier vocabulary on 6000.txt with ppl.
old150K new166K new150K baiduzhidao 394 369 333 baiduhi 217 190 188
- Built new 100K,150K,200K vocabulary
- Had fixed some bugs in sogou dict spider.
- new toolkit:find method to update the new dict. can get new wordlist from sougou and get word information from baidu.(two week)
tag LM
- set new test
- result
RNN LM
- rnn
- RNNLM=>ALPA make a report
- test RNNLM on Chinese data from jietong-data
- check the rnnlm code.
- lstm+rnn
- check the lstm-rnnlm code
Word2Vector
W2V based doc classification
- Initial results variable Bayesian GMM obtained. Performance is not as good as the conventional GMM.
- Non-linear inter-language transform: English-Spanish-Czch: wv model training done, transform model on investigation
- SSA-based local linear mapping still on running.
- k-means classes change to 2.
- Knowledge vector started
- format the data
- yuanbin will continue this work with help of xingchao.
- Character to word conversion
- prepare the task: word similarity
- prepare the dict.
- Google word vector train
- some ideal will discuss on weekly report.
Translation
- v4.0 demo released
- cut the dict and use new segment-tool
QA
- lucene Optimization
- rewrite the method to select the 50 standard question not same template.(this week)
- test the boost keyword weight and extract the synonyms word.(this week)
- check the word segment for template.(this week)
- min-segment method improve the accuracy.(0.61->0.66)
- check the query method for getting lucene information and to rewrite the score method like the idf value.
- test
- test the different idf vale from baidu sougou in fuzzymatch.(this week)
- need to check the other 10% error.(this week)
- spell check
- simple demo done.
- new inter will install SEMPRE