2013-10-11
来自cslt Wiki
Data sharing
- LM count files still undelivered!
DNN progress
Sparse DNN
- Optimal Brain Damage(OBD). Code ready. Looking for testing.
Tencent exps
N/A
Noisy training
- Dirichlet noise random corruption done. Performance show significant improved with noisy test.
- The impact on clean speech is various, some test cases obtained even better performance compared with normal training, e.g., online1 and rec1900.
Continuous LM
1. SogouT 3T data cleaning up. Keep on running. The initial results with 7G training text in terms of PPL:
- SogouQ test: 292
- Tencent online1: 578
- Tencent online2: 475
This means the SogouQ text is significantly different from the online1 and online2 Tencent set, due to the different domain.
2. NN LM
Split the most frequent 10k words into 10 x 1024 sub sets and model in 10 networks.
- Training data: QA 500M text
- Test data: Tencent online2
- Dev data: Tencent online1
short_list cslm_ppl cslm_sum n-gram_sum all_ppl coverage 0 -1023 12.12 39.7% 60.30% 122.54 58.86 1024-2047 1.75 6.56 93.44 118.92 11.35 2048-3071 1.41 3.75 96.25 117.16 6.41 3072-4095 1.23 2.17 97.83 116.24 4.27 4096-5119 1.26 2.24 97.76 116.13 3.10 5120-6143 1.18 1.69 98.31 116.82 2.38 6144-7167 1.15 1.22 98.78 117.19 1.85 7168-8191 1.13 1.13 98.87 117.34 1.50 8192-9217 1.07 0.58 99.42 116.06 1.23 9218-10241 1.06 0.44 99.56 115.86 1.03 n-gram baseline: 100% 402
note:coverge -- the proportion of short-list frequence in the train data cslm_sum -- the percent of word predicted by cslm n-gram_sum-- the percent of word predicted by n-gram cslm_ppl -- the short-list ppl calculated by clsm
3. CSLM to N-gram failed (with threshold=1e-5), due to the large number of n-grams expanded from the network. So the expansion approach is not suitable. This is something reasonable since the network is highly compacted.
4. Keep on lattice re-scoring with multiple CSLM networks.