“2014-06-27”版本间的差异
来自cslt Wiki
(→Multilingual ASR) |
(→Multilingual ASR) |
||
第34行: | 第34行: | ||
<pre> | <pre> | ||
HW 27h (HW TR LM not involved) HW27h (HW TR LM involved) | HW 27h (HW TR LM not involved) HW27h (HW TR LM involved) | ||
− | Fbank | + | Fbank stream (monolang) 21.64 20.72 |
FBank non-stream (MPE4) 22.23 21.38 | FBank non-stream (MPE4) 22.23 21.38 | ||
FBank stream (MPE4) 21.99 - | FBank stream (MPE4) 21.99 - |
2014年6月27日 (五) 05:53的最后版本
目录
Resoruce Building
Leftover questions
- Asymmetric window: Great improvement on training set(WER 34% to 24%), however the improvement is lost on test.
- Multi GPU training: Error encountered
- Multilanguage training
- Investigating LOUDS FST.
- CLG embedded decoder plus online compiler.
- DNN-GMM co-training
AM development
Sparse DNN
- GA-based block sparsity (++++++++)
Noise training
- Paper writing on going
GFbank
- Running into Sinovoice 8k 1400 + 100 mixture training.
- FBank/GFbank, stream/non-stream MPE completed:
Huawei disanpi BJ mobile 8k English data FBank non-stream (MPE4) 20.44% 22.28% 24.36% FBank stream (MPE4) 19.46% 22.00% 21.19% GFbank stream (MPE4) 20.69% 22.84% 24.45% GFbank non-stream (MPE) - - -
Multilingual ASR
HW 27h (HW TR LM not involved) HW27h (HW TR LM involved) Fbank stream (monolang) 21.64 20.72 FBank non-stream (MPE4) 22.23 21.38 FBank stream (MPE4) 21.99 -
Denoising & Farfield ASR
- correlation-based alignment is done. this is necessary since more the recording device may cause artificial delay.
- how about the output cmvn test?
- deliver the recording to /nfs/disk/perm/data/corpora/reverberant
Original model:
xEnt model: middle-field far-field dev93 74.79 96.68 eval92 63.42 94.75 MPE model: MPE adaptation: middle-field far-field dev93 63.71 94.84 eval92 52.67 90.45
VAD
- DNN-based VAD (7.49) showers much better performance than energy based VAD (45.74)
- 100 X n (n<=3) hidden units with 2 output units seem sufficient for VAD
- report form
Scoring
- refine the model with AMIDA database. Local minimum observed.
- ivector-based speaker detection seems find, reach 96% with 100 speakers
Embedded decoder
AM: 600x4+800 xent9 model: pruning threshold: 1e-5, Nobiglm ------------------------------------------------------------------------------------------ | 150k | 80k | 40k | 20k | 10k | 5k | ------------------------------------------------------------------------------------------ wer | 26.60 | 27.16 | 28.11 | 29.14 | 31.02 | 33.37 | ------------------------------------------------------------------------------------------ RT | 0.68 | 0.66 | 0.61 | 0.61 | 0.58 | 0.56 | ------------------------------------------------------------------------------------------ graph size | 21M | 14M | 9.1M | 6.9M | 5.5M | 4.1M | ------------------------------------------------------------------------------------------ YINSHI:2014-Jun-24,Wednesday,10:7:0 pruning threshold: 1e-6, Nobiglm ------------------------------------------------------------------------------------------ | 150k | 80k | 40k | 20k | 10k | 5k | ------------------------------------------------------------------------------------------ wer | 22.49 | 23.05 | 24.15 | 25.51 | 27.71 | 30.71 | ------------------------------------------------------------------------------------------ RT | 0.89 | 0.84 | 0.76 | 0.70 | 0.68 | 0.64 | ------------------------------------------------------------------------------------------ graph size | 98M | 86M | 67M | 49M | 34M | 24M | ------------------------------------------------------------------------------------------ YINSHI:2014-Jun-27,Saturday,0:52:35 pruning threshold: 1e-6.5, biglm ------------------------------------------------------------------------------------------ | 150k | 80k | 40k | 20k | 10k | 5k | ------------------------------------------------------------------------------------------ wer | 21.12 | 21.75 | 22.92 | 24.39 | 26.89 | 30.01 | ------------------------------------------------------------------------------------------ RT | 1.45 | 1.25 | 1.16 | 1.11 | 1.02 | 0.94 | ------------------------------------------------------------------------------------------ graph size | 38M | 35M | 30M | 25M | 20M | 15M | ------------------------------------------------------------------------------------------ YINSHI:2014-Jun-27,Saturday,0:58:27 pruning threshold: 1e-5.5, Nobiglm ------------------------------------------------------------------------------------------ | 150k | 80k | 40k | 20k | 10k | 5k | ------------------------------------------------------------------------------------------ wer | 24.46 | 25.05 | 26.05 | 27.11 | 29.36 | 32.01 | ------------------------------------------------------------------------------------------ RT | 0.71 | 0.69 | 0.66 | 0.63 | 0.60 | 0.58 | ------------------------------------------------------------------------------------------ graph size | 39M | 32M | 25M | 19M | 14M | 9.2M | ------------------------------------------------------------------------------------------
LM development
Domain specific LM
- Baiduzhidao + Weibeo extraction done with various thresholds
- Looks like the extracted text can improve to some extent, but the major change seems come from pre-processing.
- Check proportion of tags int HW 30h data
Word2Vector
W2V based doc classification
- Full Gaussian based doc vector
- represent each doc with a Gaussian distribution of the word vectors it involved.
- using k-nn to conduct classification
mean Eur Distance KL distance diagonal KL baseline (NB with mean) Acc (50dim) 81.84 79.65 - 69.7
- svm-based classification
mean Eur Distance KL distance diagonal KL LDA 2-class Acc (50dim) 95.57 - - 95.80 8-class Acc (50dim) 88.79 - - -
Semantic word tree
- Version v2.0 released (filter with query log)
- Please deliver to /nfs/disk/perm/data/corpora/semanticTree (Xingchao)
- Version v3.0 under going. Further refinement with Baidu Baike hierarchy
NN LM
- Character-based NNLM (6700 chars, 7gram), 500M data training done.
- Inconsistent pattern in WER were found on Tenent test sets
- probably need to use another test set to do investigation.
- Investigate MS RNN LM training