2014-06-27

来自cslt Wiki
2014年6月27日 (五) 05:53Zhaomy讨论 | 贡献的版本

(差异) ←上一版本 | 最后版本 (差异) | 下一版本→ (差异)
跳转至: 导航搜索

Resoruce Building

Leftover questions

  • Asymmetric window: Great improvement on training set(WER 34% to 24%), however the improvement is lost on test.
  • Multi GPU training: Error encountered
  • Multilanguage training
  • Investigating LOUDS FST.
  • CLG embedded decoder plus online compiler.
  • DNN-GMM co-training

AM development

Sparse DNN

  • GA-based block sparsity (++++++++)


Noise training

  • Paper writing on going

GFbank

  • Running into Sinovoice 8k 1400 + 100 mixture training.
  • FBank/GFbank, stream/non-stream MPE completed:
                                   Huawei disanpi     BJ mobile   8k English data       
FBank non-stream (MPE4)             20.44%              22.28%      24.36%
FBank stream (MPE4)                 19.46%              22.00%      21.19%
GFbank stream    (MPE4)             20.69%              22.84%      24.45%
GFbank non-stream (MPE)             -                     -           -

Multilingual ASR

                                   HW 27h (HW TR LM not involved)     HW27h (HW TR LM involved)
Fbank stream (monolang)             21.64                                   20.72
FBank non-stream (MPE4)             22.23                                   21.38
FBank stream (MPE4)                 21.99                                     -  

Denoising & Farfield ASR

  • correlation-based alignment is done. this is necessary since more the recording device may cause artificial delay.
  • how about the output cmvn test?
  • deliver the recording to /nfs/disk/perm/data/corpora/reverberant

Original model:

xEnt model:
               middle-field    far-field
    dev93       74.79          96.68
    eval92      63.42          94.75

MPE model:


MPE adaptation: 

               middle-field    far-field
    dev93       63.71          94.84
    eval92      52.67          90.45

VAD

  • DNN-based VAD (7.49) showers much better performance than energy based VAD (45.74)
  • 100 X n (n<=3) hidden units with 2 output units seem sufficient for VAD
  • report form

Scoring

  • refine the model with AMIDA database. Local minimum observed.
  • ivector-based speaker detection seems find, reach 96% with 100 speakers


Embedded decoder


AM: 600x4+800 xent9 model: 



pruning threshold: 1e-5, Nobiglm
------------------------------------------------------------------------------------------
             |    150k   |   80k    |     40k     |     20k    |    10k     |      5k    |
------------------------------------------------------------------------------------------
      wer    |    26.60  |   27.16  |    28.11    |    29.14   |   31.02    |    33.37   |
------------------------------------------------------------------------------------------
       RT    |    0.68   |   0.66   |    0.61     |    0.61    |    0.58    |    0.56    |
------------------------------------------------------------------------------------------
 graph size  |     21M   |    14M   |    9.1M     |    6.9M    |    5.5M    |    4.1M    |
------------------------------------------------------------------------------------------

YINSHI:2014-Jun-24,Wednesday,10:7:0 


pruning threshold: 1e-6, Nobiglm
------------------------------------------------------------------------------------------
             |    150k   |   80k    |     40k     |     20k    |    10k     |      5k    |
------------------------------------------------------------------------------------------
      wer    |    22.49  |   23.05  |    24.15    |    25.51   |   27.71    |    30.71   |
------------------------------------------------------------------------------------------
       RT    |    0.89   |   0.84   |    0.76     |    0.70    |    0.68    |    0.64    |
------------------------------------------------------------------------------------------
 graph size  |     98M   |    86M   |     67M     |    49M     |    34M     |     24M    |
------------------------------------------------------------------------------------------

YINSHI:2014-Jun-27,Saturday,0:52:35 


pruning threshold: 1e-6.5, biglm
------------------------------------------------------------------------------------------
             |    150k   |   80k    |     40k     |     20k    |    10k     |      5k    |
------------------------------------------------------------------------------------------
      wer    |    21.12  |   21.75  |    22.92    |    24.39   |   26.89    |    30.01   |
------------------------------------------------------------------------------------------
       RT    |    1.45   |   1.25   |    1.16     |    1.11    |    1.02    |    0.94    |
------------------------------------------------------------------------------------------
 graph size  |     38M   |    35M   |     30M     |    25M     |    20M     |     15M    |
------------------------------------------------------------------------------------------

YINSHI:2014-Jun-27,Saturday,0:58:27 


pruning threshold: 1e-5.5, Nobiglm
------------------------------------------------------------------------------------------
             |    150k   |   80k    |     40k     |     20k    |    10k     |      5k    |
------------------------------------------------------------------------------------------
      wer    |    24.46  |   25.05  |    26.05    |    27.11   |   29.36    |    32.01   |
------------------------------------------------------------------------------------------
       RT    |    0.71   |   0.69   |    0.66     |    0.63    |    0.60    |    0.58    |
------------------------------------------------------------------------------------------
 graph size  |     39M   |    32M   |     25M     |    19M     |    14M     |    9.2M    |
------------------------------------------------------------------------------------------


LM development

Domain specific LM

  • Baiduzhidao + Weibeo extraction done with various thresholds
  • Looks like the extracted text can improve to some extent, but the major change seems come from pre-processing.
  • Check proportion of tags int HW 30h data


Word2Vector

W2V based doc classification

  • Full Gaussian based doc vector
  • represent each doc with a Gaussian distribution of the word vectors it involved.
  • using k-nn to conduct classification
             mean Eur Distance     KL distance  diagonal KL  baseline (NB with mean)

Acc (50dim)    81.84                79.65          -              69.7
  • svm-based classification


                       mean Eur Distance     KL distance    diagonal KL         LDA

2-class Acc (50dim)       95.57                 -               -              95.80
8-class Acc (50dim)       88.79                 -               -                -

Semantic word tree

  • Version v2.0 released (filter with query log)
  • Please deliver to /nfs/disk/perm/data/corpora/semanticTree (Xingchao)
  • Version v3.0 under going. Further refinement with Baidu Baike hierarchy


NN LM

  • Character-based NNLM (6700 chars, 7gram), 500M data training done.
  • Inconsistent pattern in WER were found on Tenent test sets
  • probably need to use another test set to do investigation.
  • Investigate MS RNN LM training