“ASR:2014-12-08”版本间的差异

来自cslt Wiki
跳转至: 导航搜索
Domain specific LM
 
第1行: 第1行:
 +
==Speech Processing ==
 +
=== AM development ===
 +
 +
==== Environment ====
 +
* Already buy 3 760GPU
 +
* grid-9/12 760GPU crashed again; grid-11 shutdown automatically.
 +
* Change 760gpu card of grid-12 and grid-14(+).
 +
 +
==== Sparse DNN ====
 +
* Performance improvement found when pruned slightly
 +
* need retraining for unpruned one; training loss 
 +
* details at http://liuc.cslt.org/pages/sparse.html
 +
* HOLD
 +
 +
==== RNN AM====
 +
* Initial nnet seems not very well, need to be pre-trained or test lower learn-rate.
 +
* For AURORA 4 1h/epoch, model train done.
 +
* Using AURORA 4 short-sentence with a smaller number of targets.(+)
 +
* Adjusting the learning rate.(+)
 +
* Trying toolkit of Microsoft.(+)
 +
* details at http://liuc.cslt.org/pages/rnnam.html
 +
* Reading papers
 +
 +
==== A new nnet training scheduler ====
 +
* Initial code done. No better than original one considering of taking much more iterations.
 +
* details at http://liuc.cslt.org/pages/nnet-sched.html
 +
* done.
 +
 +
====Drop out & Rectification & convolutive network====
 +
 +
* Drop out(+)
 +
:* AURORA4 dataset
 +
 +
:* Use different proportion of noise data to investigate the effect of xEnt and mpe and dropout
 +
:** Problem 1) The effect of dropout in different noise proportion;
 +
<pre>
 +
No. |  data & config        | test_clean_wv1  | test_airport_wv1 | test_babble_wv1 | test_car_wv1 |
 +
---------------------------------------------------------------------------------------------------
 +
1  |  clean-std            | 6.74            |    28.77        |  31.84        |  14.24      |
 +
---------------------------------------------------------------------------------------------------
 +
2  |  clean-dropout0.8    | 6.78            |    25.89        |  26.45        |  12.57      |
 +
---------------------------------------------------------------------------------------------------
 +
3  |  noise-20%-std        | 6.76            |    14.74        |  14.32        |  8.87        |
 +
---------------------------------------------------------------------------------------------------
 +
4  |  noise-20%-dropout0.8 | 7.01            |    14.51        |  13.61        |  9.22        |
 +
---------------------------------------------------------------------------------------------------
 +
5  |  noise-100%-std      | 9.03            |    11.21        |  11.44        |  7.96        |
 +
---------------------------------------------------------------------------------------------------
 +
6  |  noise-100%-dropout0.8| 8.87            |    11.58        |  12.22        |  8.38        |
 +
---------------------------------------------------------------------------------------------------
 +
</pre>
 +
          2) The effect of MPE in different noise proportion;
 +
          3) The effect of MPE+dropout in different noise proportion.
 +
:**http://cslt.riit.tsinghua.edu.cn/cgi-bin/cvss/cvss_request.pl?step=view_request&cvssid=261
 +
 +
:** Find and test unknown noise test-data.(++)
 +
:** Have done the droptout on normal trained XEnt NNET , eg wsj(learn-rate:1e-4/1e-5). Seems small learn-rate get the balance of accuracy and train-time.
 +
 +
* MaxOut(+)
 +
:* pretraining based maxout, can't use large learning-rate.
 +
:** Select units in Groupsize interval, but need low learn-rate
 +
 +
* SoftMaxout
 +
 +
* P-norm
 +
:* Need to solve the too small learning-rate problem
 +
:** Add one normalization layer after the pnorm-layer
 +
 +
* Convolutive network (+)
 +
:* AURORA 4
 +
  --------------------------------------------------------------------------------------------------------------------------
 +
  nonlda                | %WER      |Dnn l-u    | pool size-step| cnn dim-step-num                | cnn_init_opts
 +
  --------------------------------------------------------------------------------------------------------------------------
 +
  cnn_std                | 5.73      | 4 - 1200  | 3 - 3        | 8-1-128 512-128-256            |--patch-dim1 8
 +
                          |          |          |              |                                |--input_dim~patch-dim1
 +
  --------------------------------------------------------------------------------------------------------------------------
 +
  cnn_cnnunit_384        | 5.85      | 4 - 1200  | 3 - 3        | 8-1-128 512-128-384            |--patch-dim1 8
 +
                          |          |          |              |                                |--num-filters2 384   
 +
  --------------------------------------------------------------------------------------------------------------------------
 +
  cnn_patchdim1_5        | 5.92      | 4 - 1200  | 3 - 3        | 5-1-128 512-128-256            |--patch-dim1 5
 +
  --------------------------------------------------------------------------------------------------------------------------
 +
  cnn_patchdim1_11      | 6.05      | 4 - 1200  | 3 - 3        | 11-1-128 512-128-256            |--patch-dim1 11
 +
  --------------------------------------------------------------------------------------------------------------------------
 +
  cnn_delta_1            | 5.98      | 4 - 1200  | 3 - 3        | 8-1-128 512-128-256            |--patch-dim1 8
 +
  --------------------------------------------------------------------------------------------------------------------------
 +
  cnn_delta_2            | 6.05      | 4 - 1200  | 3 - 3        | 8-1-128 512-128-256            |--patch-dim1 8
 +
  --------------------------------------------------------------------------------------------------------------------------
 +
  cnn_layer_3            | 6.00      | 4 - 1200  | 3 - 3 3 - 1  | 8-1-128 512-128-256 768-256-512 |--patch-dim1 8
 +
  --------------------------------------------------------------------------------------------------------------------------
 +
  cnn_layer_3_2          | 5.85      | 4 - 1200  | 3 - 3 2 - 2  | 8-1-128 512-128-256 768-256-512 |--patch-dim1 8
 +
  --------------------------------------------------------------------------------------------------------------------------
 +
  cnn_layer_3_3          | 5.73      | 4 - 1200  | 3 - 3 2 - 2  | 8-1-128 512-128-256 512-256-512 |--patch-dim1 8
 +
  --------------------------------------------------------------------------------------------------------------------------
 +
  cnn_layer_3_4          | 5.96      | 4 - 1200  | 3 - 3 2 - 2  | 8-1-128 512-128-256 256-256-512 |--patch-dim1 8
 +
  --------------------------------------------------------------------------------------------------------------------------
 +
 +
====DAE(Deep Atuo-Encode)====
 +
  (1) train_clean
 +
    drop-retention/testcase(WER)| test_clean_wv1  | test_airport_wv1 | test_babble_wv1 | test_car_wv1
 +
  ---------------------------------------------------------------------------------------------------------
 +
      std-xEnt-sigmoid-baseline| 6.04            |    29.91        |  27.76        | 16.37
 +
  ---------------------------------------------------------------------------------------------------------
 +
      std+dae_cmvn_noFT_2-1200 | 7.10            |    15.33        |  16.58        | 9.23
 +
  ---------------------------------------------------------------------------------------------------------
 +
    std+dae_cmvn_splice5_2-100  | 8.19            |    15.21        |  15.25        | 9.31
 +
  ---------------------------------------------------------------------------------------------------------
 +
 +
:* test on XinWenLianBo music. results on
 +
:** http://cslt.riit.tsinghua.edu.cn/cgi-bin/cvss/cvss_request.pl?account=zhaomy&step=view_request&cvssid=318
 +
 +
====Denoising & Farfield ASR====
 +
* ICASSP paper submitted.
 +
* HOLD
 +
 +
====VAD====
 +
* Harmonics and Teager energy features being investigation (++)
 +
 +
====Speech rate training====
 +
* Data ready on tencent set; some errors on speech rate dependent model. error fixed.
 +
* Retrain new model(+)
 +
 +
====Scoring====
 +
* Timber Comparison done.
 +
* harmonics based timber comparison: frequency based feature is better. done
 +
* GMM based timber comparison is done. Similar to speaker recognition. done
 +
* TODO: Code checkin and '''technique report'''. done
 +
 +
====Confidence====
 +
* Reproduce the experiments on fisher dataset.
 +
* Use the fisher DNN model to decode all-wsj dataset
 +
* preparing scoring for puqiang data
 +
* HOLD
 +
 +
===Speaker ID===
 +
* Preparing GMM-based server.
 +
* EER ~ 4% (GMM-based system)--Text independent
 +
* EER ~ 6%(1s) / 0.5%(5s) (GMM-based system)--Text dependent
 +
* test different number of components; fast i-vector computing
 +
 +
===Language ID===
 +
* GMM-based language is ready.
 +
* Delivered to Jietong
 +
* Prepare the test-case
 +
 +
===Voice Conversion===
 +
* Yiye is reading materials(+)
 +
 +
 
==Text Processing==
 
==Text Processing==
 
===LM development===
 
===LM development===

2014年12月8日 (一) 08:43的最后版本

Speech Processing

AM development

Environment

  • Already buy 3 760GPU
  • grid-9/12 760GPU crashed again; grid-11 shutdown automatically.
  • Change 760gpu card of grid-12 and grid-14(+).

Sparse DNN

RNN AM

  • Initial nnet seems not very well, need to be pre-trained or test lower learn-rate.
  • For AURORA 4 1h/epoch, model train done.
  • Using AURORA 4 short-sentence with a smaller number of targets.(+)
  • Adjusting the learning rate.(+)
  • Trying toolkit of Microsoft.(+)
  • details at http://liuc.cslt.org/pages/rnnam.html
  • Reading papers

A new nnet training scheduler

Drop out & Rectification & convolutive network

  • Drop out(+)
  • AURORA4 dataset
  • Use different proportion of noise data to investigate the effect of xEnt and mpe and dropout
    • Problem 1) The effect of dropout in different noise proportion;
No. |  data & config        | test_clean_wv1  | test_airport_wv1 | test_babble_wv1 | test_car_wv1 |
---------------------------------------------------------------------------------------------------
 1  |  clean-std            | 6.74            |    28.77         |   31.84         |  14.24       |
---------------------------------------------------------------------------------------------------
 2  |  clean-dropout0.8     | 6.78            |    25.89         |   26.45         |  12.57       |
---------------------------------------------------------------------------------------------------
 3  |  noise-20%-std        | 6.76            |    14.74         |   14.32         |  8.87        |
---------------------------------------------------------------------------------------------------
 4  |  noise-20%-dropout0.8 | 7.01            |    14.51         |   13.61         |  9.22        |
---------------------------------------------------------------------------------------------------
 5  |  noise-100%-std       | 9.03            |    11.21         |   11.44         |  7.96        |
---------------------------------------------------------------------------------------------------
 6  |  noise-100%-dropout0.8| 8.87            |    11.58         |   12.22         |  8.38        |
---------------------------------------------------------------------------------------------------
          2) The effect of MPE in different noise proportion;
          3) The effect of MPE+dropout in different noise proportion.
    • Find and test unknown noise test-data.(++)
    • Have done the droptout on normal trained XEnt NNET , eg wsj(learn-rate:1e-4/1e-5). Seems small learn-rate get the balance of accuracy and train-time.
  • MaxOut(+)
  • pretraining based maxout, can't use large learning-rate.
    • Select units in Groupsize interval, but need low learn-rate
  • SoftMaxout
  • P-norm
  • Need to solve the too small learning-rate problem
    • Add one normalization layer after the pnorm-layer
  • Convolutive network (+)
  • AURORA 4
 --------------------------------------------------------------------------------------------------------------------------
  nonlda                 | %WER      |Dnn l-u    | pool size-step| cnn dim-step-num                | cnn_init_opts
 --------------------------------------------------------------------------------------------------------------------------
  cnn_std                | 5.73      | 4 - 1200  | 3 - 3         | 8-1-128 512-128-256             |--patch-dim1 8 
                         |           |           |               |                                 |--input_dim~patch-dim1
 --------------------------------------------------------------------------------------------------------------------------
  cnn_cnnunit_384        | 5.85      | 4 - 1200  | 3 - 3         | 8-1-128 512-128-384             |--patch-dim1 8
                         |           |           |               |                                 |--num-filters2 384     
 --------------------------------------------------------------------------------------------------------------------------
  cnn_patchdim1_5        | 5.92      | 4 - 1200  | 3 - 3         | 5-1-128 512-128-256             |--patch-dim1 5
 --------------------------------------------------------------------------------------------------------------------------
  cnn_patchdim1_11       | 6.05      | 4 - 1200  | 3 - 3         | 11-1-128 512-128-256            |--patch-dim1 11
 --------------------------------------------------------------------------------------------------------------------------
  cnn_delta_1            | 5.98      | 4 - 1200  | 3 - 3         | 8-1-128 512-128-256             |--patch-dim1 8
 --------------------------------------------------------------------------------------------------------------------------
  cnn_delta_2            | 6.05      | 4 - 1200  | 3 - 3         | 8-1-128 512-128-256             |--patch-dim1 8
 --------------------------------------------------------------------------------------------------------------------------
  cnn_layer_3            | 6.00      | 4 - 1200  | 3 - 3 3 - 1   | 8-1-128 512-128-256 768-256-512 |--patch-dim1 8
 --------------------------------------------------------------------------------------------------------------------------
  cnn_layer_3_2          | 5.85      | 4 - 1200  | 3 - 3 2 - 2   | 8-1-128 512-128-256 768-256-512 |--patch-dim1 8
 --------------------------------------------------------------------------------------------------------------------------
  cnn_layer_3_3          | 5.73      | 4 - 1200  | 3 - 3 2 - 2   | 8-1-128 512-128-256 512-256-512 |--patch-dim1 8
 --------------------------------------------------------------------------------------------------------------------------
  cnn_layer_3_4          | 5.96      | 4 - 1200  | 3 - 3 2 - 2   | 8-1-128 512-128-256 256-256-512 |--patch-dim1 8
 --------------------------------------------------------------------------------------------------------------------------

DAE(Deep Atuo-Encode)

 (1) train_clean
   drop-retention/testcase(WER)| test_clean_wv1  | test_airport_wv1 | test_babble_wv1 | test_car_wv1 
  ---------------------------------------------------------------------------------------------------------
      std-xEnt-sigmoid-baseline| 6.04            |    29.91         |   27.76         | 16.37
  ---------------------------------------------------------------------------------------------------------
      std+dae_cmvn_noFT_2-1200 | 7.10            |    15.33         |   16.58         | 9.23
  ---------------------------------------------------------------------------------------------------------
   std+dae_cmvn_splice5_2-100  | 8.19            |    15.21         |   15.25         | 9.31
  ---------------------------------------------------------------------------------------------------------

Denoising & Farfield ASR

  • ICASSP paper submitted.
  • HOLD

VAD

  • Harmonics and Teager energy features being investigation (++)

Speech rate training

  • Data ready on tencent set; some errors on speech rate dependent model. error fixed.
  • Retrain new model(+)

Scoring

  • Timber Comparison done.
  • harmonics based timber comparison: frequency based feature is better. done
  • GMM based timber comparison is done. Similar to speaker recognition. done
  • TODO: Code checkin and technique report. done

Confidence

  • Reproduce the experiments on fisher dataset.
  • Use the fisher DNN model to decode all-wsj dataset
  • preparing scoring for puqiang data
  • HOLD

Speaker ID

  • Preparing GMM-based server.
  • EER ~ 4% (GMM-based system)--Text independent
  • EER ~ 6%(1s) / 0.5%(5s) (GMM-based system)--Text dependent
  • test different number of components; fast i-vector computing

Language ID

  • GMM-based language is ready.
  • Delivered to Jietong
  • Prepare the test-case

Voice Conversion

  • Yiye is reading materials(+)


Text Processing

LM development

Domain specific LM

  • domain lm
  • Sougou2T : kn-count continue .
  • lm v2.0 set up(this week)
  • new dict.
  • Released vocab v2.0 (mainly done by Dongxu) to JieTong.
  • using minimum size segmentation and artificial add the long word(like 中华人民共和国)
  • check the v2.0-dict with small data.

tag LM

  • summary done
  • need to do
  • tag Probability should test add the weight(hanzhenglong) and handover to hanzhenglong (hold)
  • make a summary about tag-lm and journal paper(wxx and yuanb)(this weeks).
  • Reviewed papers and begin to write paper (this week)

RNN LM

  • rnn
  • test wer RNNLM on Chinese data from jietong-data(this week)
  • generate the ngram model from rnnlm and test the ppl with different size txt.[1]
  • lstm+rnn
  • check the lstm-rnnlm code about how to Initialize and update learning rate.(hold)

Word2Vector

W2V based doc classification

  • Initial results variable Bayesian GMM obtained. Performance is not as good as the conventional GMM.(hold)
  • Non-linear inter-language transform: English-Spanish-Czch: wv model training done, transform model on investigation

Knowledge vector

  • Knowledge vector started
  • Analysis the wiki infomation of category and link into jso done, knowledge vector build graph done.
  • begin to code for train

relation

  • Accomplish transE with almost the same performance as the paper did(even better)[2]

Character to word

  • Character to word conversion(hold)
  • prepare the task: word similarity
  • prepare the dict.

Translation

  • v5.0 demo released
  • cut the dict and use new segment-tool

QA

deatil:

Spell mistake

  • add the xiaoI pingyin correct to framework.

improve fuzzy match

  • add Synonyms similarity using MERT-4 method(hold)

improve lucene search

  • using MERT-4 method to get good value of multi-feature.like IDF,NER,baidu_weight,keyword etc.(liurong this month)
  • now test the performance.

Multi-Scene Recognition

  • done

XiaoI framework

  • ner from xiaoI
  • new inter will install SEMPRE

patent

  • done