ASR:2014-12-01

来自cslt Wiki
2014年12月1日 (一) 07:48Zhangzy讨论 | 贡献的版本

跳转至: 导航搜索

Speech Processing

AM development

Environment

  • Already buy 3 760GPU
  • grid-9/12 760GPU crashed again; grid-11 shutdown automatically.
  • Change 760gpu card of grid-12 and grid-14(+).

Sparse DNN

RNN AM

  • Initial nnet seems not very well, need to be pre-trained or test lower learn-rate.
  • For AURORA 4 1h/epoch, model train done.
  • Using AURORA 4 short-sentence with a smaller number of targets.(+)
  • Adjusting the learning rate.(+)
  • Trying toolkit of Microsoft.(+)
  • details at http://liuc.cslt.org/pages/rnnam.html
  • Reading papers

A new nnet training scheduler

Drop out & Rectification & convolutive network

  • Drop out(+)
  • AURORA4 dataset
  • Use different proportion of noise data to investigate the effect of xEnt and mpe and dropout
    • Problem 1) The effect of dropout in different noise proportion;
          2) The effect of MPE in different noise proportion;
          3) The effect of MPE+dropout in different noise proportion.
    • Find and test unknown noise test-data.(++)
    • Have done the droptout on normal trained XEnt NNET , eg wsj(learn-rate:1e-4/1e-5). Seems small learn-rate get the balance of accuracy and train-time.
    • Debug the low cv frame-accuracy
  • MaxOut(+)
  • pretraining based maxout
    • Select units in Groupsize interval, but need low learn-rate
    • Force accept the first iteration. Jump out from the local-minimum
  • SoftMaxout
  • P-norm
  • Need to solve the too small learning-rate problem
    • Add one normalization layer after the pnorm-layer
  • Convolutive network (+)
  • AURORA 4
--------------------------------------------------------------------------------------------------------------------------
nonlda                 | %WER      |Dnn l-u    | pool size-step| cnn dim-step-num                | cnn_init_opts

cnn_std                | 5.73      | 4 - 1200  | 3 - 3         | 8-1-128 512-128-256             |--patch-dim1 8 
                       |           |           |               |                                 |--input_dim~patch-dim1

cnn_cnnunit_384        | 5.85      | 4 - 1200  | 3 - 3         | 8-1-128 512-128-384             |--patch-dim1 8
                       |           |           |               |                                 |--num-filters2 384     

cnn_patchdim1_5        | 5.92      | 4 - 1200  | 3 - 3         | 5-1-128 512-128-256             |--patch-dim1 5

cnn_patchdim1_11       | 6.05      | 4 - 1200  | 3 - 3         | 11-1-128 512-128-256            |--patch-dim1 11

cnn_delta_1            | 5.98      | 4 - 1200  | 3 - 3         | 8-1-128 512-128-256             |--patch-dim1 8

cnn_delta_2            | 6.05      | 4 - 1200  | 3 - 3         | 8-1-128 512-128-256             |--patch-dim1 8

cnn_layer_3            | 6.00      | 4 - 1200  | 3 - 3 3 - 1   | 8-1-128 512-128-256 768-256-512 |--patch-dim1 8

cnn_layer_3_2          | 5.85      | 4 - 1200  | 3 - 3 2 - 2   | 8-1-128 512-128-256 768-256-512 |--patch-dim1 8

cnn_layer_3_3          | 5.73      | 4 - 1200  | 3 - 3 2 - 2   | 8-1-128 512-128-256 512-256-512 |--patch-dim1 8

cnn_layer_3_4          | 5.96      | 4 - 1200  | 3 - 3 2 - 2   | 8-1-128 512-128-256 256-256-512 |--patch-dim1 8

DAE(Deep Atuo-Encode)

 (1) train_clean
   drop-retention/testcase(WER)| test_clean_wv1  | test_airport_wv1 | test_babble_wv1 | test_car_wv1 
  ---------------------------------------------------------------------------------------------------------
      std-xEnt-sigmoid-baseline| 6.04            |    29.91         |   27.76         | 16.37
  ---------------------------------------------------------------------------------------------------------
      std+dae_cmvn_noFT_2-1200 | 7.10            |    15.33         |   16.58         | 9.23
  ---------------------------------------------------------------------------------------------------------
   std+dae_cmvn_splice5_2-100  | 8.19            |    15.21         |   15.25         | 9.31
  ---------------------------------------------------------------------------------------------------------

Denoising & Farfield ASR

  • ICASSP paper submitted.
  • HOLD

VAD

  • Frame energy feature extraction, done
  • Harmonics and Teager energy features being investigation (++)
  • Previous results to be organized for a paper
  • MPE model VAD ,good performance observed.

Speech rate training

  • Data ready on tencent set; some errors on speech rate dependent model
  • Retrain new model(+)

Scoring

  • Timber Comparison done.
  • harmonics based timber comparison: frequency based feature is better
  • GMM based timber comparison is done. Similar to speaker recognition
  • TODO: Code checkin and technique report

Confidence

  • Reproduce the experiments on fisher dataset.
  • Use the fisher DNN model to decode all-wsj dataset
  • preparing scoring for puqiang data

Speaker ID

  • Preparing GMM-based server.
  • EER ~ 4% (GMM-based system)--Text independent
  • EER ~ 6%(1s) / 0.5%(5s) (GMM-based system)--Text dependent
  • test different number of components; fast i-vector computing

Language ID

  • GMM-based language is ready.
  • Delivered to Jietong
  • Prepare the test-case

Voice Conversion

  • Yiye is reading materials


Text Processing

LM development

Domain specific LM

  • domain lm(need to discuss with xiaoxi)
  • embedded language model(this week)
  • train some more LMs with Zhenlong (dianzishu sogou bbs chosen)("need result").
  • keep on training sogou2T lm(14/16 on 3rd iteration).(this week)
  • new dict.
  • handover of this work to hanzhenglong, give a simple docuemnt(this week)

tag LM

different weight
method tag-jsgf corpus weight wer ser add_wer
experiment 3 500(490 less frequent and 10 unseen) 500 0.1 16.72 77.92 -
0.3 15.42 71.25 -
0.5 15.40 69.58 -
0.7 15.28 68.75 -
0.8 15.38 68.33 -
1 15.98 69.17 -
2 19.08 70.83 -
experiment 4 100(90 less frequent and 10 unseen) 100 0.008 15.28 69.58 -
0.02 14.84 69.58 -
0.05 15.11 69.58 -
0.1 15.30 69.75 -
0.3 16.01 70.42 -
experiment 5 500 100 0.01 17.57 78.75 -
0.05 16.84 77.08 -
0.08 16.59 76.25 -
0.15 16.76 75.42 -
experiment 6 1280 500 0.1 17.42 77.92 -
0.5 15.20 69.17 -
0.8 15.30 68.33 -
1 15.69 69.58 -
  • conclusion:
 1. compare experiment 3  with experiment 5:
   same jsgf file, but the  tag number in corpus if different, we can find that when add 
 more tag to corpus, the optimal weight is larger.
 2. compare experiment 3 with experiment 6:
  same tag number in corpus, but different jsgf size, we can find that different jsgf size have the 
 same optimal weight.
  • need to do
  • tag Probability should test add the weight(hanzhenglong) and handover to hanzhenglong (this week)
  • make a summary about tag-lm and journal paper(wxx and yuanb)(two weeks).

RNN LM

  • rnn
  • test wer RNNLM on Chinese data from jietong-data(this week)
  • check the rnnlm code about how to Initialize and update learning rate.
  • generate the ngram model from rnnlm and test the ppl with different size txt.(this week)
  • lstm+rnn
  • check the lstm-rnnlm code about how to Initialize and update learning rate.

Word2Vector

W2V based doc classification

  • Initial results variable Bayesian GMM obtained. Performance is not as good as the conventional GMM.(hold)
  • Non-linear inter-language transform: English-Spanish-Czch: wv model training done, transform model on investigation

Knowledge vector

  • Knowledge vector started
  • begin to code

Character to wordr

  • Character to word conversion(hold)
  • prepare the task: word similarity
  • prepare the dict.

Translation

  • v5.0 demo released
  • cut the dict and use new segment-tool

QA

deatil:[1]

Spell mistake

  • retrain the ngram model(caoli)

improve fuzzy match

  • add Synonyms similarity using MERT-4 method(hold)

improve lucene search

  • using MERT-4 method to get good value of multi-feature.like IDF,NER,baidu_weight,keyword etc.(liurong this month)

Multi-Scene Recognition

  • handover to duxk(this week)

XiaoI framework

  • give a report about xiaoI framework
  • new inter will install SEMPRE

patent

  • GA-method improve the QA(this week)