“ASR-howto”版本间的差异
来自cslt Wiki
第1行: | 第1行: | ||
+ | |||
==1. how to build kaldi on linux?== | ==1. how to build kaldi on linux?== | ||
Building Kaldi on windows with VS is pretty annoying. We therefore highly recommend to build the stuff within cygwin. The process is simple: | Building Kaldi on windows with VS is pretty annoying. We therefore highly recommend to build the stuff within cygwin. The process is simple: | ||
− | #. install cygwin. Select the following components | + | #. install cygwin. Select the following components: a. make b. gcc c. automake d. perl e. python f. clapack g. wget h. gfortrain+g77+f77 i. zlib |
− | : a. make b. gcc c. automake d. perl e. python f. clapack g. wget h. gfortrain+g77+f77 i. zlib | + | |
#. download kaldi from CSLT server at /nfs/disk/perm/tool/kaldi | #. download kaldi from CSLT server at /nfs/disk/perm/tool/kaldi | ||
#. install tools: go to kaldi/tools, run install.sh if you have all the required components installed. | #. install tools: go to kaldi/tools, run install.sh if you have all the required components installed. | ||
#. install the core: go to kaldi/src, ./configure; make | #. install the core: go to kaldi/src, ./configure; make | ||
− | ==2. how to create | + | |
+ | ==2. how to create dictionary== | ||
+ | |||
+ | Given a list of words, the lexicon can be build as follows: | ||
+ | #awk '{print $1}' word.list |sort -u |/nfs/disk/work/asr/toolkit/lex/gen_word_lexicon_from_big_lexicon.py | ||
+ | #check the lexicon maunally to remove incorrect pronunciations | ||
+ | #check the words that fail to generate pronunciations, create it by yourself. | ||
+ | |||
+ | *The above default uses the Tencent 110k lexicon. If you want to produce dictionaries based on other phone system, you need set argument for gen_word_lexicon_from_big_lexicon.py by -w , and if you want provide additional background dictionary, set the -e option. | ||
+ | * I set the current word segment system (IKAnalyzer3.2.5Stable) to use the Tencent 110k lexicon for consistency with pronunciation generation. If you use a different background dictionary, then better to replace the lexicon for IKAnalyzer as well. It is simple to put your dic in /nfs/disk/work/asr/toolkit/lex/wordseg/IKAnalyzer3.2.5Stable_src, and then specify it in /nfs/disk/work/asr/toolkit/lex/wordseg/IKAnalyzer3.2.5Stable_src/IKAnalyzer.cfg.xml |
2013年5月26日 (日) 10:33的版本
1. how to build kaldi on linux?
Building Kaldi on windows with VS is pretty annoying. We therefore highly recommend to build the stuff within cygwin. The process is simple:
- . install cygwin. Select the following components: a. make b. gcc c. automake d. perl e. python f. clapack g. wget h. gfortrain+g77+f77 i. zlib
- . download kaldi from CSLT server at /nfs/disk/perm/tool/kaldi
- . install tools: go to kaldi/tools, run install.sh if you have all the required components installed.
- . install the core: go to kaldi/src, ./configure; make
2. how to create dictionary
Given a list of words, the lexicon can be build as follows:
- awk '{print $1}' word.list |sort -u |/nfs/disk/work/asr/toolkit/lex/gen_word_lexicon_from_big_lexicon.py
- check the lexicon maunally to remove incorrect pronunciations
- check the words that fail to generate pronunciations, create it by yourself.
- The above default uses the Tencent 110k lexicon. If you want to produce dictionaries based on other phone system, you need set argument for gen_word_lexicon_from_big_lexicon.py by -w , and if you want provide additional background dictionary, set the -e option.
- I set the current word segment system (IKAnalyzer3.2.5Stable) to use the Tencent 110k lexicon for consistency with pronunciation generation. If you use a different background dictionary, then better to replace the lexicon for IKAnalyzer as well. It is simple to put your dic in /nfs/disk/work/asr/toolkit/lex/wordseg/IKAnalyzer3.2.5Stable_src, and then specify it in /nfs/disk/work/asr/toolkit/lex/wordseg/IKAnalyzer3.2.5Stable_src/IKAnalyzer.cfg.xml