“LM-release-v0.2.2”版本间的差异
来自cslt Wiki
第77行: | 第77行: | ||
8. lm_100h(中间产物,未测试)由 100h文本训练而成。 | 8. lm_100h(中间产物,未测试)由 100h文本训练而成。 | ||
9. lm_hybrid_all_3gram_1e-7_myhexin 是以使得test_myhexin_20161019测试集结果最优的权重将两个 LM 插值合并而成,lm_hybrid_all_3gram_1e-7_2000ju 和 | 9. lm_hybrid_all_3gram_1e-7_myhexin 是以使得test_myhexin_20161019测试集结果最优的权重将两个 LM 插值合并而成,lm_hybrid_all_3gram_1e-7_2000ju 和 | ||
− | + | lm_hybrid_all_3gram_1e-7_recheck 同理。 | |
测试环境: | 测试环境: | ||
AM = /work5/release/project/myhexin/am/v0.1 | AM = /work5/release/project/myhexin/am/v0.1 |
2017年3月9日 (四) 06:29的最后版本
RELEASE TITLE: LM RELEASE RELEASE VERSION: v0.2.2 RELEASE TYPE: STEP RELEASE RELEASE LOCATION: /work5/release/weiy/project/myhexin/lm/v0.2.2 RELATED BUGDB: 1. BACKGROUND: 本版本发布是同花顺语音识别项目的成果发布的内部结点成果(STEP RELEASE), 版本号为V0.2.2。发布的目的是验证在现有技术下,实现同花顺的目标的可行性,提供一个可选择的基础版本,为总结问题,验证性能提供参考。 2. TECHNOLOGY SUMMARY: 以49G金融语料和85G通用语料训练出的两个模型为基础,通过不断增加语料,得到三组不同语料训练出的语言模型,并对比效果。 3. RELEASE COMPONENT: LM: LM RELEASE v0.2.1 LM RELEASE v0.2.2 4. TEST RESULT: =========================================================================================================================== 语料 LM ppl/OOV/wer myhexin 2000ju recheck =========================================================================================================================== 1 lm_fin_3gram_1e-7 | 763.126 / 2787 / 5.99 | 856.218 / 558 / 40.32 | 136.406 / 183 / 15.93 lm_non_3gram_1e-7 | 1490.42 / 2790 / 7.89 | 579.682 / 557 / 38.61 | 224.078 / 183 / 18.40 lm_hybrid_3gram_1e-7 | 788.452 / 2785 / 6.23 | 598.666 / 557 / 38.73 | 135.459 / 183 / 15.99 =========================================================================================================================== 2 lm_fin_all_3gram_1e-7 | 571.8 / 2268 / 5.72 | 1150.24 / 508 / 41.35 | 163.308 / 21 / 15.84 lm_hybrid_all_3gram_1e-7 | 607.18 / 2266 / 5.95 | 686.223 / 507 / 39.18 | 149.625 / 21 / 15.71 lm_hybrid_all_3gram_1e-7_myhexin | 563.656 / 2266 / 5.75 | 918.887 / 507 / 40.60 | 153.768 / 21 / 15.59 lm_hybrid_all_3gram_1e-7_2000ju | 985.705 / 2266 / 6.73 | 602.71 / 507 / 38.61 | 190.716 / 21 / 16.98 lm_hybrid_all_3gram_1e-7_recheck | 580.343 / 2266 / 5.88 | 738.939 / 507 / 39.48 | 148.432 / 21 / 15.63 lm_hybrid_all_5gram_1e-9 | 315.16 / 2266 / 4.86 | 476.592 / 507 / 36.66 | 94.8788 / 21 / 14.81 =========================================================================================================================== 3 lm_hybrid_all_100h_3gram_1e-7 | 720.404 / 2266 / 6.31 | 376.68 / 507 / 36.64 | 101.352 / 21 / 14.87 =========================================================================================================================== =============================================== 构成 来源 =============================================== 金融 通用 =============================================== 104G | 40G 64G | 2016-11-08前爬取的语料 =============================================== 30G | 9.1G 21G | 2016-12-23前爬取的语料 =============================================== 10G | 10G 0 | 同花顺提供的语料 =============================================== 100h | 6.2M | =============================================== =================================================================================================== output input1 input2 weight =================================================================================================== lm_hybrid_3gram_1e-7 lm_fin_3gram_1e-7 lm_non_3gram_1e-7 0.5 lm_fin_all_3gram_1e-7 lm_fin_3gram_1e-7 lm_ft_3gram_1e-7 0.5 lm_hybrid_all_3gram_1e-7 lm_fin_all_3gram_1e-7 lm_non_3gram_1e-7 0.5 lm_hybrid_all_3gram_1e-7_myhexin lm_fin_all_3gram_1e-7 lm_non_3gram_1e-7 0.900766 lm_hybrid_all_3gram_1e-7_2000ju lm_fin_all_3gram_1e-7 lm_non_3gram_1e-7 0.0547782 lm_hybrid_all_3gram_1e-7_recheck lm_fin_all_3gram_1e-7 lm_non_3gram_1e-7 0.638826 lm_hybrid_all_5gram_1e-9 lm_fin_all_5gram_1e-9 lm_non_5gram_1e-9 0.5 lm_hybrid_all_100h_3gram_1e-7 lm_hybrid_all_3gram_1e-7 lm_100h_3gram_1e-7 0.5 =================================================================================================== Note: 1. 语料1代表104G语料+30G语料。 2. 语料2代表104G语料+30G语料+10G语料。 3. 语料3代表104G语料+30G语料+10G语料+100h语料。 4. myhexin代表test_myhexin_20161019,金融领域测试集。 5. 2000ju代表test_2000ju,通用领域测试集。 6. recheck代表test_myhexin_finance_recheck,以金融领域为主的混合测试集。 7. lm_ft(中间产物,未测试)由10G语料训练而成。 8. lm_100h(中间产物,未测试)由 100h文本训练而成。 9. lm_hybrid_all_3gram_1e-7_myhexin 是以使得test_myhexin_20161019测试集结果最优的权重将两个 LM 插值合并而成,lm_hybrid_all_3gram_1e-7_2000ju 和 lm_hybrid_all_3gram_1e-7_recheck 同理。 测试环境: AM = /work5/release/project/myhexin/am/v0.1 词表为 vocab = /work5/release/project/myhexin/vocab/vocab.v0.2 分词词典为 dict = /nfs/disk/work/users/zhaomy/soft/jieba/jieba-0.38/jieba/dict.txt.myhexin.v0.2 发音词典为 lexicon = /work4/singular/public/release/lexicon/lexicon.v0.2 beam = 13 结论: 1.金融领域数据训练的LM在金融领域测试集上结果较好,在通用领域测试集上结果较差,通用领域数据训练的LM则相反,插值合并得到的混合LM,在两测试集上结果都接近良好。 2.对于同一个LM,5gram版本在所有测试集的结果都比3gram版本的好,但测试时非常消耗资源,故只测一组,作为对照。 3.lm_hybrid_all 与 lm_hybrid 相比,加入10G金融数据后,金融领域结果变好,通用领域结果变差。 4.用哪个测试集计算出的best_mix 作为权重进行插值合并得到的LM就会在那个测试集上表现最好。 5.根据金融领域优先,兼顾通用领域的原则,选择lm_hybrid_all_myhexin作为最终结果。 6.OOV 只与语料和测试集有关,与金融领域或通用领域语言模型关系不大。 5. RELEASE TEAM: Author: 魏扬 Contributor: 白子薇 Monitor: 赵梦原