“M2asr-2018-04-20-decision”版本间的差异
来自cslt Wiki
(→Progress) |
|||
(相同用户的2个中间修订版本未显示) | |||
第4行: | 第4行: | ||
:* Uyghur: 250h seed speech data ready, 10k sentences for morpheme learning ready (XJU) | :* Uyghur: 250h seed speech data ready, 10k sentences for morpheme learning ready (XJU) | ||
:* Kazak: 300h seed speech data ready, 5k sentences for morpheme learning ready (XJU) | :* Kazak: 300h seed speech data ready, 5k sentences for morpheme learning ready (XJU) | ||
− | :* Kirgiz: 0 speech data; 500k text sentences | + | :* Kirgiz: 0 speech data; 500k text sentences collected. (XJU) |
:* Tibetan: seed speech data of 42 people, Lexicon with 50k words; 50M text + 40M new blog data collected (NMU) | :* Tibetan: seed speech data of 42 people, Lexicon with 50k words; 50M text + 40M new blog data collected (NMU) | ||
− | :* | + | :* Mongolian: Lexicon with 30k words; 50M text collected; text sentences for seed speech dataset recording under preparation (NMU) |
* Technical progress | * Technical progress | ||
:* Multilingual decoding is done. Performance is good, and better than single language systems. Uyghur and Kazak are confusing. (THU) | :* Multilingual decoding is done. Performance is good, and better than single language systems. Uyghur and Kazak are confusing. (THU) | ||
− | :* Zero resource ASR is undergoing: structure & knowledge transfer + learning with | + | :* Zero resource ASR is undergoing: structure & knowledge transfer + learning with unlabelled data (THU) |
==Problems== | ==Problems== | ||
* Resource collection | * Resource collection | ||
− | :* Seed data for Kirgiz and | + | :* Seed data for Kirgiz and Mongolian should be collected quickly. They should be done before August, 1st. |
:* Body data should be collected as soon as possible. Shiying will release a recording APP and a check platform for the collection. This should be done before Just 1st. | :* Body data should be collected as soon as possible. Shiying will release a recording APP and a check platform for the collection. This should be done before Just 1st. | ||
* Resource centeralization | * Resource centeralization | ||
− | :* A key problem is that the resource has not been well managed. We should put all light resources (lexicon,transcription, recipe, tools) on github, heavy resources (speech data, text data) on disk but can be accessed by URL. All the resources should be | + | :* A key problem is that the resource has not been well managed. We should put all light resources (lexicon,transcription, recipe, tools) on github, heavy resources (speech data, text data) on disk but can be accessed by URL. All the resources should be indexed from the wiki. |
* State-of-the-art recipe | * State-of-the-art recipe | ||
:* The research has not been put on a unified baseline. We should set up the baseline systems for the 5 languages, so that individual research can has a good reference. | :* The research has not been put on a unified baseline. We should set up the baseline systems for the 5 languages, so that individual research can has a good reference. | ||
:* We also need to put the multilingual ASR system onto github, so that all can start their research from the state-of-the-art. | :* We also need to put the multilingual ASR system onto github, so that all can start their research from the state-of-the-art. | ||
− | :* Tang Zhiyuan will be response for the above task, and Shiying will be the main | + | :* Tang Zhiyuan will be response for the above task, and Shiying will be the main researcher (done before June 1st). |
+ | |||
+ | ==Technical guidance== | ||
+ | |||
+ | * http://cslt.riit.tsinghua.edu.cn/mediawiki/images/9/9d/ASR-Technologies-in-M2ASR-Project.pdf |
2018年4月21日 (六) 01:50的最后版本
Progress
- Data resource
- Uyghur: 250h seed speech data ready, 10k sentences for morpheme learning ready (XJU)
- Kazak: 300h seed speech data ready, 5k sentences for morpheme learning ready (XJU)
- Kirgiz: 0 speech data; 500k text sentences collected. (XJU)
- Tibetan: seed speech data of 42 people, Lexicon with 50k words; 50M text + 40M new blog data collected (NMU)
- Mongolian: Lexicon with 30k words; 50M text collected; text sentences for seed speech dataset recording under preparation (NMU)
- Technical progress
- Multilingual decoding is done. Performance is good, and better than single language systems. Uyghur and Kazak are confusing. (THU)
- Zero resource ASR is undergoing: structure & knowledge transfer + learning with unlabelled data (THU)
Problems
- Resource collection
- Seed data for Kirgiz and Mongolian should be collected quickly. They should be done before August, 1st.
- Body data should be collected as soon as possible. Shiying will release a recording APP and a check platform for the collection. This should be done before Just 1st.
- Resource centeralization
- A key problem is that the resource has not been well managed. We should put all light resources (lexicon,transcription, recipe, tools) on github, heavy resources (speech data, text data) on disk but can be accessed by URL. All the resources should be indexed from the wiki.
- State-of-the-art recipe
- The research has not been put on a unified baseline. We should set up the baseline systems for the 5 languages, so that individual research can has a good reference.
- We also need to put the multilingual ASR system onto github, so that all can start their research from the state-of-the-art.
- Tang Zhiyuan will be response for the above task, and Shiying will be the main researcher (done before June 1st).