AP17:OLR-special session

Title

Multilingual speech and language processing for minority languages

Organizers

Dong Wang: Tsinghua University (wangdong99@mails.tsinghua.edu.cn)

Dr. Dong Wang got his PhD degree at the University of Edinburgh, and worked in Oracle, IBM, and Nuance. He is now an assistant professor at the certer for speech and language technologies (CSLT) at Tsinghua University. Dr. Wang’s research interest covers speech processing, language processing and financial processing. He has published more than 80 academic papers in the related area, including three best paper awards. Dr. Wang plays active roles in the speech research community: he serves as the secretary in national conference of machine-man speech communication (NCMMSC) and a country representative of the mainland China in Oriental COCOSDA. He was the local chair of ChinaSIP 2013, special session co-chair of ISCSLP 14 and plenary talk co-chair of ISCSLP 16. Dr. Wang is now serving as the vice Chair of the SLA track of APSIPA.

Guanyu Li: Northwest National University (guanyu-li@163.com)

Dr. Guanyu Li got his PhD degree at the Northwest University for Nationalities, Gansu Province, China. He worked in several ERP software development companies as a developmental engineer, and is now an associate professor at the Northwest University for Nationalities and the Key Laboratory of National Language Intelligent Processing，Gansu　Province. His research interest includes speech processing for minor languages in China, especially speech recognition and speech synthesis. In recent years, he published more than ten papers in related areas.

Mijit Ablimit: Xinjiang University (mijit@xju.edu.cn) Dr. Mijit Ablimit got his PhD degree at Kyoto University of Japan. He is now an associate professor at the Information Technology and Engineering college of Xinjiang University. His research interest covers speech, language, and multilinuage information processing for less popular languages of China.

Target track

Speech and Language processing

Introduction

Minor- and multi-lingual phenomenon is a important for modern international societies. This special session focuses on minor- and multi-lingual speech and language processing, including but not limited to the following topics:

Minor- and Multi-lingual phonetic and phonological analysis
Minor- and Multi-lingual speech recognition
Minor- and Multi-lingual speaker recognition
Minor- and Multi-lingual speech synthesis
Minor- and Multi-lingual language understanding
Resource construction for minority languages

Potential Papers

Title: AP17-OLR Challenge: Data, Plan, and Baseline

Author: Zhiyuan Tang, Dong Wang, Yixiang Chen, Qing Chen

Abstract: We present the data profile and the evaluation plan of the second oriental language recognition (OLR) challenge AP17-OLR.

Compare to the event last year (AP16-OLR), the new challenge involves more languages and focuses more on short utterances. The data are offered by SpeechOcean and the NSFC M2ASR project. Two types of baselines were constructed to assist the participants, one is based on the i-vector model and the other is based on various neural networks. We report the baseline results evaluated with various metrics defined by the AP17-OLR evaluation plan and demonstrate that the combined database is a reasonable data resource for multilingual research. All the data are free for participants, and the Kaldi recipes for the baselines have been published online.

Title: Memory-augmented Chinese-Uyghur Neural Machine Translation

Author: Shiyue Zhang, Gulnigar Mahmut, Dong Wang, Askar Hamdulla

Abstract: Neural machine translation (NMT) has achieved notable performance recently. However, this approach has not been widely applied to the translation task between Chinese and Uyghur, partly due to the limited parallel data resource and the large proportion of rare words caused by the agglutinative nature of Uyghur. In this paper, we collect ~200,000 sentence pairs and show that with this middle-scale database, an attention-based NMT can perform very well on Chinese-Uyghur/Uyghur-Chinese translation. To tackle rare words, we propose a novel memory structure to assist the NMT inference. Our experiments demonstrated that the memory-augmented NMT (M-NMT) outperforms both the vanilla NMT and the phrase-based statistical machine translation (SMT). Interestingly, the memory structure provides an elegant way for dealing with words that are out of vocabulary.

Title: Language Resource Construction for Mongolian

Author: Shipeng Xu1 , Hongzhi Yu1, Thomas Fang Zheng2 and Jinghao Yan

Abstract: Mongolia is a typical low-resource language. The resource limitation is in various aspects, from acoustic

analysis, phonetic rules, lexicon, speech and text data. This paper describes our recent progression on Mongolia resource construction supported by the NSFC project.

Title: Free Linguistic and Speech Resources for Tibetan

Author: Guanyu Li, Hongzhi Yu, Jinghao Yan

Abstract: Tibetan is an important low-resource language in China. A key factor that hinders the speech and language research for Tibetan is the lack of resources, particularly free ones. This paper describes our recent progression on Tibetan resource construction supported by the NSFC M2ASR project, including the phone set, lexicon, as well as the transcription of a large scale speech corpus. Following the M2ASR free data program, all the resources are publicly available and free for researchers. We also release a small Tibetan speech database that can be used to build a proto type Tibetan speech recognition system.

Title: A Free Kazak Speech Database and a Speech Recognition Baseline

Author: Ying Shi, Askar Hamdulla, Zhiyuan Tang, Dong Wang, Thomas Fang Zheng

Abstract: Automatic speech recognition (ASR) has gained significant improvement for major languages such as English and Chinese,

partly due to the emergence of deep neural networks (DNN) and large amount of training data. For minority languages, however, the progress is largely behind the main stream. A particularly obstacle is that there are almost no large-scale speech databases for minority languages, and the only few databases are held by some institutes as private properties, far from open and standard, and very few are free. Besides the speech database, phonetic and linguistic resources are also scarce, including phone set, lexicon, and language model.

In this paper, we publish a speech database in Kazak, a major minority language in the western China. Accompanying this database, a full set of phonetic and linguistic resources are also published, by which a full-fledged Kazakh ASR system can be constructed. We will describe the recipe for constructing a baseline system, and report our present results. The resources are free for research institutes and can be obtained by request. The publication is supported by the M2ASR project supported by NSFC, which aims to build multilingual ASR systems for minority languages in China.

Title: A Multilingual Language Processing Tool for Uyghur, Kazak and Kirghiz

Author: Mijit Ablimit, Sardar Parhat, Askar Hamdulla, Thomas Fang Zheng
Abstract: Natural language processing for less popular languages is difficult, partly due to the high variations in the writing form. On the other hand, many minority languages in the same region share similar properties and can be processed in a similar way. This paper publishes an integrated multilingual language processing tool. Our aim is to provide an open, free and standard toolkit for minority language processing tasks, by a uniform user interface to support multiple languages. The present implementation supports Uyghur, Kazak, Kirghiz, three major minority languages in the Western China, and our focus was put on phonetic and morphological analysis. For the phonetic analysis, we build a multilingual parallel phoneme list, with similar phonemes grouped and character codes standardized. A multilingual syllable analyzer is also developed to detect spelling mistakes, and extract irregular spelling. For the morphological analysis, we build a multilingual morpheme segmentation tool that can extract morphemes by statistical analysis. This toolkit is extendable in terms of both functions and languages.