Deal with numbers in LM training
Numbers are not simple to handle, for all languages. The basic problem is that numbers are open, and therefore the context of numbers are not simple to model. Our approach is to substitue numbers into a single token "NUM". By bulding NUM-beared LM and a graph for NUM and composing these two graphs, we hope to train a robust model.
The first step, hence, is resubstitue numbers into NUM. The following steps are taken:
1. find all words with number 0-9, and replace it to NUM directly 2. find all words with chinese number '零'-'九', form a number word list L0 3. since some of the words are actually not numbers, such as '三纲五常', we remove the words in a pre-defined lexicon V from L0, get L=L0-V 4. the pre-defined lexicon V is from a general lexicon V0, by removing pure numbers, such as '一','二','一九一九' 5. design the mapping M=L -> num 6. using M to substitue numbers in the training text to 'NUM'.