LM optimization with annealing in Chinese
There is a problem particular for Chinese when building LM.
We have known that word-based LM is better than character based LM, and we choose a word list for example 20k. The problem is that Chinese words are open while the characters are close. For other words outside of 20k, if we just delete them from the training data, we will get loss.
A possible solution is:
1. segment words, and choose 20k list by frequency (some tips as well, e.g., substitue numbers) 2. for those words outside of 20k, split them into sequences of short words (even characters), and then amend the word frequency 3. double check if the 20k word list changed. Since words after 20k usually do not take many counts, this should not change things significantly 4. use the splitting rules to split the corresponding words into short word sequences 5. re-train the model