2012年9月13日 (四) 00:38 115.170.223.144

2012-09-13T00:38:17Z

115.170.223.144：以内容“ A particular problem of LM is that some words exists only a few times, but the context of these words should not be computed as such. For example, numbers 12537. It ma...”创建新页面

2012-09-13T00:35:44Z

以内容“ A particular problem of LM is that some words exists only a few times, but the context of these words should not be computed as such. For example, numbers 12537. It ma...”创建新页面

新页面

A particular problem of LM is that some words exists only a few times, but the context of these words should not be computed as such. For example, numbers 12537. It may occur in the training text only once, however it is context ( the context that contains numbers) is pretty solid. This motivates the class LM.

In class LMs, words with the same context are grouped as a class and the context is estimated by replacing all the class words with this class. In the class, words might be random selected or selected with some probability. This idea is a bit similar as decision tree (by the way, can we introduce tree LM?)

A class should be (1) share the same context in linguistics (2) large enough, even infinite (open) so that token-based context estimation is incorrect.

There are at least two classes: number, and name entities. Numbers are relatively simple, while name entities are not trivial. An interesting research is applying NLP approaches to find out name entities first, and then group the name entities into one or a few classes, if that can be obtained, for example, address, name, city...

←上一版本		2012年9月13日 (四) 00:38的版本
第1行：		第1行：
−
	A particular problem of LM is that some words exists only a few times, but the context of these words should not be computed as such. For example, numbers 12537. It may occur in the training text only once, however it is context ( the context that contains numbers) is pretty solid. This motivates the class LM.		A particular problem of LM is that some words exists only a few times, but the context of these words should not be computed as such. For example, numbers 12537. It may occur in the training text only once, however it is context ( the context that contains numbers) is pretty solid. This motivates the class LM.

第7行：		第6行：

	There are at least two classes: number, and name entities. Numbers are relatively simple, while name entities are not trivial. An interesting research is applying NLP approaches to find out name entities first, and then group the name entities into one or a few classes, if that can be obtained, for example, address, name, city...		There are at least two classes: number, and name entities. Numbers are relatively simple, while name entities are not trivial. An interesting research is applying NLP approaches to find out name entities first, and then group the name entities into one or a few classes, if that can be obtained, for example, address, name, city...
		+
		+	This also motivates another two ideas:
		+
		+	1. can we use more high level knowledge to improve ASR, such as parsing. Some people did that using shallow parse, but, we may want some stochastic way to integrate them. FST? CRF? re-scoring lattices?
		+
		+	2. can we use knowledge in real word, e.g., sematic web, to increase the ASR accuracy? Suppose we give just phone sequences of a name entity...

NLP based class LM - 版本历史

2012年9月13日 (四) 00:38 115.170.223.144

115.170.223.144：以内容“ A particular problem of LM is that some words exists only a few times, but the context of these words should not be computed as such. For example, numbers 12537. It ma...”创建新页面