“Document classification test”版本间的差异
来自cslt Wiki
(→Word2vec Test) |
(→Word2vec Test) |
||
第51行: | 第51行: | ||
===Word2vec Test=== | ===Word2vec Test=== | ||
*Word2vec result | *Word2vec result | ||
+ | |||
+ | {| border="2px" | ||
+ | |+ classification result Of ACC in different dimension | ||
+ | |- | ||
+ | ! Dimension !! 财经!! IT!! 健康!! 体育!! 旅游 !!教育 !! 招聘!! 文化!!军事!!sum | ||
+ | |- | ||
+ | ! 10 | ||
+ | | 0.72139 || 0.72139 || 0.75124 || 0.82089 || 0.79602 || 0.61194 || 0.70647 || 0.64179|| 0.79104 || 0.72913 | ||
+ | |- | ||
+ | ! 20 | ||
+ | | 0.766169154|| 0.383084577|| 0.52238806|| 0.820895522|| 0.666666667|| 0.44278607|| 0.567164179|| 0.721393035|| 0.850746269|| 0.637921504 | ||
+ | |- | ||
+ | !30 | ||
+ | | | ||
+ | |- | ||
+ | |} |
2014年9月9日 (二) 06:28的版本
目录
[隐藏]Problem And Solve
Document classification of Sougou data
- DATA
- Data from SougouLab [1],using SogouC.reduced(30M)
- 9-Classes:财经,IT,健康,体育,旅游,教育,招聘,文化,军事
- train and test: train(),test(),dev()
- Text preprocessing
- Segment word using wordlist of 9W.(tencent)
- Remove stop word.stop_wordlist is
- Some Tools
- weka
- scw
- google word2ve
- LDA
- class map
C000007 汽车 C000008 财经 C000010 IT C000013 健康 C000014 体育 C000016 旅游 C000020 教育 C000022 招聘 C000023 文化 C000024 军事
VSM Test
- Data
- dimension:9402
- Method
- document reprenstion: use the tf-idf weight for word weight
- classifier: Native Bayes
- Result
财经 | IT | 健康 | 体育 | 旅游 | 教育 | 招聘 | 文化 | 军事 | sum | |
---|---|---|---|---|---|---|---|---|---|---|
ACC-test | 0.72139 | 0.72139 | 0.75124 | 0.82089 | 0.79602 | 0.61194 | 0.70647 | 0.64179 | 0.79104 | 0.72913 |
ACC-train | 0.678 | 0.718 | 0.708 | 0.708 | 0.73 |
LDA Test
Word2vec Test
- Word2vec result
Dimension | 财经 | IT | 健康 | 体育 | 旅游 | 教育 | 招聘 | 文化 | 军事 | sum |
---|---|---|---|---|---|---|---|---|---|---|
10 | 0.72139 | 0.72139 | 0.75124 | 0.82089 | 0.79602 | 0.61194 | 0.70647 | 0.64179 | 0.79104 | 0.72913 |
20 | 0.766169154 | 0.383084577 | 0.52238806 | 0.820895522 | 0.666666667 | 0.44278607 | 0.567164179 | 0.721393035 | 0.850746269 | 0.637921504 |
30 |