“Document classification test”版本间的差异
来自cslt Wiki
(→VSM Test) |
|||
第23行: | 第23行: | ||
:* classifier: Native Bayes | :* classifier: Native Bayes | ||
*Result | *Result | ||
+ | |||
+ | {| border="2px" | ||
+ | |+ classification result | ||
+ | |- | ||
+ | ! Training Set !! 财经!! IT!! 健康!! 体育!! 旅游 !!教育 !! 招聘!! 文化!!军事 | ||
+ | |- | ||
+ | ! TFIDF | ||
+ | | 0.678 || 0.718 || 0.708 || 0.708 || 0.73 | ||
+ | |- | ||
+ | |} | ||
===LDA Test=== | ===LDA Test=== | ||
===Word2vec Test=== | ===Word2vec Test=== |
2014年9月9日 (二) 02:15的版本
目录
[隐藏]Problem And Solve
Document classification of Sougou data
- DATA
- Data from SougouLab [1],using SogouC.reduced(30M)
- 9-Classes:财经,IT,健康,体育,旅游,教育,招聘,文化,军事
- train and test: train(),test(),dev()
- Text preprocessing
- Segment word using wordlist of 9W.(tencent)
- Remove stop word.stop_wordlist is
- Some Tools
- weka
- scw
- google word2ve
- LDA
VSM Test
- Data
- dimension:9402
- Method
- document reprenstion: use the tf-idf weight for word weight
- classifier: Native Bayes
- Result
Training Set | 财经 | IT | 健康 | 体育 | 旅游 | 教育 | 招聘 | 文化 | 军事 |
---|---|---|---|---|---|---|---|---|---|
TFIDF | 0.678 | 0.718 | 0.708 | 0.708 | 0.73 |