“DataBase”版本间的差异
来自cslt Wiki
(→nolexicion wordlist) |
|||
(2位用户的7个中间修订版本未显示) | |||
第1行: | 第1行: | ||
+ | ==lm== | ||
{| class="wikitable" | {| class="wikitable" | ||
− | ! name | + | ! name !! size !! dir !! description |
|- | |- | ||
− | | | + | |SogouQ.full.train.3gram.gz || 132M || /work/lxs/nlphome/lm/SogouQ-500M || trainData=SougouQ(800M);dict=11w-tecent |
|- | |- | ||
− | | | + | |SogouT-11w-merge2-1.3gram.gz || 4.1G ||/work/lxs/nlphome/lm/SogouT-140G || trainData=SougouT(140G);dict=11w-tencent |
|- | |- | ||
− | | | + | |SogouT-11w-merge2-2.3gram.gz || 3.9G || /work/lxs/nlphome/lm/SogouT-140G || |
|- | |- | ||
− | | | + | |8w8.3gram.tencent.gz || 452M || /work/lxs/nlphome/lm/Tencent || |
|- | |- | ||
− | | | + | |musicQuery-ltc.3gram.gz || 28M || /work/lxs/nlphome/lm/TencentQ/musicQuery ||use qa15w-singer-songs.wordlist |
|- | |- | ||
− | | | + | |TencentQ.3gram.gz || 1.4G || /work/lxs/nlphome/lm/TencentQ/qa15w ||use qa15w.lexicion |
|- | |- | ||
− | | | + | |mix-corp1-corp2.3gram.gz || 1.3G ||/work/lxs/nlphome/lm/TencentQ/qa15w-nosinger-song||use qa15w-nosinger-song.wordlist |
|- | |- | ||
− | | | + | |mix-corp1_0.5-corp2_0.5.3gram.gz||1.4G||/work/lxs/nlphome/lm/TencentQ/qa15w-singer-song||use qa15w-singer-song.wordlist |
|- | |- | ||
− | | | + | |11w_merge6_kn.3gram.gz||4.3G||/work/lxs/nlphome/lm/TencentQA-100G|| trainData=qa(100G),dict=11w-tencent |
|- | |- | ||
− | | | + | |8w8_new_merge6_kn.3gram0.gz||4.5G||/work/lxs/nlphome/lm/TencentQA-100G||trainData=qa(100G),dict=8w8-tencent |
|- | |- | ||
− | | | + | |Hunhe_zhongzi_and_add_and_PPL_5yuan_3e9.lm.utf8.1e-5.3gram.gz||1.4M||/work/lxs/nlphome/lm/jietong|| |
|- | |- | ||
− | | | + | |Hunhe_zhongzi_and_add_and_PPL_5yuan_3e9.lm.utf8.1e-9.5gram.gz||389M||/work/lxs/nlphome/lm/jietong|| |
+ | |} | ||
+ | |||
+ | ==lexicion wordlist== | ||
+ | {| class="wikitable" | ||
+ | ! name !! size !! dir !! description | ||
|- | |- | ||
− | | | + | |singer.lexicion||2060 ||/work/lxs/nlphome/dict/lex-wordlist/music/lr || |
|- | |- | ||
− | | | + | |singer.low.lexicion||2060||/work/lxs/nlphome/dict/lex-wordlist/music/lr|| |
|- | |- | ||
− | | | + | |singer.pinyin||2104||/work/lxs/nlphome/dict/lex-wordlist/music/lr|| |
|- | |- | ||
− | | | + | |song.lexicion||4639||/work/lxs/nlphome/dict/lex-wordlist/music/lr|| |
|- | |- | ||
− | | | + | |song.low.lexicion||4639||/work/lxs/nlphome/dict/lex-wordlist/music/lr|| |
|- | |- | ||
− | | | + | |song.pinyin||4644||/work/lxs/nlphome/dict/lex-wordlist/music/lr|| |
|- | |- | ||
− | | | + | |qa15w-ch-sinovoice.lexicion||92469||/work/lxs/nlphome/dict/lex-wordlist/qa-check|| |
|- | |- | ||
− | | | + | |qa15w-ch.pinyin||92469||/work/lxs/nlphome/dict/lex-wordlist/qa-check|| |
|- | |- | ||
− | | | + | |qa15w.lexicion||158404||/work/lxs/nlphome/dict/lex-wordlist/qa-check|| |
|- | |- | ||
− | | | + | |11w.lexicion||122172||/work/lxs/nlphome/dict/lex-wordlist/tencent|| |
|- | |- | ||
− | | | + | |8w8.lexicion||90795||/work/lxs/nlphome/dict/lex-wordlist/tencent|| |
+ | |} | ||
+ | |||
+ | ==nolexicion wordlist== | ||
+ | {| class="wikitable" | ||
+ | ! name !! size !! dir !! description | ||
|- | |- | ||
− | | | + | |singer.wordlist||2060||/work/lxs/nlphome/dict/nolex-wordlist/music/lr|| |
|- | |- | ||
− | | | + | |song.wordlist||4639||/work/lxs/nlphome/dict/nolex-wordlist/music/lr|| |
|- | |- | ||
− | | | + | |album.txt||11736||/work/lxs/nlphome/dict/nolex-wordlist/music/ltc|| |
|- | |- | ||
− | | | + | |area.txt||4||/work/lxs/nlphome/dict/nolex-wordlist/music/ltc|| |
|- | |- | ||
− | | | + | |chart.txt||28||/work/lxs/nlphome/dict/nolex-wordlist/music/ltc|| |
|- | |- | ||
− | | | + | |drama.txt||517||/work/lxs/nlphome/dict/nolex-wordlist/music/ltc|| |
|- | |- | ||
− | | | + | |language.txt||35||/work/lxs/nlphome/dict/nolex-wordlist/music/ltc|| |
|- | |- | ||
− | | | + | |singer.txt||4456||/work/lxs/nlphome/dict/nolex-wordlist/music/ltc|| |
|- | |- | ||
− | | | + | |stopwords.txt||894||/work/lxs/nlphome/dict/nolex-wordlist/music/ltc|| |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
|- | |- | ||
− | | | + | |song.txt||26153||/work/lxs/nlphome/dict/nolex-wordlist/music/ltc|| |
|- | |- | ||
+ | |style.txt||562||/work/lxs/nlphome/dict/nolex-wordlist/music/ltc|| | ||
+ | |- | ||
+ | |type.txt||3||/work/lxs/nlphome/dict/nolex-wordlist/music/ltc|| | ||
+ | |- | ||
+ | |entity.txt||36198||/work/lxs/nlphome/dict/nolex-wordlist/music/ltc||merge album area chart drama language singer song stopwords style type | ||
+ | |- | ||
+ | |qa15w.wordlist||147996||/work/lxs/nlphome/dict/nolex-wordlist/qa-check|| | ||
+ | |- | ||
+ | |11w.wordlist||111895||/work/lxs/nlphome/dict/nolex-wordlist/tencent|| | ||
+ | |- | ||
+ | |8w8.wordlist||88055||/work/lxs/nlphome/dict/nolex-wordlist/tencent|| | ||
+ | |- | ||
+ | |scws20w-utf8.wordlist||284646||/work/lxs/nlphome/dict/nolex-wordlist|| | ||
|} | |} | ||
+ | |||
+ | ==lenvxx== | ||
+ | ===path:/nfs/corpus/data/corpora/lenvxx=== | ||
+ | ===description:I settle the data in /nfs/corpus/data/corpora/lenvxx/data/text/nlpcorpus/nlp_corpus=== | ||
+ | =====(in this directory,it include 4 subdirectory:ChinaDivision , dict , dict4VOD , document Resource)===== | ||
+ | ;1.Directory:/nfs/corpus/data/corpora/lenvxx/data/text/nlpcorpus/nlp_corpus/dict | ||
+ | :;1.include directory :sogou-dict | ||
+ | :::*城市信息:include many provinces' data about the cities' names and places' names in the province,and some localisms,and some cities' information about bus station and the streets' name | ||
+ | :::*电子游戏 | ||
+ | ::::*单机游戏:include the console games' name from 2001 to 2011,and some game's wordlist. | ||
+ | ::::*网游:include the online games' name from 2008 to 2011 and some game's wordlist. | ||
+ | :::*工程与应用科学:include the specialized vocabulary wordlists in project field. | ||
+ | ::::*计算机:include the specialized vocabulary wordlists in computer field,and Alibaba's product vocabulary in many fields. | ||
+ | :::*农林鱼畜:include the wordlist about livestock and agriculture. | ||
+ | :::*人文科学 | ||
+ | ::::*文学:include the wordlist about ancient Chinese literature and masterwork,and some novels' wordlist. | ||
+ | ::::*语言:include the wordlists about idiom and Folklore,Network buzzwords. | ||
+ | ::::*哲学:include the wordlists about philosophy.for instance,Hegel,Marxism. | ||
+ | ::::*宗教:include the wordlists about Taoism,Buddhism,Islam | ||
+ | ::::*历史:include the wordlists about the history about Chinese,and Japanese's warring states period,diplomacy. | ||
+ | ::::*其他:include the wordlist about the ancient Chinese numerology. | ||
+ | :::*社会科学 | ||
+ | ::::*法律:include the wordlists about law. | ||
+ | ::::*教育:include the wordlists about some universities' architecture,and some wordlist about textbook,list of Chinese univercity and America famous univercity. | ||
+ | ::::*金融:include the wordlists about wordlist about financial. | ||
+ | ::::*军事:include the wordlists about military. | ||
+ | ::::*政治:include the wordlists about Party and government offices,political,and ancient China Official institutions | ||
+ | ::::*其他:include the wordlists about public relations,ethics,anthropology | ||
+ | :::*生活:include the wordlists about many fields in our lief. | ||
+ | :::*医学:include the wordlists about medical science. | ||
+ | :::*艺术 | ||
+ | ::::*书法篆刻:include the wordlists about sculpture and calligraphy. | ||
+ | ::::*舞蹈:include the wordlists about dance and Gymnastics Rhythmic. | ||
+ | ::::*戏剧:include the wordlists about drama. | ||
+ | ::::*音乐:include the wordlists about music major in Chinese and the west. | ||
+ | ::::*其他:include the wordlists of tea,sculpture,er ren zhuan,world heritage,artist. | ||
+ | :::*娱乐 | ||
+ | ::::*电影电视:include the wordlists about science fiction film. | ||
+ | ::::*动漫:include the wordlists about some cartoons. | ||
+ | ::::*流行音乐:include the wordlists about a novel of A Song of Ice and Fire,fashionable word or phrase. | ||
+ | ::::*明星:include the wordlists about some famous person. | ||
+ | ::::*汽车:include the wordlists about car field. | ||
+ | ::::*收藏:include the wordlists about advertisement. | ||
+ | ::::*时尚品牌:the directory is empty. | ||
+ | :::*运动休闲 | ||
+ | ::::*F1赛车:the directory is empty. | ||
+ | ::::*奥运:include the wordlists of Olympic. | ||
+ | ::::*垂钓:include the wordlists of fishing. | ||
+ | ::::*轮滑:include a wordlist of roller skating. | ||
+ | ::::*棋牌:include the wordlists about mahjong,go,chinese chess,san guo sha. | ||
+ | ::::*气功:include the wordlists about qigong. | ||
+ | ::::*球类:include the wordlists about football,basketball,ping-bang ball,golf,badminton. | ||
+ | ::::*杀人游戏:the directory is empty. | ||
+ | ::::*跆拳道:include the wordlists of taekwondo. | ||
+ | ::::*太极拳:include the wordlists of ba gua,tai ji quan. | ||
+ | ::::*武术:include the wordlists of wu shu. | ||
+ | ::::*自行车:the directory is empty. | ||
+ | ::::*其他:include the wordlists about fencing,judo,wrestling,yoga. | ||
+ | :::*自然科学 | ||
+ | ::::*化学:include the wordlists of chemistry. | ||
+ | ::::*生物:include the wordlists of biology. | ||
+ | ::::*数学:include the wordlists of math. | ||
+ | ::::*天文学:include the wordlists of astronomy. | ||
+ | ::::*物理:include the wordlists of physics. | ||
+ | ::::*其他:include the wordlists of stone. | ||
+ | :;2.include directory :movie(include many wordlists about movie major) | ||
+ | :::*电影:include the movie wordlists of inland,Hongkong and Taiwan,Europe and America,Asian. | ||
+ | :::*明星:include the movie star wordlists of inland,Hongkong and Taiwan,Europe and America,Asian. | ||
+ | :;3.include directory :movie-dict(include the wordlists of actor,director,moviename,roles,style) | ||
+ | :;4.include directory :name(include the wordlists of famous person in inland,Hongkong and Taiwan,Europe and America,Asian.) | ||
+ | :;5.include directory :NER(include the wordlists of person name in English,Japan,Korea,Russia) | ||
+ | :;6.include directory :Pinyin(include a wordlists of duo ying zhi) | ||
+ | :;7.include directory :VOD | ||
+ | :::*电视剧:include a wordlist of teleplay. | ||
+ | :::*电影:include a wordlist of movie. | ||
+ | :::*微电影:include a wordlist of micro film. | ||
+ | :::*音乐:include the wordlists of famous songs in inland,Hongkong and Taiwan,Europe and America,Japan and South Korea | ||
+ | :::*综艺:include a wordlists of show. | ||
+ | :;8.include directory :领域术语(include the wordlists about computer,economy,travel,sports,medicine) | ||
+ | :;9.include directory :语言学词库 | ||
+ | :::*基础名词:it include person,abstract noun,nature,person making things,fashion noun. | ||
+ | :::*语言学词汇类别:it include all grammar vocabulary. | ||
+ | ;2.Directory:/nfs/corpus/data/corpora/lenvxx/data/text/nlpcorpus/nlp_corpus/dict4VOD | ||
+ | :the directory include the wordlists of movie distribution company,film award,filmfest,actors'name,chinese and english comparison table. | ||
+ | ;3.Directory:/nfs/corpus/data/corpora/lenvxx/data/text/nlpcorpus/nlp_corpus/ChinaDivision | ||
+ | :the directory include 4 wordlists,which divide in 4 level(province name,city name,region name,street name) |
2014年2月26日 (三) 06:24的最后版本
lm
name | size | dir | description |
---|---|---|---|
SogouQ.full.train.3gram.gz | 132M | /work/lxs/nlphome/lm/SogouQ-500M | trainData=SougouQ(800M);dict=11w-tecent |
SogouT-11w-merge2-1.3gram.gz | 4.1G | /work/lxs/nlphome/lm/SogouT-140G | trainData=SougouT(140G);dict=11w-tencent |
SogouT-11w-merge2-2.3gram.gz | 3.9G | /work/lxs/nlphome/lm/SogouT-140G | |
8w8.3gram.tencent.gz | 452M | /work/lxs/nlphome/lm/Tencent | |
musicQuery-ltc.3gram.gz | 28M | /work/lxs/nlphome/lm/TencentQ/musicQuery | use qa15w-singer-songs.wordlist |
TencentQ.3gram.gz | 1.4G | /work/lxs/nlphome/lm/TencentQ/qa15w | use qa15w.lexicion |
mix-corp1-corp2.3gram.gz | 1.3G | /work/lxs/nlphome/lm/TencentQ/qa15w-nosinger-song | use qa15w-nosinger-song.wordlist |
mix-corp1_0.5-corp2_0.5.3gram.gz | 1.4G | /work/lxs/nlphome/lm/TencentQ/qa15w-singer-song | use qa15w-singer-song.wordlist |
11w_merge6_kn.3gram.gz | 4.3G | /work/lxs/nlphome/lm/TencentQA-100G | trainData=qa(100G),dict=11w-tencent |
8w8_new_merge6_kn.3gram0.gz | 4.5G | /work/lxs/nlphome/lm/TencentQA-100G | trainData=qa(100G),dict=8w8-tencent |
Hunhe_zhongzi_and_add_and_PPL_5yuan_3e9.lm.utf8.1e-5.3gram.gz | 1.4M | /work/lxs/nlphome/lm/jietong | |
Hunhe_zhongzi_and_add_and_PPL_5yuan_3e9.lm.utf8.1e-9.5gram.gz | 389M | /work/lxs/nlphome/lm/jietong |
lexicion wordlist
name | size | dir | description |
---|---|---|---|
singer.lexicion | 2060 | /work/lxs/nlphome/dict/lex-wordlist/music/lr | |
singer.low.lexicion | 2060 | /work/lxs/nlphome/dict/lex-wordlist/music/lr | |
singer.pinyin | 2104 | /work/lxs/nlphome/dict/lex-wordlist/music/lr | |
song.lexicion | 4639 | /work/lxs/nlphome/dict/lex-wordlist/music/lr | |
song.low.lexicion | 4639 | /work/lxs/nlphome/dict/lex-wordlist/music/lr | |
song.pinyin | 4644 | /work/lxs/nlphome/dict/lex-wordlist/music/lr | |
qa15w-ch-sinovoice.lexicion | 92469 | /work/lxs/nlphome/dict/lex-wordlist/qa-check | |
qa15w-ch.pinyin | 92469 | /work/lxs/nlphome/dict/lex-wordlist/qa-check | |
qa15w.lexicion | 158404 | /work/lxs/nlphome/dict/lex-wordlist/qa-check | |
11w.lexicion | 122172 | /work/lxs/nlphome/dict/lex-wordlist/tencent | |
8w8.lexicion | 90795 | /work/lxs/nlphome/dict/lex-wordlist/tencent |
nolexicion wordlist
name | size | dir | description |
---|---|---|---|
singer.wordlist | 2060 | /work/lxs/nlphome/dict/nolex-wordlist/music/lr | |
song.wordlist | 4639 | /work/lxs/nlphome/dict/nolex-wordlist/music/lr | |
album.txt | 11736 | /work/lxs/nlphome/dict/nolex-wordlist/music/ltc | |
area.txt | 4 | /work/lxs/nlphome/dict/nolex-wordlist/music/ltc | |
chart.txt | 28 | /work/lxs/nlphome/dict/nolex-wordlist/music/ltc | |
drama.txt | 517 | /work/lxs/nlphome/dict/nolex-wordlist/music/ltc | |
language.txt | 35 | /work/lxs/nlphome/dict/nolex-wordlist/music/ltc | |
singer.txt | 4456 | /work/lxs/nlphome/dict/nolex-wordlist/music/ltc | |
stopwords.txt | 894 | /work/lxs/nlphome/dict/nolex-wordlist/music/ltc | |
song.txt | 26153 | /work/lxs/nlphome/dict/nolex-wordlist/music/ltc | |
style.txt | 562 | /work/lxs/nlphome/dict/nolex-wordlist/music/ltc | |
type.txt | 3 | /work/lxs/nlphome/dict/nolex-wordlist/music/ltc | |
entity.txt | 36198 | /work/lxs/nlphome/dict/nolex-wordlist/music/ltc | merge album area chart drama language singer song stopwords style type |
qa15w.wordlist | 147996 | /work/lxs/nlphome/dict/nolex-wordlist/qa-check | |
11w.wordlist | 111895 | /work/lxs/nlphome/dict/nolex-wordlist/tencent | |
8w8.wordlist | 88055 | /work/lxs/nlphome/dict/nolex-wordlist/tencent | |
scws20w-utf8.wordlist | 284646 | /work/lxs/nlphome/dict/nolex-wordlist |
lenvxx
path:/nfs/corpus/data/corpora/lenvxx
description:I settle the data in /nfs/corpus/data/corpora/lenvxx/data/text/nlpcorpus/nlp_corpus
(in this directory,it include 4 subdirectory:ChinaDivision , dict , dict4VOD , document Resource)
- 1.Directory
- /nfs/corpus/data/corpora/lenvxx/data/text/nlpcorpus/nlp_corpus/dict
- 1.include directory
- sogou-dict
- 城市信息:include many provinces' data about the cities' names and places' names in the province,and some localisms,and some cities' information about bus station and the streets' name
- 电子游戏
- 单机游戏:include the console games' name from 2001 to 2011,and some game's wordlist.
- 网游:include the online games' name from 2008 to 2011 and some game's wordlist.
- 工程与应用科学:include the specialized vocabulary wordlists in project field.
- 计算机:include the specialized vocabulary wordlists in computer field,and Alibaba's product vocabulary in many fields.
- 农林鱼畜:include the wordlist about livestock and agriculture.
- 人文科学
- 文学:include the wordlist about ancient Chinese literature and masterwork,and some novels' wordlist.
- 语言:include the wordlists about idiom and Folklore,Network buzzwords.
- 哲学:include the wordlists about philosophy.for instance,Hegel,Marxism.
- 宗教:include the wordlists about Taoism,Buddhism,Islam
- 历史:include the wordlists about the history about Chinese,and Japanese's warring states period,diplomacy.
- 其他:include the wordlist about the ancient Chinese numerology.
- 社会科学
- 法律:include the wordlists about law.
- 教育:include the wordlists about some universities' architecture,and some wordlist about textbook,list of Chinese univercity and America famous univercity.
- 金融:include the wordlists about wordlist about financial.
- 军事:include the wordlists about military.
- 政治:include the wordlists about Party and government offices,political,and ancient China Official institutions
- 其他:include the wordlists about public relations,ethics,anthropology
- 生活:include the wordlists about many fields in our lief.
- 医学:include the wordlists about medical science.
- 艺术
- 书法篆刻:include the wordlists about sculpture and calligraphy.
- 舞蹈:include the wordlists about dance and Gymnastics Rhythmic.
- 戏剧:include the wordlists about drama.
- 音乐:include the wordlists about music major in Chinese and the west.
- 其他:include the wordlists of tea,sculpture,er ren zhuan,world heritage,artist.
- 娱乐
- 电影电视:include the wordlists about science fiction film.
- 动漫:include the wordlists about some cartoons.
- 流行音乐:include the wordlists about a novel of A Song of Ice and Fire,fashionable word or phrase.
- 明星:include the wordlists about some famous person.
- 汽车:include the wordlists about car field.
- 收藏:include the wordlists about advertisement.
- 时尚品牌:the directory is empty.
- 运动休闲
- F1赛车:the directory is empty.
- 奥运:include the wordlists of Olympic.
- 垂钓:include the wordlists of fishing.
- 轮滑:include a wordlist of roller skating.
- 棋牌:include the wordlists about mahjong,go,chinese chess,san guo sha.
- 气功:include the wordlists about qigong.
- 球类:include the wordlists about football,basketball,ping-bang ball,golf,badminton.
- 杀人游戏:the directory is empty.
- 跆拳道:include the wordlists of taekwondo.
- 太极拳:include the wordlists of ba gua,tai ji quan.
- 武术:include the wordlists of wu shu.
- 自行车:the directory is empty.
- 其他:include the wordlists about fencing,judo,wrestling,yoga.
- 自然科学
- 化学:include the wordlists of chemistry.
- 生物:include the wordlists of biology.
- 数学:include the wordlists of math.
- 天文学:include the wordlists of astronomy.
- 物理:include the wordlists of physics.
- 其他:include the wordlists of stone.
- 2.include directory
- movie(include many wordlists about movie major)
- 电影:include the movie wordlists of inland,Hongkong and Taiwan,Europe and America,Asian.
- 明星:include the movie star wordlists of inland,Hongkong and Taiwan,Europe and America,Asian.
- 3.include directory
- movie-dict(include the wordlists of actor,director,moviename,roles,style)
- 4.include directory
- name(include the wordlists of famous person in inland,Hongkong and Taiwan,Europe and America,Asian.)
- 5.include directory
- NER(include the wordlists of person name in English,Japan,Korea,Russia)
- 6.include directory
- Pinyin(include a wordlists of duo ying zhi)
- 7.include directory
- VOD
- 电视剧:include a wordlist of teleplay.
- 电影:include a wordlist of movie.
- 微电影:include a wordlist of micro film.
- 音乐:include the wordlists of famous songs in inland,Hongkong and Taiwan,Europe and America,Japan and South Korea
- 综艺:include a wordlists of show.
- 8.include directory
- 领域术语(include the wordlists about computer,economy,travel,sports,medicine)
- 9.include directory
- 语言学词库
- 基础名词:it include person,abstract noun,nature,person making things,fashion noun.
- 语言学词汇类别:it include all grammar vocabulary.
- 2.Directory
- /nfs/corpus/data/corpora/lenvxx/data/text/nlpcorpus/nlp_corpus/dict4VOD
- the directory include the wordlists of movie distribution company,film award,filmfest,actors'name,chinese and english comparison table.
- 3.Directory
- /nfs/corpus/data/corpora/lenvxx/data/text/nlpcorpus/nlp_corpus/ChinaDivision
- the directory include 4 wordlists,which divide in 4 level(province name,city name,region name,street name)