检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
机构地区:[1]南京财经大学图书馆,南京210046 [2]南京农业大学信息科技学院,南京210095
出 处:《情报学报》2011年第6期618-625,共8页Journal of the China Society for Scientific and Technical Information
基 金:国家社会科学基金重点项目“文化典籍整理与开发智能技术研究”(编号:08ATQ002); 教育部人文社会科学基金项目“农业古籍自动分词及索引编制研究”(编号:08JA870006)
摘 要:综合采用切分标志、分词词典和N元语法3种方法对古籍文本进行分词,并采用子串比较过滤、相邻词过滤、高频词过滤、低频词过滤等方法对分词结果进行过滤,分别以12种农业古籍和379种《广东方志物产》为语料进行了古籍分词测试。从12种农业古籍中共识别出已有词1164个,约占总词汇量的31%;未登录词2530个,占总词汇的69%。从379种《广东方志物产》资料中共识别出已有词6314个,占总词汇的8%;未登录词75 438个,则占总词汇的92%。通过对379种《广东方志物产》分词结果的分析发现,当词频等级位于区间(2000,8000)时,词频等级与频次乘积基本为常数23 000 000。这一结果说明齐夫定律在古籍文本中同样适用。The experiment adopts a comprehensive method of word segmentation including word segmentation by segmentation markers,dictionary-based word segmentation and word segmentation by N-gram,and uses some measures for noise reduction such as substring comparison,neighbor comparison,high frequency words,low frequency phrase words etc. Finally,taking 12 agriculture ancient books and 379 Local Chronicle of Guangdong:Products as the example respectively, the experiment makes a test of word segmentation of agricultural ancient books.From corpus of 12 agriculture ancient books,the experiment recognizes 1164 old words that account for 31%of total vocabulary and 2,530 new words that take up 69%of total vocabulary.From corpus of 379 Local Chronicle of Guangdong:Products,the experiment recognizes 6,314 old words that account for 8%of total vocabulary and 754,380 new words that take up 92%of total vocabulary.The words whose term frequency is more than 10 times are up to 8,044,which take up 10%of all words.In the meantime,the words whose term frequency is more than 20 times are up to 3,760 in all words,which take up 5%oi the total vocabulary.By analysis on results of word segmentation on 379 Local Chronicle of Guangdong;Products,a fact is discovered that if the level of term frequency is in range(2000,8000),then the product of level of term frequency and frequency is a constant of 23 million.The appearance shows that Zipf's law is the same with the ancient Chinese text.
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.15