检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
机构地区:[1]华中师范大学语言学系,武汉430079 [2]武汉大学计算机学院,武汉430072
出 处:《计算机科学》2014年第4期263-268,共6页Computer Science
基 金:教育部人文社会科学研究项目:逻辑推理与词义匹配相融合的中文网页语义检索技术研究(10YJA740120);湖北省教育厅人文社会科学研究项目:基于语义理解的中文网页检索方法研究(2010b032)资助
摘 要:中文文本统计软件Cici高效地实现了对超大规模中文文本语料N-gram串频次的统计与检索。通过统计不同规模中文语料库发现,当N等于6时,语料库中包含的不同N-gram汉字串数量最多。根据"句子"的平均长度和数量,可以准确估算语料库中包含的N-gram串数量。根据多数汉字串在语料库中出现频次低于10次的特点,提出对汉字串频次信息实现分段存储与排序,即对频次不超过10的汉字串独立存储,对频次高于10的汉字串进行分段排序与存储。对大规模中文文本应先进行分块统计,然后合并分块统计结果,建议分块规模约为20MB。Counting N-gram Chinese characters of huge text corpora is a challenge for Chinese information processing and Cici was developed to count huge Chinese text corpora efficiently.We found that the number of different Chinese strings is maximal when the length of strings is 6,and the number of strings can be estimated by the average length of sentences.Since most Chinese strings appear no more than 10 times in the corpora,the N-gram characters are stored in 13 separate files according to their frequency,and only highly used strings are sorted.This strategy speeds up the accounting process dramatically.Due to the limited physical memory,huge Chinese text corpora have to be divided into many blocks,whose size is suggested to be 20MB.Every block is counted separately,and then the block statistic results are merged together.We implemented the algorithm of accounting huge corpora efficiently in personal computer.
分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:3.17.146.235