面向超大规模的中文文本N-gram串统计  被引量:3

N-gram Chinese Characters Counting for Huge Text Corpora

在线阅读下载全文

作  者:余一骄[1] 刘芹[2] 

机构地区:[1]华中师范大学语言学系,武汉430079 [2]武汉大学计算机学院,武汉430072

出  处:《计算机科学》2014年第4期263-268,共6页Computer Science

基  金:教育部人文社会科学研究项目:逻辑推理与词义匹配相融合的中文网页语义检索技术研究(10YJA740120);湖北省教育厅人文社会科学研究项目:基于语义理解的中文网页检索方法研究(2010b032)资助

摘  要:中文文本统计软件Cici高效地实现了对超大规模中文文本语料N-gram串频次的统计与检索。通过统计不同规模中文语料库发现,当N等于6时,语料库中包含的不同N-gram汉字串数量最多。根据"句子"的平均长度和数量,可以准确估算语料库中包含的N-gram串数量。根据多数汉字串在语料库中出现频次低于10次的特点,提出对汉字串频次信息实现分段存储与排序,即对频次不超过10的汉字串独立存储,对频次高于10的汉字串进行分段排序与存储。对大规模中文文本应先进行分块统计,然后合并分块统计结果,建议分块规模约为20MB。Counting N-gram Chinese characters of huge text corpora is a challenge for Chinese information processing and Cici was developed to count huge Chinese text corpora efficiently.We found that the number of different Chinese strings is maximal when the length of strings is 6,and the number of strings can be estimated by the average length of sentences.Since most Chinese strings appear no more than 10 times in the corpora,the N-gram characters are stored in 13 separate files according to their frequency,and only highly used strings are sorted.This strategy speeds up the accounting process dramatically.Due to the limited physical memory,huge Chinese text corpora have to be divided into many blocks,whose size is suggested to be 20MB.Every block is counted separately,and then the block statistic results are merged together.We implemented the algorithm of accounting huge corpora efficiently in personal computer.

关 键 词:汉字 N-GRAM 语料库 排序 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象