词类共现频率的MapReduce并行生成方法被引量：1

Parallel Implementation for Co-occurrence Statistics with MapReduce Model

出　　处：《重庆理工大学学报（自然科学）》2013年第11期53-57,64,共6页Journal of Chongqing University of Technology：Natural Science

基　　金：国家自然科学基金资助项目(61171141);广东省产学研省部合作专项资金资助项目(2012B091100448)

摘　　要：语料库在自然语言处理(NLP)领域的应用越来越广泛,词类共现频率的统计是其研究内容之一。针对词类共现的计算特点,给出了基于MapReduce编程模型实现的并行方法,即pairs和stripes方法[1]。虽然stripes模式性能明显优于pairs模式,但其在词汇表很大时存在内存溢出问题。针对此缺陷,给出了划分词汇表的解决方法,对输入词汇表进行拆分,此过程可利用MapReduce模型进行预处理。实验结果表明:利用MapReduce的并行性能较好地提高海量语料库中词类共现频率统计的效率和性能。Corpus plays more and more important role in natural language processing （NLP） field and co-occurrence statistics is one of its applications. This article provides method based on MapReduce programming model including pairs and stripes approaches to calculate. Although results demonstrate that the stripes approach is much faster than the pairs method and there is a problem of memory over- flow for huge vocabulary. The article gives the solution to the problem with splitting the input vocabu-lary, which can also be handled with MapReduce model. Experimental data demonstrate that MapRe-duce parallelism model will improve the efficiency and performance for massive corpus co-occurrence frequency statistics.

关键词：语料库词类共现频率自然语言处理 MAPREDUCE HADOOP

分类号：TP391[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

词类共现频率的MapReduce并行生成方法被引量：1

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

词类共现频率的MapReduce并行生成方法 被引量：1

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

词类共现频率的MapReduce并行生成方法被引量：1