检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:龚永罡[1] 田润琳 廉小亲[1] 夏天[2] Gong Yonggang;Tian Runlin;Lian Xiaoqin;Xia Tian(School of Computer and Information Engineering,Beijing Technology and Business University,Beijing 100024,China;School of Information Resource Management,Renmin University of China,Beijing 100872,China)
机构地区:[1]北京工商大学计算机与信息工程学院,北京100024 [2]中国人民大学信息资源管理学院,北京100872
出 处:《电子技术应用》2019年第5期70-73,77,共5页Application of Electronic Technique
基 金:国家重点研发计划项目(2017YFC0820100)
摘 要:大规模语料库的训练是使用三元N-gram算法进行中文文本自动查错中一个重要的基础工作。面对新媒体平台每日高达百万篇需处理的语料信息,单一节点的三元N-gram语言模型词库的构建存在计算瓶颈。在深入研究三元N-gram算法的基础上,提出了基于MapReduce计算模型的三元N-gram并行化算法的思想。MapReduce计算模型中,将运算任务平均分配到m个节点,三元N-gram算法在Map函数部分的主要任务是计算局部字词分别与其前两个字词搭配出现的次数,Reduce函数部分的主要任务是合并Map部分统计字词搭配出现的次数,生成全局统计结果。实验结果表明,运行在Hadoop集群上的基于MapReduce的三元N-gram并行化算法具有很好的运算性和可扩展性,对于每日120亿字的训练语料数据集,集群环境下该算法得到训练结果的速率更接近于线性。The training of large-scale corpora is an important basic work for the automatic detection of Chinese texts using the trigram N-gram algorithm. Faced with up to one million pieces of data to be processed by the new media platform per day, there is a computational bottleneck in the construction of a single-node trigram N-gram language model lexicon. Based on the deep research of the trigram N-gram algorithm, the idea of trigram N-gram parallelization algorithm based on MapReduce programming model is proposed. In the MapReduce programming model, the arithmetic tasks are evenly distributed to m nodes. The main task of the trigram N-gram algorithm in the Map function part is to calculate the number of times the local words are matched with the first two words, while the main part of the Reduce function, its task is to merge the number of occurrences of the statistical word matching in the Map part to generate global statistical results. The experimental results show that the MapReduce-based trigram N-gram parallelization algorithm running on Hadoop clusters has good performance and scalability. For a 12 billion word-per-day training corpus data set, the algorithm is obtained in a cluster environment. The rate of training results is more linear.
关 键 词:中文文本查错 三元N-gram算法 MapReduce计算模型 并行化算法 HADOOP集群 语料库
分 类 号:TP301.6[自动化与计算机技术—计算机系统结构]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.171