基于N-gram的双向匹配中文分词方法被引量：12

Bi-Direction Matching Chinese Word Segmentation Based on N-gram Statistical Model

作　　者：凤丽洲杨贵军徐雪徐玉慧 FENG Li-zhou;YANG Gui-ju n;XU Xue;XU Yu-hui(School of Statistics,Tianjin University of Finance—Economics,Tianjin 300222,China;School of Science,Tianjin University of Commerce,Tianjin 300134,China;China United Network Communication Group Co.,Ltd.Qingdao Branch,Qingdao 266000,China)

机构地区：[1]天津财经大学统计学院,天津300222 [2]天津商业大学理学院,天津300134 [3]中国联合网络通信有限公司青岛分公司,山东青岛266000

出　　处：《数理统计与管理》2020年第4期633-643,共11页Journal of Applied Statistics and Management

基　　金：国家社会科学基金项目青年项目(18CTJ008);天津市自然科学基金项目青年项目(18JCQNJC69600);国家自然科学基金项目面上项目(11471239);全国统计科学研究计划重点项目(2017LZ25,2017LZ05);全国统计科学研究一般项目(2018LY50);天津市社科规划重点课题(TJTJ19-001)。

摘　　要：针对基础词更能表达中文文本所包含的基本信息,更适合于后续的文本挖掘,提出一种基于N-gram的双向匹配中文分词方法。充分挖掘训练语料的词频信息,给出一种组合词迭代切分方法,解决最大匹配分词中长词歧义切分问题,并基于N-gram语言模型,实现最优分词序列的选择。此外,为弥补准确率P这一评价指标受词条长度影响较大而不稳健的问题,在刻画分词方法性能时引入正确切分词条总字数这一因素,提出一个新的测评指标Pn,有效规避了词条长度对分词准确率评价的影响。最后在SIGHAN组织的国际中文自然语言处理竞赛的两个语料上进行实验表明,相较于传统N-gram中文分词方法,本文方法在保证分词效率的前提下,有效地提高了准确率P、召回率R、Pn和F1值。Aiming at the problem that basic words can define the basic information contained in Chinese text more clearly and are better used to subsequent text mining,a bi-direction matching Chinese word segmentation method based on N-gram statistical model is provided.An iterative segmentation method of combined words is formulated to solve the problem of long word ambiguity in the maximum matching algorithm by fully mining the word frequency information of the training corpus.And the optimal word segmentation sequence can be selected based on the N-gram statistical language model.In addition,due to the problem that the accuracy P is greatly influenced by the length of words,a new evaluation index Pn based on the total number of accumulative correct words is proposed.The new evaluation index has better robustness,is an additional evaluation of the Chinese word segmentation.On the two experimental corpuses of SIGHAN International Chinese Natural Language Processing Competition,the experimental results and analysis are given.The results show that the accuracy P,recall rate R,F1 value and Pn are better than the N-gram Chinese word segmentation method with the same efficiency of word segmentation.

关键词：N-GRAM模型分词歧义评测指标双向匹配

分类号：O212[理学—概率论与数理统计]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于N-gram的双向匹配中文分词方法被引量：12

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于N-gram的双向匹配中文分词方法 被引量：12

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于N-gram的双向匹配中文分词方法被引量：12