检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:凤丽洲 杨贵军 徐雪 徐玉慧 FENG Li-zhou;YANG Gui-ju n;XU Xue;XU Yu-hui(School of Statistics,Tianjin University of Finance—Economics,Tianjin 300222,China;School of Science,Tianjin University of Commerce,Tianjin 300134,China;China United Network Communication Group Co.,Ltd.Qingdao Branch,Qingdao 266000,China)
机构地区:[1]天津财经大学统计学院,天津300222 [2]天津商业大学理学院,天津300134 [3]中国联合网络通信有限公司青岛分公司,山东青岛266000
出 处:《数理统计与管理》2020年第4期633-643,共11页Journal of Applied Statistics and Management
基 金:国家社会科学基金项目青年项目(18CTJ008);天津市自然科学基金项目青年项目(18JCQNJC69600);国家自然科学基金项目面上项目(11471239);全国统计科学研究计划重点项目(2017LZ25,2017LZ05);全国统计科学研究一般项目(2018LY50);天津市社科规划重点课题(TJTJ19-001)。
摘 要:针对基础词更能表达中文文本所包含的基本信息,更适合于后续的文本挖掘,提出一种基于N-gram的双向匹配中文分词方法。充分挖掘训练语料的词频信息,给出一种组合词迭代切分方法,解决最大匹配分词中长词歧义切分问题,并基于N-gram语言模型,实现最优分词序列的选择。此外,为弥补准确率P这一评价指标受词条长度影响较大而不稳健的问题,在刻画分词方法性能时引入正确切分词条总字数这一因素,提出一个新的测评指标Pn,有效规避了词条长度对分词准确率评价的影响。最后在SIGHAN组织的国际中文自然语言处理竞赛的两个语料上进行实验表明,相较于传统N-gram中文分词方法,本文方法在保证分词效率的前提下,有效地提高了准确率P、召回率R、Pn和F1值。Aiming at the problem that basic words can define the basic information contained in Chinese text more clearly and are better used to subsequent text mining,a bi-direction matching Chinese word segmentation method based on N-gram statistical model is provided.An iterative segmentation method of combined words is formulated to solve the problem of long word ambiguity in the maximum matching algorithm by fully mining the word frequency information of the training corpus.And the optimal word segmentation sequence can be selected based on the N-gram statistical language model.In addition,due to the problem that the accuracy P is greatly influenced by the length of words,a new evaluation index Pn based on the total number of accumulative correct words is proposed.The new evaluation index has better robustness,is an additional evaluation of the Chinese word segmentation.On the two experimental corpuses of SIGHAN International Chinese Natural Language Processing Competition,the experimental results and analysis are given.The results show that the accuracy P,recall rate R,F1 value and Pn are better than the N-gram Chinese word segmentation method with the same efficiency of word segmentation.
分 类 号:O212[理学—概率论与数理统计]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.204