检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:杨欢欢 赵书良[1,2,3] 李文斌 武永亮 田国强 YANG Huanhuan;ZHAO Shuliang;LI Wenbin;WU Yongliang;TIAN Guoqiang(College of Computer and Cyber Security,Hebei Normal University,Shijiazhuang,050024,China;Hebei Provincial Engineering Research Center for Supply Chain Big Data Analytics&Data Security,Hebei Normal University,Shijiazhuang,050024,China;Key Laboratory of Network&Information Security,Hebei Normal University,Shijiazhuang,050024,China;College of Information Engineering,Hebei GEO University,Shijiazhuang,050031,China;School of Mathematical Sciences,Hebei Normal University,Shijiazhuang,050024,China)
机构地区:[1]河北师范大学计算机与网络空间安全学院,石家庄050024 [2]河北师范大学河北省供应链大数据分析与数据安全工程研究中心,石家庄050024 [3]河北师范大学河北省网络与信息安全重点实验室,石家庄050024 [4]河北地质大学信息工程学院,石家庄050031 [5]河北师范大学数学科学学院,石家庄050024
出 处:《数据采集与处理》2020年第3期458-473,共16页Journal of Data Acquisition and Processing
基 金:国家社会科学基金重大(13&ZD091,18ZDA200)资助项目。
摘 要:Quality Phrase挖掘是从文本语料库中提取有意义短语的过程,是文档摘要、信息检索等任务的基础。然而现有的无监督短语挖掘方法存在候选短语质量不高、Quality Phrase的特征权重平均分配的问题。本文提出基于统计特征的Quality Phrase挖掘方法,将频繁N-Gram挖掘、多词短语组合性约束及单词短语拼写检查相结合,保证了候选短语的质量;引入公共知识库对候选短语添加类别标签,实现了Quality Phrase特征权重的分配,并考虑特征之间相互影响设置惩罚因子调整权重比例;按照候选短语的特征加权函数得分排序,提取Quality Phrase。实验结果表明,基于统计特征的Quality Phrase挖掘方法明显提高了短语挖掘的精度,与最优的无监督短语挖掘方法相比,精确率、召回率及F1-Score分别提升了5.97%,1.77%和4.02%。Quality Phrase mining is a process of extracting meaningful phrases from text corpus,which is the basis of tasks such as document summary and information retrieval.However,the existing unsupervised phrase mining methods have problems of low quality of candidate phrases and average distribution of feature weight of Quality Phrase.Therefore,a Quality Phrase mining method based on statistic features is proposed.This method combines frequent N-Gram mining,combinatorial constraints of multi-word phrases,and spell checking to ensure the quality of candidate phrases.The public knowledge base is introduced to add labels to the candidate phrases,and the weight distribution of Quality Phrase is realized.The penalty factor is set to adjust the weight ratio considering the mutual influence between the features.The Quality Phrase is extracted according to the score of the feature weighting function of the candidate phrases.Experimental results show that the Quality Phrase mining method based on statistic features significantly improves the precision of phrase mining.Compared with the optimal unsupervised phrase mining methods,the precision,recall and F1-Score values are improved by 5.97%,1.77%,and4.02%,respectively.
关 键 词:文本挖掘 Quality Phrase 统计特征 候选短语 特征加权
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.30