基于统计特征的Quality Phrase挖掘方法  被引量:4

Quality Phrase Mining Method Based on Statistic Features

在线阅读下载全文

作  者:杨欢欢 赵书良[1,2,3] 李文斌 武永亮 田国强 YANG Huanhuan;ZHAO Shuliang;LI Wenbin;WU Yongliang;TIAN Guoqiang(College of Computer and Cyber Security,Hebei Normal University,Shijiazhuang,050024,China;Hebei Provincial Engineering Research Center for Supply Chain Big Data Analytics&Data Security,Hebei Normal University,Shijiazhuang,050024,China;Key Laboratory of Network&Information Security,Hebei Normal University,Shijiazhuang,050024,China;College of Information Engineering,Hebei GEO University,Shijiazhuang,050031,China;School of Mathematical Sciences,Hebei Normal University,Shijiazhuang,050024,China)

机构地区:[1]河北师范大学计算机与网络空间安全学院,石家庄050024 [2]河北师范大学河北省供应链大数据分析与数据安全工程研究中心,石家庄050024 [3]河北师范大学河北省网络与信息安全重点实验室,石家庄050024 [4]河北地质大学信息工程学院,石家庄050031 [5]河北师范大学数学科学学院,石家庄050024

出  处:《数据采集与处理》2020年第3期458-473,共16页Journal of Data Acquisition and Processing

基  金:国家社会科学基金重大(13&ZD091,18ZDA200)资助项目。

摘  要:Quality Phrase挖掘是从文本语料库中提取有意义短语的过程,是文档摘要、信息检索等任务的基础。然而现有的无监督短语挖掘方法存在候选短语质量不高、Quality Phrase的特征权重平均分配的问题。本文提出基于统计特征的Quality Phrase挖掘方法,将频繁N-Gram挖掘、多词短语组合性约束及单词短语拼写检查相结合,保证了候选短语的质量;引入公共知识库对候选短语添加类别标签,实现了Quality Phrase特征权重的分配,并考虑特征之间相互影响设置惩罚因子调整权重比例;按照候选短语的特征加权函数得分排序,提取Quality Phrase。实验结果表明,基于统计特征的Quality Phrase挖掘方法明显提高了短语挖掘的精度,与最优的无监督短语挖掘方法相比,精确率、召回率及F1-Score分别提升了5.97%,1.77%和4.02%。Quality Phrase mining is a process of extracting meaningful phrases from text corpus,which is the basis of tasks such as document summary and information retrieval.However,the existing unsupervised phrase mining methods have problems of low quality of candidate phrases and average distribution of feature weight of Quality Phrase.Therefore,a Quality Phrase mining method based on statistic features is proposed.This method combines frequent N-Gram mining,combinatorial constraints of multi-word phrases,and spell checking to ensure the quality of candidate phrases.The public knowledge base is introduced to add labels to the candidate phrases,and the weight distribution of Quality Phrase is realized.The penalty factor is set to adjust the weight ratio considering the mutual influence between the features.The Quality Phrase is extracted according to the score of the feature weighting function of the candidate phrases.Experimental results show that the Quality Phrase mining method based on statistic features significantly improves the precision of phrase mining.Compared with the optimal unsupervised phrase mining methods,the precision,recall and F1-Score values are improved by 5.97%,1.77%,and4.02%,respectively.

关 键 词:文本挖掘 Quality Phrase 统计特征 候选短语 特征加权 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象