融合多因素的TFIDF关键词提取算法研究  被引量:26

Research on TFIDF Keyword Extraction Algorithm Based on Multiple Factors

在线阅读下载全文

作  者:牛永洁[1] 田成龙 NIU Yong-jie;TIAN Cheng-long(School of Mathematics & Computer,Yan’an University,Yan’an 716000,China)

机构地区:[1]延安大学数学与计算机学院

出  处:《计算机技术与发展》2019年第7期80-83,共4页Computer Technology and Development

基  金:国家社会科学基金项目(18BTQ042);国家级大学生创新创业训练计划项目(201710719024)

摘  要:为了能更加准确、快速地提取文本中的关键词,首先需要对待提取的文本进行数据清洗,去掉其中的噪声数据,接着对文本进行分词操作,在去掉停用词的基础上,综合考虑词语的位置、词性、词语关联性、词长和词跨度等因素,将这些因素与经典的TFIDF关键词提取算法相结合,采用不同权重的方法得到最终的词语权重,按照词语权重从大到小取得前5个词作为文本的关键词。以本校图书馆提供的8045篇《红色中华》新闻为源数据,从准确度、召回率及F1值三个指标对文中算法、经典的TFIDF算法和专家标注进行对比,发现文中算法在三个指标上均优于经典的TFIDF算法,与专家标注比较接近。In order to extract the key words in the text more accurately and quickly,the first step is to clean the extracted text,remove the noise data,and then perform word segmentation on the text.On the basis of removing the stop words,the word location,part of speech,word relevance,word length and word span are considered comprehensively.These factors are combined with the classic TFIDF key word extraction algorithm.The final word weight is obtained by using the method of different weights,and the first five words are taken as the key words in the text according to the weight of words from large to small.Based on the news of the 8 045 “Red China” provided by the library,by comparing the algorithm proposed,the classical TFIDF algorithm and expert annotation from three indexes of accuracy,recall rand F1,it is found that the algorithm proposed is superior to the classical TFIDF algorithm in three indexes and is close to expert annotation.

关 键 词:TFIDF算法 词位置 词性 词语关联 词长 词跨度 

分 类 号:TP301.6[自动化与计算机技术—计算机系统结构]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象