基于TF-IDF-MP算法的新闻关键词提取研究  被引量:8

Research on News Keyword Extraction Based on TF-IDF-MP Algorithm

在线阅读下载全文

作  者:曹义亲[1] 盛武平 周会祥 Cao Yiqin;Sheng Wuping;Zhou Huixiang(School of Software,East China Jiaotong University,Nanchang 330013,China)

机构地区:[1]华东交通大学软件学院,江西南昌330013

出  处:《华东交通大学学报》2021年第1期122-130,共9页Journal of East China Jiaotong University

基  金:国家自然科学基金项目(61967006)。

摘  要:TF-IDF算法使用词频和逆文档频率来判断文章中词语的重要性,但类别区分效果不是很好。为提高分类效果,提出TF-IDF-MP算法。首先对语料库中的文档进行段落标注,利用jieba分词工具分词并标注词性,然后根据特征词在单个文档中出现的次数与该特征词在语料库所有文档中出现的平均次数进行比较,采用改进后的Sigmoid函数调整特征词权值,同时根据相关文档的段落位置重要程度赋予不同的位置权重,根据特征词权重大小排序后用朴素贝叶斯分类器对文档进行分类。实验结果表明,TF-IDF-MP算法应用到新闻分类中,精确率、召回率和F1值等评价指标较TF-IDF及相关改进算法都得到较好的提升。The TF-IDF algorithm uses the word frequency and inverse document frequency to judge the importance of words,but the category discrimination effect is not very good.In order to improve the classification effect,a TF-IDF-MP algorithm is proposed.First,the documents in the corpus were marked with paragraphs.The word segmentation tool jieba was used to label and tag the parts of speech.Then,the number of times a feature word in a single document was compared with the average number of occurrences in the document,and the feature word weights were adjusted by the improved Sigmoid function.At the same time,different position weights were given according to the importance of the paragraph position of the relevant document.According to the weight of the feature words,Naive Bayes classifier was used to classify the documents.The experimental results show that the TF-IDF-MP algorithm is applied to the news classification,and the evaluation indicators such as accuracy,recall and F1 value are better than TF-IDF and related improved algorithms.

关 键 词:文本分类 关键词提取 TF-IDF 词频均值化 位置加权 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象