检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:曹义亲[1] 盛武平 周会祥 Cao Yiqin;Sheng Wuping;Zhou Huixiang(School of Software,East China Jiaotong University,Nanchang 330013,China)
出 处:《华东交通大学学报》2021年第1期122-130,共9页Journal of East China Jiaotong University
基 金:国家自然科学基金项目(61967006)。
摘 要:TF-IDF算法使用词频和逆文档频率来判断文章中词语的重要性,但类别区分效果不是很好。为提高分类效果,提出TF-IDF-MP算法。首先对语料库中的文档进行段落标注,利用jieba分词工具分词并标注词性,然后根据特征词在单个文档中出现的次数与该特征词在语料库所有文档中出现的平均次数进行比较,采用改进后的Sigmoid函数调整特征词权值,同时根据相关文档的段落位置重要程度赋予不同的位置权重,根据特征词权重大小排序后用朴素贝叶斯分类器对文档进行分类。实验结果表明,TF-IDF-MP算法应用到新闻分类中,精确率、召回率和F1值等评价指标较TF-IDF及相关改进算法都得到较好的提升。The TF-IDF algorithm uses the word frequency and inverse document frequency to judge the importance of words,but the category discrimination effect is not very good.In order to improve the classification effect,a TF-IDF-MP algorithm is proposed.First,the documents in the corpus were marked with paragraphs.The word segmentation tool jieba was used to label and tag the parts of speech.Then,the number of times a feature word in a single document was compared with the average number of occurrences in the document,and the feature word weights were adjusted by the improved Sigmoid function.At the same time,different position weights were given according to the importance of the paragraph position of the relevant document.According to the weight of the feature words,Naive Bayes classifier was used to classify the documents.The experimental results show that the TF-IDF-MP algorithm is applied to the news classification,and the evaluation indicators such as accuracy,recall and F1 value are better than TF-IDF and related improved algorithms.
关 键 词:文本分类 关键词提取 TF-IDF 词频均值化 位置加权
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:3.15.140.134