基于k最近邻和改进TF-IDF的文本分类框架  被引量:10

Text categorization framework based on improved TF-IDF and k-nearest neighbor

在线阅读下载全文

作  者:龚静[1] 黄欣阳[2] GONG Jing;HUANG Xin-yang(Department of Public Basic Courses, Hunan Polytechnic of Environment and Biology, Hengyang 421005, China;College of Computer Science, University of South China, Hengyang 421001, China)

机构地区:[1]湖南环境生物职业技术学院公共基础课部,湖南衡阳421005 [2]南华大学计算机学院,湖南衡阳421001

出  处:《计算机工程与设计》2018年第5期1340-1344,1349,共6页Computer Engineering and Design

基  金:国家自然科学基金项目(61300234);湖南省教育厅基金项目(12C1056)

摘  要:为获得更加精确稳定的文本分类结果,提出一种基于k-最近邻(k-NN)和词频-逆文档词频(TF-IDF)改进的文本分类方法,主要由文本模块、图形用户界面(GUI)模块、预处理模块、k-NN&TF-IDF模块和相似性测量共5个模块组成。在权重获取方面,对处于不同位置的特征词分别赋予不同的系数,通过构建权重矩阵,反映特征词的重要性和分布情况。在编程方面,通过执行修正的语言集查询(LINQ),优化查询效率。实验结果表明,与其它分类方法相比,该方法在分类准确率、查全率和F1测度方面具有一定优势。讨论分类器对整个文本分类框架的影响,实验结果表明,k-NN分类器比SVM分类器更适合文本分类。To obtain more accurate and stable results for text categorization,a text categorization method based on improved term frequency-inverse document frequency(TF-IDF)and k-nearest neighbor(k-NN)was proposed,which mainly contained the document module,the module of graphical user interface(GUI),the pre-processing module,and the module of k-NN&TFIDF and similarity measurement.In the aspect of weight acquisition,different coefficients were assigned to different positions,and the weight matrix was constructed to reflect the importance and distribution of feature words.In the aspect of programming,the query efficiency was optimized by executing the revised language set query(LINQ).Experimental results show that compared with other classification methods,the proposed method has certain advantages in classification accuracy rate,recall rate and the F1 measurement.In addition,the impact of the classifier on the whole text classification framework was discussed.Experimental results show,k-NN classifier is more suitable for text classification than SVM classifier.

关 键 词:文本分类 K-NN 分类器 权重矩阵 优化 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象