内容相近类目实现自动分类时相关分类技术的比较研究——以《中图法》E271和E712.51为例  被引量:5

A Comparative Study of Relevant Classification Techniques in Automatic Classification for Two Categories with Similar Contents: Taking E271 and E712.51 in the Chinese Library Classification as Example

在线阅读下载全文

作  者:李湘东[1,2] 阮涛 Li Xiangdong;Ruan Tao(School of Information Management, Wuhan University;Center for Electronic Commerce Research and Development, Wuhan University)

机构地区:[1]武汉大学信息管理学院 [2]武汉大学电子商务研究与发展中心

出  处:《图书馆杂志》2018年第6期11-21,30,共12页Library Journal

摘  要:对《中图法》中内容极为相似的两个类别,开展基于机器学习的自动分类(两类分类)研究。以《中图法》中E271和E712.51两个类别的书目信息作为两类分类的对象,对涉及的CHI、IG和MI等特征选择法,TF和TF*IDF等加权方式,KNN、NB和SVM等分类算法等主要分类环节中的各种代表性技术的分类性能进行比较研究,为今后对《中图法》中极为相似类目开展针对性的自动分类研究提供基础数据。实验结果表明,关于特征选择法,CHI和IG的效果较佳,MI的表现稍弱,但是MI在特征数为4000以上时,性能明显提高;关于分类算法,NB在采取MI特征选择法时表现较佳,但SVM在采取CHI和IG两种特征选择法下表现更佳,而KNN比前两者均差;关于特征加权方式,大多数情况下TF优于TF*IDF,但易受到分类算法、特征数目或特征选择法的影响。各个分类环节中的相关技术组合在一起能够适应对相似类目的自动分类,但性能上优劣不一,需要针对相似类目分类改进相关技术,以进一步提高对相似类目开展自动分类时的分类性能。The purpose of this paper is to study the automatic classification(two types of classification) based on machine learning in two categories with very similar contents in the Chinese Library Classification. In this paper, we use the bibliographic information of E271 and E712.51 as two types of bibliographic information, and provide a comparative study of the performance of some representative technologies, three feature selection methods, namely, CHI, IG and MI, two feature weighting methods, namely, TF and TF * IDF, and three classification algorithm, namely, KNN, NB and SVM, in the classification of two categories, which provides basic data for targeted automatic classification research. The experimental results show that the performance of CHI and IG is better than MI. However, when the number of features of MI are more than 4000, the performance is improved enormouslyly. For the classification algorithm, the performance of the NB, which adopts the MI feature selection, is the best. The performance of the SVM is better, which uses the feature selection of CHI and IG, than NB and KNN. And the KNN is worse than the former. For feature weighting, TF is better than TF * IDF in most cases. However, the performance of feature weighting is easily influenced by the classification algorithm, the number of features or feature selection method. The related technology in each classification can be combined to adapt to the automatic classification of imitation classification, but the performance of related methods have different advantages and disadvantages, which needs to further improve the classification of related technology and to further improve the classification of similar categories to carry out automatic classification of performance.

关 键 词:两类分类 《中国图书馆分类法》 特征选择 特征加权 文本分类 

分 类 号:G254.1[文化科学—图书馆学]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象