检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:李湘东[1,2] 阮涛 Li Xiangdong;Ruan Tao(School of Information Management, Wuhan University;Center for Electronic Commerce Research and Development, Wuhan University)
机构地区:[1]武汉大学信息管理学院 [2]武汉大学电子商务研究与发展中心
出 处:《图书馆杂志》2018年第6期11-21,30,共12页Library Journal
摘 要:对《中图法》中内容极为相似的两个类别,开展基于机器学习的自动分类(两类分类)研究。以《中图法》中E271和E712.51两个类别的书目信息作为两类分类的对象,对涉及的CHI、IG和MI等特征选择法,TF和TF*IDF等加权方式,KNN、NB和SVM等分类算法等主要分类环节中的各种代表性技术的分类性能进行比较研究,为今后对《中图法》中极为相似类目开展针对性的自动分类研究提供基础数据。实验结果表明,关于特征选择法,CHI和IG的效果较佳,MI的表现稍弱,但是MI在特征数为4000以上时,性能明显提高;关于分类算法,NB在采取MI特征选择法时表现较佳,但SVM在采取CHI和IG两种特征选择法下表现更佳,而KNN比前两者均差;关于特征加权方式,大多数情况下TF优于TF*IDF,但易受到分类算法、特征数目或特征选择法的影响。各个分类环节中的相关技术组合在一起能够适应对相似类目的自动分类,但性能上优劣不一,需要针对相似类目分类改进相关技术,以进一步提高对相似类目开展自动分类时的分类性能。The purpose of this paper is to study the automatic classification(two types of classification) based on machine learning in two categories with very similar contents in the Chinese Library Classification. In this paper, we use the bibliographic information of E271 and E712.51 as two types of bibliographic information, and provide a comparative study of the performance of some representative technologies, three feature selection methods, namely, CHI, IG and MI, two feature weighting methods, namely, TF and TF * IDF, and three classification algorithm, namely, KNN, NB and SVM, in the classification of two categories, which provides basic data for targeted automatic classification research. The experimental results show that the performance of CHI and IG is better than MI. However, when the number of features of MI are more than 4000, the performance is improved enormouslyly. For the classification algorithm, the performance of the NB, which adopts the MI feature selection, is the best. The performance of the SVM is better, which uses the feature selection of CHI and IG, than NB and KNN. And the KNN is worse than the former. For feature weighting, TF is better than TF * IDF in most cases. However, the performance of feature weighting is easily influenced by the classification algorithm, the number of features or feature selection method. The related technology in each classification can be combined to adapt to the automatic classification of imitation classification, but the performance of related methods have different advantages and disadvantages, which needs to further improve the classification of related technology and to further improve the classification of similar categories to carry out automatic classification of performance.
关 键 词:两类分类 《中国图书馆分类法》 特征选择 特征加权 文本分类
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.3