基于《知网》的多种类型文献混合自动分类研究  被引量:4

A New Automatic Categorization Method with Documents Based on HowNet

在线阅读下载全文

作  者:李湘东[1,2] 刘康[1] 丁丛[1] 高凡[1] 

机构地区:[1]武汉大学信息管理学院,武汉430072 [2]武汉大学信息资源研究中心,武汉430072

出  处:《现代图书情报技术》2016年第2期59-66,共8页New Technology of Library and Information Service

基  金:国家社会科学基金项目"多种类型文本数字资源自动分类研究"(项目编号:15BTQ066)的研究成果之一

摘  要:【目的】解决由于不同类型文献而产生的特征不匹配等问题,提高待分类文本的分类效果。【方法】使用与待分类文本属于不同文献类型的文本作为语料库的训练集,引入第三方资源《知网》进行语义特征扩展。【结果】利用该方法在网页、图书、非学术性期刊、学术性期刊4种类型文献上进行分类实验,与未经过扩展的分类方法相比,分类准确率提高1.2%至11.0%。【局限】未对每一种文献类型都使用公开语料进行测试,因此本文方法的通用性和实验结果的客观性有待进一步检验。【结论】实验结果表明,该方法具有一定的可行性和实用性,在不同程度上可以消除不同类型文献之间的语义差异,从语料库构建和特征扩展两个途径提高文本自动分类的分类效果。[Objective] This paper aims to solve the feature mismatch problem caused by different document types and improve the performance of automatic classification technology. [Methods] We proposed a new method to extend the semantic features using documents of various types as the corpus, which were introduced the third-party resource How Net and were different with the other un-categorized ones. [Results] Compared with the non-feature-extension classification method, the proposed method increased the F-measure by 1.2% to 11.0% in our classification experiment. Four document types, used in our study included webpages, books, non-academic periodicals and academic journals. [Limitations] Not every type of document was tested with the publicly accessible corpus, thus, more tests were needed to examine the generalization and objectiveness of the new method. [Conclusions] Our study showed that the proposed method was feasible. It could effectively eliminate the semantic differences among various types of collections and improve the performance of automatic text classification through corpus construction and feature extension.

关 键 词:第三方资源 知网 特征扩展 语义差异 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象