利用未标识文档提高中心分类法性能的研究

Research on Using Unlabled Text to Improve the Performance of Centroid-based Classification Algorithms

机构地区：[1]福建工程学院,计算机与信息科学系,福建,福州,350014 福建工程学院,计算机与信息科学系,福建,福州,350014

出　　处：《电脑知识与技术（过刊）》2007年第16期1125-1126,1169,共3页Computer Knowledge and Technology

摘　　要：中心分类法性能高效,但需要大量的训练文档(已标识文档)来训练分类器以保证分类的正确性.而训练文档因需花费大量人力物力来分类而数量有限,同时,网络上存在着很多未标识文档.为此,对中心分类法进行改进,提出了ONUC和0FFUC算法,以弥补当训练文档不足时,中心分类法性能急剧下降的缺陷.考虑到中心分类法易受孤立点的影响,采取了去边处理.实验证明,与普通的中心分类法、其它半监督经典算法比较,在训练文档很少的情况下,该算法能获得较好的性能.Centroid-based Classification Algorithms is a high efficient class of Algorithms for Text Categorization.However,in order to obtain classification model well,it requires a number of labeled documents.in practical applications,labeled documents are often very sparse because manually labeling data is tedious and costly,while there are often abundant unlabeled documents.So,we propose OFFUC and ONUC algorithms to mend the matter that centroid-based classification algorithms degrade dramatically when the training data is scarce.Considering that the training data items that are far away from the center of its training category reduce the accuracy of classification.,we exclude them from consideration.Experiment results show that OFFUC and ONUC algorithms,proposed in this paper,can improve the performance of centroid-based Classification Algorithms and outperforms the generic semi-supervised methods when the the number of labeled text is very small.

关键词：中心分类法文本分类未标识文档已标识文档

分类号：TP391[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

利用未标识文档提高中心分类法性能的研究

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

利用未标识文档提高中心分类法性能的研究

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索