利用未标识文档提高中心分类法性能的研究  

Research on Using Unlabled Text to Improve the Performance of Centroid-based Classification Algorithms

在线阅读下载全文

作  者:何尧[1] 张顺淼[1] 

机构地区:[1]福建工程学院,计算机与信息科学系,福建,福州,350014 福建工程学院,计算机与信息科学系,福建,福州,350014

出  处:《电脑知识与技术(过刊)》2007年第16期1125-1126,1169,共3页Computer Knowledge and Technology

摘  要:中心分类法性能高效,但需要大量的训练文档(已标识文档)来训练分类器以保证分类的正确性.而训练文档因需花费大量人力物力来分类而数量有限,同时,网络上存在着很多未标识文档.为此,对中心分类法进行改进,提出了ONUC和0FFUC算法,以弥补当训练文档不足时,中心分类法性能急剧下降的缺陷.考虑到中心分类法易受孤立点的影响,采取了去边处理.实验证明,与普通的中心分类法、其它半监督经典算法比较,在训练文档很少的情况下,该算法能获得较好的性能.Centroid-based Classification Algorithms is a high efficient class of Algorithms for Text Categorization.However,in order to obtain classification model well,it requires a number of labeled documents.in practical applications,labeled documents are often very sparse because manually labeling data is tedious and costly,while there are often abundant unlabeled documents.So,we propose OFFUC and ONUC algorithms to mend the matter that centroid-based classification algorithms degrade dramatically when the training data is scarce.Considering that the training data items that are far away from the center of its training category reduce the accuracy of classification.,we exclude them from consideration.Experiment results show that OFFUC and ONUC algorithms,proposed in this paper,can improve the performance of centroid-based Classification Algorithms and outperforms the generic semi-supervised methods when the the number of labeled text is very small.

关 键 词:中心分类法 文本分类 未标识文档 已标识文档 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象