一种基于泊松估计的可控特征选择算法  

A Controllable Feature Selection Algorithm Based on Poisson Estimates

在线阅读下载全文

作  者:高影繁[1] 王惠临[1] 

机构地区:[1]中国科学技术信息研究所,北京100038

出  处:《情报学报》2010年第3期408-413,共6页Journal of the China Society for Scientific and Technical Information

基  金:“十一五”国家科技支撑计划重点项目(2006BAH03B02); 国家社科基金项目(06BTQ030)支持

摘  要:特征选择是文本分类的关键技术之一。本文提出一种基于泊松估计的可控特征选择算法,该算法以基于泊松假设估算的文档频率作为衡量特征语义信息的依据,以通信领域中的信息率失真理论作为可控特征选择的思想来源。在Reuters-21578新闻语料上进行的实验结果表明,基于泊松估计的特征选择算法性能优于基于语义的WN算法和同样基于统计的IG、Chi2等算法;在以特征漏选率作为信息率失真函数的前提下,设定分类算法分类指标下限值,则可以通过改变特征漏选率得到任意的分类精度值。实验表明本文算法在与相关算法的对比中存在优势。算法思想来源于通信领域中的信息率失真理论,也是一种在领域融合方面的崭新尝试。Feature selection is one of the most important technologies in text categorization.A new Controllable Feature Selection Algorithm Based on Poisson Estimates(CFSPE) is proposed in this article.It is based on poisson estimates and rate distortion theory in information field,trying to find features in documents with more semantic information and searching for controllable methods for feature selection.The comparative experiments have been done on the Reuters-21578 corpus adopted the IG,Chi2,WN algorithms and the poisson estimates based algorithm presented in this article.Its result shows that the latter one has more advantages.Moreover,the arbitrary effectiveness measure of categorization could be applied by adjusting the omitting ratio of feature selection of categories as long as the lowest effectiveness measure has been provided with the CFSPE.The experiment shows the algorithm proposed in this research is superior to the others.Stemming from rate distortion theory in the communications field,it is a brand-new attempt in the field of information fusion.

关 键 词:泊松估计 语义特征 率失真理论 可控特征选择 

分 类 号:TP301.6[自动化与计算机技术—计算机系统结构]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象