DICV文本分类研究框架  

DICV: A Study Framework of Text Categorization

在线阅读下载全文

作  者:李纲[1] 夏晨曦[1] 

机构地区:[1]武汉大学信息资源研究中心,武汉430072

出  处:《情报学报》2007年第6期803-807,共5页Journal of the China Society for Scientific and Technical Information

基  金:国家自然科学基金资助项目(项目编号:70673070)研究成果之一.

摘  要:文本分类实验包括实验文本集准备、文本索引、特征降维、分类以及性能评估等多个步骤,每个步骤都有很多方法可供选择,而每个不同的选择都会对最终的实验结果产生影响。比较同一步骤中适用的不同算法的性能时,需要保证其他步骤使用相同的方法,使它们在相同的条件下运行。本文提出了文本分类的DICV研究框架,该框架包括核心数据(core data)、文本索引(text indexing)、分类算法(classification algorithm)和可视化界面(visualization interface)4个模块。该框架设计的重点在于:①提炼一个统一的文本分类模型,为每个步骤的算法提供一个接口,实现了这个接口的算法就可以通过简单的配置应用于框架中,这使得研究者可以方便地选择各种文本索引、特征降维和分类算法,或添加新的文本集和算法,来完成其需要的文本分类实验。②自动记录文本分类实验各个步骤使用的算法、参数和结果,这使得系统能够将研究者的选择和实验步骤的中间结果记录下来,供研究者在后续研究中使用,可避免不必要的重复性工作,提高文本分类研究的效率。A text categorization experiment must walk through multiple steps, including preparing document set, document indexing, feature dimensionality reduction, classification, and performance evaluation. And in each of the steps, we get a lot of choices to fulfill the task, which will make a great impact on the final result of the whole text categorization processes. So when we want to compare different algorithms that achieve the same step, we must choose same methods for the other steps so that they run in the same environment. In this paper, we propose a study framework of text categorization, DICV, which is consisted of four modules, including core data, text indexing, classification algorithm, and visualization interface. When we design this framework, we focus on : 1) Extract a uniform model of text categorization, which includes a unique interface for each of the steps. So with this framework, when researchers make a study on text categorization, they don't need to build a text categorization system from the scratch. What they need to do is to choose algorithms, or to induct new algorithms implementing the interfaces. 2) Write down the algorithms and parameters used and the result of each step automatically. So the researchers can have a log of their choice and the result of every step of the whole text categorization processes, which can be re-used in the following study. This feature of our framework can help researchers avoid unnecessary reduplicate work, and improve the efficiency of text categorization researches.

关 键 词:文本分类 文本索引 特征降维 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象