检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
出 处:《情报学报》2007年第6期803-807,共5页Journal of the China Society for Scientific and Technical Information
基 金:国家自然科学基金资助项目(项目编号:70673070)研究成果之一.
摘 要:文本分类实验包括实验文本集准备、文本索引、特征降维、分类以及性能评估等多个步骤,每个步骤都有很多方法可供选择,而每个不同的选择都会对最终的实验结果产生影响。比较同一步骤中适用的不同算法的性能时,需要保证其他步骤使用相同的方法,使它们在相同的条件下运行。本文提出了文本分类的DICV研究框架,该框架包括核心数据(core data)、文本索引(text indexing)、分类算法(classification algorithm)和可视化界面(visualization interface)4个模块。该框架设计的重点在于:①提炼一个统一的文本分类模型,为每个步骤的算法提供一个接口,实现了这个接口的算法就可以通过简单的配置应用于框架中,这使得研究者可以方便地选择各种文本索引、特征降维和分类算法,或添加新的文本集和算法,来完成其需要的文本分类实验。②自动记录文本分类实验各个步骤使用的算法、参数和结果,这使得系统能够将研究者的选择和实验步骤的中间结果记录下来,供研究者在后续研究中使用,可避免不必要的重复性工作,提高文本分类研究的效率。A text categorization experiment must walk through multiple steps, including preparing document set, document indexing, feature dimensionality reduction, classification, and performance evaluation. And in each of the steps, we get a lot of choices to fulfill the task, which will make a great impact on the final result of the whole text categorization processes. So when we want to compare different algorithms that achieve the same step, we must choose same methods for the other steps so that they run in the same environment. In this paper, we propose a study framework of text categorization, DICV, which is consisted of four modules, including core data, text indexing, classification algorithm, and visualization interface. When we design this framework, we focus on : 1) Extract a uniform model of text categorization, which includes a unique interface for each of the steps. So with this framework, when researchers make a study on text categorization, they don't need to build a text categorization system from the scratch. What they need to do is to choose algorithms, or to induct new algorithms implementing the interfaces. 2) Write down the algorithms and parameters used and the result of each step automatically. So the researchers can have a log of their choice and the result of every step of the whole text categorization processes, which can be re-used in the following study. This feature of our framework can help researchers avoid unnecessary reduplicate work, and improve the efficiency of text categorization researches.
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.185