检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:王涛[1] 李舟军[2] 胡小华[3] 颜跃进[1] 陈火旺[1]
机构地区:[1]国防科学技术大学计算机学院,长沙410073 [2]北京航空航天大学计算机学院,北京100083 [3]德雷塞尔大学信息科学与技术学院
出 处:《计算机学报》2007年第8期1244-1250,共7页Chinese Journal of Computers
基 金:国家自然科学基金(60573057)资助~~
摘 要:数据流具有数据持续到达、到达速度快、数据规模巨大等特点,这些都给数据流挖掘领域的研究工作带来了新挑战,而其中分类算法更是当前的研究热点.Domingos等在VFDT中利用Hoeffding不等式很好地解决了在数据流上进行单遍扫描获取高精度决策树的问题.Gama等对VFDT进行扩展并实现了VFDTc,使系统能够处理连续属性.Peng等在传统数据挖掘环境下提出了基于模糊理论的连续属性平滑离散化方法.基于前述工作,作者设计并实现了一种基于线索化排序二叉树的增量模糊决策树分类算法fVFDT,其主要贡献有如下4点:(1)第一次设计并实现了数据流上的基于线索化二叉排序树(TBST)的连续属性处理方法.相比VFDT,fVFDT的样本插入时间复杂度由O(n2)降低到O(nlogn).当新样本到达时,VFDTc需要更新O(logn)个属性节点,而fVFDT只需要更新相应的一个节点即可;(2)改进了VFDTc连续属性的最佳划分节点选取的计算方法,使其时间复杂度由O(nlogn)降低到O(n);(3)根据Fayyad等的研究成果,相比VFDTc,fVFDT只需从更少的备选划分节点中选取最佳节点,备选划分节点数由O(n)降低到O(logn);(4)改进了传统数据挖掘环境下的基于模糊理论的连续属性平滑离散化方法,有效地处理了噪声数据,很好地提高了分类精度.Decision tree classification is a well-studied problem in data mining. Recently, there has been much interest in mining data streams. Domingos and Hulten have presented a one-pass algorithm. Their system, VFDT, uses Hoeffding inequality to achieve a probabilistic bound on the accuracy of the tree constructed. Gama et al. have extended VFDT in two directions. Their system VFDTc can deal with continuous data and use more powerful classification techniques at tree leaves. Peng et al. present soft discretization method to solve continuous attributes in data mining. This paper revisits this problem and implemented a system fVFDT on top of VFDT and VFDTc. It has the following four contributions.- (1) It presents a threaded binary search trees (TBST) approach for efficiently handling continuous attributes. It builds a threaded binary search tree, and its processing time for values inserting is O(n log n), while VFDT's processing time is O(n^2). When a new example arrives, VFDTc need update O(logn) attribute tree nodes, but fVFDT just need update one necessary node. (2) It improves the method of getting the best split-test point of a given continuous attribute. Comparing to the method used in VFDTc, it improves from O(nlogn) to O(n) in processing time. (3) Comparing to VFDTc, fVFDT's candidate split-test number decrease from O(n) to O(logn). (4) It uses soft discretization method in data streams mining to solve the problem of noise data.
关 键 词:数据流 线索化二叉排序树 连续属性 模糊离散化 增量 VFDT
分 类 号:TP181[自动化与计算机技术—控制理论与控制工程]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.40