一种高效的数据流挖掘增量模糊决策树分类算法被引量：18

An Incremental Fuzzy Decision Tree Classification Method for Data Streams Mining Based on Threaded Binary Search Trees

作　　者：王涛[1] 李舟军[2] 胡小华[3] 颜跃进[1] 陈火旺[1]

机构地区：[1]国防科学技术大学计算机学院,长沙410073 [2]北京航空航天大学计算机学院,北京100083 [3]德雷塞尔大学信息科学与技术学院

出　　处：《计算机学报》2007年第8期1244-1250,共7页Chinese Journal of Computers

基　　金：国家自然科学基金(60573057)资助~~

摘　　要：数据流具有数据持续到达、到达速度快、数据规模巨大等特点,这些都给数据流挖掘领域的研究工作带来了新挑战,而其中分类算法更是当前的研究热点.Domingos等在VFDT中利用Hoeffding不等式很好地解决了在数据流上进行单遍扫描获取高精度决策树的问题.Gama等对VFDT进行扩展并实现了VFDTc,使系统能够处理连续属性.Peng等在传统数据挖掘环境下提出了基于模糊理论的连续属性平滑离散化方法.基于前述工作,作者设计并实现了一种基于线索化排序二叉树的增量模糊决策树分类算法fVFDT,其主要贡献有如下4点:(1)第一次设计并实现了数据流上的基于线索化二叉排序树(TBST)的连续属性处理方法.相比VFDT,fVFDT的样本插入时间复杂度由O(n2)降低到O(nlogn).当新样本到达时,VFDTc需要更新O(logn)个属性节点,而fVFDT只需要更新相应的一个节点即可;(2)改进了VFDTc连续属性的最佳划分节点选取的计算方法,使其时间复杂度由O(nlogn)降低到O(n);(3)根据Fayyad等的研究成果,相比VFDTc,fVFDT只需从更少的备选划分节点中选取最佳节点,备选划分节点数由O(n)降低到O(logn);(4)改进了传统数据挖掘环境下的基于模糊理论的连续属性平滑离散化方法,有效地处理了噪声数据,很好地提高了分类精度.Decision tree classification is a well-studied problem in data mining. Recently, there has been much interest in mining data streams. Domingos and Hulten have presented a one-pass algorithm. Their system, VFDT, uses Hoeffding inequality to achieve a probabilistic bound on the accuracy of the tree constructed. Gama et al. have extended VFDT in two directions. Their system VFDTc can deal with continuous data and use more powerful classification techniques at tree leaves. Peng et al. present soft discretization method to solve continuous attributes in data mining. This paper revisits this problem and implemented a system fVFDT on top of VFDT and VFDTc. It has the following four contributions.- （1） It presents a threaded binary search trees （TBST） approach for efficiently handling continuous attributes. It builds a threaded binary search tree, and its processing time for values inserting is O（n log n）, while VFDT＇s processing time is O（n^2）. When a new example arrives, VFDTc need update O（logn） attribute tree nodes, but fVFDT just need update one necessary node. （2） It improves the method of getting the best split-test point of a given continuous attribute. Comparing to the method used in VFDTc, it improves from O（nlogn） to O（n） in processing time. （3） Comparing to VFDTc, fVFDT＇s candidate split-test number decrease from O（n） to O（logn）. （4） It uses soft discretization method in data streams mining to solve the problem of noise data.

关键词：数据流线索化二叉排序树连续属性模糊离散化增量 VFDT

分类号：TP181[自动化与计算机技术—控制理论与控制工程]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

一种高效的数据流挖掘增量模糊决策树分类算法被引量：18

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

一种高效的数据流挖掘增量模糊决策树分类算法 被引量：18

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

一种高效的数据流挖掘增量模糊决策树分类算法被引量：18