XML数据流分页频繁子树挖掘研究  被引量:2

Mining Frequent Subtree on Paging XML Data Stream

在线阅读下载全文

作  者:雷向欣[1] 杨智应[2] 黄少寅[3] 胡运发[4] 

机构地区:[1]华东理工大学计算机科学与工程系,上海200237 [2]上海海事大学计算机科学与技术系,上海200135 [3]北京大学光华管理学院,北京100871 [4]复旦大学计算机科学技术学院,上海200433

出  处:《计算机研究与发展》2012年第9期1926-1936,共11页Journal of Computer Research and Development

基  金:国家自然科学基金项目(60402008);上海市科委创新行动计划基金项目(08170511300);上海海事大学科技基金项目(2009445654);华东理工大学教改基金项目(YH0126115)

摘  要:随着XML数据流的广泛应用,从挖掘XML数据流中发现知识具有重要的理论与应用价值.相比其他频繁模式挖掘,大型XML文档与数据流的频繁子树挖掘面临困难:XML数据流不可能整体在内存解析;对XML数据流分段挖掘必须考虑XML数据的半结构化特征等.针对上述问题,提出数据流分页频繁子树挖掘模型Tmlist.Tmlist对XML数据流进行分页,管理跨页节点及频繁候选子树的跨页增长,逐页挖掘频繁子树;频繁候选子树的增长根据根节点层次由浅至深地在最右路径加入频繁候选节点,避免以低层次为根子树的重复性递归增长;对频繁候选子树采用子树拓扑序列和最右路径共同标识,子树的增长不需要对子树前缀进行匹配,省去前缀节点存储与匹配开销;以页面最小支持度对频繁候选子树按页筛选,子树按页面衰减度衰减支持度、剪枝.Tmlist在可控误差范围内降低频繁子树挖掘的空间消耗,提高内存利用率和挖掘效率.With the widespread use of XML data stream, discovering knowledge from it becomes important. Compared with other frequent pattern mining, mining frequent subtree over large-scale XML documents and unlimited growing XML data stream is facing difficulties, data steam can not be resolved in memory as a whole, and mining partitioned XML data stream must be considered semi- structured characteristics of XML data, etc. Inspired by this fact, Tmlist is proposed for mining frequent subtrees over paging XML data stream. Tmlist pages XML data stream, manages cross-page nodes and frequent candidate subtrees growing across page, and mines frequent subtrees page-by- page. Frequent candidate subtrees grow by inserting frequent candidate nodes in their rightmost path according to the level of their roots, avoiding the repeated recursive growth of the subtrees rooted by the low-level nodes. A subtree is represented by the topologic sequence of its rightmost path, which avoids the prefix match for the increment of subtrees, so the storing and matching cost for the prefix nodes is cut. Frequent candidate subtrees are selected according to the page minimum support, the support of frequent subtrees is decayed and branches are pruned according to the decaying factor. Accordingly, Tmlist reduces the memory cost of mining frequent subtrees in the limit of error and improves memory utilization and mining efficiency.

关 键 词:XML 数据流 分页 频繁子树 数据挖掘 

分 类 号:TP311.13[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象