多段支持度数据挖掘算法研究被引量：23

A Data Mining Algorithm Based on Calculating Multi Segment Support

出　　处：《计算机学报》2001年第6期661-665,共5页Chinese Journal of Computers

基　　金：国家自然科学基金!(6 98730 19);吉林省科委应用基础基金! (19990 5 2 8)资助

摘　　要：在基于相联规则的数据挖掘算法中 ,Apriori等算法最为著名 .它分为两个主要步骤 :(1)通过多趟扫描数据库求解出频繁项集 ;(2 )利用频繁项集生成规则 .随后的许多算法都沿用 Apriori中“频繁项集的子集必为频繁项集”的思想 ,在频繁项集 Lk- 1 上进行 JOIN运算构成潜在 k项集 Ck.由于数据库和 Ck 的规模较大 ,需要相当大的计算量才能生成频繁项集 .Apriori Tid算法给每个事务增加了一个唯一标识 Tid ,其特点是只扫描一趟数据库 ,其余趟扫描 (如第 k趟扫描 )均在相应的数据集 Ck上进行 .由于数据规模改变不大 ,各算法的效率差别并不明显 .该文提出分段计算支持度的思想 ,是把一个项集的支持度分段计算 ,每一个段记录该项集在相应规模事务中出现的频度 ,从而构成一个支持度向量 .由于有了项集的多段支持度 ,可以推测出该项集能否包含在更大规模的频繁项集中 ,采用这种算法既提高了在扫描数据库过程中的信息获取率 ,又能及时剔除超集不是频繁项集的项集 ,进一步缩减了潜在项集的规模 .在数据集扫描过程中 ,按文中定理 1的思想调整数据集。Among the studies of KDD, R.Agrawal had presented a theory of association rules based on the basket data, which is the famous algorithm Apriori for data mining. The algorithm is executed in two steps, the large itemsets are generated at first and the set of rules generated afterwards. The algorithms presented by others thereafter still use the ideas of Apriori, that is any subset of a large itemset must also be large. Extending the large ( k-1 ) itemsets L k-1 using JOIN operation generates the set of candidate k itemsets C k . The generation of the large itemsets takes up a large amount of calculation, because scale of database is large and also the C k . The algorithm AprioriTid has set an identifier Tid for each transaction and the database is scanned only once, other scans (e.g. kth scan) are executed at corresponding data set C k . But the efficiency increased by the algorithm AprioriTid is not evidently because the difference of scale of database and data set C k is small. A new algorithm using multi segment for support is presented in this paper. The support of an itemset is divided into a lot of segments, and the counting for the different scale of transactions are recorded in corresponding segments. We can predict whether an itemset may be contained in a large itemset of lager scale, because the algorithm can calculate the multi segment support for the itemsets in one scan. This algorithm not only enhances the information gain ratio in database scanning, but also can find that some supersets are not the members of the large itemsets in advance, and to reduce the size of candidate itemsets by deleting these itemsets. In order to increase the efficiency of generating the large itemsets, the reduction of the scale of data set in each scan is according to the theorem 1 in this paper. A performance comparison of this algorithm and Apriori is given at the end of the paper.

关键词：数据挖掘相联规则算法频繁项集多段支持度数据库

分类号：TP311.13[自动化与计算机技术—计算机软件与理论]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

多段支持度数据挖掘算法研究被引量：23

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

多段支持度数据挖掘算法研究 被引量：23

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

多段支持度数据挖掘算法研究被引量：23