检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:李雄飞[1] 苑森淼[1] 董立岩[1] 全勃[1]
出 处:《计算机学报》2001年第6期661-665,共5页Chinese Journal of Computers
基 金:国家自然科学基金!(6 98730 19);吉林省科委应用基础基金! (19990 5 2 8)资助
摘 要:在基于相联规则的数据挖掘算法中 ,Apriori等算法最为著名 .它分为两个主要步骤 :(1)通过多趟扫描数据库求解出频繁项集 ;(2 )利用频繁项集生成规则 .随后的许多算法都沿用 Apriori中“频繁项集的子集必为频繁项集”的思想 ,在频繁项集 Lk- 1 上进行 JOIN运算构成潜在 k项集 Ck.由于数据库和 Ck 的规模较大 ,需要相当大的计算量才能生成频繁项集 .Apriori Tid算法给每个事务增加了一个唯一标识 Tid ,其特点是只扫描一趟数据库 ,其余趟扫描 (如第 k趟扫描 )均在相应的数据集 Ck上进行 .由于数据规模改变不大 ,各算法的效率差别并不明显 .该文提出分段计算支持度的思想 ,是把一个项集的支持度分段计算 ,每一个段记录该项集在相应规模事务中出现的频度 ,从而构成一个支持度向量 .由于有了项集的多段支持度 ,可以推测出该项集能否包含在更大规模的频繁项集中 ,采用这种算法既提高了在扫描数据库过程中的信息获取率 ,又能及时剔除超集不是频繁项集的项集 ,进一步缩减了潜在项集的规模 .在数据集扫描过程中 ,按文中定理 1的思想调整数据集 。Among the studies of KDD, R.Agrawal had presented a theory of association rules based on the basket data, which is the famous algorithm Apriori for data mining. The algorithm is executed in two steps, the large itemsets are generated at first and the set of rules generated afterwards. The algorithms presented by others thereafter still use the ideas of Apriori, that is any subset of a large itemset must also be large. Extending the large ( k-1 ) itemsets L k-1 using JOIN operation generates the set of candidate k itemsets C k . The generation of the large itemsets takes up a large amount of calculation, because scale of database is large and also the C k . The algorithm AprioriTid has set an identifier Tid for each transaction and the database is scanned only once, other scans (e.g. kth scan) are executed at corresponding data set C k . But the efficiency increased by the algorithm AprioriTid is not evidently because the difference of scale of database and data set C k is small. A new algorithm using multi segment for support is presented in this paper. The support of an itemset is divided into a lot of segments, and the counting for the different scale of transactions are recorded in corresponding segments. We can predict whether an itemset may be contained in a large itemset of lager scale, because the algorithm can calculate the multi segment support for the itemsets in one scan. This algorithm not only enhances the information gain ratio in database scanning, but also can find that some supersets are not the members of the large itemsets in advance, and to reduce the size of candidate itemsets by deleting these itemsets. In order to increase the efficiency of generating the large itemsets, the reduction of the scale of data set in each scan is according to the theorem 1 in this paper. A performance comparison of this algorithm and Apriori is given at the end of the paper.
关 键 词:数据挖掘 相联规则 算法 频繁项集 多段支持度 数据库
分 类 号:TP311.13[自动化与计算机技术—计算机软件与理论]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.117