检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
机构地区:[1]云南大学信息学院,昆明650091
出 处:《计算机研究与发展》2007年第10期1774-1781,共8页Journal of Computer Research and Development
基 金:国家自然科学基金项目(60463004)
摘 要:KNN-join是一种新近才提出的操作,它在数据挖掘中有着广泛的应用.利用KNN-join的"一次一个集合"的性质,一些数据挖掘任务,例如分类、例外挖掘和聚类等,就会更加容易地进行.MuX和Goreder则是两种专为KNN-join设计的算法.为了综合利用这两种方法的优点,一种新的KNN-join并行处理方法——pgi-distance(parallel grid index-distance)——被提了出来.pgi-distance使用双层结构,可以对I/O和CPU进行同时优化;基于距离的索引能够让它更好地适应数据维度和分布的变化.由于采用的是各DBMS厂商广泛支持的B+树索引,这让pgi-distance得以成为一种更为实用的KNN-join处理方法.在合成数据集和真实数据集上的测试也表明pgi-distance是实用的和高效的.The KNN-join has extensive applications where high-dimensional data is involved,because it is powerful and can provide more meaningful results than range-join.With its set-a-time nature,the KNN-join is able to facilitate many data mining tasks such as classification,outlier detection and clustering.MuX and Goreder are two algorithms specially designed for KNN-join.Since the MuX makes use of R-tree to reduce CPU cost,it suffers like all of the algorithms based on R-tree.In order to optimize both I?O and CPU cost,the Goreder applies a grid on data space.With the difference of data distribution,however,it is difficult to determine the cell's appropriate granularity.In the consequence,the Goreder tends to perform poorly.In this paper,a novel parallel method—pgi-distance(parallel grid index-distance)—is proposed,which combines the advantages of MuX and Goreder.On the one hand,the pgi-distance utilizes a two-tier structure to optimize I?O and CPU time separately.On the other hand,the index based on distance makes the pgi-distance adapted well to data dimensionality and data distribution.Especially the index in the pgi-distance is B+-tree,which is supported by all commercial DBMS.So,this method becomes more applicable than that based on R-tree.Extensive experiments on both synthetic datasets and real datasets are conducted,and the results illustrate that pgi-distance is efficient.
关 键 词:KNN-join 数据挖掘 分类 基于距离的索引 B+树
分 类 号:TP311.13[自动化与计算机技术—计算机软件与理论]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.38