面向大规模数据的DBSCAN加速算法综述  被引量:4

Survey on DBSCAN Acceleration Algorithms for Large Scale Data

在线阅读下载全文

作  者:陈叶旺[1,2,5,6,7] 曹海露 陈谊 康昭[3] 雷震 杜吉祥[1,6] Chen Yewang;Cao Hailu;Chen Yi;Kang Zhao;Lei Zhen;Du Jixiang(College of Computer Science and Technology,Huaqiao University,Xiamen,Fujian 361021;Beijing Key Laboratory of Big Data Technology for Food Safety(Beijing Technology and Business University),Beijing 100048;School of Computer Science and Engineering,University of Electronic Science and Technology of China,Chengdu 611731;State Key Laboratory of Pattern Recognition(Institute of Automation,Chinese Academy of Sciences),Beijing 100190;Xiamen Key Laboratory of Data Security and Blockchain Technology(Huaqiao University),Xiamen,Fujian 361021;Fujian Key Laboratory of Big Data Intelligence and Security(Huaqiao University),Xiamen,Fujian 361021;Jiangsu Provincial Key Laboratory for Computer Information Processing Technology(Soochow University),Suzhou,Jiangsu 215006)

机构地区:[1]华侨大学计算机科学与技术学院,福建厦门361021 [2]食品安全大数据技术北京市重点实验室(北京工商大学),北京100048 [3]电子科技大学计算机科学与工程学院,成都611731 [4]模式识别国家重点实验室(中国科学院自动化所),北京100190 [5]厦门市数据安全与区块链技术重点实验室(华侨大学),福建厦门361021 [6]福建省大数据智能与安全重点实验室(华侨大学),福建厦门361021 [7]江苏省计算机信息处理技术重点实验室(苏州大学),江苏苏州215006

出  处:《计算机研究与发展》2023年第9期2028-2047,共20页Journal of Computer Research and Development

基  金:国家自然科学基金项目(61673186,71771094,61876068,61972010);福建省科技计划引导性项目(2021H0019);福建省自然科学基金项目(2020J05059,2021J01317)。

摘  要:DBSCAN(density-based spatial clustering of applications with noise)是应用最广的密度聚类算法之一.然而,它时间复杂度过高(O(n^(2))),无法处理大规模数据.因而,对它进行加速成为一个研究热点,众多富有成效的工作不断涌现.从加速目标上看,这些工作大体上可分为减少冗余计算和并行化两大类;就具体加速手段而言,可分为6个主要类别:基于分布式、基于采样化、基于近似模糊、基于快速近邻、基于空间划分以及基于GPU加速技术.根据该分类,对现有工作进行了深入梳理与交叉比较,发现采用多重技术的融合加速算法优于单一加速技术;近似模糊化、并行化与分布式是当前最有效的手段;高维数据仍然难以应对.此外,对快速化DBSCAN算法在多个领域中的应用进行了跟踪报告.最后,对本领域未来的方向进行了展望.DBSCAN(density-based spatial clustering of applications with noise)is one of the most widely used and studied density clustering algorithms for its simplicity and easy implementation.However,the high time complexity(O(n^(2)))yields large-scale data that it is unable to deal with,due to that DBSCAN has great number of redundant distance computations in the process of calculating density.Therefore,accelerating it,which aims to make it suitable for big data environment,has become a research hotspot,and much fruitful work has emerged.From the perspective of acceleration goals,these efforts can be broadly divided into two categories:reducing redundant computations and parallelization.In terms of specific acceleration means,they can be divided into six main categories:distributed technique,sampling,approximation,fast neighbor,space division and GPU acceleration.According to this classification,the existing work is thoroughly combed and cross compared.It is found that the fusion acceleration algorithms of multiple technologies are better than those that only use single acceleration technology;approximate fuzziness,parallelism and distribution are the most effective methods to accelerate DBSCAN at present;high-dimensional data are still difficult to deal with.In addition,the applications of fast DBSCAN in many fields are tracked and reported.Finally,the future direction of rapid DBSCAN is prospected.

关 键 词:快速化DBSCAN 密度聚类 聚类算法 大数据 数据挖掘 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象