面向大型数据集的局部敏感哈希K−means算法被引量：2

Locality-sensitive hashing K-means algorithm for large-scale datasets

作　　者：魏峰马龙 WEI Feng;MA Long(CCTEG China Coal Research Institute,Beijing 100013,China;National Key Lab of Coal Resources High Efficient Mining and Clean Utilization,Beijing 100013,China)

机构地区：[1]煤炭科学技术研究院有限公司,北京100013 [2]煤炭资源高效开采与洁净利用国家重点实验室,北京100013

出　　处：《工矿自动化》2023年第3期53-62,共10页Journal Of Mine Automation

基　　金：国家重点研发计划资助项目(2021YFB3201905)。

摘　　要：大型数据集高效处理策略是煤矿安全监测智能化、采掘智能化等煤矿智能化建设的关键支撑。针对K−means算法面对大型数据集时聚类高效性及准确性不足的问题,提出了一种基于局部敏感哈希(LSH)的高效K−means聚类算法。基于LSH对抽样过程进行优化,提出了数据组构建算法LSH−G,将大型数据集合理划分为子数据组,并对数据集中的噪声点进行有效删除;基于LSH−G算法优化密度偏差抽样(DBS)算法中的子数据组划分过程,提出了数据组抽样算法LSH−GD,使样本集能更真实地反映原始数据集的分布规律;在此基础上,通过K−means算法对生成的样本集进行聚类,实现较低时间复杂度情况下从大型数据集中高效挖掘有效数据。实验结果表明:由10个AND操作与8个OR操作组成的级联组合为最优级联组合,得到的类中心误差平方和(SSEC)最小;在人工数据集上,与基于多层随机抽样(M−SRS)的K−means算法、基于DBS的K−means算法及基于网格密度偏差抽样(G−DBS)的K−means算法相比,基于LSH−GD的K−means算法在聚类准确性方面的平均提升幅度分别为56.63%、54.59%及25.34%,在聚类高效性方面的平均提升幅度分别为27.26%、16.81%及7.07%;在UCI标准数据集上,基于LSH−GD的K−means聚类算法获得的SSEC与CPU消耗时间(CPU−C)均为最优。Efficient processing strategy for large datasets is a key support for coal mine intelligent constructions,such as the intelligent construction of coal mine safety monitoring and mining.To address the problem of insufficient clustering efficiency and accuracy of the K-means algorithm for large datasets,a highly efficient K-means clustering algorithm based on locality-sensitive hashing(LSH)is proposed.Based on LSH,the sampling process is optimized,and a data grouping algorithm LSH-G is proposed.The large dataset is divided into subgroups and the noisy points in the dataset are removed effectively.Based on LSH-G,the subgroup division process in the density biased sampling(DBS)algorithm is optimized.And a data group sampling algorithm,LSH-GD,is proposed.The sample set can more accurately reflect the distribution law of the original dataset.On this basis,the K-means algorithm is used to cluster the generated sample set,achieving efficient mining of effective data from large datasets with low time complexity.The experimental results show that the optimal cascade combination consists of 10 AND operations and 8 OR operations, resulting in the smallest sum of squares due to error of class center (SSEC). On the artificial dataset, compared with the K-means algorithm based on multi-layer simple random sampling (M-SRS), the K-means algorithm based on DBS, and the K-means algorithm based on grid density biased sampling (G-DBS), the K-means algorithm based on LSH-GD achieves an average improvement of 56.63%, 54.59%, and 25.34% respectively in clustering accuracy. The proposed algorithm achieves an average improvement of 27.26%, 16.81%, and 7.07% in clustering efficiency respectively. On the UCI standard dataset, the K-means clustering algorithm based on LSH-GD obtains optimal SSEC and CPU time consumption (CPU-C).

关键词：智慧矿山大型数据集 K−means聚类局部敏感哈希噪声点筛选密度偏差抽样

分类号：TD67[矿业工程—矿山机电]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

面向大型数据集的局部敏感哈希K−means算法被引量：2

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

面向大型数据集的局部敏感哈希K−means算法 被引量：2

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

面向大型数据集的局部敏感哈希K−means算法被引量：2