基于稀疏自编码的多维数据去重聚类算法分析  

Analysis of Multi-dimensional Data De-Duplication Clustering Algorithm Based Sparse Self-Coding

在线阅读下载全文

作  者:薛丽香[1] 高丽杰[1] 李占波[2] XUE Li-xiang;GAO Li-jie;LI Zhan-bo(College of Information Engineering,Zhengzhou University of Science and Technology,Zhengzhou Henan 450064,China;Network Management Center,Zhengzhou University,Zhengzhou Henan 450001,China)

机构地区:[1]郑州科技学院信息工程学院,河南郑州450064 [2]郑州大学网络管理中心,河南郑州450001

出  处:《计算机仿真》2024年第3期542-547,共6页Computer Simulation

摘  要:随着科技信息的不断发展,数据量与数据类型与日俱增,针对数据集维度高、重复数据多导致有效信息提取复杂的问题,提出基于改进稀疏自编码器的多维数据聚类算法。算法分为数据处理与聚类分析两大部分,数据处理时首先利用S-SAE中逐层贪婪的原理将高维数据集降维至每组6维的数据集;接着采用映射值匹配机制对降维后的数据集进行重复数据清洗处理,被清洗的值用0替代;然后将处理好的数据投入到K-Means++聚类算法中进行聚类分析;最终构建出TS-SAE-K-Means++多维数据聚类模型,并通过最优化分析得出其最优化参数设置情况。通过对不同基线组合算法的仿真对比分析表明,TS-SAE-K-Means++在聚类轮廓系数S与模型特征值F1评价体系中均优于其它算法组合。这表明提出的算法在解决高维数据内有效信息提取的问题上具有一定的优越性。With the continuous development of science and technology information,the volume and type of data are increasing day by day.To address the problem of high dimensionality of data sets and complicated extraction of effective information due to many duplicate data,this paper proposes a multi-dimensional data clustering algorithm based on improved sparse self-encoder.The algorithm is divided into two major parts:data processing and clustering analysis.The data processing first uses the layer-by-layer greedy principle in S-SAE to downscale the high-dimensional data set to a 6-dimensional data set in each group;Then the mapped value matching mechanism is used to clean the downscaled data set with duplicate data,and the cleaned values are replaced by O;Then the processed data are put into the K-Means++clustering algorithm for clustering analysis;Finally,a TS-SAE-K-Means++multi-dimensional data clustering model is constructed and its optimal parameter settings are derived by optimization analysis.The simulation comparison analysis of different baseline combination algorithms shows that TS-SAE-K-Means++outperforms other algorithm combinations in the evaluation system of clustering profile coefficient S and model eigenvalue F1.This indicates that the algorithm proposed in this paper has certain superiority in solving the problem of effective information extraction within high-dimensional data.

关 键 词:改进稀疏自编码器 聚类算法 评级指标 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象