基于HDFS的海量日志数据冗余点过滤算法仿真  被引量:2

Simulation of Redundant Point Filtering Algorithm for Mass Log Data Based on HDFS

在线阅读下载全文

作  者:贾文钢[1] 高锦涛 JIA Wen-gang;GAO Jin-tao(College of Information Engineering,Inner Mongolia University of Technology,I Hohhot nner Mongolia 010051,China;Inner Mongolia Autonomous Region Special Equipment Inspection,Hohhot Inner Mongolia 010051,China)

机构地区:[1]内蒙古工业大学信息工程学院,内蒙古呼和浩特010051 [2]内蒙古特种设备检验院,内蒙古呼和浩特010051

出  处:《计算机仿真》2021年第12期241-244,249,共5页Computer Simulation

基  金:内蒙古工业大学科学研究项目(ZY201902)。

摘  要:利用当前算法滤除数据冗余点时,缺少对数据冗余点特征的提取、分类处理过程,导致滤除效率差、准确率低、存储开销过大。于是设计了基于HDFS的海量日志数据冗余点过滤算法。引入HDFS体系架构,利用数据采样时间序列获取数据冗余点的特征,并进行分类处理,提升冗余点的滤除效率;计算滤除前含有冗余特征的数据字节数与普通字节数之比的缩减率、误判率,减少存储开销量;为提高准确率、消除性能,采用相似度概念,根据冗余点的突出特征计算整体相似度,再通过均值漂移传递函数实现对数据冗余点的滤除。实验结果表明:上述算法滤出效率更好、准确率更高、存储开销量更小。Currently, the lack of feature extraction and classification of redundant data points leads to poor filtering efficiency, low accuracy and large storage overhead. In this regard, the redundant point filtering algorithm of massive log data based on HDFS was designed in this paper. Firstly, according to the HDFS architecture, the data sampling time series was introduced to obtain the characteristics of data redundant points. Concurrently, the characteristics were classified to improve the filtering efficiency of redundant points. Secondly, the reduction rate and misjudgment rate of the ratio of the number of data bytes with redundant characteristics to the number of ordinary bytes before filtering were calculated to reduce the storage volume. Then, the overall similarity was calculated according to the concept of similarity and the prominent characteristics of redundant points for improving the accuracy and eliminating the performance. Finally, based on the mean shift transfer function, the filtering of redundant data points was achieved. The experimental results show that the algorithm has high filtering efficiency, accuracy and low storage overhead.

关 键 词:数据冗余点 冗余特征 缩减率计算 均值漂移传递函数 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象