基于孤立森林的多离群点数据检测算法设计  被引量:2

Design of multi⁃outlier data detection algorithm based on isolation forest

在线阅读下载全文

作  者:李加军 LI Jiajun(School of Data Science,Guangzhou Huashang College,Guangzhou 511399,China)

机构地区:[1]广州华商学院数据科学学院,广东广州511399

出  处:《现代电子技术》2024年第5期139-142,共4页Modern Electronics Technique

基  金:广州华商学院校内导师制科研项目:大数据驱动的电子商务企业竞争力评价研究(2022HSDS12)。

摘  要:精准找出异常离群数据有利于确保大规模数据在应用中的精确度,为此,设计了基于孤立森林的多离群点数据检测算法。首先,采用近似符号聚合算法处理大规模数据的多条件时间序列,再通过计算欧氏距离分析多条件时间序列的相似度,而后采用加权调整法调整相似曲线,剔除其中的异常数据,完成对大规模数据的清洗;利用清洗后的数据构建孤立树形成孤立森林,将待检测数据作为孤立森林的输入量,通过计算数据样本点到每棵树根节点的距离,实现对离群点数据的检测。实验结果表明:该算法能够有效地检测出离群点数据,在针对大规模数据离群点的检测时,检测结果精确度较高。Accurately identifying outlier data is beneficial for ensuring the accuracy of large⁃scale data in applications.Therefore,a multi⁃outlier data detection algorithm based on isolation forests has been designed.The approximate symbol aggregation algorithm is used to process the multi conditional time series of large⁃scale data.The similarity of the multi conditional time series is analyzed by calculating the Euclidean distance.The weighted adjustment method is used to adjust the similarity curve,eliminate abnormal data,and complete the cleaning of large⁃scale modular data.The cleaned data is used to construct an isolation tree and form an isolation forest.The data under detection is used as the input for the isolation forest.By calculating the distance between the data sample points and each node of the tree roots,outlier data detection is achieved.Experimental results have shown that the algorithm can effectively detect outlier data,and its detection accuracy is high when detecting outliers in large⁃scale data.

关 键 词:孤立树 孤立森林 离群点 大规模数据 异常检测 相似度测量 数据清洗 时间序列 

分 类 号:TN99-34[电子电信—信号与信息处理]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象