检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:张凯棋 宋亦静 陈鑫 ZHANG Kai-qi;SONG Yi-jing;CHEN Xin(School of Computer Science and Technology,Taiyuan University of Science and Technology,Taiyuan 030024,China)
机构地区:[1]太原科技大学计算机科学与技术学院,山西太原030024
出 处:《计算机技术与发展》2023年第11期20-27,共8页Computer Technology and Development
基 金:山西省基础研究计划资助项目(202103021223267);山西省高等学校科技创新计划项目(2021L297);太原科技大学科研启动基金项目(20212053,20222107)。
摘 要:属性分组是高维离群检测中的有效手段之一,可以有效缓解“维度灾难”的干扰,但现有的属性分组离群检测方法未能体现属性组之间的差异性,以及属性组的偏离程度,严重影响了高维离群检测的效果与性能。该文采用信息熵累加和刻画与描述属性组之间的差异性,提出了一种基于属性组权重的分类离群检测方法。首先,根据数据模式频率和编码长度,定义了属性组偏离因子,并将其作为属性组之间的合并依据,有效地刻画了属性组的偏离程度,进一步提高了属性分组过程中的搜索效率;其次,利用信息熵累加和定义了属性组权重,有效地体现了不同属性组之间的差异性;然后,依据属性组权重,重新定义了离群得分函数,并提出了一种基于属性组权重的分类数据离群检测算法;最后,采用UCI,NTU,KEEL和人工合成数据集,实验验证了该离群检测算法不仅具有较高的检测精度和效率,而且也具有良好的可扩展性与伸缩性,可适用于高维海量分类属性数据集的离群检测任务。Attribute grouping is one of the effective methods in high-dimensional outlier detection,which can effectively alleviate the interference of“the curse of dimensionality”.However,existing attribute grouping outlier detection methods fail to reflect the differences among attribute groups and the deviation degree of attribute groups,which have a significant negative influence on the efficiency and performance of high-dimensional outlier detection.We propose an attribute group weight-based outlier detection method for categorical data by using information entropy cumulative sum,which depicts and describes the difference among attribute groups.Firstly,the attribute group deviation factor is defined according to the data pattern frequency and code lengths,and used as a basis of merging attribute groups,which effectively portrays the deviation among attribute groups and further improves the search efficiency in the process of attribute grouping.Secondly,the information entropy cumulative sum is used to define the attribute group weights,which effectively reflects the difference among different attribute groups.Thirdly,the outlier score function is redefined based on the attribute group weights,and an outlier detection algorithm for categorical data is proposed on this basis.In the end,experimental results on UCI,NTU,KEEL and synthetic datasets validate that the outlier detection algorithm not only has high detection accuracy and efficiency,but also has good extensibility and scalability,which can be applied to the outlier detection task of high-dimensional massive categorical attribute datasets.
关 键 词:离群检测 属性分组 分类数据 属性组权重 偏离因子
分 类 号:TP311.13[自动化与计算机技术—计算机软件与理论]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.33