检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:水泽农 张星宇 沙朝锋[1] SHUI Ze-nong;ZHANG Xing-yu;SHA Chao-feng(School of Computer Science,Fudan University,Shanghai 200433,China)
机构地区:[1]复旦大学计算机科学技术学院,上海200433
出 处:《计算机科学》2021年第7期105-111,共7页Computer Science
基 金:国家重点研发计划(2018YFB0904503)。
摘 要:离群点或异常检测是数据挖掘和机器学习等领域的研究热点之一,研究人员已提出了多种离群点检测方法,并将其应用于入侵检测和异常交易检测等问题。但多数离群点检测方法主要针对表数据或时间序列数据等,无法直接应用于离群文档检测。现有基于相近性的离群文档检测方法一般用文档与整个文档集的距离来衡量离群性,无法发现基于局部考量的离群文档,而且采用欧几里德距离可能无法刻画出文档间的语义相近性。基于概率模型的离群文档检测方法过于复杂,并且同样只从全局来定义文档的离群值。针对这些问题,文中提出了一种新的基于相近性的离群文档检测方法。该方法引入最优输运距离,基于利用文档词嵌入向量的语义信息,在文档之间使用最优输运算法以度量距离,并利用LDA主题模型对文本进行层级抽象,通过最优输运算法算出主题之间的距离后,再计算文档距离,文中基于这两种最优运输距离计算文档与它的k近邻文档之间的距离来衡量该文档的离群程度。该方法从局部视角来定义文档的离群性,所采用的文档距离能体现文档之间的语义相近性。在两个开源数据集上进行了较细致的对比实验,实验结果显示,所提方法在多个指标上优于基准离群文档检测方法;还检验了基于k近邻离群文档定义的有效性以及k值的选取对结果的影响。Outlier or anomaly detection is one of the research hotspots in areas such as data mining and machine learning,and researchers have proposed a variety of outlier detection methods that can be applied to problems such as intrusion detection and anomalous transaction detection.However,most outlier detection methods mainly target tabular data or time series data,etc.and cannot be directly applied to outlier document detection.Existing outlier detection methods based on proximity generally measure proximity by the distance of a document to the entire document set,failing to find outliers based on local considerations,and may not be able to characterize semantic proximity between documents using Euclidean distance.Probabilistic model-based outlier do-cument detection methods are too complex and define document outliers only globally.In response to these questions,this paper proposes a new proximity-based outlier document detection method where we measure the outlier of a document by the distance between the document and its k-nearest neighbor document.We introduce the optimal transport algorithm to calculate the distance between documents,based on the semantic information of the document obtained from word embedding vector and the topic model.The method defines document outliers from a local perspective,using document distances that reflect the semantic proxi-mity between documents.This paper conducts extensive experiments on two open source document datasets,and the results show that the proposed methods outperform the benchmark outlier document detection methods in terms of four evaluation metrics.Experiments also demonstrate the effectiveness of proposal of k-nearest neighbor based outliers and the impact of value k.
关 键 词:离群文档检测 最优输运 词搬动距离 层次型最优主题输运
分 类 号:TP311[自动化与计算机技术—计算机软件与理论]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.7