基于最优输运和k-近邻的离群文档检测  被引量:1

Outlier Document Detection via Optimal Transport and k-nearest Neighbor

在线阅读下载全文

作  者:水泽农 张星宇 沙朝锋[1] SHUI Ze-nong;ZHANG Xing-yu;SHA Chao-feng(School of Computer Science,Fudan University,Shanghai 200433,China)

机构地区:[1]复旦大学计算机科学技术学院,上海200433

出  处:《计算机科学》2021年第7期105-111,共7页Computer Science

基  金:国家重点研发计划(2018YFB0904503)。

摘  要:离群点或异常检测是数据挖掘和机器学习等领域的研究热点之一,研究人员已提出了多种离群点检测方法,并将其应用于入侵检测和异常交易检测等问题。但多数离群点检测方法主要针对表数据或时间序列数据等,无法直接应用于离群文档检测。现有基于相近性的离群文档检测方法一般用文档与整个文档集的距离来衡量离群性,无法发现基于局部考量的离群文档,而且采用欧几里德距离可能无法刻画出文档间的语义相近性。基于概率模型的离群文档检测方法过于复杂,并且同样只从全局来定义文档的离群值。针对这些问题,文中提出了一种新的基于相近性的离群文档检测方法。该方法引入最优输运距离,基于利用文档词嵌入向量的语义信息,在文档之间使用最优输运算法以度量距离,并利用LDA主题模型对文本进行层级抽象,通过最优输运算法算出主题之间的距离后,再计算文档距离,文中基于这两种最优运输距离计算文档与它的k近邻文档之间的距离来衡量该文档的离群程度。该方法从局部视角来定义文档的离群性,所采用的文档距离能体现文档之间的语义相近性。在两个开源数据集上进行了较细致的对比实验,实验结果显示,所提方法在多个指标上优于基准离群文档检测方法;还检验了基于k近邻离群文档定义的有效性以及k值的选取对结果的影响。Outlier or anomaly detection is one of the research hotspots in areas such as data mining and machine learning,and researchers have proposed a variety of outlier detection methods that can be applied to problems such as intrusion detection and anomalous transaction detection.However,most outlier detection methods mainly target tabular data or time series data,etc.and cannot be directly applied to outlier document detection.Existing outlier detection methods based on proximity generally measure proximity by the distance of a document to the entire document set,failing to find outliers based on local considerations,and may not be able to characterize semantic proximity between documents using Euclidean distance.Probabilistic model-based outlier do-cument detection methods are too complex and define document outliers only globally.In response to these questions,this paper proposes a new proximity-based outlier document detection method where we measure the outlier of a document by the distance between the document and its k-nearest neighbor document.We introduce the optimal transport algorithm to calculate the distance between documents,based on the semantic information of the document obtained from word embedding vector and the topic model.The method defines document outliers from a local perspective,using document distances that reflect the semantic proxi-mity between documents.This paper conducts extensive experiments on two open source document datasets,and the results show that the proposed methods outperform the benchmark outlier document detection methods in terms of four evaluation metrics.Experiments also demonstrate the effectiveness of proposal of k-nearest neighbor based outliers and the impact of value k.

关 键 词:离群文档检测 最优输运 词搬动距离 层次型最优主题输运 

分 类 号:TP311[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象