基于异质信息网络的文本相似性度量方法  

A Text Similarity Measure Based on Heterogeneous Information Network

在线阅读下载全文

作  者:马秋微 赵书良[1,2,3] 赵妍 MA Qiuwei;ZHAO Shuliang;ZHAO Yan(College of Computer and Cyber Security,Hebei Normal University,Shijiangzhuang,Hebei 050024,China;Hebei Provincial Engineering Research Center for Supply Chain Big Data Analytics&Data Security,Shijiangzhuang,Hebei 050024,China;Hebei Provincial Key Laboratory of Cyber and Information Security,Shijiangzhuang,Hebei 050024,China)

机构地区:[1]河北师范大学计算机与网络空间安全学院,河北石家庄050024 [2]供应链大数据分析与数据安全河北省工程研究中心,河北石家庄050024 [3]河北省网络与信息安全重点实验室,河北石家庄050024

出  处:《中文信息学报》2023年第9期108-120,共13页Journal of Chinese Information Processing

基  金:国家社会科学基金(13&ZD091,18ZDA200);河北省重点研发计划项目(20370301D);河北师范大学重大关键技术攻关项目(L2020K01)。

摘  要:文本相似性度量对基于文本的分类,聚类以及排序等有着广泛的影响。现有的大部分文本相似性度量方法不仅文本特征粒度单一化,而且忽略了非结构化文本数据中的结构化信息。该文将文本相似性度量问题转化为加权异质信息网络中的节点相似性度量问题,利用元路径的结构特性和语义特性度量文本的显式语义相似性,使其度量结果更准确并且更具有可解释性。首先,结合世界知识库,扩大文本特征粒度,构建加权文本异质信息网络,将非结构化文本类型数据表示为结构化的异质信息网络的形式。其次,挖掘元路径,并提出基于元路径的ω-PageRank-Nibble子图划分算法,得到包含给定文本节点集的局部图。根据局部图,计算并存储特定元路径的交换矩阵,为后续相似性度量降低时间及空间成本。最后,提出AllPathSim耦合相似性度量方法,度量文本类型节点的相似性。在图剪枝方面,利用基于元路径的ω-PageRank-Nibble算法划分子图,与处理整张图相比,时间成本和空间成本降低效果显著。在相似性度量方面,与同期最优的相同类型节点度量方法相比,AllPathSim耦合相似性度量方法与度量结果的相关系数在20NG和GCAT数据集上分别提高了6.1%和6.9%。Text similarity measure has a wide range of effects on text-based classification,clustering and ranking.This paper treats the text similarity measurement problem as a node similarity measurement in a weighted heterogeneous information network,and proposed to determine the explicit semantic similarity of text by the structural and semantic properties of meta-paths to.Firstly,the text feature granularity is expanded by combining with the world knowledge base to construct a weighted text heterogeneous information network,and the unstructured text is represented as a form of structured heterogeneous information network.Secondly,the meta-paths is mined,and anω-PageRank-Nibble subgraph partitioning algorithm is designed to obtain a partial graph containing a given set of text nodes.According to the partial graph,the commuting matrix of the specific meta-path is calculated,which reduces the time and space cost for the subsequent similarity measurement.Finally,the AllPathSim similarity measure is proposed to measure the similarity of text type nodes.The AllPathSim coupling similarity measuring method is compared with the optimal measuring method of the same type of nodes,and the correlation coefficient of the measurement results is increased by 6.1%and 6.9%on the 20NG and GCAT data sets.

关 键 词:相似性度量 加权异质信息网络 元路径 文本挖掘 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象