中文异构百科知识库实体对齐被引量：8

Entity alignment of Chinese heterogeneous encyclopedia knowledge base

出　　处：《计算机应用》2016年第7期1881-1886,1898,共7页journal of Computer Applications

基　　金：国家自然科学基金资助项目(61573292;61572407);中央高校基本科研业务费专项(2682015CX070)~~

摘　　要：针对传统实体对齐方法在中文异构网络百科实体对齐任务中效果不够显著的问题,提出一种基于实体属性与上下文主题特征相结合的实体对齐方法。首先,基于百度百科及互动百科数据构造中文异构百科知识库,通过统计方法构造资源描述框架模式(RDFS)词表,对实体属性进行规范化;其次,抽取实体上下文信息,对其进行中文分词后,利用主题模型对上下文建模并通过吉布斯采样法求解模型参数,计算出主题-单词概率矩阵,提取特征词集合及对应特征矩阵;然后,利用最长公共子序列(LCS)算法判定实体属性相似度,当相似度位于下界与上界之间时,进一步结合百科类实体上下文主题特征进行判定;最后,依据标准方法构造了一个异构中文百科实体对齐数据集进行仿真实验。实验结果表明,与经典的属性相似度算法、属性加权算法、上下文词频特征模型及主题模型算法进行比较,所提出的实体对齐算法在人物领域和影视领域的准确率、召回率与综合指标F值分别达到97.8%、88.0%、92.6%和98.6%、73.0%、83.9%,比其他方法均有较大的提高。实验结果验证了在构建中文异构百科知识库场景中,所提算法可以有效提升中文百科实体对齐效果,可应用到具有上下文信息的实体对齐任务中。Aiming at the problem that the traditional entity alignment algorithm may lead to bad performance in entity alignment task of Chinese heterogeneous encyclopedia knowledge base, an entity alignment method based on entity attributes and the features of context topics was proposed. First, a Chinese heterogeneous encyclopedia knowledge base was constructed based on Baidu encyclopedia and Hudong encyclopedia data. Next, the Resource Description Framework Schema（ RDFS）vocabulary list was made to normalize the entity attributes. Then the entity context information was extracted and the Chinese word segmentation was used on the contexts. The contexts were modelled by using the topic model and the parameters were computed by Gibbs sampling method. After that the topic-word probability matrix, the characteristic word collection and the corresponding feature matrix were calculated. Last, the Longest Common Subsequence（ LCS） algorithm was used to compute the entity attribute similarity. When the similarity was between the lower and the upper bounds, the topic features of the entities＇ context were combined to resolve the entity alignment problem. Finally, according to the standard method, an entity alignment data set of Chinese heterogeneous encyclopedia was constructed for simulation experiments. In comparison with the traditional property similarity algorithm, weighted-property algorithm, context term frequency feature model and topic model algorithm, the experimental results show that the proposed method achieves 97. 8% accuracy, 88. 0% recall, 92. 6% F-score in people class and 98. 6% accuracy, 73. 0% recall, 83. 9% F-score in movie class. It outperformed the other entity alignment algorithms. The experimental results also indicate that the proposed method can improve the entity alignment results in constructing the Chinese heterogeneous encyclopedia knowledge base, and it can be applied to the entity alignment tasks with context information.

关键词：知识库实体对齐主题模型资源描述框架模式最长公共子序列算法

分类号：TP391.1[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

中文异构百科知识库实体对齐被引量：8

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

中文异构百科知识库实体对齐 被引量：8

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

中文异构百科知识库实体对齐被引量：8