基于模式感知元分块技术的Web实体解析算法

Web Entity Resolution Algorithm Based on Schema-Aware Meta-Blocking Technology

出　　处：《数据挖掘》2020年第1期16-29,共14页Hans Journal of Data Mining

摘　　要：实体解析ER (Entity Resolution)是识别一个或多个数据源中同一实体记录。对于在多数据源中直接比较每对记录计算复杂度较大的问题,通常采用分块的方法。由于在Web数据源中大部分是模式未知的,通常采用元分块技术,虽然减少了丢失可能的匹配,但是增加了在同一块中放置不匹配实体记录的可能性。为此提出了一种基于局部敏感哈希的属性匹配归纳法从多个Web大数据集中对属性进行匹配划分,去除了属性间冗余的比较;然后通过一种基于聚合熵加权图的元分块技术,来提高Web数据源的分块质量,去除了分块中实体记录之间多余的比较,降低了算法的复杂度。最后采用实际数据集进行实验验证了该算法的有效性。Entity Resolution is the identification of the same Entity record in one or more data sources. For problems with high complexity of directly comparing each pair of records in multiple data sources, chunking is usually adopted. Since most of the Schema are unknown in Web data sources, Me-ta-blocking techniques are commonly used, which reduce the possibility of missing matches but in-crease the possibility of placing mismatched entity records in the same block. To solve the above problems, an attribute matching induction method based on locally sensitive hashing is proposed to conduct attribute’s matching division from multiple Web big data sets to remove redundant com-parison among attributes. Then, a block technique based on aggregation entropy weighted graph is used to improve the block quality of Web data sets, remove redundant comparisons in the blocks and reduce the complexity of the algorithm. Finally, the effectiveness of the algorithm is verified by experiments with actual data sets.

关键词：实体解析聚合熵元分块局部敏感哈希

分类号：TP3[自动化与计算机技术—计算机科学与技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于模式感知元分块技术的Web实体解析算法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于模式感知元分块技术的Web实体解析算法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索