检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:韦海浪 李贵[1] 李征宇[1] 韩子扬[1] 曹科研
出 处:《数据挖掘》2020年第1期16-29,共14页Hans Journal of Data Mining
摘 要:实体解析ER (Entity Resolution)是识别一个或多个数据源中同一实体记录。对于在多数据源中直接比较每对记录计算复杂度较大的问题,通常采用分块的方法。由于在Web数据源中大部分是模式未知的,通常采用元分块技术,虽然减少了丢失可能的匹配,但是增加了在同一块中放置不匹配实体记录的可能性。为此提出了一种基于局部敏感哈希的属性匹配归纳法从多个Web大数据集中对属性进行匹配划分,去除了属性间冗余的比较;然后通过一种基于聚合熵加权图的元分块技术,来提高Web数据源的分块质量,去除了分块中实体记录之间多余的比较,降低了算法的复杂度。最后采用实际数据集进行实验验证了该算法的有效性。Entity Resolution is the identification of the same Entity record in one or more data sources. For problems with high complexity of directly comparing each pair of records in multiple data sources, chunking is usually adopted. Since most of the Schema are unknown in Web data sources, Me-ta-blocking techniques are commonly used, which reduce the possibility of missing matches but in-crease the possibility of placing mismatched entity records in the same block. To solve the above problems, an attribute matching induction method based on locally sensitive hashing is proposed to conduct attribute’s matching division from multiple Web big data sets to remove redundant com-parison among attributes. Then, a block technique based on aggregation entropy weighted graph is used to improve the block quality of Web data sets, remove redundant comparisons in the blocks and reduce the complexity of the algorithm. Finally, the effectiveness of the algorithm is verified by experiments with actual data sets.
分 类 号:TP3[自动化与计算机技术—计算机科学与技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.249