数据仓库下基于学习的并行实体解析算法研究  

The Parallel Entity Resolution Algorithm Based on Learning in Data Warehouse

在线阅读下载全文

作  者:刘叶 吴晟[1] 吴兴蛟 周海河[1] 李英娜[1] 张晶[1] 

机构地区:[1]昆明理工大学信息工程与自动化学院,云南昆明650500

出  处:《软件导刊》2018年第2期19-22,27,共5页Software Guide

摘  要:为了改善传统实体解析算法在单机环境下采用人为方式设定属性权值及阈值难以对海量数据进行快速有效处理的缺点,基于Hadoop框架使用MapReduce计算模型,在多节点分布式环境下,通过不断调整网络学习属性之间的内在关系以及属性权值、阈值等参数后,再将模型放在Hive数据仓库中的真实数据集上进行有效性验证。分别使用5 000及9 000条数据进行实验,实验结果表明,基于学习的并行实体解析算法准确率、召回率和F1值较高。因此,基于学习的并行实体解析算法对于海量数据不仅能进行快速有效的处理,而且能有效降低人工经验中存在的误差,同时也能提高识别结果的准确度,提升识别效率。For solving the disadvantage in traditional entity resolution algorithm which is usually used in the single machine environment setting the artificial attribute weights and threshold processing methods for entity analysis,which makes the recognition result heavily dependent on manual experience and difficult in efficient big data processing,this article tries to study the intrinsic relationship between the attributes through adjustment network in multiple-nodes-distributed environment by using MapReduce calculation model based on Hadoop frame.Through adjusting attribute weight and threshold value we can validate on the real data set in the Hive data warehouse by using separately 5 000 and 9 000 data records.Experiment result have shown that parallel entity analysis algorithm based on self-learning has higher accuracy,recall value and F1 value,thus we can draw the conclusion that parallel entity analysis algorithm based on learning has not only effectively reduced the errors in the artificial experience,which made the recognition result obtain high recognition accuracy and recognition efficiency,but can also deal with the massive data with high efficiency.

关 键 词:数据仓库 数据质量 实体解析 自主学习 并行计算 

分 类 号:TP312[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象