检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:陈华凤 董永权[2] 杨昊霖 张国玺 CHEN Huafeng;DONG Yongquan;YANG Haolin;ZHANG Guoxi(Department of Information Construction and Management,Jiangsu Normal University,Xuzhou 221116,China;College of Computer Science and Technology,Jiangsu Normal University,Xuzhou 221116,China)
机构地区:[1]江苏师范大学信息化建设与管理处,徐州221116 [2]江苏师范大学计算机科学与技术学院,徐州221116
出 处:《数据采集与处理》2023年第3期629-642,共14页Journal of Data Acquisition and Processing
基 金:国家自然科学基金(61872168);江苏省研究生科研与实践创新项目(KYCX20_2382)。
摘 要:真值发现是数据集成领域具有挑战性的研究热点之一。传统的方法利用数据源与观测值之间的交互关系推断真值,缺乏足够的特征信息;基于深度学习的方法可以有效地进行特征抽取,但其性能依赖于大量手工标注,而在实际应用中很难获取到大量高质量的真值标签。为克服以上问题,本文提出一种基于多特征融合的无监督真值发现方法(Unsupervised truth discovery method based on multi-feature fusion,MFOTD)。首先,利用集成学习无监督标注“真值”标签;然后,分别使用预训练模型Bert和独热编码获取观测值的语义特征和交互特征;最后,融合观测值多种特征并使用其“真值”标签构建初始训练集,通过自训练方式训练真值预测模型。在两个真实数据集上的实验结果表明,与已有方法相比,本文所提出的方法具有更高的真值发现准确性。Truth discovery is one of the challenging research hotspots in the field of data integration.Traditional methods use the interaction between data sources and values to infer the truth,which lack sufficient feature information.Deep learning-based methods can effectively perform feature extraction,but their performance depends on a large number of manual annotations,and it is difficult to obtain a large number of high-quality truth labels in practical applications.To overcome these problems,this paper proposes an unsupervised truth discovery method based on multi-feature fusion(MFUTD).First,ensemble learning is used to label truth without supervision.Then,the pre-training Bert model and the onehot coding method are used to obtain the semantic features and interactive features of the values.Finally,the initial training set is constructed by fusing multiple features of the values and using their“truth”labels to train the truth prediction model by self-training.Experimental results on two real data sets show that the proposed method has the higher truth discovery accuracy than the existing methods.
关 键 词:WEB数据集成 半监督学习 数据清洗 真值发现 数据源质量
分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.38