检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:饶卫雄[1] 高宏业 林程 赵钦佩[1] 叶丰 RAO Weixiong;GAO Hongye;LIN Cheng;ZHAO Qinpei;YE Feng(School of Software Engineering,Tongji University,Shanghai 201804,China;National Key Laboratory for Complex Systems Simulation,Beijing 100101,China)
机构地区:[1]同济大学软件学院,上海201804 [2]复杂系统仿真总体重点实验室,北京100101
出 处:《同济大学学报(自然科学版)》2022年第10期1392-1404,共13页Journal of Tongji University:Natural Science
基 金:上海市科技重大专项(2021SHZDZX0100);中央高校基本科研业务费专项资金。
摘 要:为实现不同数据管理系统之间的互通,提出一种基于半监督学习算法的多源异构数据治理框架,并由此设计、实现和测试了一套非结构化数据与结构化数据的自动化对齐方法。利用命名实体识别(NER)技术,将非结构化数据转化为结构化数据,再分别利用基于字符串相似度的方法和基于监督学习的方法,对结构化数据进行模式匹配;通过半监督学习方法,在结构化数据与数据库记录实体之间进行实体匹配与融合;利用自然语言处理(NLP)技术及深度学习方法,对融合后的数据集进行缺失值填补。结果表明:在论文数据集和视频元数据集上进行对齐处理后,两者的F1值分别达到89.70%及96.50%;在不同属性上进行缺失值填补后,整体填补准确率达到78%以上,大大优于基线方法的准确率。In order to realize the intercommunication between different data management systems, we proposed a framework of multi-source heterogeneous data governance based on semi-supervised learning.Then,we designed,implemented and tested an automatic alignment method of unstructured data and structured data. The named entity recognition(NER)technology was firstly employed in the framework to convert the unstructured data into the structured one,and the stringsimilarity-based method and supervised-learning-based method were respectively used for the schema matching of structured data. With the semi-supervised learning method,the structured data and its corresponding entity in database were matched and integrated. Finally,natural language processing(NLP)technology and deep learning methods were used to impute missing values in the integrated dataset. It is shown that the F1-scores for the alignment on the paper dataset and video metadata set are89.70% and 96.50%,respectively;and that the accuracy of missing value imputation on different attributes is all above 78%,which is a great improvement compared with the baseline methods.
关 键 词:半监督学习 数据治理 多源异构数据 缺失值填补 命名实体识别(NER)
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.3