检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:张磊 焦晶 李勃昕 周延杰 ZHANG Lei;JIAO Jing;LI Bo-xin;ZHOU Yan-jie(School of Information,Xi'an University of Finance and Economics,Xi'an 710100,China;School of Economics&Management,Northwest University,Xi'an 710127,China)
机构地区:[1]西安财经大学信息学院,西安710100 [2]西北大学经济管理学院,西安710127
出 处:《吉林大学学报(工学版)》2024年第9期2631-2637,共7页Journal of Jilin University:Engineering and Technology Edition
基 金:中国(西安)丝绸之路研究院纵向项目(2019HZ02);中国(西安)丝绸之路研究院纵向项目(2017SY05);西安财经大学横向项目(2022250)。
摘 要:由于半结构化数据具有很高的数据异构性,并且数据量巨大,不同来源的数据结构不一致,导致数据抽取的准确性和完整性较低。为此,本文将机器学习和深度学习深度融合,提出一种针对大容量半结构化数据的抽取算法。利用机器学习的主成分分析法,降低大容量半结构化数据的维度。基于深度学习的转换器网络结构,分别改进嵌入层、编码层-解码层和编码层等部分,得到用于识别数据命名实体和抽取数据实体关系的两种数据抽取算法,实现大容量半结构化数据的抽取。经测试结果验证,所提算法的正确抽取成效显著,无效数据项的最小抽取量仅有4个,且抽取复杂度较低,时效价值较高,F值和抽取时间的消融实验结果充分证明,两种技术的融合对数据抽取研究意义重大,F值始终保持在92以上,抽取时间缩短至125ms内,具备较强的可行性,为提升运营效率、优化资源配置提供重要手段。Due to the high heterogeneity of semi-structured data and the huge amount of data,the data structure of different sources is inconsistent,resulting in low accuracy and integrity of data extraction.To this end,machine learning and deep learning are deeply integrated,and an extraction algorithm for largecapacity semi-structured data is proposed.By using the principal component analysis method of machine learning,the dimensionality of large volume semi-structured data is reduced.The converter network structure based on deep learning improves the embedding layer,encoding layer-decoding layer and encoding layer respectively,and obtains two kinds of data extraction algorithms for identifying the named entity of data and extracting the relationship of data entity,so as to realize the extraction of large-capacity semi-structured data.The test results verify that the proposed algorithm has a significant effect on correct extraction,the minimum extraction amount of invalid data items is only 4,the extraction complexity is low,and the aging value is high.The ablation experiment results of F-value and extraction time fully prove that the fusion of the two technologies is of great significance to the research of data extraction,and the Fvalue is always kept above 92,and the extraction time is shortened to 125 ms.It has strong feasibility and provides an important means for improving operational efficiency and optimizing resource allocation.
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.7