融合机器学习和深度学习的大容量半结构化数据抽取算法  

Large capacity semi structured data extraction algorithm combining machine learning and deep learning

在线阅读下载全文

作  者:张磊 焦晶 李勃昕 周延杰 ZHANG Lei;JIAO Jing;LI Bo-xin;ZHOU Yan-jie(School of Information,Xi'an University of Finance and Economics,Xi'an 710100,China;School of Economics&Management,Northwest University,Xi'an 710127,China)

机构地区:[1]西安财经大学信息学院,西安710100 [2]西北大学经济管理学院,西安710127

出  处:《吉林大学学报(工学版)》2024年第9期2631-2637,共7页Journal of Jilin University:Engineering and Technology Edition

基  金:中国(西安)丝绸之路研究院纵向项目(2019HZ02);中国(西安)丝绸之路研究院纵向项目(2017SY05);西安财经大学横向项目(2022250)。

摘  要:由于半结构化数据具有很高的数据异构性,并且数据量巨大,不同来源的数据结构不一致,导致数据抽取的准确性和完整性较低。为此,本文将机器学习和深度学习深度融合,提出一种针对大容量半结构化数据的抽取算法。利用机器学习的主成分分析法,降低大容量半结构化数据的维度。基于深度学习的转换器网络结构,分别改进嵌入层、编码层-解码层和编码层等部分,得到用于识别数据命名实体和抽取数据实体关系的两种数据抽取算法,实现大容量半结构化数据的抽取。经测试结果验证,所提算法的正确抽取成效显著,无效数据项的最小抽取量仅有4个,且抽取复杂度较低,时效价值较高,F值和抽取时间的消融实验结果充分证明,两种技术的融合对数据抽取研究意义重大,F值始终保持在92以上,抽取时间缩短至125ms内,具备较强的可行性,为提升运营效率、优化资源配置提供重要手段。Due to the high heterogeneity of semi-structured data and the huge amount of data,the data structure of different sources is inconsistent,resulting in low accuracy and integrity of data extraction.To this end,machine learning and deep learning are deeply integrated,and an extraction algorithm for largecapacity semi-structured data is proposed.By using the principal component analysis method of machine learning,the dimensionality of large volume semi-structured data is reduced.The converter network structure based on deep learning improves the embedding layer,encoding layer-decoding layer and encoding layer respectively,and obtains two kinds of data extraction algorithms for identifying the named entity of data and extracting the relationship of data entity,so as to realize the extraction of large-capacity semi-structured data.The test results verify that the proposed algorithm has a significant effect on correct extraction,the minimum extraction amount of invalid data items is only 4,the extraction complexity is low,and the aging value is high.The ablation experiment results of F-value and extraction time fully prove that the fusion of the two technologies is of great significance to the research of data extraction,and the Fvalue is always kept above 92,and the extraction time is shortened to 125 ms.It has strong feasibility and provides an important means for improving operational efficiency and optimizing resource allocation.

关 键 词:半结构化数据 机器学习 数据容量降维 深度学习 命名实体识别 实体关系抽取 

分 类 号:G255[文化科学—图书馆学]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象