基于文档结构与深度学习的金融公告信息抽取  被引量:10

Information extraction of financial announcement based on document structure and deep learning

在线阅读下载全文

作  者:黄胜[1,2] 王博博 朱菁 HUANG Sheng;WANG Bo-bo;ZHU Jing(School of Communication and Information Engineering,Chongqing University of Posts and Telecommunications,Chongqing 400065,China;Key Laboratory of Optical Communications and Networking,Chongqing University of Posts and Telecommunications,Chongqing 400065,China;Data Center,Shenzhen Securities Information Limited Company,Shenzhen 518000,China)

机构地区:[1]重庆邮电大学通信与信息工程学院,重庆400065 [2]重庆邮电大学光通信与网络重点实验室,重庆400065 [3]深圳证券信息有限公司数据中心,广东深圳518000

出  处:《计算机工程与设计》2020年第1期115-121,共7页Computer Engineering and Design

基  金:国家自然科学基金项目(61371096)

摘  要:针对金融类公告中的结构化数据难以被高效快速提取的问题,提出一种基于文档结构与Bi-LSTM-CRF网络模型的信息抽取方法。自定义一种文档结构树生成算法,利用规则从文档结构树中抽取所需节点信息;构建基于信息句触发词的局部句子规则,抽取包含结构化字段信息的信息句;将字段的结构化信息抽取看作序列标注问题,分词时加入领域知识词典,构建基于Bi-LSTM-CRF的神经网络模型进行字段信息识别。实验结果表明,该信息抽取方法可以满足多类型公告的结构化信息提取,最终的信息句与字段信息抽取的平均F1值均可达到91%以上,验证了该方法在产品业务中的可行性和实用性。Structured data in financial bulletins are difficult to extract efficiently and quickly,a method of extracting information based on document structure and Bi-LSTM-CRF network model was proposed.A document structure tree generation algorithm was defined to extract the required node information from the document structure tree by using rules.A local sentence rule based on trigger words of information sentences was constructed to extract information sentences containing structured field information.The structured information extraction of field was regarded as the problem of sequence labeling.A domain knowledge dictionary was added to the word segmentation,and a Bi-LSTM-CRF based neural network model was constructed to recognize field information.Experimental results show that the information extraction method can satisfy the structural information extraction of multi-type announcements.The average F1 value of the final information sentence and field information extraction can reach over 91%,which verifies the feasibility and practicability of the proposed method in product business.

关 键 词:公告 信息抽取 神经网络 文档结构树 序列标注 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象