基于聚类和双向门控循环单元-条件随机场的多类型流式文档结构识别  

Multi-type Streaming Document Structure Recognition Based on Clustering and Bidirectional Gated Recurrent Unit-Conditional Random Field

在线阅读下载全文

作  者:王娟 李宁[1] 姜雨彤[1] 田英爱[1] WANG Juan;LI Ning;JIANG Yu-tong;TIAN Ying-ai(Beijing Key Laboratory of Internet Culture and Digital Dissemination Research,Beijing Information Science and Technology University,Beijing 100101,China)

机构地区:[1]北京信息科技大学网络文化与数字文化传播重点实验室,北京100101

出  处:《科学技术与工程》2021年第17期7208-7216,共9页Science Technology and Engineering

基  金:国家自然科学基金(61672105)。

摘  要:流式文档结构识别对于文档自动排版和优化、信息检索等领域有着重要作用。以往针对流式文档结构识别主要集中于学术论文领域,对于其他诸如公文、报告等多类型的文档结构识别研究较少。针对此现状,使用聚类的方法对文档进行分类,在此基础上提出了针对不同文档分类的、基于双向门控循环单元-条件随机场(bidirectional gated recurrent unit-conditional random field,BIGRU-CRF)的文档结构识别方法,以此来解决多类型文档结构识别的问题。实验结果表明,该方法不仅能够提高学术论文结构识别的效果,对其他类型的文档结构也能够进行较好地识别。Stream document structural recognition plays an important role in automatic document layout and optimization,information retrieval and other fields.In the past,it had been mainly focused on academic papers,but less research had been done on other types of documents including official documents and reports.Based on the current analysis and the clustering method to recognize documents,a document structure recognition method based on different document classification and bidirectional gated recurrent unit-conditional random field(BIGRU-CRF)was proposed to solve the problem of multi-type document structure recognition.It has been shown that this method can not only improve the recognition of the structure of academic papers,but also do better for other types of document structures.

关 键 词:流式文档 结构识别 聚类 多类型文档 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象