数字人文视域下面向历史古籍的信息抽取方法研究被引量：6

Research on information extraction methods for historical classics under the threshold of digital humanities

作　　者：韩立帆季紫荆陈子睿王鑫[1,2] HAN Lifan;JI Zijing;CHEN Zirui;WANG Xin(College of Intelligence and Computing,Tianjin University,Tianjin 300350,China;Tianjin Key Laboratory of Cognitive Computing and Application,Tianjin 300350,China)

机构地区：[1]天津大学智能与计算学部,天津300350 [2]天津市认知计算与应用重点实验室,天津300350

出　　处：《大数据》2022年第6期26-39,共14页Big Data Research

基　　金：科技创新2030—“新一代人工智能”重大项目(No.2020AAA0108504);国家自然科学基金资助项目(No.61972275)。

摘　　要：数字人文旨在采用现代计算机网络技术助力传统人文研究,文言历史古籍是进行历史研究和学习的重要基础,但由于其写作语言为文言文,与现代所用的白话文在语法和词义上均有较大差别,因此不易于阅读和理解。针对上述问题,提出基于预训练模型对历史古籍中的实体和关系等进行知识抽取的方法,从而有效获取历史古籍文本中蕴含的丰富信息。该模型首先采用多级预训练任务代替BERT原有的预训练任务,以充分捕获语义信息,此外在BERT模型的基础上添加了卷积层及句子级聚合等结构,以进一步优化生成的词表示。然后,针对文言文标注数据稀缺的问题,构建了一个面向历史古籍文本标注任务的众包系统,获取高质量、大规模的实体和关系数据,完成文言文知识抽取数据集的构建,评估模型性能,并对模型进行微调。在构建的数据集及GulianNER数据集上的实验证明了提出模型的有效性。Digital humanities aims to use modern computer network technology to help traditional humanities research.Classical Chinese historical books are the important basis for historical research and learning,but since their writing language is classical Chinese,it is quite different from the vernacular Chinese in grammar and meaning,so it is not easy to read and understand.In view of the above problems,the solution to extract entities and relations in historical books based on pre-trained models was proposed to obtain the rich information contained in historical texts effectively.The model usedmulti-level pre-training tasks instead of BERT's original pre-training tasks to fully capture semantic information.And the model added some structures such as convolutional layers and sentence-level aggregations on the basis of the BERT model to optimize the generated word representation further.Then,in view of the scarcity of classical Chinese annotation data,a crowdsourcing system for the task of labeling historical classics was constructed,high-quality,large-scale entity and relation data was obtained and the classical Chinese knowledge extraction dataset was constructed.So it helped to evaluate the performance of the model and fine-tune the model.Experiments on the dataset constructed in this paper and on the GulianNER dataset demonstrated the effectiveness of the model proposed in this paper.

关键词：历史古籍预训练模型信息抽取众包机制

分类号：TP391.1[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

数字人文视域下面向历史古籍的信息抽取方法研究被引量：6

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

数字人文视域下面向历史古籍的信息抽取方法研究 被引量：6

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

数字人文视域下面向历史古籍的信息抽取方法研究被引量：6