检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:武帅 杨秀璋 何琳[1] 公佐权 Wu Shuai;Yang Xiuzhang;He Lin;Gong Zuoquan(College of Information Management,Nanjing Agricultural University,Nanjing 210095,China;Guizhou Big Data Academy,Guizhou University,Guiyang 550025,China;School of Cyber Science and Engineering,Wuhan University,Wuhan 430072,China;School of Information,Guizhou University of Finance and Economics,Guiyang 550025,China)
机构地区:[1]南京农业大学信息管理学院,南京210095 [2]贵州大学贵州省大数据产业发展应用研究院,贵阳550025 [3]武汉大学国家网络安全学院,武汉430072 [4]贵州财经大学信息学院,贵阳550025
出 处:《数据分析与知识发现》2024年第12期136-148,共13页Data Analysis and Knowledge Discovery
基 金:国家社会科学基金重大项目(项目编号:22&ZD262);贵州省科技计划项目(项目编号:黔科合基础[2020]1Y279)的研究成果之一
摘 要:【目的】结合古籍文本复合句式结构特征,设计识别古籍文本中实体词精度较高的方法,推动数字人文研究的发展。【方法】以触发词和关系词作为识别实体词的关键特征词,设计句式特征模板;根据古籍文本特征,构建Bert-BiLSTM-MHA-CRF模型;融合句法特征和Bert-BiLSTM-MHA-CRF模型实现对古籍文本深层次、细粒度的命名实体识别。【结果】本文模型在传统样本标注的测试数据集上的F1值为0.88;在小样本标注的测试数据集上的F1值为0.83;在迁移学习的测试数据集上的F1值分别为0.79(《诗经》)、0.81(《吕氏春秋》)和0.85(《国语》)。【局限】在句法特征模板设计上,仅以单部古籍设计特征模板;在语义信息挖掘上,未考虑古籍文本字符的注音、部首等字结构特征。【结论】所提方法在小样本标注和迁移学习实验中,同样能精准地实现对古籍文本的命名实体识别,为“数字人文”研究任务提供较高质量语料数据。[Objective]Combining the complex sentence structure features of ancient texts,a method with higher accuracy for identifying entity words in ancient texts was developed to further the development of digital humanities research.[Methods]Trigger words and relative words were used as key feature words to identify entity words,and a sentence pattern template was designed.Based on the characteristics of ancient texts,a Bert-BiLSTM-MHA-CRF model was constructed.The fusion of syntactic features and the Bert-BiLSTM-MHA-CRF model was used to achieve deep and fine-grained entity recognition of ancient texts.[Results]The F1 Score of this method is 0.88 on the conventional annotated test data set,0.83 on the small sample annotated test data set,0.79(The Book of Songs),0.81(Master Lü’s Spring and Autumn Annals)and 0.85(Discourses of the States)on the transfer learning test data set.[Limitations]In the design of syntactic feature templates,only single ancient books are used as feature templates.Semantic information mining does not take into account the structural features of characters such as phonetic symbols and radicals in ancient texts.[Conclusions]In small sample annotation and transfer learning experiments,this method can also achieve accurate named entity recognition of ancient texts,providing high quality corpus data for digital humanities research.
关 键 词:预训练模型 古籍文本 命名实体识别 Bert-BiLSTM-MHA-CRF 句法特征
分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.38