基于大语言模型技术的古籍限定域关系抽取及应用研究  

Research on the Extraction and Application of Ancient BooksRestricted Domain Relation Based on Large Language Model Technology

在线阅读下载全文

作  者:刘畅[1] 张琪 王东波[1] 沈思 吴梦成 刘浏 苏雨诗 Liu Chang;Zhang Qi;Wang Dongbo;Shen Si;Wu Mengcheng;Liu Liu;Su Yushi(College of Information Management,Nanjing Agricultural University,Nanjing 211800;School of Economics and Management,Shanxi University,Taiyuan 030006;School of Economics&Management,Nanjing University of Science&Technology,Nanjing 210094)

机构地区:[1]南京农业大学信息管理学院,南京211800 [2]山西大学经济与管理学院,太原030006 [3]南京理工大学经济管理学院,南京210094

出  处:《情报学报》2025年第2期200-219,共20页Journal of the China Society for Scientific and Technical Information

基  金:国家社会科学基金重大项目“中国古代典籍跨语言知识库构建及应用研究”(21&ZD331)。

摘  要:古籍文本中的细粒度知识单元的自动抽取和结构化能够为群体传记、历史地图等古籍数字人文研究提供数据基础。基于判别式模型的抽取方法严重受制于古汉语本身语义的复杂性和训练样本的缺失,抽取效果和领域迁移的效果受到影响,相关研究亟待生成式人工智能技术的赋能。本研究探索了基于大语言模型的古籍领域限定域关系抽取方法和高质量训练语料自动生成方法。通过比较不同提示模板对模型抽取性能的影响,证明了微调方法对模型性能提升具有显著价值。基于ChatGPT4的API服务,结合自指令、思维链与人类反馈合成古籍限定域关系抽取数据集,在数据增强后于两种古籍关系抽取数据集上分别取得56.07%和30.50%的F1值,迁移能力较两种使用全部数据训练的模型均取得了显著提升。本研究还探索了协同使用自指令模型和自动评价模型合成训练语料和评价信息,并基于合成数据训练模型,有效缓解了训练数据不足的问题。研究结果表明,使用大语言模型抽取关系三元组与合成训练数据,能够显著降低过往限定域关系抽取的人力成本,有助于提升古籍领域知识图谱的构建效率。Automatic extraction and structuring of fine-grained knowledge units in ancient books can provide a database for digital humanities research on ancient books,such as group biographies and historical maps.The extraction method based on the discriminative model is severely restricted by the semantic complexity of ancient Chinese and missing training samples,which limit the extraction and domain transfer effects.Related research is required to develop generative arti‐ficial intelligence technology.This study explores methods for restricted domain relation extraction in ancient texts based on large language models and the automatic generation of high-quality training corpora.After comparing the impact of dif‐ferent prompt templates on model extraction performance and proving the significance of fine-tuning methods in improv‐ing model performance,we utilize the ChatGPT4 application programming interface(API)service,combined with self-in‐struction,thought chains,and human feedback,to create a domain-specific relation extraction dataset for ancient texts.Af‐ter data augmentation,F1 scores of 56.07%and 30.50%are achieved on two ancient text relation extraction datasets,exhib‐iting a significant improvement in transferability compared with models trained on the entire dataset.The study also ex‐plores the collaborative use of self-instruction and automatic evaluation models to synthesize training corpora,evaluation information,and trained models based on synthetic data,effectively alleviating the problem of insufficient training data.The findings indicate that using large language models to extract relational triplets and synthesize training data can signifi‐cantly reduce the labor costs previously associated with domain-specific relation extraction and improve the efficiency of constructing knowledge graphs in the field of ancient texts.

关 键 词:大语言模型 古籍智能 限定域关系抽取 AI生成数据 数字人文 

分 类 号:H109.2[语言文字—汉语] TP18[自动化与计算机技术—控制理论与控制工程] TP391.1[自动化与计算机技术—控制科学与工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象