基于预训练语言模型的古籍文本智能补全研究  被引量:2

Data Analysis and Knowledge Discovery Intelligent Completion of Ancient Texts Based on Pre-trained Language Models

在线阅读下载全文

作  者:李嘉俊 明灿 郭志浩 钱铁云[1,2] 彭智勇 王晓光[2,3] 李旭晖[2,3] 李静[2,4] Li Jiajun;Ming Can;Guo Zhihao;Qian Tieyun;Peng Zhiyong;Wang Xiaoguang;Li Xuhui;Li Jing(School of Computer Science,Wuhan University,Wuhan 430072,China;Intellectual Computing Laboratory for Cultural Heritage,Wuhan University,Wuhan 430072,China;School of Information Management,Wuhan University,Wuhan 430072,China;School of History,Wuhan University,Wuhan 430072,China)

机构地区:[1]武汉大学计算机学院,武汉430072 [2]武汉大学文化遗产智能计算实验室,武汉430072 [3]武汉大学信息管理学院,武汉430072 [4]武汉大学历史学院,武汉430072

出  处:《数据分析与知识发现》2024年第5期59-67,共9页Data Analysis and Knowledge Discovery

基  金:国家社会科学基金重大项目(项目编号:21&ZD334)的研究成果之一。

摘  要:【目的】为古籍补全任务提供一种基于预训练语言模型的新方法,利用不同语义层次和简繁体预训练语言模型获得的表示,构建混合专家系统和简繁融合模型实现古籍补全。【方法】针对传世文献和出土文献分别设计基于混合专家系统的模型和简繁融合模型,在不同场景下充分融合与挖掘模型能力,进一步提升模型古籍补全的能力。【结果】使用自行构建的传世文献数据集以及出土文献数据集,补全任务的准确率分别达到70.14%和57.13%。【局限】只从自然语言处理角度出发,未来可以利用多模态技术,计算机视觉与自然语言处理相结合,整合图像信息和语义信息两个维度,可能会有更好的效果。【结论】在构建的传世文献和出土文献数据集上进行验证,达到较高的准确率,为古籍补全任务提供了一种具有竞争力的解决思路。[Objective]This paper proposes a new method based on pre-trained language models for completing ancient texts,utilizing representations obtained from pre-training models at different semantic levels and for simplified and traditional Chinese characters.The method constructs a mixture-of-experts system and a simplifiedtraditional Chinese fusion model to complete ancient texts.[Methods]We designed the mixture-of-experts systembased model for transmitted texts and constructed the simplified-traditional Chinese character fusion model for excavated literature.We fully integrated and explored the model’s capabilities in different scenarios to improve its ability to complete ancient texts.[Results]We examined the new models with self-constructed datasets of transmitted and excavated texts.The models achieved accuracy of 70.14% and 57.13% for the completion task.[Limitations]We only utilized natural language processing approaches.Future improvements involve leveraging multimodal techniques,combining computer vision with natural language processing,and integrating image and semantic information to yield better results.[Conclusions]The proposed models achieve high accuracy on the constructed datasets of ancient literature,providing a competitive solution for completing ancient texts.

关 键 词:古籍数字化 预训练语言模型 混合专家系统 

分 类 号:G350[文化科学—情报学] TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象