面向藏族传统节日的汉藏双语命名实体识别研究  被引量:3

Chinese-Tibetan Bilingual Named Entity Recognition for Traditional Tibetan Festivals

在线阅读下载全文

作  者:邓宇扬 吴丹[1,2] Deng Yuyang;Wu Dan(School of Information Management,Wuhan University,Wuhan 430072,China;Center for Studies of Human-Computer Interaction and User Behavior,Wuhan University,Wuhan 430072,China)

机构地区:[1]武汉大学信息管理学院,武汉430072 [2]武汉大学人机交互与用户行为研究中心,武汉430072

出  处:《数据分析与知识发现》2023年第7期125-135,共11页Data Analysis and Knowledge Discovery

基  金:国家社会科学基金重大项目(项目编号:19ZDA341)研究成果之一。

摘  要:【目的】研究资源稀少语言中预训练模型的表现,为构建藏语知识图谱、语义检索提供帮助。【方法】本研究采集人民网、人民网藏文版等新闻网站中藏族传统节日的汉藏双语文本数据,并比较多种预训练语言模型与词向量在汉藏双语情景下对命名实体识别任务的表现,同时分析了命名实体识别模型的两种特征处理层(BiLSTM层与CRF层)对实验结果的影响。【结果】实验结果表明:相较于词向量,汉语以及藏语的预训练语言模型在该任务上的F1性能分别提升0.0108及0.0590。特别是在实体数量较少的情景下,预训练模型相比词向量可提取更多的文本信息,并且训练时间缩短40%。【局限】藏语数据与汉语数据并非平行语料,且藏语数据中的实体数量少于汉语数据。【结论】预训练语言模型不仅在汉语文本领域有显著效果,在藏语这种资源稀少的语种也能取得很好的表现。[Objective]This paper examines the performance of pre-trained models in resource-scarce languages and assists in building Tibetan knowledge graphs and semantic retrieval.[Methods]We collected Chinese-Tibetan bilingual text data related to traditional Tibetan festivals from websites such as People’s Daily and its Tibetan Edition.Then,we compared the performance of multiple pre-trained language models and word embeddings on named entity recognition tasks in a Chinese-Tibetan bilingual context.We also analyzed the impact of two feature processing layers(BiLSTM and CRF)in the named entity recognition model.[Results]Compared with word embeddings,the pre-trained language models of Chinese and Tibetan improved the F1 performance by 0.0108 and 0.0590,respectively.In the context of fewer entities,the pre-trained model can extract more textual information than word embeddings,reducing the training time by 40%.[Limitations]The Tibetan and Chinese language data are not parallel corpora,and the Tibetan language data has fewer entities than the Chinese data.[Conclusions]The pre-trained models demonstrate significant performance in the Chinese text domain but also perform well in Tibetan,a language with scarce resources.

关 键 词:命名实体识别 藏族传统文化 预训练语言模型 

分 类 号:TP391[自动化与计算机技术—计算机应用技术] G350[自动化与计算机技术—计算机科学与技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象