基于RoBERTa-BiLSTM-CRF的藏文新闻要素识别  

Study on Identification of Tibetan News Element Based on RoBERTa-BiLSTM-CRF

在线阅读下载全文

作  者:香前 才藏太[1,2,3] 李措 Xiangqian;Caizang-Tai;LI Cuo(School of Computer Science,Qinghai Normal University,Xining 810016,China;Key Laboratory of Tibetan Information Processing,Ministry of Education,Xining 810008,China;The State Key Laboratory of Tibetan Intelligent Information Processing and Application,Xining 810008,China)

机构地区:[1]青海师范大学计算机学院,青海西宁810016 [2]藏文信息处理教育部重点实验室,青海西宁810008 [3]省部共建藏语智能信息处理及应用国家重点实验室,青海西宁810008

出  处:《高原科学研究》2024年第4期108-114,共7页Plateau Science Research

基  金:国家社会科学基金项目(23BYY078)。

摘  要:新闻要素识别是从新闻文本中提取时间、地点、人物、组织机构、事件等关键信息实体的过程,是新闻内容分析的基础。文章将藏文新闻要素分类细化为10类,并提出一种基于RoBERTa-BiLSTM-CRF的藏文新闻要素识别方法。该方法首先通过RoBERTa预训练语言模型对藏文新闻文本进行编码,然后通过BiLSTM和自注意力机制进行特征提取,最后采用条件随机场进行序列标注,完成对新闻要素的识别和分类。在自建数据集(Tibetan news)上进行实验后F1值达到88.8%。News element recognition is a process of extracting key information entities such as time,location,people,organizations,and events from news texts,serving as the foundation for news content analysis.While sig-nificant progress has been made for Chinese news element recognition,few studies have been conducted for Ti-betan news and the existing element classification systems are rather coarse,making it difficult to comprehensive-ly cover various key information in Tibetan news reports.Therefore,in this paper,the element classification of Ti-betan news is refined into 10 categories.Meanwhile,addressing the challenges in Tibetan news texts such as un-clear word boundaries,numerous out-of-vocabulary words,and word polysemy,we propose a Tibetan news ele-ment recognition method based on RoBERTa-BiLSTM-CRF.This method first encodes Tibetan news texts using the RoBERTa pre-trained language model,then extracts features through BiLSTM and self-attention mecha-nism,and finally employs conditional random fields for sequence labeling to complete the recognition and classi-fication of news elements.Experiments conducted on our self-built dataset(Tibetan news)demonstrate the effec-tiveness of this method,achieving an F1 score of 88.8%.

关 键 词:藏文 新闻要素 识别 深度学习 RoBERTa 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象