基于RoBERTa多特征融合的棉花病虫害命名实体识别  被引量:2

Recognition of Cotton Pests and Diseases Named Entities Based on RoBERTA Multi‐feature Fusion

在线阅读下载全文

作  者:李东亚 白涛[1,2,3] 香慧敏[4] 戴硕 王震鲁 陈珍 LI Dongya;BAT Tao;XIANG Huimin;DAI Shuo;WANG Zhenu;CHEN Zhen(College of Computer and Information Engineering,Xinjiang Agricultural University,Urumqi 830052,China;Intelligent Agriculture Engineering Research Center of the Ministry of Education,Urumqi 830052,China;Xinjiang Agricultural Informatization Engineering Technology Research Center,Urumqi 830052,China;Xinjiang Science and Technology College,Urumqi 830049,China)

机构地区:[1]新疆农业大学计算机与信息工程学院,新疆乌鲁木齐830052 [2]智能农业教育部工程研究中心,新疆乌鲁木齐830052 [3]新疆农业信息化工程技术研究中心,新疆乌鲁木齐830052 [4]新疆科信职业技术学院,新疆乌鲁木齐830049

出  处:《河南农业科学》2024年第2期152-161,共10页Journal of Henan Agricultural Sciences

基  金:科技部科技创新2030重大项目(2022ZD0115800);新疆维吾尔自治区重大科技专项(2022A02011-4);新疆维吾尔自治区高校基本科研业务费科研项目(XJEDU2022J009)。

摘  要:针对棉花病虫害文本语料数据匮乏且缺少中文命名实体识别语料库,棉花病虫害实体内容复杂、类型多样且分布不均等问题,构建了包含11种类别的棉花病虫害中文实体识别语料库CDIPNER,提出了一种基于RoBERTa多特征融合的命名实体识别模型。该模型采用掩码学习能力更强的RoBERTa预训练模型进行字符级嵌入向量转换,通过BiLSTM和IDCNN模型联合抽取特征向量,分别捕捉文本的时序和空间特征,使用多头自注意力机制将抽取的特征向量进行融合,最后利用CRF算法生成预测序列。结果表明,该模型对于棉花病虫害文本中命名实体的识别精确率为96.60%,召回率为95.76%,F1值为96.18%;在ResumeNER等公开数据集上也有较好的效果。表明该模型能有效地识别棉花病虫害命名实体且具有一定的泛化能力。Aiming at the scarcity of cotton pest and disease text corpus data and the lack of Chinese named entity recognition corpus,and the problems of complexity,diversity and uneven distribution of the content of cotton pest and disease entities,a Chinese entity recognition corpus CDIPNER containing 11 categories of cotton pests and diseases entities was constructed,and a named entity recognition model based on RoBERTa multi‐feature fusion was proposed.The model adopted RoBERTa pre‐training model with stronger mask learning ability for character‐level embedding vector conversion,extracted feature vectors jointly by BiLSTM and IDCNN models to capture the temporal and spatial features of the text,respectively,fused the extracted feature vectors using a multi‐head self‐attention mechanism,and finally generated predicted sequences using the CRF algorithm.The results showed that the model had 96.60%recognition accuracy,95.76%recall,and 96.18%F1 value for named entities in cotton pest and disease text;it also had good results on public datasets such as ResumeNER.The results indicate that the model could effectively identify named entities of cotton pest and disease and has certain generalisation ability.

关 键 词:棉花 病虫害 RoBERTa模型 命名实体识别 多特征融合 多头注意力机制 

分 类 号:S126[农业科学—农业基础科学] TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象