机构地区:[1]中国农业大学信息与电气工程学院,北京100083
出 处:《农业机械学报》2020年第S02期335-343,共9页Transactions of the Chinese Society for Agricultural Machinery
基 金:国家重点研发计划项目(2016YFD0300710)。
摘 要:为了解决农业病虫害命名实体识别过程中存在的内在语义信息缺失、局部上下文特征易被忽略和捕获长距离依赖能力不足等问题,以农业病虫害文本为研究对象,提出一种基于部首嵌入和注意力机制的农业病虫害命名实体识别模型(Chinese agricultural diseases and pests named entity recognition with joint radical-embedding and self-attention,RSADP)。首先,该模型将部首嵌入集成到字符嵌入中作为输入,用以丰富语义信息。其中,针对部首嵌入设计了3种特征提取策略,即卷积神经网络(Convolutional neural network,CNN)、双向长短时记忆网络(Bidirectional long short-term memory network,BiLSTM)和CNNBiLSTM;其次,采用多层不同窗口尺寸的CNNs层提取不同尺度的局部上下文信息;然后,在BiLSTM提取全局序列特征的基础上,采用自注意力机制进一步增强模型提取更长距离依赖的能力;最后,采用条件随机场(Conditional random field,CRF)联合识别实体边界和划分实体类别。在包含11个类别和24715条标注样本的农业病虫害自制语料上进行了实验。结果表明,本文模型RSADP在该数据集上精确率、召回率和F1值分别为94.16%、94.47%和94.32%;在具体实体类别上,RSADP在作物、病害、虫害等易识别实体上F1值高达95.81%、97.76%和97.23%。同时,RSADP在草害、病原等难以识别实体上F1值仍保持86%以上。实验结果表明,本文所提模型能够有效识别农业病虫害命名实体,其识别精度优于其他模型,且具有一定的泛化性。Chinese named entity recognition in agricultural diseases and pests domain(CNERADP)plays an important role in agricultural natural language processing such as relation extraction,agricultural knowledge graph construction,and agricultural knowledge question and answering,but it still presents some problems,i.e.,the neglect of inherent semantic information and local contextual features and the insufficiency of capturing long-distance dependencies,which will lead to low accuracy and robustness.To solve the above problems and tackle the CNERADP task,a novel Chinese named entity recognition method for agricultural diseases and pests via jointly using radical-embedding and self-attention(RSADP)was proposed.Firstly,the model integrated radical embedding and character embedding as input to enrich semantic information.Among them,three different strategies,including CNN and BiLSTM were both designed to capture the radical-level embedding.Secondly,a CNNs layer with different kernel sizes was considered capturing multi-scale local contextual features.Thirdly,based on the BiLSTM layer,self-attention mechanism was used to further enhance the ability of the model to extract longer-distance dependencies.Finally,the conditional random field(CRF)was utilized to identify entity boundaries and category.The experiments were carried out on the corpus of agricultural diseases and pests,named AgCNER,which contained 11 categories and 24715 samples.At macro-level,the RSADP model achieved optimal precision,recall,and F1 values of 94.16%,94.47%,and 94.32%,respectively.In terms of specific categories,it achieved F1 values as high as 95.81%,97.76%,and 97.23%on easily identifiable entities such as crop,disease,and pest.Meanwhile,this model still maintained over 86%of F1 value on some other difficultly recognized entities such as weed and pathogeny.The experimental results showed that the proposed model could effectively recognize the named entities of agricultural pests and diseases without feature engineering.Moreover,it had certain generaliz
关 键 词:农业病虫害 命名实体识别 部首嵌入 自注意力机制 双向长短时记忆网络 卷积神经网络
分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...