融合数据增强与半监督学习的药物不良反应检测被引量：4

Adverse Drug Reaction Detection Combined with Data Augmentation and Semi-supervised Learning

作　　者：佘朝阳严馨[1,2] 徐广义陈玮[1,2] 邓忠莹 SHE Zhaoyang;YAN Xin;XU Guangyi;CHEN Wei;DENG Zhongying(Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650504,China;Yunnan Key Laboratory of Artificial Intelligence,Kunming University of Science and Technology,Kunming 650504,China;Yunnan Nantian Electronic Information Industry Co.,Ltd.,Kunming 650041,China)

机构地区：[1]昆明理工大学信息工程与自动化学院,昆明650504 [2]昆明理工大学云南省人工智能重点实验室,昆明650504 [3]云南南天电子信息产业股份有限公司,昆明650041

出　　处：《计算机工程》2022年第6期314-320,共7页Computer Engineering

基　　金：国家自然科学基金(61462055,61562049)。

摘　　要：目前药物不良反应(ADR)研究使用的数据主要来源于英文语料,较少选用存在标注数据稀缺问题的中文医疗社交媒体数据集,导致对中文医疗社交媒体的研究有限。为解决标注数据稀缺的问题,提出一种新型的ADR检测方法。采用ERNIE预训练模型获取文本的词向量,利用BiLSTM模型和注意力机制学习文本的向量表示,并通过全连接层和softmax函数得到文本的分类标签。对未标注数据进行文本增强,使用分类模型获取低熵标签,此标签被作为原始未标注样本及其增强样本的伪标签。此外,将带有伪标签的数据与人工标注数据进行混合,在分类模型的编码层和分类层间加入Mixup层,并在文本向量空间中使用Mixup增强方法插值混合样本,从而扩增样本数量。通过将数据增强和半监督学习相结合,充分利用标注数据与未标注数据,实现ADR的检测。实验结果表明,该方法无需大量的标注数据,缓解了标注数据不足对检测结果的影响,有效提升了药物不良反应检测模型的性能。At present,the data used in the study of Adverse Drug Reaction(ADR) are mainly from English corpus,fewer Chinese medical social media data sets are selected because of label data scarcity,resulting in limited research on Chinese medical social media.To deal with the problem of lack of labeled data,this study proposes an ADR detection method that combines data augmentation and semi-supervised learning.The pre-training ERNIE model is used to obtain the word vectors.BiLSTM and the attention mechanism are used to learn the vector representation of the text.The classification layer consists of a fully connected layer and a softmax function to obtain the classification label.First,the unlabeled data are augmented several times.The low-entropy label,which is the weighted average of the predicted values of the original and augmented samples,is shared by these samples.The pseudo-label data are then mixed with the labeled data.Based on the classification model,a Mixup layer is added between the encoding and classification layers.In the text vector space,Mixup is used to interpolate the mixed samples,and the number of samples will be higher.By combining data augmentation and semi-supervised learning,labeled and unlabeled data are fully utilized to detect adverse drug reactions.Experimental results show that this method does not require a large amount of labeled data,alleviates the impact of insufficient labeled data,and effectively improves the performance.

关键词：医疗社交媒体药物不良反应数据增强半监督学习预训练语言模型

分类号：TP391[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

融合数据增强与半监督学习的药物不良反应检测被引量：4

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

融合数据增强与半监督学习的药物不良反应检测 被引量：4

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

融合数据增强与半监督学习的药物不良反应检测被引量：4