检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:郑承宇 王新[1] 王婷 徐权峰 ZHENG Cheng-yu;WANG Xin;WANG Ting;XU Quan-feng(School of Mathematics and Computer Science,Yunnan Minzu University,Kunming 650500,China)
机构地区:[1]云南民族大学数学与计算机科学学院,云南昆明650500
出 处:《计算机技术与发展》2022年第4期28-33,共6页Computer Technology and Development
基 金:国家自然科学基金资助项目(61363022);云南省教育厅科学研究基金项目(2021Y670)。
摘 要:针对医疗文本语义稀疏、维度过高的问题,提出一种基于迁移学习和集成学习的多标签医疗文本分类算法(Trans-LSTM-CNN-Multi,TLCM)。该算法采用ALBERT(A Lite BERT)模型内部的多层双向Transfomer结构对大型语料库展开训练,获取通用领域的文本动态字向量表示。然后,利用医学领域目标数据集通过迁移学习和模型微调技术实现ALBERT预训练语言模型在医学领域的文本语义增强。在此基础上,将上述通过迁移学习得到的文本语义增强模型输入到Bi-LSTM-CNN集成学习模块,进一步提取医学文本内容的重要信息特征。最后,基于二元交叉熵损失函数构造文本多标签分类器实现医疗文本分类。实验结果表明,通过迁移学习和集成学习的TLCM文本分类算法能有效提升医疗文本的分类性能,在中文健康问句数据集上整体F1值达到了91.8%。Aiming at the problems of sparse semantic and high dimension of medical text,a multi-label medical text classification algorithm based on transfer learning and ensemble learning named TLCM(Trans-LSTM-CNN-Multi) is proposed.Firstly,the large-scale corpus is trained through the multi-layer Transfomer structure inside the ALBERT(A Lite BERT) model to obtain the dynamic word vector representation of the text.Then,the target data set in the medical field is used to realize the text semantic enhancement in the medical field through transfer learning and model fine-tuning technology based on ALBERT(A Lite BERT) pre-training language model.On this basis,the above-mentioned semantic enhancement model obtained through transfer learning is input to the Bi-LSTM-CNN ensemble learning module to further extract important information characteristics of medical text content.Finally,a text multi-label classifier based on binary cross entropy loss function is constructed to achieve medical text classification.The experimental results show that the text classification algorithm through transfer learning and ensemble learning can effectively improve the overall performance of the model,and finally the overall F1 value on the Chinese health question data set reaches 91.8%.
关 键 词:迁移学习 集成学习 ALBERT Bi-LSTM-CNN 医疗文本 健康问句
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:18.117.80.241