检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:拉毛杰 万玛才旦 拥措 高兴 尼玛扎西 Lhamo-kyap;Pema-Tseden;Yongtso;GAO Xing;Nyima-Trashi(School of Information Science and Technology,Tibet University,Lhasa 850000,China;Key Laboratory of Tibetan Information Technology and Artificial Intelligence of Tibet Autonomous Region,Tibet University,Lhasa 850000,China;Engineering Research Center of the Ministry of Education of Tibetan Information Technology,Tibet University,Lhasa 850000,China)
机构地区:[1]西藏大学信息科学技术学院,西藏拉萨850000 [2]西藏自治区藏文信息技术人工智能重点实验室,西藏拉萨850000 [3]藏文信息技术教育部工程研究中心,西藏拉萨850000
出 处:《高原科学研究》2025年第1期105-118,共14页Plateau Science Research
基 金:新一代人工智能国家科技重大专项项目(2022ZD0116101);西藏自治区科技厅项目(XZ202401JD0010);拉萨市科技计划项目(LSKJ20250X)。
摘 要:当前藏文命名实体识别模型在处理藏医药领域的实体识别任务时,往往面临迁移性和泛化能力受限以及语义关联捕捉不充分和实体边界模糊等问题。文章提出一种融合对抗训练与迭代膨胀卷积的藏医药文本命名实体识别模型(TM-ATD)。该方法首先基于《四部医典》构建了藏文音节标注的数据集TibetanAI_YUTOK_NER。其次采用预训练模型对藏文音节进行特征编码,融合对抗训练生成对抗样本以增强模型鲁棒性和泛化能力;通过双向长短时记忆网络捕捉序列依赖关系;采用迭代膨胀卷积全面捕捉文本上下文信息和全局特征,并利用多头自注意力机制增强局部上下文的理解能力,强化实体边界信息和文本语义关联。最后采用条件随机场进行解码操作输出最优标签序列。实验结果表明,融合对抗训练与迭代膨胀卷积的方法在藏医药文本数据集和藏文数据集TibetanAI_NER上的F1值分别达到了76.59%和54.91%,相较于基线模型,F1分别提升了3.03%和0.77%。To address the challenges faced by current Tibetan named entity recognition models in dealing with entity recognition tasks in the Tibetan medicine domain due to the limitations in transferability and generalization,as well as insufficient semantic association capture and ambiguous entity boundaries.This paper proposes a Tibetan medicine named entity recognition model(TM-ATD) that integrates adversarial training and iterative dilated convolutions.First,a syllable-annotated dataset,named TibetanAI_YUTOK_NER,was constructed based on the "Four Medical Tantras".Pre-trained models were then employed to encode Tibetan syllables,and adversarial training was used to generate adversarial samples,aiming to enhance the robustness and generalization capabilities of the model.A Bidirectional Long Short-Term Memory(BiLSTM) network was applied to capture sequence dependencies.Additionally,Iterative Dilated Convolutions(IDC)was utilized to capture contextual information and global features of the text comprehensively.Multi-head self-attention mechanisms were introduced to enhance the understanding of local contexts and strengthen the entity boundary information and semantic associations.Finally,a Conditional Random Field(CRF) was used to decode and output the optimal label sequence.The experimental results show that the method,which combines adversarial training with IDC,achieved an F1 score of 76.59% on the Tibetan medicine dataset and 54.91% on the Tibetan dataset TibetanAI_NER.These values reflect improvements of 3.03% and 0.77%,respectively,over the baseline model,demonstrating significant advancements in performance.
关 键 词:藏医药 命名实体识别 预训练模型 对抗训练 膨胀卷积
分 类 号:P237[天文地球—摄影测量与遥感]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.222