基于TMS-BERT的藏文多粒度语义匹配模型研究  被引量:2

Study on A Multi-Granularity Semantic Matching Model for Tibetan Text Based on TMS-BERT

在线阅读下载全文

作  者:杨进 朱云飞 陈晨 阿永强 YANG Jin;ZHU Yunfei;CHEN Chen;A Yongqiang(College of Cyber Security,Sichuan University,Chengdu 610065,China;School of Information Science and Technology,Tibet University,Lhasa 850000,China)

机构地区:[1]四川大学网络空间安全学院,四川成都610065 [2]西藏大学信息科学技术学院,西藏拉萨850000

出  处:《高原科学研究》2023年第2期84-92,共9页Plateau Science Research

基  金:国家自然科学基金(62162057,61872254);四川省科技计划(2021JDRC0004);公安部信息网络安全重点实验室(C20606);国家级大学生创新训练项目(202210694032).

摘  要:该文提出了基于TMS-BERT(Tibetan Multi-granularity Semantic matching-BERT)的藏文多粒度语义匹配模型。针对藏文文本特点,提出一种基于音节字、词、短语混合的多粒度特征向量构建模型,有效保留了藏文的语义特征,缓解了传统藏文文本匹配模型存在的维度灾难问题。提出一种基于Transformer的双向编码能力和自注意力机制,采用大量藏文训练一个用于检测藏文语义相似性的模型,克服了传统文本匹配模型检测准确率较低的问题。在社交平台和新闻网站等搜集到71904个藏文句子对用于训练和模型评估,该模型最终精确率高达95.33%,准确率高达94.33%,相比于传统的BERT模型准确率提高了3.68%,比传统词向量生成模型fastText准确率提高了12.39%,比传统文本相似度模型提高了27.35%。A multi-granularity semantic matching model for Tibetan text based on Tibetan Multi-granularity Semantic matching-BERT(TMS-BERT)is introduced in this paper.Aiming at the characteristics of Tibetan text,a multi-granularity feature vector construction model based on a mixture of syllabic characters,words,and phrases is proposed,which effectively preserves the semantic features of Tibetan and alleviates the dimensional disaster problem of traditional Tibetan text matching models.A model based on Transformer's bidirectional coding ability and self-attentiveness mechanism is proposed,and a model for detecting semantic similarity of Tibetan texts was trained using a large amount of Tibetan texts,which overcomes the problem of low detection accuracy of traditional text matching models.71904 Tibetan sentence pairs was collected from social media platforms and news websites for training and model evaluation.The accuracy of the model is as high as 95.33% with an accuracy rate of 94.33%,which is 3.68%,12.39%,and 27.35% higher than the accuracy of the BERT model,the word vector generation model fastText model,and the text similarity model,respectively,and proved the efficiency of the model introduced in this paper.

关 键 词:藏文信息处理 多粒度 文本语义匹配 BERT 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象