基于ASBC模型的藏文自动分词方法研究  被引量:1

Research on Tibetan Automatic Word Segmentation Method Based on ASBC Model

在线阅读下载全文

作  者:尹宗鹤 尼玛次仁 于韬[1,2,3] 拥措 YIN Zonghe;NIMA Ciren;YU Tao;YONG Cuo(College of Information Science and Technology,Tibet University,Lhasa 850000;Engineering Research Center of Tibetan Information Technology Ministry of Education,Tibet University,Lhasa 850000;Key Laboratory of Tibetan Information Technology Artificial Intelligence of Tibet Autonomous Region,Lhasa 850000)

机构地区:[1]西藏大学信息科学技术学院,拉萨850000 [2]西藏大学藏文信息技术教育部工程研究中心,拉萨850000 [3]西藏自治区藏文信息技术人工智能重点实验室,拉萨850000

出  处:《计算机与数字工程》2023年第6期1227-1230,1237,共5页Computer & Digital Engineering

基  金:科技部重点研发计划专项(编号:2017YFB1402202);西藏自治区科技创新基地自主研发项目(编号:XZ2021HR002G);西藏大学研究生“高水平人才培养计划”项目(编号:2020-GSP-S174)资助。

摘  要:藏文分词是藏文自然语言处理的前提工作,其效果将影响藏文自然语言处理的下游任务。神经网络的兴起,使结合预训练语言模型的深度学习方法成为分词研究的主流。针对传统神经网络获取语义信息有限的问题,论文利用大规模藏文语料库构建ALBERT预训练语言模型,同时引入藏文音节特征融合的方法,提出了基于深度学习的ALBERT预训练与音节特征融合的双向长短时记忆条件随机场藏文分词模型(ALBERT-Syllable-BiLSTM-CRF,ASBC)。实验在多主题数据集上进行,主要验证了ALBERT预训练语言模型和音节特征融合对藏文分词的有效性,最终模型分词效果得到明显提升。Tibetan word segmentation is a prerequisite for Tibetan natural language processing,and its effect will affect the downstream tasks of Tibetan natural language processing.With the rise of neural network,the method combined deep learning with pre-trained language model has become a mainstream in word segmentation research.To solve the problem of limited semantic infor-mation obtained by traditional neural network,this paper uses a large-scale Tibetan corpus to construct ALBERT pre-training lan-guage model,the method of Tibetan syllable feature fusion is introduced,a deep learning Tibetan word segmentation model is pro-posed based on bidirectional long and short time memory conditional random field model combine with ALBERT pre-training and syllable feature fusion(ALBERT-Syllable-BiLSTM-CRF,ASBC).The experiment is carried out on multi-theme data sets,which mainly verifies the effectiveness of ALBERT pre-training language model and syllable feature fusion for Tibetan word segmentation,and finally the word segmentation effect of the model is obviously improved.

关 键 词:藏文 自动分词 预训练 ALBERT 音节特征融合 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象