基于深度学习的梵藏文本识别  

Sanskrit-Tibetan text recognition based on deep learning

在线阅读下载全文

作  者:才让叁智 仁增多杰[1,4,5] 多拉 索南尖措 TSHERING Bsamvgub;RENZENG Duojie;DOL La;SUONAN Jiancuo(School of Information Science and Technology,Tibet University,Lhasa 850000,China;Department of Chinese Language and Literature,Northwest Minzu University,Lanzhou 730030,China;National and Local Joint Engineering Research Center for Tibetan Information Technology,Tibet University,Lhasa 850000,China;State Key Laboratory of Tibetan Intelligent Information Processing and Application,Tibet University,Lhasa 850000,China;State Key Laboratory of Tibetan Intelligent Information Processing and Application,Qinghai Normal University,Xining 810016,China)

机构地区:[1]西藏大学信息科学技术学院,西藏拉萨850000 [2]西北民族大学中国语言文学学部,甘肃兰州750000 [3]西藏大学藏文信息技术国家地方联合工程研究中心,西藏拉萨850000 [4]省部共建藏语智能信息处理及应用国家重点实验室,西藏拉萨850000 [5]青海师范大学藏语智能信息处理及应用国家重点实验室,青海西宁810008

出  处:《厦门大学学报(自然科学版)》2024年第6期1059-1066,共8页Journal of Xiamen University:Natural Science

基  金:国家自然科学基金项目(62266037);西藏自治区自然科学基金项目(XZ202101ZR0108G);西藏大学珠峰学科建设计划项目(zf22002001);西藏大学校级科研培育基金项目(ZDCZJH19-19);西藏自治区科技厅中央引导地方科技发展资金(XZ202102YD0018C)。

摘  要:[目的]梵藏文本识别是自动排序、词法分析和自动校对等研究的重要前期工作环节.当前基于规则的梵藏文本识别方法中存在无法有效识别短梵文词语等诸多问题.[方法]在自建的梵藏文本识别数据集上,采用基于双向长短时记忆网络和自注意力的梵藏文本识别方法、基于预训练语言模型CINO的梵藏文本识别方法和基于规则的梵藏文本识别方法之间进行实验对比,并分析它们的识别结果,进而选出最优的梵藏文本识别方法.[结果]基于双向长短时记忆网络和自注意力机制的梵藏文本识别模型的宏准确率、召回率和F1值分别达到了98.09%、99.22%和98.65%,其效果优于多语言预训练模型CINO和其他3种基于规则的方法.[结论]基于skip-gram、CBOW和GloVe的藏文字符表示模型使用相同的小规模、无重样的训练数据集时,CBOW的字符表示效果优于其他两者;训练数据相同的情况下,基于双向长短时记忆网络和自注意力机制的梵藏文本识别模型优于多语言预训练模型CINO,同时,也优于基于规则的梵藏文本识别模型.[Objective]As an important preliminary work link in the research of automatic sorting,lexical analysis and automatic correction,Sanskrit-Tibetan text recognition is critically needed.However,numerous problems in the rule-based Sanskrit-Tibetan text recognition methods,such as the inability to effectively identify short Sanskrit words exist.[Methods]On the self-built Sanskrit-Tibetan text recognition dataset,the Sanskrit-Tibetan text recognition method based on Bi-LSTM and Self-Attention,the Sanskrit-Tibetan text recognition method based on pre-trained language model CINO,and the rule-based Sanskrit-Tibetan text recognition method are compared experimentally.Next,their recognition results are analyzed,and the optimal Sanskrit-Tibetan text recognition method is selected.[Results]The macro accuracy,recall and F1 value of the Sanskrit-Tibetan text recognition model based on Bi-LSTM and Self-Attention mechanism reach 98.09%,99.22%and 98.65%,respectively,and perform more effectively than the multilingual pre-trained model CINO and the other three rule-based methods do.[Conclusions]When the same small-scale and no duplicate training dataset are used along with the Tibetan character representation models based on skip-gram,CBOW and GloVe,the character representation effect of CBOW is better than those of the other two.Under the same training data,the Sanskrit-Tibetan text recognition model based on Bi-LSTM and Self-Attention mechanism performs better than the multilingual pre-trained model CINO does,and also better than the rule-based Sanskrit-Tibetan text recognition model does.

关 键 词:藏文信息处理 梵藏文本识别 字符表示 STTRM_BS模型 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象