藏语文本标准化方法  

Tibetan text normalization method

在线阅读下载全文

作  者:拉巴顿珠 扎西多吉 珠杰 LHAKPA Dondrub;ZHAXI Duoji;ZHU Jie(School of Information Science and Technology,Tibet University,Lhasa 850000,China;Tibet Informatization Collaborative Innovation Center Jointly Built by the Province and the Ministry,Lhasa 850000,China)

机构地区:[1]西藏大学信息科学技术学院,拉萨850000 [2]西藏信息化省部共建协同创新中心,拉萨850000

出  处:《吉林大学学报(工学版)》2024年第12期3577-3588,共12页Journal of Jilin University:Engineering and Technology Edition

基  金:国家自然基金项目(62406256);教育部人文社会科学研究项目(21YJCZH059);2025年西藏自治区自然科学基金项目(ZRKX2025000068);西藏大学在职攻读博士学位及博士后进站研究人员科研项目(zbds202326);西藏大学培育计划项目(ZDQMJH20-09)。

摘  要:针对现代藏语文本表征形式复杂多样且不规范,影响语音合成系统的性能问题,提出了具有易于维护及可扩展性特点的藏语文本标准化方法。首先,对藏文标记符号和来自其他语言的非藏文特殊符号在藏语文本中的不同表现形式进行了深度解析,并通过不同特征对特殊符号进行了分类;其次,根据归纳的不同类型,分别建立起了15种特殊符号转化为藏语的书写规则;最后,以13490个句子作为实验数据,通过藏语字音转换测试识别并检测文本中特殊符号和藏文音节的有效性,采用规则匹配的方法对含有特殊符号的句子进行标准化处理。实验结果表明:标准化之前藏语音素转写的遗漏率高达4.69%,而经过标准化之后音素转写的遗漏率降低到0.01%,其藏语文本标准化准确率达99%。In view of the complexity and nonstandard representation of modern Tibetan text,which affects the performance of speech synthesis system,this paper proposes a Tibetan text standardization method with the characteristics of easy maintenance and scalability.Firstly,a deep analysis was conducted on the different manifestations of Tibetan marker symbols and non Tibetan special symbols from other languages in Tibetan texts,and the special symbols were classified based on different features.Secondly,according to the different types of induction,the writing rules for converting 15 special symbols into Tibetan language were respectively established.Finally,using 13490 sentences as the experimental data,the effectiveness of special symbols and Tibetan syllables in the text is identified and tested through the Tibetan graphemeto-phoneme conversion test,and the sentences containing special symbols are standardized by the method of rule matching.The experimental results show that the omission rate of Tibetan phoneme transcription before standardization was as high as 4.69%,but after standardization,the omission rate of phoneme transcription was reduced to 0.01%,and the standardization accuracy rate of Tibetan text reached 99%.

关 键 词:计算机应用技术 藏语文本分析 文本标准化 语音合成 特殊符号 字音转换 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象