基于语义信息扩充的汉藏短语翻译语料SECT  

A corpus of Chinese-Tibetan phrase translation based on semantic enrichment-SECT

在线阅读下载全文

作  者:常润 陈波 赵小兵[1,2] CHANG Run;CHEN Bo;ZHAO Xiaobing(College of Information Engineering,Minzu University of China,Beijing 100081,P.R.China;National Language Resource Monitoring&Research Center of Minority Languages,100081,P.R.China)

机构地区:[1]中央民族大学信息工程学院,北京100081 [2]国家语言资源监测与研究少数民族语言中心,北京100081

出  处:《中国科学数据(中英文网络版)》2024年第4期39-45,共7页China Scientific Data

基  金:国家社科基金重大项目(22&ZD035)。

摘  要:机器翻译是自然语言处理的关键任务,在促进政治、经济、文化交流等方面起到的作用日渐显著。在富资源语言间,如中、英文,机器翻译的效果已经接近人工翻译水平。然而,对于低资源语言,如藏文,由于缺乏足够的大规模公开平行语料,藏文机器翻译的准确性仍然有待提高。在汉文-藏文翻译中,涉及短语翻译时,由于它们简短且包含深层语义信息,如缩略语,现有的机器翻译结果并不准确。为了帮助翻译模型更好地捕捉和传达语义信息,本文构建了基于语义信息扩充的汉藏短语翻译语料,该语料含有汉-藏短语翻译数据7000条。其中汉-藏短语的原始数据来自西藏藏语言文字网,扩充的语义信息包括汉文短语的汉文释义以及包含汉文短语的例句,这部分内容均采用大语言模型生成和专业人士校对的方式获取。本数据集的发表,对于促进汉-藏文信息处理的发展具有重要的价值。Machine translation plays an important role in natural language processing,and increasingly important for promoting political,economic,and cultural exchanges.In high-resource languages like Chinese and English,machine translation has almost reached the accuracy of human translation.However,for low-resource languages like Tibetan,the accuracy of Tibetan machine translation still needs improvement due to the lack of large-scale publicly available parallel corpora.In Chinese-Tibetan translation,when it comes to phrase translation,the existing machine translation results are often inaccurate due to their brevity and the deep semantic information between the lines,such as abbreviations.To improve the capability of translation models to better capture and convey semantic information,this paper constructs a Chinese-Tibetan phrase translation corpus based on semantic information enrichment.This corpus contains 7,000 entries of Chinese-Tibetan phrase pairs.The original data for Chinese-Tibetan phrases is sourced from the Tibetan Language and Writing Network of Tibet,and the enriched semantic information includes Chinese definitions for Chinese phrases and example sentences incorporating the target phrases.This part of the content is obtained through the generation of large language models and professional proofreading.The publication of this dataset is of great value in promoting the development of Chinese-Tibetan information processing.

关 键 词:机器翻译 低资源 藏文 短语 语义信息 数据集 

分 类 号:TP391.2[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象