检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:常润 陈波 赵小兵[1,2] CHANG Run;CHEN Bo;ZHAO Xiaobing(College of Information Engineering,Minzu University of China,Beijing 100081,P.R.China;National Language Resource Monitoring&Research Center of Minority Languages,100081,P.R.China)
机构地区:[1]中央民族大学信息工程学院,北京100081 [2]国家语言资源监测与研究少数民族语言中心,北京100081
出 处:《中国科学数据(中英文网络版)》2024年第4期39-45,共7页China Scientific Data
基 金:国家社科基金重大项目(22&ZD035)。
摘 要:机器翻译是自然语言处理的关键任务,在促进政治、经济、文化交流等方面起到的作用日渐显著。在富资源语言间,如中、英文,机器翻译的效果已经接近人工翻译水平。然而,对于低资源语言,如藏文,由于缺乏足够的大规模公开平行语料,藏文机器翻译的准确性仍然有待提高。在汉文-藏文翻译中,涉及短语翻译时,由于它们简短且包含深层语义信息,如缩略语,现有的机器翻译结果并不准确。为了帮助翻译模型更好地捕捉和传达语义信息,本文构建了基于语义信息扩充的汉藏短语翻译语料,该语料含有汉-藏短语翻译数据7000条。其中汉-藏短语的原始数据来自西藏藏语言文字网,扩充的语义信息包括汉文短语的汉文释义以及包含汉文短语的例句,这部分内容均采用大语言模型生成和专业人士校对的方式获取。本数据集的发表,对于促进汉-藏文信息处理的发展具有重要的价值。Machine translation plays an important role in natural language processing,and increasingly important for promoting political,economic,and cultural exchanges.In high-resource languages like Chinese and English,machine translation has almost reached the accuracy of human translation.However,for low-resource languages like Tibetan,the accuracy of Tibetan machine translation still needs improvement due to the lack of large-scale publicly available parallel corpora.In Chinese-Tibetan translation,when it comes to phrase translation,the existing machine translation results are often inaccurate due to their brevity and the deep semantic information between the lines,such as abbreviations.To improve the capability of translation models to better capture and convey semantic information,this paper constructs a Chinese-Tibetan phrase translation corpus based on semantic information enrichment.This corpus contains 7,000 entries of Chinese-Tibetan phrase pairs.The original data for Chinese-Tibetan phrases is sourced from the Tibetan Language and Writing Network of Tibet,and the enriched semantic information includes Chinese definitions for Chinese phrases and example sentences incorporating the target phrases.This part of the content is obtained through the generation of large language models and professional proofreading.The publication of this dataset is of great value in promoting the development of Chinese-Tibetan information processing.
分 类 号:TP391.2[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.222