融合单语语言模型的汉越伪平行语料生成被引量：2

Chinese-Vietnamese pseudo-parallel corpus generation based on monolingual language model

作　　者：贾承勋赖华[1,2] 余正涛[1,2] 文永华[1,2] 于志强 JIA Chengxun;LAI Hua;YU Zhengtao;WEN Yonghua;YU Zhiqiang(Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming Yunnan 650504,China;Yunnan Key Laboratory of Artificial Intelligence(Kunming University of Science and Technology),Kunming Yunnan 650500,China)

机构地区：[1]昆明理工大学信息工程与自动化学院,昆明650504 [2]云南省人工智能重点实验室(昆明理工大学),昆明650500

出　　处：《计算机应用》2021年第6期1652-1658,共7页journal of Computer Applications

基　　金：国家自然科学基金资助项目(61672271,61732005,61761026,61762056,61866020);国家重点研发计划项目(2019QY1801)。

摘　　要：神经机器翻译在资源丰富的语种上取得了良好的翻译效果,但是由于数据稀缺问题在汉语-越南语这类低资源语言对上的性能不佳。目前缓解该问题最有效的方法之一是利用现有资源生成伪平行数据。考虑到单语数据的可利用性,在回译方法的基础上,首先将利用大量单语数据训练的语言模型与神经机器翻译模型进行融合,然后在回译过程中通过语言模型融入语言特性,以此生成更规范质量更优的伪平行数据,最后将生成的语料添加到原始小规模语料中训练最终翻译模型。在汉越翻译任务上的实验结果表明,与普通的回译方法相比,通过融合语言模型生成的伪平行数据使汉越神经机器翻译的BLEU值提升了1.41个百分点。Neural machine translation achieves good translation results on resource-rich languages,but due to data scarcity,it performs poorly on low-resource language pairs such as Chinese-Vietnamese.At present,one of the most effective ways to alleviate this problem is to use existing resources to generate pseudo-parallel data.Considering the availability of monolingual data,based on the back-translation method,firstly the language model trained by a large amount of monolingual data was fused with the neural machine translation model.Then,the language features were integrated into the language model in the back-translation process to generate more standardized and better quality pseudo-parallel data.Finally,the generated corpus was added to the original small-scale corpus to train the final translation model.Experimental results on the Chinese-Vietnamese translation tasks show that compared with the ordinary back-translation methods,the Chinese-Vietnamese neural machine translation has the BiLingual Evaluation Understudy(BLEU)value improved by 1.41 percentage points by fusing the pseudo-parallel data generated by the language model.

关键词：汉越神经机器翻译数据增强伪平行数据单语数据语言模型

分类号：TP391[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

融合单语语言模型的汉越伪平行语料生成被引量：2

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

融合单语语言模型的汉越伪平行语料生成 被引量：2

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

融合单语语言模型的汉越伪平行语料生成被引量：2