检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:贾承勋 赖华[1,2] 余正涛[1,2] 文永华[1,2] 于志强 JIA Chengxun;LAI Hua;YU Zhengtao;WEN Yonghua;YU Zhiqiang(Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming Yunnan 650504,China;Yunnan Key Laboratory of Artificial Intelligence(Kunming University of Science and Technology),Kunming Yunnan 650500,China)
机构地区:[1]昆明理工大学信息工程与自动化学院,昆明650504 [2]云南省人工智能重点实验室(昆明理工大学),昆明650500
出 处:《计算机应用》2021年第6期1652-1658,共7页journal of Computer Applications
基 金:国家自然科学基金资助项目(61672271,61732005,61761026,61762056,61866020);国家重点研发计划项目(2019QY1801)。
摘 要:神经机器翻译在资源丰富的语种上取得了良好的翻译效果,但是由于数据稀缺问题在汉语-越南语这类低资源语言对上的性能不佳。目前缓解该问题最有效的方法之一是利用现有资源生成伪平行数据。考虑到单语数据的可利用性,在回译方法的基础上,首先将利用大量单语数据训练的语言模型与神经机器翻译模型进行融合,然后在回译过程中通过语言模型融入语言特性,以此生成更规范质量更优的伪平行数据,最后将生成的语料添加到原始小规模语料中训练最终翻译模型。在汉越翻译任务上的实验结果表明,与普通的回译方法相比,通过融合语言模型生成的伪平行数据使汉越神经机器翻译的BLEU值提升了1.41个百分点。Neural machine translation achieves good translation results on resource-rich languages,but due to data scarcity,it performs poorly on low-resource language pairs such as Chinese-Vietnamese.At present,one of the most effective ways to alleviate this problem is to use existing resources to generate pseudo-parallel data.Considering the availability of monolingual data,based on the back-translation method,firstly the language model trained by a large amount of monolingual data was fused with the neural machine translation model.Then,the language features were integrated into the language model in the back-translation process to generate more standardized and better quality pseudo-parallel data.Finally,the generated corpus was added to the original small-scale corpus to train the final translation model.Experimental results on the Chinese-Vietnamese translation tasks show that compared with the ordinary back-translation methods,the Chinese-Vietnamese neural machine translation has the BiLingual Evaluation Understudy(BLEU)value improved by 1.41 percentage points by fusing the pseudo-parallel data generated by the language model.
关 键 词:汉越神经机器翻译 数据增强 伪平行数据 单语数据 语言模型
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.13