基于语种关联度课程学习的多语言神经机器翻译  被引量:2

Similarity-based Curriculum Learning for Multilingual Neural Machine Translation

在线阅读下载全文

作  者:于东[1] 谢婉莹 谷舒豪 冯洋[2,3] YU Dong;XIE Wan-ying;GU Shu-hao;FENG Yang(College of Information Sciences,Beijing Language and Culture University,Beijing 100083,China;Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China;University of Chinese Academy of Sciences,Beijing 100049,China)

机构地区:[1]北京语言大学信息科学学院,北京100083 [2]中国科学院计算技术研究所,北京100190 [3]中国科学院大学,北京100049

出  处:《计算机科学》2022年第1期24-30,共7页Computer Science

基  金:教育部人文社会科学研究青年基金项目(19YJCZH230);北京语言大学研究生创新基金资助项目(20YCX138)。

摘  要:近年来,使用单一模型实现多语言神经机器翻译的方法受到了广泛关注。然而,现有方法多将所有语种语料直接混合作为训练语料,未能利用多种语言之间关联和相似的信息。此外,模型训练涉及语言种类多、数据量大、整体训练难度大、耗时长等问题。针对以上两个问题,文中提出了一种基于语种关联度的课程学习方法来提高多语言神经机器翻译的整体性能和收敛速度。具体来说,提出了两种度量语种关联度的指标:使用奇异向量典型相关分析对不同语言进行排序以及使用余弦相似度对特定语言中的不同句子进行排序。进一步,文中提出以验证集损失为课程替换标准的课程学习策略,使模型训练由整体训练转化为一系列课程上的训练,降低了训练难度。该方法填补了课程学习策略在多语言神经机器翻译领域的空白。文中在平衡和非平衡的IWSLT多语言数据集和Europarl语料库数据集上进行了实验,结果表明,所提方法优于多语言基线翻译系统,最多可使训练时间缩短64%。Multilingual neural machine translation(MNMT)with a single model has drawn more attention due to its capability to deal with multiple languages.However,the current multilingual translation paradigm does not make use of the similar features embodied in different languages,which has already been proven useful for improving the multilingual translation.Besides,the training of multilingual model is usually very time-consuming due to the huge amount of training data.To address these problems,we propose a similarity-based curriculum learning method to improve the overall performance and convergence speed.We propose two hierarchical criteria for measuring the similarity,one is for ranking different languages(inter-language)with singular vector canonical correlation analysis,and the other is for ranking different sentences in a particular language(intra-language)with cosine similarity.At the same time,the paper proposes a curriculum learning strategy that takes the loss of validation set as the curriculum replacement standard.We conduct experiments on balanced and unbalanced IWSLT multilingual data sets and Europarl corpus datasets.The results demonstrate that the proposed method outperforms strong multilingual translation systems and can achieve up to a 64%decrease in training time.

关 键 词:机器翻译 多语言 课程学习 关联度评估 语种排序 句子排序 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象