基于单语优先级采样自训练神经机器翻译的研究  被引量:1

Research on self-training neural machine translation based on monolingual priority sampling

在线阅读下载全文

作  者:张笑燕 逄磊 杜晓峰[1] 陆天波 夏亚梅 ZHANG Xiaoyan;PANG Lei;DU Xiaofeng;LU Tianbo;XIA Yamei(School of Computer Science(National Pilot Software Engineering School),Beijing University of Posts and Telecommunications,Beijing 100876,China)

机构地区:[1]北京邮电大学计算机学院(国家示范性软件学院),北京100876

出  处:《通信学报》2024年第4期65-72,共8页Journal on Communications

基  金:国家自然科学基金资助项目(No.62162060)。

摘  要:为了提高神经机器翻译(NMT)性能,改善不确定性过高的单语在自训练过程中对NMT模型的损害,提出了一种基于优先级采样的自训练神经机器翻译模型。首先,通过引入语法依存分析构建语法依存树并计算单语单词重要程度。然后,构建单语词典并基于单语单词的重要程度和不确定性定义优先级。最后,计算单语优先级并基于优先级进行采样,进而合成平行数据集,作为学生NMT的训练输入。在大规模WMT英德部分数据集上的实验结果表明,所提模型能有效提升NMT的翻译效果,并改善不确定性过高对模型的损害。To enhance the performance of neural machine translation(NMT)and ameliorate the detrimental impact of high uncertainty in monolingual data during the self-training process,a self-training NMT model based on priority sampling was proposed.Initially,syntactic dependency trees were constructed and the importance of monolingual tokenization was assessed using grammar dependency analysis.Subsequently,a monolingual lexicon was built,and priority was defined based on the importance of monolingual tokenization and uncertainty.Finally,monolingual priorities were computed,and sampling was carried out based on these priorities,consequently generating a synthetic parallel dataset for training the student NMT model.Experimental results on a large-scale subset of the WMT English to German dataset demonstrate that the proposed model effectively enhances NMT translation performance and mitigates the impact of high uncertainty on the model.

关 键 词:机器翻译 数据增强 自训练 不确定性 语法依存 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象