结合噪声数据增强的蒙汉伪平行语料库的构造  

Construction of Mongolian-Chinese pseudo-parallel corpus enhanced by noisy data

在线阅读下载全文

作  者:田永红 章钧津 宋哲煜 TIAN Yonghong;ZHANG Junjin;SONG Zheyu(College of Data Science and Application,Inner Mongolia University of Technology,Hohhot 010080,China)

机构地区:[1]内蒙古工业大学数据科学与应用学院,内蒙古呼和浩特010080

出  处:《计算机工程与科学》2025年第4期751-760,共10页Computer Engineering & Science

基  金:国家自然科学基金(62466043)。

摘  要:神经机器翻译作为机器翻译的主流方法在一般翻译任务中取得了较好的表现。然而其翻译质量依赖于大规模平行语料库,对于低资源语言,语料不足成为其发展面临的重要挑战。数据增强技术的出现能够有效解决数据稀缺问题,因此,通过将噪声数据引入反向翻译的方法进行数据增强构造伪平行语料库。首先对文本进行语料预处理,其次进行反向翻译和结合噪声数据后的反向翻译,再次进行文本相似度匹配,最后将反向翻译技术与结合噪声数据后的反向翻译技术进行对比。在实验数据集上的实验结果表明,结合噪声数据后的反向翻译技术有效提升了低资源机器翻译的表现,其翻译结果在BLEU指标上较仅使用反向翻译技术的提升了1.10%,较未使用反向翻译技术的提升了1.96%。Neural machine translation(NMT),as the mainstream approach in machine translation,has achieved excellent performance in general translation tasks.However,its translation quality relies heavily on large-scale parallel corpora.For low-resource languages,the scarcity of corpora poses a significant challenge to its development.The emergence of data augmentation techniques can effectively address the issue of data scarcity.Therefore,a pseudo-parallel corpus is constructed by introducing noisy data into back translation.Firstly,the text is pre-processed with corpus.Secondly,the back translation and the back translation combined with noisy data are carried out.Thirdly,the text acquaintance degree is matched.Finally,the back translation technology is compared with the back translation technology combined with noisy data.Experiments on experimental datasets show that the back translation technology combined with noisy data effectively improves the performance of low-resource machine translation.Specifically,its translation results achieve 1.10% improvement compared with those using the back translation technique alone on BLEU score and 1.96% improvement compared with those not using the back translation technique at all.

关 键 词:数据增强 噪声数据 文本相似度匹配 语料预处理 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象