基于隐性句逗号识别的汉语长句机器翻译  

Machine Translation of Chinese Long Sentences Based on Recognition of Implicit Period and Comma

在线阅读下载全文

作  者:冯文贺 李熳佳 张文娟 Feng Wen-he;Li Man-jia;Zhang Wen-juan(Lab of Language Engineering and Computing,Center for Linguistics and Applied Linguistics,Guangdong University of Foreign Studies,Guangzhou 510420,China;School of Computer Science and Engineering,Guangzhou Institute of Science and Technology,Guangzhou 510420,China)

机构地区:[1]广东外语外贸大学外国语言学及应用语言学研究中心语言工程与计算实验室,广州510420 [2]广州理工学院计算机科学与工程学院,广州510420

出  处:《外语学刊》2025年第1期39-46,共8页Foreign Language Research

基  金:教育部人文社科基金“汉英机器翻译的结构性篇章质量评估研究”(24YJA740014);教育部人文社科基金“面向机器翻译的汉英复杂句主从对齐语料自动构建”(22YJCZH091);广东省教育厅GK特色创新项目“机器翻译的结构性篇章质量评估研究”(2023WTSCXO17)的阶段性成果。

摘  要:长句翻译一直是机器翻译的难题。本文根据汉语中相当数量的逗号和句号可相互转化的特点,提出“隐性句号”和“隐性逗号”概念,并实现其自动识别,以将汉语长句变为短句用于汉英机器翻译。为此,首先通过人工与半监督学习结合方法构建一个隐性句逗数据集,实现基于预训练模型的隐性句逗识别方法,其中性能最好的Hierarchical BERT作为后续应用模型。进而,实现基于隐性句逗识别的汉英机器翻译方法。在新闻和文学公开翻译测试语料上基于预训练机器翻译模型的实验表明,对于汉语长句的英译,本文方法相比基准翻译的BLEU值整体有所提高,而且在相对稳健机器翻译模型上,呈现为句子越长本文方法效果越明显。The translation of long sentences has always been a difficult task for machine translation.In this paper,based on the feature that a considerable number of commas and periods in Chinese text can be transformed into each other,we propose the concepts of“implicit period”and“implicit comma”,and realize their automatic recognition to transform Chinese long sentences into short sentences for Chinese⁃English machine translation.In this paper,a dataset of implicit period and comma is constructed by combining manual and semi⁃supervised learning methods,and an implicit period and comma recognition method is realized based on a pretrained model,in which Hierarchical BERT,which has the best performance,is used as the subsequent application model.In this paper,a Chinese⁃English machine translation method based on implicit period and comma recognition is realized.The experiments based on pre⁃trained machine translation models on the News and Literature corpus show that for the English translation of Chinese long sentences,the method in this paper improves the BLEU value compared to the benchmark translation as a whole,and the effect of the method in this paper is more obvious the longer the sentence is for the relatively robust machine translation model.

关 键 词:机器翻译 长句翻译 隐性句逗号 汉语长句 逗号识别 句内标点 

分 类 号:H08[语言文字—语言学]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象