多语言专利机器翻译平行语料构建方法研究  被引量:1

Research on Construction Methods of Machine Translation-Oriented Multilingual Patent Corpus

在线阅读下载全文

作  者:曹竟成 邬小倩 王倩 孙小宇 邓汇娟 CAO Jingcheng;WU Xiaoqian;WANG Qian;SUN Xiaoyu;DENG Huijuan(China Patent Information Center,Beijing 100044)

机构地区:[1]中国专利信息中心,北京100044

出  处:《中国发明与专利》2022年第6期70-75,80,共7页China Invention & Patent

摘  要:神经网络机器翻译技术本质上是数据驱动型技术,大规模、高质量的语料资源是构建高性能多语种神经网络机器翻译系统的基础条件,语料资源建设至关重要。本文基于现有专利机器翻译引擎训练语料扩充及特定语言方向专利语料资源建设的需求,对基于标准BLEU4算法、基于伪数据构建及基于同族专利数据进行多语言专利平行语料构建的方法分别进行研究,并分析总结不同的专利平行语料构建方法的优劣及各自的适用场景,以期探索多语言专利平行语料构建的可靠方案,有效扩充现有专利语料资源。Neural machine translation(NMT) technology is data-driven technology intrinsically and the foundation of a high performance multilingual neural machine translation system is large-scale and highquality corpus resources. Therefore, the construction of corpus resources is crucial. Based on the shortage of existing patent corpus resources and the needs of patent corpus resource construction, this paper conducts a study on patent parallel corpus construction methods based on a standard BLEU4 algorithm, based on pseudo-data construction and based on family patents, and analyzes and summarizes the advantages and disadvantages of said patent parallel corpus construction methods and their respective applicable scenarios,so as to explore reliable construction schemes of a multilingual parallel corpus and thus to achieve the effective expansion of the current multilingual parallel corpus resources.

关 键 词:多语言平行语料构建 中间语言匹配 标准BLEU4算法 伪数据构建 同族专利 

分 类 号:TP391.2[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象