基于BERT和多相似度融合的句子对齐方法研究  被引量:6

Sentence Alignment Method Based on BERT and Multi-similarity Fusion

在线阅读下载全文

作  者:刘文斌 何彦青[1] 吴振峰 董诚[1] Liu Wenbin;He Yanqing;Wu Zhenfeng;Dong Cheng(Institute of Scientific and Technical Information of China,Beijing 100038,China)

机构地区:[1]中国科学技术信息研究所,北京100038

出  处:《数据分析与知识发现》2021年第7期48-58,共11页Data Analysis and Knowledge Discovery

基  金:中国科学技术信息研究所重点工作项目(项目编号:ZD2020-18)的研究成果之一。

摘  要:【目的】实现双语句子的自动对齐,为构建双语平行语料库、跨语言信息检索等自然语言处理任务提供技术支持。【方法】将BERT预训练引入句子对齐方法中,通过双向Transformer提取特征,每一个词汇由位置嵌入向量、单词嵌入向量、句子切分嵌入向量三种向量叠加表征词汇的语义信息,进而对源语言与译文、目标语言与译文实施双向度量,融合BLEU得分、余弦相似度和曼哈顿距离三种相似度进行句子对齐。【结果】通过两种任务验证方法的有效性。在平行语料库过滤任务中,召回率为97.84%;在可比语料过滤任务中,当噪声比率分别为20%、50%、90%时,精确率依次为99.47%、98.31%、95.00%。【局限】文本向量化与相似度计算方法可以采用更具有语义表征的方式进行改进。【结论】本方法在平行语料过滤和可比语料过滤两个任务中均优于基线系统,能够获得大规模、高质量的平行语料。[Objective]This paper proposes a method automatically aligning bilingual sentences,aiming to provide technical support for constructing bilingual parallel corpus,cross-language information retrieval and other natural language processing tasks.[Methods]First,we added the BERT pre-training to the method of sentence alignment,and extracted features with a two-way Transformer.Then,we represented the words’semantics with Position embeddings,Token embeddings,and Segment embeddings.Third,we bi-directionally measured the source language sentence and its translation,as well as the target language sentence and its translation.Finally,we combined the BLEU score,cosine similarity and Manhattan distance to generate the final sentence alignment.[Results]We conducted two rounds of tests to evaluate the effectiveness of the new method.In the parallel corpus filtering task,the recall was 97.84%.In the comparable corpus filtering task,the accuracy reached 99.47%,98.31%,and 95.00%,when the noise ratio was 20%,50%,and 90%,respectively.[Limitations]The text representation and similarity calculation could be further improved by adding more semantic information.[Conclusions]The proposed method,which is better than the baseline systems in parallel corpus filtering and comparable corpus filtering tasks,could generate large scale and high-quality parallel corpus.

关 键 词:BERT 机器翻译 句子对齐 平行语料 多相似度融合 

分 类 号:G351[文化科学—情报学]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象