检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:刘文斌 何彦青[1] 吴振峰 董诚[1] Liu Wenbin;He Yanqing;Wu Zhenfeng;Dong Cheng(Institute of Scientific and Technical Information of China,Beijing 100038,China)
出 处:《数据分析与知识发现》2021年第7期48-58,共11页Data Analysis and Knowledge Discovery
基 金:中国科学技术信息研究所重点工作项目(项目编号:ZD2020-18)的研究成果之一。
摘 要:【目的】实现双语句子的自动对齐,为构建双语平行语料库、跨语言信息检索等自然语言处理任务提供技术支持。【方法】将BERT预训练引入句子对齐方法中,通过双向Transformer提取特征,每一个词汇由位置嵌入向量、单词嵌入向量、句子切分嵌入向量三种向量叠加表征词汇的语义信息,进而对源语言与译文、目标语言与译文实施双向度量,融合BLEU得分、余弦相似度和曼哈顿距离三种相似度进行句子对齐。【结果】通过两种任务验证方法的有效性。在平行语料库过滤任务中,召回率为97.84%;在可比语料过滤任务中,当噪声比率分别为20%、50%、90%时,精确率依次为99.47%、98.31%、95.00%。【局限】文本向量化与相似度计算方法可以采用更具有语义表征的方式进行改进。【结论】本方法在平行语料过滤和可比语料过滤两个任务中均优于基线系统,能够获得大规模、高质量的平行语料。[Objective]This paper proposes a method automatically aligning bilingual sentences,aiming to provide technical support for constructing bilingual parallel corpus,cross-language information retrieval and other natural language processing tasks.[Methods]First,we added the BERT pre-training to the method of sentence alignment,and extracted features with a two-way Transformer.Then,we represented the words’semantics with Position embeddings,Token embeddings,and Segment embeddings.Third,we bi-directionally measured the source language sentence and its translation,as well as the target language sentence and its translation.Finally,we combined the BLEU score,cosine similarity and Manhattan distance to generate the final sentence alignment.[Results]We conducted two rounds of tests to evaluate the effectiveness of the new method.In the parallel corpus filtering task,the recall was 97.84%.In the comparable corpus filtering task,the accuracy reached 99.47%,98.31%,and 95.00%,when the noise ratio was 20%,50%,and 90%,respectively.[Limitations]The text representation and similarity calculation could be further improved by adding more semantic information.[Conclusions]The proposed method,which is better than the baseline systems in parallel corpus filtering and comparable corpus filtering tasks,could generate large scale and high-quality parallel corpus.
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:3.148.255.182