基于Transformer的英文粘连词还原方法  被引量:1

ENGLISH ADHESION WORD RESTORATION BASED ON TRANSFORMER MODEL

在线阅读下载全文

作  者:朱鑫洋 迟呈英[1] 战学刚[1] Zhu Xinyang;Chi Chengying;Zhan Xuegang(School of Computer Science and Software Engineering,University of Science and Technology Liaoning,Anshan 114000,Liaoning,China)

机构地区:[1]辽宁科技大学计算机与软件工程学院,辽宁鞍山114000

出  处:《计算机应用与软件》2023年第8期45-49,97,共6页Computer Applications and Software

基  金:国家自然科学基金面上项目(61672138)。

摘  要:神经机器翻译(Neural Machine Translation,NMT)性能依赖于语料库的数据量和数据质量,经研究分析发现英文数据中存在多词粘连的现象,以下统称为粘连词,出现粘连词影响数据质量。为了进一步提高数据质量,需将粘连词还原成独立词,即词与词之间由空格作为分隔符的形式。针对该问题提出使用Transformer模型对粘连词进行还原。在数据预处理阶段,对数据采取三种不同的策略。实验证明,对数据进行分词、BPE切分的策略最佳,在真实数据集上准确率达到95.5%,在Transformer模型的基础上添加后处理操作后的正确率达到98.5%。该文方法具备可迁移性,对于任一种单词间用空格分割的语言都是可用的。The performance of Neural Machine Translation(NMT)depends on the amount and quality of data in the corpus.Through research and analysis,it is found that there is a phenomenon of multi-word adhesion in English data,which is generally referred as adhesion.The adhesion affects the data quality.In order to further improve the data quality,it is necessary to restore adhesions to independent words,that is,spaces are used as separators between words.In order to solve this problem,this paper proposes a method of adhesions restoration using Transformer model.In the data preprocessing stage,three different strategies were adopted for data.Experiments show that the strategy of direct word segmentation and BPE segmentation for training data is the best,with 95.5%accuracy on the real data set and 98.5%after post-processing operations based on Transformer model.This method is transferable and can be used for any language where words are separated by spaces.

关 键 词:数据质量 粘连词 贝叶斯 Transformer模型 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象