检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:王海波 余丽丽 王宏伟[3] WANG Hai-bo;YU Li-li;WANG Hong-wei(College of Biomedical Engineering&Instrument Science,Zhejiang University,Hangzhou,Zhejiang 310027,China;College of Teacher Education,Zhejiang Normal University,Jinhua,Zhejiang 321004,China;ZJU-UIUC Joint Institute,Zhejiang University,Haining,Zhejiang 314499,China)
机构地区:[1]浙江大学生物医学工程与仪器科学学院,浙江杭州310027 [2]浙江师范大学教师教育学院,浙江金华321004 [3]浙江大学伊利诺伊大学厄巴纳香槟校区联合学院,浙江海宁314499
出 处:《电子学报》2023年第10期2884-2893,共10页Acta Electronica Sinica
基 金:国家重点研发计划(No.2020YFB1707803);浙江大学科研资助项目(No.XY2021018)。
摘 要:语符不平衡是神经机器翻译(Neural Machine Translation,NMT)语料库中普遍存在的现象.评估NMT语料库的语符不平衡度对提升语料库质量和翻译效果具有重要意义.针对现有的语符不平衡度测评研究在算法和分词范围上的缺陷与不足,本文提出语符分布离散度算法(Dispersion of Token Distribution,DTD),用以计算语符不平衡度,并扩大分词范围,从字符、子词和词3种粒度对语料库进行评估.实验结果表明,该算法在准确度、有效性和鲁棒性方面较以往研究有较大提升;语料库在不同分词粒度下的语符不平衡度差异很大,其中字符粒度的语符不平衡度最大,子词粒度次之,词粒度最小.Token imbalance is a common phenomenon in the corpus of neural machine translation(NMT).It is of great significance to evaluate the token imbalance degree of NMT corpus to improve the quality of corpus and translation effect.Aiming at the defects and deficiencies in the algorithm and word segmentation scope of the existing studies on the measurement of the token imbalance degree,this paper proposes the dispersion of token distribution(DTD)algorithm to cal⁃culate the token imbalance degree,expands the word segmentation scope,and evaluates the corpus from three granularity:character,subword and word.The experimental results show that the accuracy,validity and robustness of the proposed al⁃gorithm are greatly improved compared with previous studies.There are great differences in the token imbalance degree of corpora under different word segmentation granularity,in which character granularity has the highest token imbalance de⁃gree,followed by subword granularity and word granularity.
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.171