检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:于清[1] 常乐 徐健 刘天毅 LI Xiao-long YU Qing;CHANG Le;XU Jian;LIU Tian-yi;LI Xiao-long(Academy of Information Science and Engineering,Xinj iang University,Urumqi 830046,China;School of Software,Xinjiang University,Urumqi 830008,China)
机构地区:[1]新疆大学信息科学与工程学院,乌鲁木齐830046 [2]新疆大学软件学院,乌鲁木齐830008 [3]Academy of Information Science and Engineering,Xinjiang University
出 处:《内蒙古大学学报(自然科学版)》2018年第5期528-533,共6页Journal of Inner Mongolia University:Natural Science Edition
基 金:国家自然科学基金(61562082)
摘 要:为提高汉语和维吾尔语医学领域机器翻译质量,解决人工提取和翻译大量医学术语耗时费力的问题,提出基于词向量表示的双语术语抽取方法,并与传统统计短语对齐抽取进行对比.首先,自建45216句汉语医疗语料,人工翻译获得23996句维语语料,人工采集汉语医学词汇65394条,翻译获得31421条维语术语,对汉语语料分词,对维语语料形态切分,获得实验数据;其次,使用词向量方法,设计了基于词向量表示的双语术语抽取实验,准确率为25.12%;并将传统统计短语对齐抽取技术应用于汉维医疗平行语料,准确率为27.28%;实验结果表明,新方法更需要大量平行语料支持,但是两种方法都有助于提高汉维医学领域机器翻译质量,使提取和翻译大量医学术语自动化.In order to improve the quality of machine translation in the domain of Chinese and Uyghur medicine,and the efficiency of the artificial extraction and translation of medical terminology,a method of bilingual terminology extraction based on word vector representation was proposed and compared with the traditional statistical phrase alignment extraction.Firstly,a bilingual terminology corpus was built.A Chinese medical corpora with 45216 phrases were built and translated into 23996 sentences in Uyghur corpus,and 65394 Chinese medical words were collected and translated into 31421 Uyghur terminology.Words in the Chinese corpus as well as in the Uyghur corpus were segmented and a large mountain of experimental data was collected.Secondly,the experiment of bilingual terminology extraction was designed based on the method of word vectors.The accuracy rate of the experiment is 25.12%.The traditional statistical phrase alignment extraction technology was also applied to the medical parallel corpus,its accuracy rate is 27.28%.The experimental results showed that the new method required a bigger parallel corpus than the old one,but both can greatly improve the quality of machine translation in the domain of Chinese and Uyghur medicine and realize the automation of extracting and translating.
关 键 词:双语术语抽取 词向量 机器翻译 平行语料库 GIZA++
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.222