基于汉维医疗平行语料的双语术语抽取研究  被引量:5

Research of Bilingual Term Extraction Based on Parallel Corpus Related to Medical in Chinese and Uyghur

在线阅读下载全文

作  者:于清[1] 常乐 徐健 刘天毅 LI Xiao-long YU Qing;CHANG Le;XU Jian;LIU Tian-yi;LI Xiao-long(Academy of Information Science and Engineering,Xinj iang University,Urumqi 830046,China;School of Software,Xinjiang University,Urumqi 830008,China)

机构地区:[1]新疆大学信息科学与工程学院,乌鲁木齐830046 [2]新疆大学软件学院,乌鲁木齐830008 [3]Academy of Information Science and Engineering,Xinjiang University

出  处:《内蒙古大学学报(自然科学版)》2018年第5期528-533,共6页Journal of Inner Mongolia University:Natural Science Edition

基  金:国家自然科学基金(61562082)

摘  要:为提高汉语和维吾尔语医学领域机器翻译质量,解决人工提取和翻译大量医学术语耗时费力的问题,提出基于词向量表示的双语术语抽取方法,并与传统统计短语对齐抽取进行对比.首先,自建45216句汉语医疗语料,人工翻译获得23996句维语语料,人工采集汉语医学词汇65394条,翻译获得31421条维语术语,对汉语语料分词,对维语语料形态切分,获得实验数据;其次,使用词向量方法,设计了基于词向量表示的双语术语抽取实验,准确率为25.12%;并将传统统计短语对齐抽取技术应用于汉维医疗平行语料,准确率为27.28%;实验结果表明,新方法更需要大量平行语料支持,但是两种方法都有助于提高汉维医学领域机器翻译质量,使提取和翻译大量医学术语自动化.In order to improve the quality of machine translation in the domain of Chinese and Uyghur medicine,and the efficiency of the artificial extraction and translation of medical terminology,a method of bilingual terminology extraction based on word vector representation was proposed and compared with the traditional statistical phrase alignment extraction.Firstly,a bilingual terminology corpus was built.A Chinese medical corpora with 45216 phrases were built and translated into 23996 sentences in Uyghur corpus,and 65394 Chinese medical words were collected and translated into 31421 Uyghur terminology.Words in the Chinese corpus as well as in the Uyghur corpus were segmented and a large mountain of experimental data was collected.Secondly,the experiment of bilingual terminology extraction was designed based on the method of word vectors.The accuracy rate of the experiment is 25.12%.The traditional statistical phrase alignment extraction technology was also applied to the medical parallel corpus,its accuracy rate is 27.28%.The experimental results showed that the new method required a bigger parallel corpus than the old one,but both can greatly improve the quality of machine translation in the domain of Chinese and Uyghur medicine and realize the automation of extracting and translating.

关 键 词:双语术语抽取 词向量 机器翻译 平行语料库 GIZA++ 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象