基于维基百科的俄汉可比语料库构建及可比度计算  被引量:3

Building a Russian-Chinese comparable corpus based on Wikipedia and its comparability calculation

在线阅读下载全文

作  者:原伟 易绵竹[2] 

机构地区:[1]上海外国语大学博士后流动站,上海200083 [2]中国人民解放军外国语学院语言工程系,河南洛阳471003

出  处:《山东大学学报(理学版)》2017年第9期1-6,共6页Journal of Shandong University(Natural Science)

基  金:国家社会科学基金资助项目(14CYY051);中国博士后科学基金面上资助项目(2017M610268)

摘  要:可比语料库由于其自身优势和广泛用途逐渐成为语料库研究的热点方向之一,而目前国内俄汉可比语料库相关研究未见学者涉及。通过梳理国内外相关研究成果,设计了一种基于维基百科构建俄汉可比语料库的思路和方法,研制了语料自动获取系统,以篇章对齐为基础建立了俄汉可比语料库,语料字(词)总数达到了百万级,最后利用跨语言相似度计算的方法对俄汉语料的可比度进行计算。计算结果表明该方法能够有效获取可比度较高的俄汉语料,所构建的语料库可被用于俄汉翻译、话语分析及计算语言学研究中。Currently Russian and Chinese corpus research is urgently needed new breakthroughs in data sources, re- search angles and applications. Comparable corpus is one of the research hotspots in corpus linguistics and natural lan- guage processing. So far there has been no study of Russian-Chinese comparable corpora in China. This paper reviews the existing achievements in this area, designs an method to construct Russian-Chinese comparable corpus based on Wikipedia, develops a system for automatic acquiring comparable texts, and builds a Russian-Chinese comparable cor- pus, which contents more than a million words. In the end, the comparability of this comparable corpora was evaluated by using cross-language similarity calculation methods. The results demonstrate that using this method can effectively obtain Russian-Chinese comparable texts with high comparability, and the corpus can be used for translation, discourse analysis and computational linguistics studies.

关 键 词:可比语料库 俄语 维基百科 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象