构建和剖析中英三元组可比语料库  被引量:5

Building and profiling Chinese-English 3-tuple comparable corpora

在线阅读下载全文

作  者:胡小鹏[1] 袁琦[1] 耿鑫辉[1] 朱姝[1] 

机构地区:[1]中国电子信息产业发展研究院,北京100044

出  处:《计算机工程与应用》2014年第13期153-157,186,共6页Computer Engineering and Applications

基  金:国家自然科学基金(No.61172101;No.61172102)

摘  要:由于受到翻译腔的影响,中英平行语料库存在固有的扭斜的语言模型。显然,用这样的语料库训练的机器翻译、跨语言检索等自然语言处理系统也承袭了扭斜的语言模型,严重影响到应用系统的性能。为了克服平行语料库固有的缺陷,提出构建和剖析中英三元组可比语料库的技术研究。这项研究采用可比语料库和语言自动剖析技术,使用统计和规则相结合的方法,对由本族英语、中式英语和标准中文三元素所组成的三元组可比语料库中的本族英语和中式英语进行统计分析。在此基础上,利用n-元词串、关键词簇等自动抽取技术挖掘基于本族语言模型的双语资源,实现改进和发展机器翻译等自然语言的处理应用。There exists inherent skewed language model in Chinese-English parallel corpus due to the influence of transla-tionese. Obviously, natural language processing systems trained with these corpora, including machine translation and cross-language information retrieval, will inherit the skewed language model, thus seriously degrading the performance of applications. To fix the inherent defaults in parallel corpus, this paper proposes a technical research on building and profiling Chinese-English 3-tuple comparable corpora. The study adopts comparable corpora and automatic language profiling technologies and applies a combined method of statistics and rules for statistical analysis on native English and Chinglish in 3-tuple comparable corpora that consists of native English, Chinglish and standard Chinese. Based on this, automatic extraction technologies, such as n-grams and key clusters, are used in the mining of native-language-based bilingual resources to improve and develop natural language processing applications such as machine translation.

关 键 词:三元组可比语料库 语言迁移 自动语言剖析 n-元词串 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象