平行语料库处理初探:一种排序模型  被引量:4

Research on Filtering Parallel Corpus:A Ranking Model

在线阅读下载全文

作  者:陈毅东[1] 史晓东[1] 周昌乐[1] 

机构地区:[1]厦门大学计算机系,福建厦门361005

出  处:《中文信息学报》2006年第B03期66-70,共5页Journal of Chinese Information Processing

基  金:国家863计划资助项目(2004AA117010);国家自然科学基金资助项目(60373080)

摘  要:十年来,统计方法在机器翻译中的应用得到了广泛的关注,并逐渐成为机器翻译研究的主流方法。构造高质量统计机器翻译系统的重要基础是大规模高质量的双语平行语料库。目前,多数平行语料库包含着错误或噪音,它们极大影响着统计机器翻译系统的性能。用人工手段来筛选语料库中的句对是费时费力的,本文研究了一种有助于处理这一问题排序模型,该模型考虑了多方面的因素,包括:语言模型、长度信息、意义对应等。鉴于如今的统计机器翻译系统都依赖词对齐信息,词对齐因素也被考虑入本模型中。文章最后的实验度结果表明本模型具有较好的性能。In the past ten years, statistical methods have been more and more popular in the research of machine translation. The pedormance of a statistical machine translation system is dependent on many aspects, such as the translation model, the search strategy and the parallel corpus. Specifically, parallel corpus has become an essential resource for the SMT system. Many parallel corpora contain errom and it's tiring and time-consuming to filter bad sentence pairs out. In this paper, a model called ranking model that will help dealing with such problem was addressed. In this model, both syntax features and semantics features of sentence pairs are considered. Since most current statis- tical machine translation models depends on word alignment, features related to word alignment information are also included. At the end of this paper, an experiment was carried out and the results showed that our model had promising performance.

关 键 词:平行语料库 语料库处理 排序 统计机器翻译 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象