无双语词典的英汉词对齐  被引量:11

Aligning English-Chinese Words Without Bilingual Dictionary

在线阅读下载全文

作  者:吕学强[1] 吴宏林[2] 姚天顺[2] 

机构地区:[1]北京大学信息科学技术学院计算语言学研究所,北京100871 [2]东北大学信息科学与工程学院计算机软件与理论研究所,沈阳110004

出  处:《计算机学报》2004年第8期1036-1045,共10页Chinese Journal of Computers

基  金:国家自然科学基金(60083006);国家"九七三"重点基础研究发展规划项目基金(G19980305011);国家"八六三"高技术研究发展计划项目基金(2001AA114019;2001AA114210;2002AA11701008)资助

摘  要:该文提出了一种基于语料库的无双语词典的英汉词对齐模型 .它把自然语言的句子形式化地表示为集合 ,通过集合的交运算和差运算实现单词对齐 ,同时还考虑了词序和重复词的影响 .该模型不仅能对齐高频单词 ,而且能对齐低频单词 ,对未登录词和汉语分词错误具有兼容能力 .该模型几乎不需要任何语言学知识和语言学资源 ,使语料库方法可独立应用 .实验表明 ,同质语料规模越大 ,词对齐的正确率和召回率越高 .One of the bilingual corpus processing methods is the alignment of two languages on each linguistic level. Much research on word alignment between Indo-European languages has been done before, however, much less has been done on English-Chinese alignment. This paper proposes a corpus-based model for word alignment between English and Chinese. It formalizes natural languages into sets, and the intersection and difference of the sets to implement the word alignment. At the same time, the effect of word order and repetition is considered. The model includes a set of sub-models: minimum intersection model, minimum difference model, hybrid model, mono-directional model, bi-directional model, union model, and surrounding model. The English&rarrChinese mono-directional model is used to generate 1-m parallels, and the English-Chinese model is used to generate n-1 parallels. The union model and surrounding model are used to generate n-m parallels from the 1-m and n-1 parallels. The intersection of any two generated parallels in a sentence pair is empty, and the parallels themselves are minimum. This method can be used for alignment of both high-frequency words and low-frequency words, and is tolerant with Chinese word segmentation errors and unknown words. The typical characteristic of this model is that it needs few linguistic knowledge and resource. Experimental results show that the larger the homogeneous corpus scale, the higher precision and recall rate can be obtained.

关 键 词:自然语言处理 双语语料库 词对齐 最小求交 最小求差 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象