检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
机构地区:[1]北京大学信息科学技术学院计算语言学研究所,北京100871 [2]东北大学信息科学与工程学院计算机软件与理论研究所,沈阳110004
出 处:《计算机学报》2004年第8期1036-1045,共10页Chinese Journal of Computers
基 金:国家自然科学基金(60083006);国家"九七三"重点基础研究发展规划项目基金(G19980305011);国家"八六三"高技术研究发展计划项目基金(2001AA114019;2001AA114210;2002AA11701008)资助
摘 要:该文提出了一种基于语料库的无双语词典的英汉词对齐模型 .它把自然语言的句子形式化地表示为集合 ,通过集合的交运算和差运算实现单词对齐 ,同时还考虑了词序和重复词的影响 .该模型不仅能对齐高频单词 ,而且能对齐低频单词 ,对未登录词和汉语分词错误具有兼容能力 .该模型几乎不需要任何语言学知识和语言学资源 ,使语料库方法可独立应用 .实验表明 ,同质语料规模越大 ,词对齐的正确率和召回率越高 .One of the bilingual corpus processing methods is the alignment of two languages on each linguistic level. Much research on word alignment between Indo-European languages has been done before, however, much less has been done on English-Chinese alignment. This paper proposes a corpus-based model for word alignment between English and Chinese. It formalizes natural languages into sets, and the intersection and difference of the sets to implement the word alignment. At the same time, the effect of word order and repetition is considered. The model includes a set of sub-models: minimum intersection model, minimum difference model, hybrid model, mono-directional model, bi-directional model, union model, and surrounding model. The English&rarrChinese mono-directional model is used to generate 1-m parallels, and the English-Chinese model is used to generate n-1 parallels. The union model and surrounding model are used to generate n-m parallels from the 1-m and n-1 parallels. The intersection of any two generated parallels in a sentence pair is empty, and the parallels themselves are minimum. This method can be used for alignment of both high-frequency words and low-frequency words, and is tolerant with Chinese word segmentation errors and unknown words. The typical characteristic of this model is that it needs few linguistic knowledge and resource. Experimental results show that the larger the homogeneous corpus scale, the higher precision and recall rate can be obtained.
关 键 词:自然语言处理 双语语料库 词对齐 最小求交 最小求差
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.188