大规模句子相似度计算方法被引量：6

Approach of Large-Scale Sentence Similarity Computation

机构地区：[1]中国科学院计算机语言信息工程研究中心,北京100083 [2]南京理工大学,江苏南京210094

出　　处：《中文信息学报》2006年第B03期47-52,共6页Journal of Chinese Information Processing

基　　金：国家自然科学基金资助项目（60502048,60272088）;国家863计划资助项目（2002AA117010-02）

摘　　要：如何根据源语言文本从大规模语料库中找出其最相近的翻译实例，即句子相似度计算，是基于实例翻译方法的关键问题之一。本文提出一种多层次句子相似度计算方法：首先基于句子的词表层特征和信息熵从大规模语料库中选择出少量候选实例，然后针对这些候选实例进行泛化匹配，从而计算出相似句子。在多策略机器翻译系统IHSMTS中的实验表明，当语料规模为20万英汉句对时，系统提取相似句子的召回率达96％。准确率达90％，充分说明了本文算法的有效性。The retrieval of the similar translation examples corresponding to the SL sentence from the large-scale corpora, or the computation of sentence similarity, is one of the key problems of EBMT. A new multi-layer sentence similarity computation approach is proposed in this paper. First, a few candidate translation examples are selected form a large-scale corpus on the basis of the surface features and entropies of the given words. Second, the degree of generalization match between the input sentence and each of those candidate translation examples is computed respectively. Finally, the sentence similarity is computed according to the outcomes of the previous two steps. Experimental results from tests on IHSMTS show that this approach has a recall rate of 96% and a precision rate of 90% when applied to a corpus of 200,000 English-Chinese sentence pairs.

关键词：句子相似度基于实例的机器翻译多策略机器翻译泛化匹配

分类号：TP391[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

大规模句子相似度计算方法被引量：6

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

大规模句子相似度计算方法 被引量：6

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

大规模句子相似度计算方法被引量：6