一种改进的基于关系的信息检索技术  被引量:4

Improved Relation-based Information Retrieval Technology

在线阅读下载全文

作  者:李岩[1] 文健[1] 李舟军[2] 

机构地区:[1]国防科学技术大学计算机学院,长沙410073 [2]北京航空航天大学计算机科学与工程学院,北京100083

出  处:《计算机科学》2008年第7期145-150,共6页Computer Science

基  金:国家自然科学基金项目(60573057,60473057,90604007)的资助

摘  要:有研究工作表明现有的基于关系的信息检索技术(RIR)优于基于项(term)或基于语义(concept)的IR技术,但仍存在显而易见的缺陷,即不能明确关系本身,只能表达概念A,B是存在关系的概念对。本文提出一种改进的基于关系的IR技术—IRIR(Improved Relation-based Information Retrieval),就是要明确关系的取值和属性,整合概念对和关系的信息为三元组表达式(triple),通过以下匹配方法获取未知信息。对于文本中出现的知识表示为R(rela-tion)[First Concept,Second Concept],对于疑问代词(如what)开头的查询表达为R(relation)[First Concept,Un-known],对于疑问副词(如how)开头的查询表达为R(Unknown)[First Concept,Second Concept],当文本与查询的三元组表达式中已知部分匹配一致时,则得到查询未知部分的一个取值。由此,既可以实现类似QA(query answer)功能,又可以完成精确信息检索。基于Drexel大学DM&Bioinformatics Lab开发的生物医学文献搜索引擎(2004版,简称为RIRS),我们开发了一个能实现IRIR技术和功能的实验IR引擎—IRIRS(Improved Relation-Based IR Sys-tem),该系统使用UMLS和WordNet两大权威本体库分别确定概念和关系,在博士入学考试英语阅读理解测试集上的实验结果令人满意,IRIRS将文字段级别的检索精确度MA PP(Mean average passage precision)从RIRS的64.44%提高到74.28%。这表明,在IR中应用改进的基于关系的信息检索技术是非常具有探索价值的。One of the limitations with the traditional relationship-based IR methods is that a relation is often recorded as a binary form, such as R(First Term, Second Term), which is only composed of general information of a pair of two terms which are semantically and syntactically related to each other. To tackle this problem, we explore an improved technique by using of triples in information retrieval for precision-focused biomedical literature search. In this paper, a triple is defined as a data structure for the integration of a pair of concepts as well as a verb phrase or sometimes a special noun we extract from the sentence as the relation of the above concepts pair, and stores relation and concepts information. Unlike the traditional relationship-based model, our model represents a document or a query by a set of triples, such as R(relation)[First Concept, Second Concept]. Since some semantic and syntactic exceptions occur in documents and queries, the different types of triple should be permitted, e. g. a query.. "What does the mad cow disease come from?" has a triple.. R(come from)I-First Concept(mad cow disease), Unknown]. Therefore,we can get the "answer" of the un- known thing in query if some documents have the matching triples in the index. Of course, we will apply the advanced ontology-based approach to extract generic concepts and their relations by using both UMLS and WordNet,and we have implemented a new approach to rank retrieved passages from same or different documents corresponding to measuring system performance protocol in TREC 2007 Genomics Track. A new version (we called it IRIRS) of the relation-based IR system which has been developed by DM & Bioinformatics Lab of Drexel University in 2004 (we called it RIRS) ,is then built for the improved relation-based search in the area of biomedical literature IR and DM. We use IRIRS to improve the retrieval result of tests of English reading comprehension. The experiment shows promising performance of relation-based I

关 键 词:信息检索 关系抽取 查询分析 三元组结构 

分 类 号:TP391.3[自动化与计算机技术—计算机应用技术] TP391[自动化与计算机技术—计算机科学与技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象