检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:寇月[1] 李冬[2] 申德荣[1] 于戈[1] 聂铁铮[1]
机构地区:[1]东北大学信息科学与工程学院,沈阳110004 [2]东软集团商用软件事业部,沈阳110179
出 处:《计算机研究与发展》2010年第5期858-865,共8页Journal of Computer Research and Development
基 金:国家自然科学基金项目(60673139;60973021);国家"八六三"高技术研究发展计划基金项目(2008AA01Z146);中央高校基本科研业务费专项基金项目(NO90304005)~~
摘 要:随着Web数据库的不断增长,通过对Deep Web的访问逐渐成为获取信息的主要手段.如何有效地抽取Deep Web中结果页面所包含的实体信息成为一个值得研究的问题.通过分析Deep Web结果页面的特点,提出了一种基于DOM树的Deep Web实体抽取机制(DOM-tree based entity extraction mechanism for Deepweb,D-EEM),能够有效解决Deep Web环境中的实体抽取问题.D-EEM采用基于DOM树的自动实体抽取策略,利用DOM树中的文本内容和层次结构来确定数据区域和实体区域,提高了实体抽取的准确性;另外,提出了一种基于上下文距离和共现次数的语义标注方法,有效地将来自不同数据源的抽取结果进行合成.通过实验验证了D-EEM中所采用的关键技术的可行性和有效性,同其他实体抽取策略相比,D-EEM在抽取效率及抽取准确性等方面具有一定的优势.With the increase of Web databases,accessing Deep Web is becoming the main method to acquire information.Because of the large-scale unstructured content,heterogeneous result and dynamic data in Deep Web,there are some new challenges for entity extraction.Thus it is important to solve the problem of extracting the entities from Deep Web result pages effectively.By analyzing the characteristics of result pages,a DOM-tree based entity extraction mechanism for Deep Web(called D-EEM) is presented to solve the problem of entity extraction for Deep Web.D-EEM is modeled as three levels:expression level,extraction level,collection level.Therein the components of region location and semantic annotation are the core parts to be researched in this paper.A DOM-tree based automatic entity extraction strategy is performed in D-EEM to determine the data regions and entity regions respectively,which can improve the accuracy of extraction by considering both the textual content and the hierarchical structure in DOM-trees.Also based on the Web context and co-occurrence,a semantic annotation method is proposed to benefit the process of data integration effectively.An experimental study is proposed to determine the feasibility and effectiveness of the key techniques of D-EEM.Compared with various entity extraction strategies,D-EEM is superior in the accuracy and efficiency of extraction.
关 键 词:实体抽取 DOM树 DEEPWEB 数据区域定位 实体区域定位
分 类 号:TP311.13[自动化与计算机技术—计算机软件与理论]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.222