基于LCA分块算法的大学科研人员信息抽取被引量：3

Information Extraction of University Research Faculty Based on LCA Segmentation Algorithm

机构地区：[1]武汉大学计算机学院,武汉430072 [2]湖北大学计算机与信息工程学院,武汉430062

出　　处：《计算机科学与探索》2016年第6期761-772,共12页Journal of Frontiers of Computer Science and Technology

基　　金：国家自然科学基金No.61202100;软件工程国家重点实验室开放基金No.SKLSE2012-09-20~~

摘　　要：现有的半结构化网页信息抽取方法主要假设有效数据间具有较强结构相似性,将网页分割为具有类似特征的数据记录与数据区域然后进行抽取。但是存有大学科研人员信息的网页大多是人工编写填入内容,结构特征并不严谨。针对这类网页的弱结构性,提出了一种基于最近公共祖先(lowest common ancestor,LCA)分块算法的人员信息抽取方法,将LCA和语义相关度强弱的联系引入网页分块中,并提出了基本语义块与有效语义块的概念。在将网页转换成文档对象模型(document object model,DOM)树并进行预处理后,首先通过向上寻找LCA节点的方法将页面划分为基本语义块,接着结合人员信息的特征将基本语义块合并为存有完整人员信息的有效语义块,最后根据有效语义块的对齐获取当前页面所有关系映射的人员信息。实验结果表明,该方法在大量真实的大学人员网页的分块与抽取中,与MDR(mining data records)算法相比仍能保持较高的准确率与召回率。Conventional information extraction methods of semi-structured pages usually assume that valid data have relatively strong structural similarity, divide the page into data records and data region with similar characteristics and then extract from them. However, faculty list pages of universities mostly are written artificially and filled by human beings instead of automatic generation by using templates, so their structure is not rigorous. This paper proposes a fac-ulty information extraction method based on LCA （lowest common ancestor） segmentation algorithm, introduces the connection between LCA and semantic relation into Web segmentation, and presents the new concepts of basic semantic blocks and effective semantic blocks. After converting the page into a DOM （document object model） tree and the pre-processing, the page is divided into the basic semantic blocks with LCA algorithm firstly. Then the basic semantic blocks are merged into their corresponding effective semantic blocks with complete personnel information. Finally, according to the alignment of effective semantic blocks, all faculty information mapped by all relationships in current page is gotten. The experimental results show that the proposed method still has high precision and recall rates in the segmentation and extraction of quantities of real university research faculty list pages by compared with the MDR （mining data records） algorithm.

关键词：信息抽取最近公共祖先(LCA) 基本语义块有效语义块关系映射

分类号：TP391[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于LCA分块算法的大学科研人员信息抽取被引量：3

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于LCA分块算法的大学科研人员信息抽取 被引量：3

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于LCA分块算法的大学科研人员信息抽取被引量：3