检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
机构地区:[1]武汉大学计算机学院,武汉430072 [2]湖北大学计算机与信息工程学院,武汉430062
出 处:《计算机科学与探索》2016年第6期761-772,共12页Journal of Frontiers of Computer Science and Technology
基 金:国家自然科学基金No.61202100;软件工程国家重点实验室开放基金No.SKLSE2012-09-20~~
摘 要:现有的半结构化网页信息抽取方法主要假设有效数据间具有较强结构相似性,将网页分割为具有类似特征的数据记录与数据区域然后进行抽取。但是存有大学科研人员信息的网页大多是人工编写填入内容,结构特征并不严谨。针对这类网页的弱结构性,提出了一种基于最近公共祖先(lowest common ancestor,LCA)分块算法的人员信息抽取方法,将LCA和语义相关度强弱的联系引入网页分块中,并提出了基本语义块与有效语义块的概念。在将网页转换成文档对象模型(document object model,DOM)树并进行预处理后,首先通过向上寻找LCA节点的方法将页面划分为基本语义块,接着结合人员信息的特征将基本语义块合并为存有完整人员信息的有效语义块,最后根据有效语义块的对齐获取当前页面所有关系映射的人员信息。实验结果表明,该方法在大量真实的大学人员网页的分块与抽取中,与MDR(mining data records)算法相比仍能保持较高的准确率与召回率。Conventional information extraction methods of semi-structured pages usually assume that valid data have relatively strong structural similarity, divide the page into data records and data region with similar characteristics and then extract from them. However, faculty list pages of universities mostly are written artificially and filled by human beings instead of automatic generation by using templates, so their structure is not rigorous. This paper proposes a fac-ulty information extraction method based on LCA (lowest common ancestor) segmentation algorithm, introduces the connection between LCA and semantic relation into Web segmentation, and presents the new concepts of basic semantic blocks and effective semantic blocks. After converting the page into a DOM (document object model) tree and the pre-processing, the page is divided into the basic semantic blocks with LCA algorithm firstly. Then the basic semantic blocks are merged into their corresponding effective semantic blocks with complete personnel information. Finally, according to the alignment of effective semantic blocks, all faculty information mapped by all relationships in current page is gotten. The experimental results show that the proposed method still has high precision and recall rates in the segmentation and extraction of quantities of real university research faculty list pages by compared with the MDR (mining data records) algorithm.
关 键 词:信息抽取 最近公共祖先(LCA) 基本语义块 有效语义块 关系映射
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.7