基于网页分块和链接特征的卷期目录链接提取方法  被引量:1

Extracting Links for Volumns' Issue and Table of Contents Based on Web Page Segmentation and Link Features

在线阅读下载全文

作  者:于洪涛 王冬青[1] 张付志[1] 

机构地区:[1]燕山大学信息科学与工程学院,秦皇岛066004

出  处:《情报学报》2012年第7期686-693,共8页Journal of the China Society for Scientific and Technical Information

基  金:教育部科技发展中心网络时代的科技论文快速共享专项研究资助课题(20101333110013,2011109);河北省自然科学基金资助项目(F2011203219).

摘  要:针对传统的信息抽取方法在提取卷期目录链接时精度不高的问题,本文提出一种基于网页分块和链接特征的卷期目录链接提取方法。首先,以网页标签树的布局标签为最小粒度,提出一种原子网页分块算法,将网页分割为若干个相互独立、互不包含的内容块;其次,根据内容块的子树结构,提出一种原子内容块聚类算法,通过合并相似内容块对网页进行语义块划分;最后,提出一种卷期目录链接块的识别算法,通过融合链接文本相似度和基于Bayes的语义分析方法识别出卷期目录链接区域,从而实现链接的提取。实验结果表明,本文提出的方法能够有效提取卷期Et录链接。Traditional information extraction methods have low precision when extracting links from issuses' table of contents. With this problem in mind, in this paper we propose an approach to extract links from issuses' table of contents based on Web page segmentation and link features. We first present an atomic page segmentation algorithm based on page tag tre~, which splits the page into several independent and mutual non-inclusion content blocks. Then we propose an atomic content block clustering algorithm according to the sub-tree structure of the content blocks, which divides web page into semantic blocks by merging several blocks with similar content structures. Finally, we present a link blocks identification algorithm, which combines the similarity of link texts and Bayes-based semantic analysis method to identify link area from issuses' table of contents in order to extract the links. Experimental results show that the proposed method can effectively extract links from issuses' table of contents.

关 键 词:网页分块 链接块 卷期目录 链接提取 

分 类 号:TP393.092[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象