检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
机构地区:[1]燕山大学信息科学与工程学院,秦皇岛066004
出 处:《情报学报》2012年第7期686-693,共8页Journal of the China Society for Scientific and Technical Information
基 金:教育部科技发展中心网络时代的科技论文快速共享专项研究资助课题(20101333110013,2011109);河北省自然科学基金资助项目(F2011203219).
摘 要:针对传统的信息抽取方法在提取卷期目录链接时精度不高的问题,本文提出一种基于网页分块和链接特征的卷期目录链接提取方法。首先,以网页标签树的布局标签为最小粒度,提出一种原子网页分块算法,将网页分割为若干个相互独立、互不包含的内容块;其次,根据内容块的子树结构,提出一种原子内容块聚类算法,通过合并相似内容块对网页进行语义块划分;最后,提出一种卷期目录链接块的识别算法,通过融合链接文本相似度和基于Bayes的语义分析方法识别出卷期目录链接区域,从而实现链接的提取。实验结果表明,本文提出的方法能够有效提取卷期Et录链接。Traditional information extraction methods have low precision when extracting links from issuses' table of contents. With this problem in mind, in this paper we propose an approach to extract links from issuses' table of contents based on Web page segmentation and link features. We first present an atomic page segmentation algorithm based on page tag tre~, which splits the page into several independent and mutual non-inclusion content blocks. Then we propose an atomic content block clustering algorithm according to the sub-tree structure of the content blocks, which divides web page into semantic blocks by merging several blocks with similar content structures. Finally, we present a link blocks identification algorithm, which combines the similarity of link texts and Bayes-based semantic analysis method to identify link area from issuses' table of contents in order to extract the links. Experimental results show that the proposed method can effectively extract links from issuses' table of contents.
分 类 号:TP393.092[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.249