基于结构相似网页聚类的正文提取算法研究被引量：2

Research on text extraction algorithm based on structure similarity page clustering

作　　者：王海涌[1] 冯兆旭杨海波张津栋 WANG Haiyong;FENG Zhaoxu;YANG Haibo;ZHANG Jindong(School of Electronic and Information Engineering, Lanzhou Jiaotong University, Lanzhou 730070, Chin)

机构地区：[1]兰州交通大学电子与信息工程学院,兰州730070

出　　处：《计算机工程与应用》2018年第11期122-127,139,共7页Computer Engineering and Applications

基　　金：甘肃省自然科学基金(No.145RJZA086);兰州交通大学科技支撑基金(No.ZC2014003);兰州市科技计划项目(No.2013-3-79)

摘　　要：针对当前互联网网页越来越多样化、复杂化的特点,提出一种基于结构相似网页聚类的网页正文提取算法,首先,根据组成网页前端模板各"块"对模板的贡献赋以不同的权重,其次计算两个网页中对应块的相似度,将各块的相似度与权重乘积的总和作为两个网页的相似度。该算法充分考虑结构差别较大的网页对网页正文提取的影响,通过计算网页间相似度将网页聚类,使得同一簇中的网页正文提取结果更加准确。实验结果表明,该方法具有更高的准确率,各项评价指标均有所提高。The current Web pages are getting more and more diverse, complex which makes the information extraction more difficult. In this paper, a text extraction algorithm based on structure similarity page clustering is proposed. Firstly,the contribution of each＂block＂to the template is assigned to different weights according to the composition of the front page of the Web page. Secondly, the similarity of the corresponding blocks in the two Web pages is calculated. The similarity and the weight of each block product as the sum of the two pages＇ similarity. This algorithm takes into account the influence of Web page structure difference on Web page text extraction. Web page is clustered based on computing the similarity between Web pages. The results are more accurate for the Web page text in the same cluster. The experimental results show that the method has higher accuracy and the evaluation indexes are improved.

关键词：正文提取相似性文档对象模型(DOM)树层次聚类

分类号：TP391[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于结构相似网页聚类的正文提取算法研究被引量：2

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于结构相似网页聚类的正文提取算法研究 被引量：2

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于结构相似网页聚类的正文提取算法研究被引量：2