检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:王海涌[1] 冯兆旭 杨海波 张津栋 WANG Haiyong;FENG Zhaoxu;YANG Haibo;ZHANG Jindong(School of Electronic and Information Engineering, Lanzhou Jiaotong University, Lanzhou 730070, Chin)
机构地区:[1]兰州交通大学电子与信息工程学院,兰州730070
出 处:《计算机工程与应用》2018年第11期122-127,139,共7页Computer Engineering and Applications
基 金:甘肃省自然科学基金(No.145RJZA086);兰州交通大学科技支撑基金(No.ZC2014003);兰州市科技计划项目(No.2013-3-79)
摘 要:针对当前互联网网页越来越多样化、复杂化的特点,提出一种基于结构相似网页聚类的网页正文提取算法,首先,根据组成网页前端模板各"块"对模板的贡献赋以不同的权重,其次计算两个网页中对应块的相似度,将各块的相似度与权重乘积的总和作为两个网页的相似度。该算法充分考虑结构差别较大的网页对网页正文提取的影响,通过计算网页间相似度将网页聚类,使得同一簇中的网页正文提取结果更加准确。实验结果表明,该方法具有更高的准确率,各项评价指标均有所提高。The current Web pages are getting more and more diverse, complex which makes the information extraction more difficult. In this paper, a text extraction algorithm based on structure similarity page clustering is proposed. Firstly,the contribution of each"block"to the template is assigned to different weights according to the composition of the front page of the Web page. Secondly, the similarity of the corresponding blocks in the two Web pages is calculated. The similarity and the weight of each block product as the sum of the two pages' similarity. This algorithm takes into account the influence of Web page structure difference on Web page text extraction. Web page is clustered based on computing the similarity between Web pages. The results are more accurate for the Web page text in the same cluster. The experimental results show that the method has higher accuracy and the evaluation indexes are improved.
关 键 词:正文提取 相似性 文档对象模型(DOM)树 层次聚类
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.28