基于改进内容分析算法的网页正文提取  被引量:3

Web content extraction based on improved content analysis algorithm

在线阅读下载全文

作  者:陈婷婷 严华[1,2] 臧军 CHEN Ting-ting;YAN Hua;ZANG Jun(School of Electronics Information and Engineering,Sichuan University,Chengdu 610000,China;Science and Technology on Electronic Information Control Laboratory,Chengdu 610000,China;Jingmen Petroleum Transportation Office,Sinopec Pipeline Storage and Transportation Limited Company,Jingmen 448000,China)

机构地区:[1]四川大学电子信息学院,四川成都610000 [2]电子信息控制重点实验室,四川成都610000 [3]中石化管道储运有限公司荆门输油处,湖北荆门448000

出  处:《计算机工程与设计》2018年第4期1017-1021,共5页Computer Engineering and Design

基  金:国家973重点基础研究发展计划基金项目(2013CB328903-2)

摘  要:针对内容分析算法,即Readability算法,在正文抽取中易丢失部分正文字段、锚文本、结构数据(表格、列表)的缺点,提出一种改进的网页正文提取算法。基于网页正文的结构特征,在原算法基础上评估非p标签节点的文本特性;引入节点相对距离过滤文本特性较强的网页噪音;重新定义剪枝范围,避免剪枝过度,使Readability算法的正文内部信息丢失问题得到较好地的改善。对国内各大博客、新闻、科普、专业类网站进行正文提取实验,实验结果表明,该算法结果优于Readability算法,正文提取准确率达到95%以上。An improved web content extraction algorithm was proposed to solve the loss of partial text fields,anchor text,structure data(tables,lists)of the content analysis algorithm,namely the Readability algorithm.Based on the structure characteristics of web pages,the text characteristics of non-p tag nodes were evaluated on the basis of the original algorithm.The relative distance of nodes was adopted to filter the text characteristics of the strong web page noise.The scope of pruning was redefined to avoid over-pruning.Hence,the loss of internal information of the text in the Readability algorithm was reduced.Experimental results show that the proposed algorithm is better than the Readability algorithm,and the accuracy rate of content extraction is above 95%.

关 键 词:内容分析算法 Readability算法 数据丢失 节点相对距离 正文提取 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象