检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
机构地区:[1]山西大学商务学院
出 处:《智能计算机与应用》2017年第4期13-16,20,共5页Intelligent Computer and Applications
基 金:山西大学商务学院2016年科研基金(2016008)
摘 要:本文提出了一种标签路径和行块分布函数相结合的信息抽取方法来实现Web页面的信息抽取。该方法将Web页面解析成DOM树,使用视觉特征和标签过滤的规则将树进行剪枝,引入标签路径特征的方法粗略划分出网页的正文内容和噪音内容,最终使用行块分布函数的方法进行抽取,获得正文文本。实验结果表明,这种抽取方法有效地防止了正文内容误删及噪音内容漏删的现象,使得提取的正文信息更加准确,准确度达到91%,召回率达到95%,F值达到93%。本算法对于包含过多短文本的网页抽取的准确度还有待提高。In this paper, an information extraction method combining tag path and block distribution function is proposed to extract information from Web pages. The Web page is parsed into a DOM tree in first step. Secondly, the DOM tree is pruned by using visual features and label filtering rules. And then introducing label path characteristics, Web information is roughly divided into two parts: text content and noise content. Finally, using row block distribution function to extract text, the text is utterly obtained. The experimental results show that this method can prevent that the text is mistaken to delete and the noise content is missed to delete effectively, making the extraction of text information more accurately. The results shows that the precision reaches 91%, the recall rate 95%, F score 93%. The accuracy of the algorithm for Web pages which are containing too many short texts still has to be improved.
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.15