基于DOM树抽象的包装器自动生成技术  

Automatic generation technology of wrapper based on DOM tree abstraction

在线阅读下载全文

作  者:张佳俊 王一洲 陈星[1,2] 张颖 ZHANG Jiajun;WANG Yizhou;CHEN Xing;ZHANG Ying(College of Mathematics and Computer Science,Fuzhou University,Fuzhou Fujian 350108,China;Fujian Provincial Key" Laboratory of Network Computing and Intelligent Information Processing,Fuzhou Fujian 350108,China;National Engineering Research Center of Software Engineering,Peking University,Beijing 100871,China)

机构地区:[1]福州大学数学与计算机科学学院,福州350108 [2]福建省网络计算与智能信息处理重点实验室,福州350108 [3]北京大学软件工程国家工程研究中心,北京100871

出  处:《计算机应用》2018年第A01期150-154,182,共6页journal of Computer Applications

基  金:国家重点研发计划项目(2017YFB1002000);国家自然科学基金资助项目(61402111);海西政务大数据应用协同创新中心项目

摘  要:传统的包装器都由人工定义,要为不同类型的页面制作不同的包装器,因此包装器维护的开销很大,一旦原来的页面风格变了,原来的包装器也就需要重新定义。针对现有方法需要人工定义和维护包装器,并且准确率还有待提升的问题,提出一种可行的基于DOM树抽象的包装器自动生成技术。该技术主要由两个部分组成:目标类型网页的DOM树抽象和目标节点的定位及包装器生成。运用该技术可以对多种类型的网页实现包装器的自动生成。该技术针对主流的购物网站(京东、亚马逊、苏宁、当当)及主流书籍信息网站(豆瓣读书)进行了实验,实验结果表明该方法的平均精确率和召回率能够达到96%和99%。Traditional wrappers are defined by hand, and different wrappers are made for different types of Web pages, so the maintenance of the wrapper is a great eost. Once the original page style has ehanged, the original wrapper also needs to be redefined. Aiming at the problem that the wrapper needs to be defined and maintained manually and the accuracy still needs to be improved in the existing methods, this paper presented a feasible automatic wrapper generation technique based on DOM tree abstraction. The technology consists of two parts: first, DOM tree abstraction for the target type of the pages; seeond, the target node locating and the wrapper generation. It can be used for a variety of types of Web pages. The experiments were eondneted on mainstream shopping websites (Jingdong, Amazon, Snning, Dangdang) and mainstream book information website ( Douban Books). The experimental results show that the average precision and recall of this method ean reach 96% and 99%.

关 键 词:DOM 抽象 信息抽取 包装器 自动生成 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象