一种基于网页指纹的网页查重技术研究  被引量:2

Research on NLP-basedpage fingerprint Seek Algorithm

在线阅读下载全文

作  者:王希杰[1] 

机构地区:[1]安阳师范学院,河南安阳455000

出  处:《计算机仿真》2011年第9期154-157,共4页Computer Simulation

摘  要:研究网页查重问题。针对传统的SCAM网页查重算法根据比较几个关键词网页中出现次数来判断网页是否重复,当网站中存在相似网页时,由于其关键词非常相近,导致出现误判,造成查重准确率不高的问题。本文提出一种网页指纹查重算法,通过采用信息检索技术,提取出待检测网页的网页指纹,然后通过与网页库中的网页指纹比较判决,完成网页的查重,避免了传统方法只依靠几个关键词而造成的查重准确率不高的问题。实验证明,这种利用网页指纹查重的方法能准确判断网页是否重复,提高了网页信息的准确性,取得了满意的结果。Study the problem of seeking duplicated web pages. The traditional re-SCAM algorithm determines if the web pages are repeated according to the repeating times of a few key words, When some users browse web pages, if the key words then used are very similar, the miscarriage of justice and re-checking will be resulted and the accu- racy is not high. This paper presents an repeat checking algorithm of web page fingerprint. Information retrieval tech- nology is used to extract fingerprint information of the page to be detected, then the fingerprint information is com- pared with the Web fingerprint of Web page library to complete the repeat checking. This method avoids the low accu- racy in traditional algorithm. Experimental results show that the method of repeat cheching of web fingerprint can ac- curately determine whether a page is repeated, improve the accuracy of the information page, and achieve satisfactory results.

关 键 词:网页查重 关键词 网页指纹 

分 类 号:TP391.3[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象