近似镜像网页去重方法研究  

Research on Finding Near-replicas of Documents on the Web

在线阅读下载全文

作  者:陈剑 史有群[1] 陶然 Jian Chen#;Youqun Shi;Ran Tao(College of Computer Science and Technology,Donghua University,Shanghai 201620,China)#Email: jay-ch@126.com)

机构地区:[1]东华大学计算机科学与技术学院,上海市201600

出  处:《电气工程与自动化(中英文版)》2016年第2期56-61,共6页Electrical Engineering and Automation

基  金:受上海市“科技创新行动计划”高新技术领域项目支持资助(项目编号:16511100903).

摘  要:因特网上大量近似镜像网页的存在已经成为人们快速获取有效讯息的最大阻碍.为了解决网络上存在大量近似镜像网页的问题,研究人员提出了多种网页去重算法,但这些算法在网页噪声抵抗方面的表现并不令人满意.针对此问题,本文提出一种基于Simhash的长句提取近似镜像网页去重算法,通过提取文档中的长句规避网页噪声,减弱噪声对于算法的不利影响.通过对互联网上的网页信息进行去重实验表明,改进算法能有效减弱噪声影响,具有较高的准确率与召回率。The presence of a large number of near-replicas of documents on the web has become the biggest obstacle to the rapid access to effective information. In order to solve the problem that there are a large number of approximate mirror pages on the network, the researchers proposed a variety of approximate mirror page de-algorithm, but the performance of these algorithms in web noise resistance is not satisfactory. To solve these problems, this paper proposes an algorithm based on Simhash long sentence extraction approximate mirror page de-emphasis, extracting long sentences in the document to avoid the adverse effect of web noise and weakening the bad effects on algorithm brought by the noises. Researches on de-emphasis of web page information suggest that the improved algorithm can effectively weaken the noise effect, which has a high accuracy rate and recall rate.

关 键 词:近似镜像网页 Simhash 长句提取 噪声规避 

分 类 号:TP393.092[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象