检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:陈剑 史有群[1] 陶然 Jian Chen#;Youqun Shi;Ran Tao(College of Computer Science and Technology,Donghua University,Shanghai 201620,China)#Email: jay-ch@126.com)
机构地区:[1]东华大学计算机科学与技术学院,上海市201600
出 处:《电气工程与自动化(中英文版)》2016年第2期56-61,共6页Electrical Engineering and Automation
基 金:受上海市“科技创新行动计划”高新技术领域项目支持资助(项目编号:16511100903).
摘 要:因特网上大量近似镜像网页的存在已经成为人们快速获取有效讯息的最大阻碍.为了解决网络上存在大量近似镜像网页的问题,研究人员提出了多种网页去重算法,但这些算法在网页噪声抵抗方面的表现并不令人满意.针对此问题,本文提出一种基于Simhash的长句提取近似镜像网页去重算法,通过提取文档中的长句规避网页噪声,减弱噪声对于算法的不利影响.通过对互联网上的网页信息进行去重实验表明,改进算法能有效减弱噪声影响,具有较高的准确率与召回率。The presence of a large number of near-replicas of documents on the web has become the biggest obstacle to the rapid access to effective information. In order to solve the problem that there are a large number of approximate mirror pages on the network, the researchers proposed a variety of approximate mirror page de-algorithm, but the performance of these algorithms in web noise resistance is not satisfactory. To solve these problems, this paper proposes an algorithm based on Simhash long sentence extraction approximate mirror page de-emphasis, extracting long sentences in the document to avoid the adverse effect of web noise and weakening the bad effects on algorithm brought by the noises. Researches on de-emphasis of web page information suggest that the improved algorithm can effectively weaken the noise effect, which has a high accuracy rate and recall rate.
关 键 词:近似镜像网页 Simhash 长句提取 噪声规避
分 类 号:TP393.092[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.15