检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:谢志豪 杨贤 Xie Zhihao;Yang Xian(Mechanical and Electrical Engineering,Guangdong University of Technology,Guangzhou 510006,China)
出 处:《现代计算机》2024年第17期1-6,12,共7页Modern Computer
基 金:广东省哲学社会科学“十三五”规划一般项目(GD20CTS07)。
摘 要:为降低网站对用户的影响,同时提升去除重复的能力,设计了一种能够应用在大型网站的去除重复的创新方案。首先,利用文本预处理技术提取网页正文内容关键词和长句特征码;其次,使用Simhash算法把特征码映射成指纹,并构建关键词指向文档的倒排索引;最后,通过关键词快速找到与待测文档高度相似的文档,接着只需比较待测文档与相似文档的指纹,即可判断网页是否重复。结果显示,该算法识别率较高,实用性良好。To reduce the impact of websites on users and enhance their ability to remove duplicates,an innovative solution for removing duplicates has been designed that can be applied to large websites.Firstly,text preprocessing techniques are used to ex⁃tract keywords and long sentence feature codes from web page content.Secondly,the Simhash algorithm is used to map the feature codes into fingerprints and construct an inverted index of keywords pointing to the document.Finally,quickly find documents that are highly similar to the test document through keywords,and then simply compare the fingerprints of the test document with simi⁃lar documents to determine if the webpage is duplicated.The results show that the algorithm has a high recognition rate and good practicality.
分 类 号:TP393.092[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:3.144.84.11