基于傅立叶变换的网页去重算法  被引量:2

Finding near replicas of Web pages based on Fourier transform

在线阅读下载全文

作  者:陈锦言[1] 孙济洲[1] 张亚平[1] 

机构地区:[1]天津大学计算机科学与技术学院,天津300072

出  处:《计算机应用》2008年第4期948-950,共3页journal of Computer Applications

摘  要:去除重复网页可以提高搜索引擎的搜索精度,减少数据存储空间。目前文本去重算法以关键词去重、语义指纹去重为主,用上述算法进行网页去重时容易发生误判。通过对字符关系矩阵进行K-L展开,将每个字符映射成为一个数值,然后对这个数值序列做离散傅立叶变换,得到每个网页的傅立叶系数向量,通过比较傅立叶系数向量差异实现对网页的相似度判断。实验结果表明该方法可对网页实现较好的去重。Removing duplicated Web pages can improve the searching accuracy and reduce the data storage space, Current de-duplication algorithms mainly focus on keywords de-duplication or semantic fingerprint de-duplication and may cause error when processing Web pages. In this paper each character was mapped into a semantic value by Karhunen-Loeve (K-L) transform of the relationship matrix, and then each document was transformed into a series of discrete values. By Fourier transform of the series each Web page was expressed as several Fourier coefficients, and then the similarity between two Web pages was calculated based on the Fourier coefficients. Experiment results show that this method can find similar Web pages efficiently.

关 键 词:网页去重 K—L展开 傅立叶变换 维数压缩 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术] TP393.09[自动化与计算机技术—计算机科学与技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象