基于网页正文逻辑段落和长句提取的网页去重算法  被引量:1

Detection and Elimination of Similar Web Pages Based on Logical Paragraphs and Extraction of Long Sentences

在线阅读下载全文

作  者:张小娣[1] 宋余庆[1] 

机构地区:[1]江苏大学科技信息研究所,镇江212013

出  处:《图书情报研究》2012年第2期41-45,共5页Library and Information Studies

摘  要:网页去重是提高网络检索效果的有效途径。针对现有网页去重算法的不足和网页正文的结构特征,提出一个基于网页正文逻辑段落和长句提取的网页去重算法。该方法通过用户检索关键词将网页正文物理段落结构表示成逻辑段落,在此基础上提取逻辑段落中的长句作为网页特征码实现相似网页判断。实验证明,该方法提高了篇幅短小的镜像网页和近似镜像网页的去重效果。The technology of detection and elimination of similar web pages is an effective way to improve the effect of network retrieval. Because of the inadequacy of algorithm and the struc- tural features of webpage texts, an algorithm, based on logical paragraphs and extraction of long sentences to detect and delete similar web pages, is proposed in this paper. Through retrieval keywords, this method expresses webpage' s physical paragraph structures as logical para- graphs. Based on that, long sentences are extracted from logical paragraphs as similar charac- teristics code of webpages. The experiment results show that this method can improve the effec- tiveness of short webpages and eliminating similar webpages in retrieval.

关 键 词:网页去重 逻辑段落 长句提取 句子相似度 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象