基于概念和语义网络的近似网页检测算法  被引量:15

Near Duplicated Web Pages Detection Based on Concept and Semantic Network

在线阅读下载全文

作  者:曹玉娟[1,2] 牛振东[1] 赵堃[1] 彭学平[1] 

机构地区:[1]北京理工大学计算机科学技术学院,北京100081 [2]北京航天飞行控制中心,北京100094

出  处:《软件学报》2011年第8期1816-1826,共11页Journal of Software

基  金:国家自然科学基金(60803050;60705022);新世纪优秀人才计划(NCET-06-0161)

摘  要:在搜索引擎的检索结果页面中,用户经常会得到内容近似的网页.为了提高检索整体性能和用户满意度,提出了一种基于概念和语义网络的近似网页检测算法DWDCS(near-duplicate webpages detection based on concept and semantic network).改进了经典基于小世界理论提取文档关键词的算法.首先对文档概念进行抽取和归并,不但解决了"表达差异"问题,而且有效降低了语义网络的复杂度;从网络结构的几何特征对其进行分析,同时利用网页的语法和结构信息构建特征向量进行文档相似度的计算,由于无须使用语料库,使得算法天生具有领域无关的优点.实验结果表明,与经典的网页去重算法(I-Match)和单纯依赖词汇共现小世界模型的算法相比,DWDCS具有很好的抵抗噪声的能力,在大规模实验中获得了准确率>90%和召回率>85%的良好测试结果.良好的时空间复杂度及算法性能不依赖于语料库的优点,使其在大规模网页去重实际应用中获得了良好的效果.Reprinting websites and blogs produces a great deal redundant WebPages. To improve search efficiency and user satisfaction, the near-Duplicate WebPages Detection based on Concept and Semantic network (DWDCS) is proposed. In the course of developing a near-duplicate detection system for a multi-billion pages repository, this paper makes two research contributions. First, the key concept is extracted, instead of the keyphrase, to build Small Word Network (SWN). This not only reduces the complexity of the semantic network, but also resolves the "expression difference" problem. Second, this paper considers both syntactic and semantic information to present and compute the documents' similarities. In a large-scale test, experimental results demonstrate that this approach outperforms that of both I-Match and keyphrase extraction algorithms based on SWN. Many advantages such as linear time and space complexity, without using a corpus, make the algorithm valuable in actual practice.

关 键 词:网页去重算法 小世界网络 近似网页 均方差 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象