检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:彭涛[1,2] 孟宇[3] 左万利[1,2] 王英[1,2] 胡亮[1,2]
机构地区:[1]吉林大学计算机科学与技术学院,长春130012 [2]符号计算与知识工程教育部重点实验室(吉林大学),长春130012 [3]北京科技大学土木与环境工程学院,北京100083
出 处:《计算机研究与发展》2010年第4期628-637,共10页Journal of Computer Research and Development
基 金:国家自然科学基金项目(60903098;60973040);吉林省科技发展计划基金项目(20070533);教育部高等学校博士学科点专项科研基金项目(200801830021);吉林大学基本科研业务费交叉学科与创新项目(200810025);符号计算与知识工程教育部重点实验室资助项目(93K-17)~~
摘 要:由于网络环境的复杂性和网页内容的多主题性,要想得到更多的特定主题相关网页,就要穿越那些主题不相关网页来获取更多的主题相关网页,即隧道穿越.将隧道穿越分为灰色隧道穿越和黑色隧道穿越.对于灰色隧道,在爬行过程中,将一个多主题Web页面分割成数量不多的内容块分别处理来避免由于网页整体主题不相关给该块所带来的影响.对于黑色隧道的穿越,将隧道中主题不相关网页根据其父亲页面的主题相关性赋予一个深度值,然后根据其深度值的大小进行取舍,来达到扩展主题爬行区域的目的.实验结果显示,这两种方法都达到了预期效果,所以方法是有效、稳健和实用的.Due to the complexity of the Web environment and topic-multiplicity of the contents of Web pages, it is quite difficult to get all the Web pages relevant to a specific topic. It is possible for an irrelevant Web page to link a relevant Web page, so it is required to traverse the irrelevant Web page to get more relevant pages. This procedure is called tunneling. In this paper, some research about tunneling technique is presented, and also presented is a correction to the previous results. Tunneling is partitioned into grey tunneling and black tunneling. During the process of crawling, in order to avoid the effect caused by the Web page that is irrelevant to the specific topic as a whole but relevant partially, a multi-topical page is divided into several blocks and the blocks are processed individually for grey tunneling. In black tunneling, a depth value is assigned to determine whether the page should he kept to each irrelevant page according to the relevance of its father page, and then the scope of the topical crawler can be broadened. The experimental results show that the two tunneling methods have achieved the effect expected. Accordingly, the approaches are effective, robust and practicable.
关 键 词:主题爬行 灰色隧道穿越 黑色隧道穿越 网页分块 TARGET LENGTH
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.15