检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
机构地区:[1]上海理工大学计算机工程学院,上海200093
出 处:《计算机工程与设计》2008年第5期1088-1089,1107,共3页Computer Engineering and Design
基 金:上海市高等学校青年科学基金项目(03SQ05)
摘 要:传统的网页权重过滤算法中的权重大都根据词频统计方法来确定,该方法不能很好地表达关键词对主题的表征程度,且易被某些网站利用反关键字过滤策略逃避检测。在传统方法的基础上,设置加权的关键字矩阵词典,从关联规则出发,应用汉语语料库里的同类词定义,提出基于同类词权重均值的关联过滤算法。试验结果表明,该算法过滤更为高效,并且能够很好地应对色情网站的反关键字过滤策略,尤其在色情与医学网页的分离上有明显的效果。The weights of traditional keywords webpage filtering are mainly determined by the frequency of statistical methods. This method can't expression degree of keywords characterization of the theme very well, and some websites are easy to use anti-keyword filtering strategy to evade detection. Based on the way of the traditional keywords webpage filtering, intercalate a keyword matrix dictionary with weight value, setting out from the connection rule, make use of the same kind word definition in the Chinese glossary database, creatively brought forward a connection rule filtering algorithm base on the weight equal value of the same kind word, which makes filtering more effective and cope with the strategy to the anti-keyword filtering of eroticism website. Especially in the separation of the eroticism webpage and the medical science webpage have the obvious effect.
分 类 号:TP309[自动化与计算机技术—计算机系统结构]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:3.139.239.16