基于权重均值的不良网页过滤算法研究  被引量:3

Study and realization of method to webpage filtrating based on weight equal value

在线阅读下载全文

作  者:唐坚刚 魏然[1] 

机构地区:[1]上海理工大学计算机工程学院,上海200093

出  处:《计算机工程与设计》2008年第5期1088-1089,1107,共3页Computer Engineering and Design

基  金:上海市高等学校青年科学基金项目(03SQ05)

摘  要:传统的网页权重过滤算法中的权重大都根据词频统计方法来确定,该方法不能很好地表达关键词对主题的表征程度,且易被某些网站利用反关键字过滤策略逃避检测。在传统方法的基础上,设置加权的关键字矩阵词典,从关联规则出发,应用汉语语料库里的同类词定义,提出基于同类词权重均值的关联过滤算法。试验结果表明,该算法过滤更为高效,并且能够很好地应对色情网站的反关键字过滤策略,尤其在色情与医学网页的分离上有明显的效果。The weights of traditional keywords webpage filtering are mainly determined by the frequency of statistical methods. This method can't expression degree of keywords characterization of the theme very well, and some websites are easy to use anti-keyword filtering strategy to evade detection. Based on the way of the traditional keywords webpage filtering, intercalate a keyword matrix dictionary with weight value, setting out from the connection rule, make use of the same kind word definition in the Chinese glossary database, creatively brought forward a connection rule filtering algorithm base on the weight equal value of the same kind word, which makes filtering more effective and cope with the strategy to the anti-keyword filtering of eroticism website. Especially in the separation of the eroticism webpage and the medical science webpage have the obvious effect.

关 键 词:网页过滤 关键字 矩阵词典 关联规则 权重均值 

分 类 号:TP309[自动化与计算机技术—计算机系统结构]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象