AN EFFICIENT APPROACH TO COMMENT SPAM IDENTIFICATION  被引量:1

AN EFFICIENT APPROACH TO COMMENT SPAM IDENTIFICATION

在线阅读下载全文

作  者:Yang Yuhang Zhao Tiejun Zheng Dequan Yu Hao 

机构地区:[1]MOE-MS Key Laboratory of Natural Language Processing and Speech [2] Harbin Institute of Technology [3] Harbin 150001 [4] China

出  处:《Journal of Electronics(China)》2009年第5期644-650,共7页电子科学学刊(英文版)

基  金:Supported by the National Natural Science Foundation of China (No.60736044, 60803094)

摘  要:This paper proposes a novel approach to comment spam identification based on content analysis. Three main features including the number of links, content repetitiveness, and text similarity are used for comment spam identification. In practice, content repetitiveness is determined by the length and frequency of the longest common substring. Furthermore, text similarity is calculated using vector space model. The precisions of preliminary experiments on comment spam identification conducted on Chinese and English are as high as 93% and 82% respectively. The results show the validity and language independency of this approach. Compared with conventional spam filtering approaches, our method requires no training, no rule sets and no link relationships. The proposed approach can also deal with new comments as well as existing comments.This paper proposes a novel approach to comment spam identification based on content analysis. Three main features including the number of links, content repetitiveness, and text similarity are used for comment spam identification. In practice, content repetitiveness is determined by the length and frequency of the longest common substring. Furthermore, text similarity is calculated using vector space model. The precisions of preliminary experiments on comment spam identification conducted on Chinese and English are as high as 93% and 82% respectively. The results show the validity and language independency of this approach. Compared with conventional spam filtering approaches, our method requires no training, no rule sets and no link relationships. The proposed approach can also deal with new comments as well as existing comments.

关 键 词:Comment spam Automatic identification Content analysis BLOG 

分 类 号:TP393.098[自动化与计算机技术—计算机应用技术] TP391[自动化与计算机技术—计算机科学与技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象