基于HDBSCAN的多模态高效不良网页聚类算法设计  

Design of Multi-modal Efficient Bad Web Page Clustering Algorithm Based on HDBSCAN

在线阅读下载全文

作  者:史磊 邓桂英 张恒[1] 刘宇 肖建芳[1] SHI Lei;DENG Guiying;ZHANG Heng;LIU Yu;XIAO Jianfang(China Internet Network Information Center,Beijing 100190,China)

机构地区:[1]中国互联网络信息中心,北京100190

出  处:《微型电脑应用》2024年第6期242-246,共5页Microcomputer Applications

摘  要:自二十一世纪以来,大量网页在互联网中被构建,为人们提供了各种信息,不仅加快了信息交换的速度,而且使信息流通的成本大大降低。与此同时大量不良网站不断涌现,然而对于不良网页的认定多基于人工识别,无法应对不良网站的大规模出现,因此提出基于HDBSCAN的多模态高效不良网页聚类算法。利用HDBSCAN对不良网页图片进行初步聚类,对初步聚类的结果叠加使用不良网页文本信息、不良网页结构信息等多个信息要素进一步归类合并,将相似网页合并为一个大而全的图片集合。实验结果表明,相比于HDBSCAN,改进后的聚类算法提高了聚类质量,具有更好的聚类效果,不良网站的处理效率得到明显提升。Since the 21st century,a large number of Web pages are constructed on Internet,and provide people with various types of information,not only accelerating the speed of information exchange,but also greatly reducing the cost of information circulation.At the same time,a large number of bad Web pages are constantly emerging.However,the identification of bad Web pages is mostly based on manual recognition,which can not cope with the large-scale emergence of bad Web pages.This paper proposes a multi-modal efficient bad Web page clustering algorithm based on HDBSCAN.The HDBSCAN is used to preliminarily cluster bad Web page images.The preliminary clustering results are overlaid with multiple information elements such as bad Web page text information and bad Web page structure information to further classify and merge.Similar Web pages are merged into a large and complete set of images.The experimental results show that compared to HDBSCAN,the inproved clustering algorithm improves the clustering quality,has better clustering effects,and significantly improves the processing efficiency of bad websites.

关 键 词:HDBSCAN 多模态 不良网页 聚类 

分 类 号:TN91[电子电信—通信与信息系统]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象