融合语义和共现特征的Web跟踪器深度识别方法  

A Deep Web Tracker Detection Method with Coordinated Semantic and Co-Occurrence Features

在线阅读下载全文

作  者:严瑾 董科军 李洪涛[1] YAN Jin;DONG Kejun;LI Hongtao(China Internet Network Information Center,Beijing 100190,China)

机构地区:[1]中国互联网络信息中心,北京100190

出  处:《数据与计算发展前沿》2024年第3期127-138,共12页Frontiers of Data & Computing

基  金:国家重点研发计划课题“互联网基础设施关键信息分析技术”(2022YFB3105003)。

摘  要:【目的】Web跟踪器通过嵌入用户访问的网站,收集用户的标识与访问信息,用于个性化推荐服务和网站性能分析等。然而,Web跟踪器对互联网用户来说可能会造成隐私泄漏,让用户有选择的关闭/打开Web跟踪对互联网健康发展至关重要,而Web跟踪器的自动识别是前提与基础。【方法】通过对实际数据的分析,发现Web跟踪器在URL的文本语义和嵌入关联(即共现)两个维度的重要特征,并据此设计了融合关联特征与语义特征的Web跟踪器深度识别方法。该方法首先建立用户直接访问网站和其嵌入URL的嵌入关系二部图,并基于DeepWalk算法提取URL的嵌入特征向量;其次,基于自然语言处理领域的预训练BERT模型提取URL字符串的文本语义特征;最后,使用注意力机制聚合两类特征,并使用多层感知机模型实现URL的分类,识别Web跟踪器。【结果】基于真实数据的实验结果表明,与已有方法相比,本文所提方法提高了识别的准确度,其F1分数可达到0.91。【结论】基于深度学习的Web跟踪器识别方法仅依赖跟踪器URL及其在网站的嵌入关系信息,取得了较高的识别准确度,易于部署。[Objective]Web trackers embedded in the website can collect the user identification and access information from user’s visit.The collected information may be used for personalized recommendation services and website performance analysis.However,web trackers may also cause Internet users privacy leakages.It is very important to allow users to selectively turn off/on web tracking,where the automatic detection of web trackers is the premise and foundation.[Methods]By analyzing real-life data sets,this paper reveals two important characteristics of web trackers from the perspectives of URL text semantics and embedded association(i.e.,cooccurrence).With this basis,this paper designs a web tracker detection method based on deep learning that consolidates the semantic features and association features of URLs.Specifically,the method first constructs the bipartite graph of the embedding relationship between the web-sites that users visit directly and the embedded URLs of the websites,and then extracts the embedded feature vector of the URL by applying the DeepWalk algorithm.Secondly,the method extracts the text semantic features of the URL strings using the pre-trained BERT model in the field of natural language processing.Finally,the method uses the attention mechanism to consolidate the two types of features and uses the multi-layer perceptron model to implement URL classification and identify Web trackers.[Results]Experimental results based on real-life data sets show that compared with existing methods,the proposed method improves the recognition accuracy,and its F1 score can reach 0.91.[Conclusions]The proposed method achieves relatively high accuracy in detecting trackers by using only the URLs of trackers and their embedding information in websites.As such,it is easy to be deployed in practice.

关 键 词:Web跟踪器识别 关联特征 深度学习 预训练模型 域名系统 

分 类 号:TP393.09[自动化与计算机技术—计算机应用技术] TP18[自动化与计算机技术—计算机科学与技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象