检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:张雯 盛颖怡 张晓晴 孟升祥 周蓓[1] 沈健[1] ZHANG Wen;SHENG Yingyi;ZHANG Xiaoqing;MENG Shengxiang;ZHOU Bei;SHEN Jian(School of Computer Science and Engineering,Changshu Institute of Technology,Changshu 215500,China)
机构地区:[1]常熟理工学院计算机科学与工程学院,江苏常熟215500
出 处:《常熟理工学院学报》2022年第5期33-36,共4页Journal of Changshu Institute of Technology
摘 要:个人敏感信息泄露是目前多发的网络安全事件之一,可能危及人身和财产安全,损害公民名誉和身体健康等.本文通过爬虫技术获取网页内容及附件,然后提取其正文并通过ElasticSearch实现全文索引和查询,实现了个人敏感信息的检测.以手机号码为例,采用不同分词器和查询方式对查询效率进行测试后得出结论:通过自定义分词器进行全文索引并使用正则表达式查询进行个人敏感信息检测具有最高的效率.The leakage of the sensitive personal information is one of the most frequent types of network security incidents.Once the sensitive personal information is leaked,it may endanger personal and property safety,and it is likely to damage not only personal reputation,but also physical and mental health.This paper obtains the content and attachments of web pages through the web crawler,and realizes full-text indexing and querying through ElasticSearch,thus realizing the detection of the sensitive personal information.By taking the mobile phone number as an example,the paper uses different tokenizers and query methods to test the query efficiency.It is concluded that it is the most efficient way to detect the sensitive personal information by using the self-defined word segmentation and regular expression query.
关 键 词:WEB爬虫 ElasticSearch 个人敏感信息泄露
分 类 号:TP399[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.44