一种基于Simhash算法的重复域名数据去重方法  被引量:4

Method for deleting duplicate domain name data based on Simhash algorithm

在线阅读下载全文

作  者:侯开茂 韩庆敏 吴云峰 黄兵 张久发 柴处处 Hou Kaimao;Han Qingmin;Wu Yunfeng;Huang Bing;Zhang Jiufa;Chai Chuchu(The 6th Research Institute of China Electronics Corporation,Beijing 100083,China)

机构地区:[1]中国电子信息产业集团有限公司第六研究所,北京100083

出  处:《信息技术与网络安全》2022年第4期71-76,共6页Information Technology and Network Security

摘  要:随着数字科学技术的发展,各领域需要传输和存储的数据量急剧上升。然而传输和存储的数据中重复数量占据了很大的比例,这不仅会增加使用数据的成本,也会影响处理数据的效率。域名是一种存储量大而且对处理速率有极高要求的数据,为了节约域名解析系统的存储成本,提高传输效率,本文在原有数据去重技术的基础上,引入了Simhash算法,结合域名数据的结构特征,改进数据分词和指纹值计算方式,提出了一种基于Simhash算法的重复域名数据去重方法。实验结果表明,相比于传统的数据去重技术,该方法对删除重复域名数据效率更高,具有较好的实际应用价值。With the development of digital science and technology,the amount of data that needs to be transmitted and stored in various fields has risen sharply.However,the number of repetitions in these data occupies a large proportion.This not only increases the cost of using data,but also reduces the efficiency of data processing.Domain name is a kind of data with large storage capacity and extremely high requirements for processing speed.In order to save storage cost and improve transmission efficiency,this paper proposes a method for deleting duplicate domain name data based on Simhash algorithm.Compared with the traditional data deduplication technology,this method combines the structural characteristics of the domain name data,and introduces the Simhash algorithm to design a deduplication method for the domain name data.The experimental results show that compared with the traditional data deduplication technology,this method is more efficient in deleting duplicate domain name data and has better practical application value.

关 键 词:数据去重 域名 Simhash 数据分块 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象