云环境下基于LSH的分布式数据流聚类算法  被引量:3

Distributed Data Stream Clustering Based on LSH on Cloud Environments

在线阅读下载全文

作  者:曲武[1,2] 王莉军[3] 韩晓光[4] 

机构地区:[1]清华大学计算机科学与技术系,北京100084 [2]北京启明星辰信息安全技术有限公司核心研究院,北京10019 [3]中国科学技术信息研究所,北京100038 [4]北京科技大学计算机与通信工程学院,北京100083

出  处:《计算机科学》2014年第11期195-202,共8页Computer Science

基  金:国家"九七三"重点基础研究发展规划项目基金(2007CB310803);国家自然科学基金重点项目(61035004);国家自然科学基金(60875029);国家科技部博士后基金(2013M541005)资助

摘  要:近年来,随着计算机技术、信息处理技术在工业生产、信息处理等领域的广泛应用,会连续不断地产生大量随时间演变的序列型数据,构成时间序列数据流,如互联网新闻语料分析、网络入侵检测、股市行情分析和传感器网络数据分析等。实时数据流聚类分析是当前数据流挖掘研究的热点问题。单遍扫描算法虽然满足数据流高速、数据规模较大和实时分析的需求,但因缺乏有效的聚类算法来识别和区分模式而限制了其有效性和可扩展性。为了解决以上问题,提出云环境下基于LSH的分布式数据流聚类算法DLCStream,通过引入Map-Reduce框架和位置敏感哈希机制,DLCStream算法能够快速找到数据流中的聚类模式。通过详细的理论分析和实验验证表明,与传统的数据流聚类框架CluStream算法相比,DLCStream算法在高效并行处理、可扩展性和聚类结果质量方面更有优势。In recent years,with the wide application of computer technology and internet technology in the field of industrial production and information processing,these applications will continuously produce large amounts of sequence data evolved over time and constitute time series data stream,such as internet news feed analysis,network intrusion detection system,stock markets analysis and sensor networks data analysis.The real-time clustering analysis of data stream is a hot issue of the current data stream mining.However,due to the high speed,large-scale data and real-time analysis,data must often be analyzed on the fly.Although the one-pass-through scanning algorithm is able to meet the needs,the lack of efficient clustering algorithms to identify and distinguish patterns limits the effectivity and scalability of this method.In order to solve the above problems,we proposed a novel stream clustering algorithm called DLCStream,which is based on LSH on cloud environments.It is a distributed data stream clustering approach that uses the Map-Reduce framework and LSH mechanism to quickly find the clustering pattern in the data stream.Finally,the theoretical analysis and experiment results illustrate that the DLCStream algorithm results is significantly more efficient in efficient parallel processing,scalablity,and quality of the clustering results compared with traditional data stream clustering framework CluStream algorithm.

关 键 词:数据流聚类 位置敏感哈希方法 Map-Reduce框架 DLCStream算法 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象