Minimum Epsilon-Kernel Computation for Large-Scale Data Processing  

在线阅读下载全文

作  者:Hong-Jie Guo Jian-Zhong Li Hong Gao 郭鸿杰;李建中;高宏(School of Computer Science and Technology,Harbin Institute of Technology,Harbin 150001,China;Faculty of Computer Science and Control Engineering,Shenzhen Institute of Advanced Technology,Chinese Academy of Sciences,Shenzhen 518000,China)

机构地区:[1]School of Computer Science and Technology,Harbin Institute of Technology,Harbin 150001,China [2]Faculty of Computer Science and Control Engineering,Shenzhen Institute of Advanced Technology,Chinese Academy of Sciences,Shenzhen 518000,China

出  处:《Journal of Computer Science & Technology》2022年第6期1398-1411,共14页计算机科学技术学报(英文版)

基  金:the National Natural Science Foundation of China under Grant Nos.61732003,61832003,61972110 and U19A2059;the National Key Research and Development Program of China under Grant No.2019YFB2101902;the CCF-Baidu Open Fund CCF-BAIDU under Grant No.OF2021011.

摘  要:Kernel is a kind of data summary which is elaborately extracted from a large dataset.Given a problem,the solution obtained from the kernel is an approximate version of the solution obtained from the whole dataset with a provable approximate ratio.It is widely used in geometric optimization,clustering,and approximate query processing,etc.,for scaling them up to massive data.In this paper,we focus on the minimumε-kernel(MK)computation that asks for a kernel of the smallest size for large-scale data processing.For the open problem presented by Wang et al.that whether the minimumε-coreset(MC)problem and the MK problem can be reduced to each other,we first formalize the MK problem and analyze its complexity.Due to the NP-hardness of the MK problem in three or higher dimensions,an approximate algorithm,namely Set Cover-Based Minimumε-Kernel algorithm(SCMK),is developed to solve it.We prove that the MC problem and the MK problem can be Turing-reduced to each other.Then,we discuss the update of MK under insertion and deletion operations,respectively.Finally,a randomized algorithm,called the Randomized Algorithm of Set Cover-Based Minimumε-Kernel algorithm(RA-SCMK),is utilized to further reduce the complexity of SCMK.The efficiency and effectiveness of SCMK and RA-SCMK are verified by experimental results on real-world and synthetic datasets.Experiments show that the kernel sizes of SCMK are 2x and 17.6x smaller than those of an ANN-based method on real-world and synthetic datasets,respectively.The speedup ratio of SCMK over the ANN-based method is 5.67 on synthetic datasets.RA-SCMK runs up to three times faster than SCMK on synthetic datasets.

关 键 词:approximate query processing KERNEL large-scale dataset NP-HARD 

分 类 号:TP311[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象