Minimum Epsilon-Kernel Computation for Large-Scale Data Processing

作　　者：Hong-Jie Guo Jian-Zhong Li Hong Gao 郭鸿杰;李建中;高宏(School of Computer Science and Technology,Harbin Institute of Technology,Harbin 150001,China;Faculty of Computer Science and Control Engineering,Shenzhen Institute of Advanced Technology,Chinese Academy of Sciences,Shenzhen 518000,China)

机构地区：[1]School of Computer Science and Technology,Harbin Institute of Technology,Harbin 150001,China [2]Faculty of Computer Science and Control Engineering,Shenzhen Institute of Advanced Technology,Chinese Academy of Sciences,Shenzhen 518000,China

出　　处：《Journal of Computer Science & Technology》2022年第6期1398-1411,共14页计算机科学技术学报（英文版）

基　　金：the National Natural Science Foundation of China under Grant Nos.61732003,61832003,61972110 and U19A2059;the National Key Research and Development Program of China under Grant No.2019YFB2101902;the CCF-Baidu Open Fund CCF-BAIDU under Grant No.OF2021011.

摘　　要：Kernel is a kind of data summary which is elaborately extracted from a large dataset.Given a problem,the solution obtained from the kernel is an approximate version of the solution obtained from the whole dataset with a provable approximate ratio.It is widely used in geometric optimization,clustering,and approximate query processing,etc.,for scaling them up to massive data.In this paper,we focus on the minimumε-kernel(MK)computation that asks for a kernel of the smallest size for large-scale data processing.For the open problem presented by Wang et al.that whether the minimumε-coreset(MC)problem and the MK problem can be reduced to each other,we first formalize the MK problem and analyze its complexity.Due to the NP-hardness of the MK problem in three or higher dimensions,an approximate algorithm,namely Set Cover-Based Minimumε-Kernel algorithm(SCMK),is developed to solve it.We prove that the MC problem and the MK problem can be Turing-reduced to each other.Then,we discuss the update of MK under insertion and deletion operations,respectively.Finally,a randomized algorithm,called the Randomized Algorithm of Set Cover-Based Minimumε-Kernel algorithm(RA-SCMK),is utilized to further reduce the complexity of SCMK.The efficiency and effectiveness of SCMK and RA-SCMK are verified by experimental results on real-world and synthetic datasets.Experiments show that the kernel sizes of SCMK are 2x and 17.6x smaller than those of an ANN-based method on real-world and synthetic datasets,respectively.The speedup ratio of SCMK over the ANN-based method is 5.67 on synthetic datasets.RA-SCMK runs up to three times faster than SCMK on synthetic datasets.

关键词：approximate query processing KERNEL large-scale dataset NP-HARD

分类号：TP311[自动化与计算机技术—计算机软件与理论]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

Minimum Epsilon-Kernel Computation for Large-Scale Data Processing

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

Minimum Epsilon-Kernel Computation for Large-Scale Data Processing

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索