基于大数据随机样本划分的分布式观测点分类器

Distributed observation point classifier for big data with random sample partition

作　　者：李旭何玉林崔来中黄哲学[1,2] PHILIPPE Fournier-Viger LI Xu;HE Yulin;CUI Laizhong;HUANG Zhexue;PHILIPPE Fournier-Viger(Guangdong Laboratory of Artificial Intelligence and Digital Economy(SZ),Shenzhen Guangdong 518107,China;College of Computer Science and Software Engineering,Shenzhen University,Shenzhen Guangdong 518060,China)

机构地区：[1]人工智能与数字经济广东省实验室(深圳),广东深圳518107 [2]深圳大学计算机与软件学院,广东深圳518060

出　　处：《计算机应用》2024年第6期1727-1733,共7页journal of Computer Applications

基　　金：国家自然科学基金资助项目(61972261);广东省自然科学基金资助项目(2023A1515011667);深圳市基础研究项目(JCYJ20220818100205012,JCYJ20210324093609026)。

摘　　要：观测点分类器(OPC)是一种试图通过将多维样本空间线性不可分问题转换成一维距离空间线性可分问题的有监督学习模型,对高维数据的分类问题尤为有效。针对OPC在处理大数据分类问题时表现的较高训练复杂度,在Spark框架下设计一款基于大数据的随机样本划分(RSP)的分布式OPC(DOPC)。首先,在分布式计算环境下生成大数据的RSP数据块,并将它转换为弹性分布式数据集(RDD);其次,在RSP数据块上协同式地训练一组OPC,由于每个RSP数据块上的OPC独立训练,因此有高效的Spark可实现性;最后,在Spark框架下将在RSP数据块上协同训练的OPC集成为DOPC,对新样本进行类标签预测。在8个大数据集上,对Spark集群环境下实现的DOPC的可行性、合理性和有效性进行实验验证,实验结果显示,DOPC能够以更低的计算消耗获得比单机OPC更高的测试精度,同时相较于Spark框架下实现的基于RSP模型的神经网络(NN)、决策树(DT)、朴素贝叶斯(NB)和K最近邻(KNN),DOPC分类器具有更强的泛化性能。测试结果表明,DOPC是一种高效低耗的处理大数据分类问题的有监督学习算法。Observation Point Classifier(OPC)is a supervised learning model which tries to transform a multi-dimensional linearly-inseparable problem in original data space into a one-dimensional linearly-separable problem in projective distance space and it is good at high-dimensional data classification.In order to alleviate the high train complexity when applying OPC to handle the big data classification problem,a Random Sample Partition(RSP)-based Distributed OPC(DOPC)for big data was designed under the Spark framework.First,RSP data blocks were generated and transformed into Resilient Distributed Dataset(RDD)under the distributed computation environment.Second,a set of OPCs was collaboratively trained on RSP data blocks with high Spark parallelizability.Finally,different OPCs were fused into a DOPC to predict the final label of unknow sample.The persuasive experiments on eight big datasets were conducted to validate the feasibility,rationality and effectiveness of designed DOPC.Experimental results show that DOPC trained on multiple computation nodes gets the higher testing accuracy than OPC trained on single computation node with less time consumption,and meanwhile compared to the RSP model based Neural Network(NN),Decision Tree(DT),Naive Bayesian(NB),and K-Nearest Neighbor(KNN)classifiers under the Spark framework,DOPC obtains stronger generalization capability.The superior testing performances demonstrate that DOPC is a highly effective and low consumptive supervised learning algorithm for handling big data classification problems.

关键词：大数据分类分布式文件系统随机样本划分观测点分类器 Spark计算框架

分类号：TP181[自动化与计算机技术—控制理论与控制工程]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于大数据随机样本划分的分布式观测点分类器

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于大数据随机样本划分的分布式观测点分类器

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索