分布式子空间局部链接随机向量函数链接网络  

Distributed random vector functional link network with subspace-based local connections

在线阅读下载全文

作  者:于万国 袁镇濠 陈佳琪 何玉林 YU Wanguo;YUAN Zhenhao;CHEN Jiaqi;HE Yulin(College of Mathematics and Computer Science,Hebei Normal University for Nationalities,Chengde 067000,Hebei Province,P.R.China;College of Computer Science and Software Engineering,Shenzhen University,Shenzhen 518060,Guangdong Province,P.R.China;Guangdong Laboratory of Artificial Intelligence and Digital Economy(SZ),Shenzhen 518107,Guangdong Province,P.R.China)

机构地区:[1]河北民族师范学院数学与计算机科学学院,河北承德067000 [2]深圳大学计算机与软件学院,广东深圳518060 [3]人工智能与数字经济广东省实验室(深圳),广东深圳518107

出  处:《深圳大学学报(理工版)》2022年第6期675-683,共9页Journal of Shenzhen University(Science and Engineering)

基  金:深圳市基础研究计划资助面上项目(JCYJ20210324093609026)。

摘  要:为解决随机向量函数链接(random vector functional link,RVFL)网络处理大规模数据分类时表现出的泛化能力差和计算复杂度高的问题,基于Spark框架设计与实现一种分布式子空间局部链接的RVFL(distributed RVFL with subspace-based local connections,DRVFL-SLC)网络.利用弹性分布式数据集(resilient distributed dataset,RDD)的分区并行性,对存于Hadoop分布式文件系统(Hadoop distributed file system,HDFS)的大规模数据集进行随机样本划分(random sample partition,RSP)操作,保证每个RSP数据块对应RDD的1个分区.其中,RSP数据块是在给定的显著性水平下与大数据保持概率分布一致性的数据子集.在分布式环境下对包含多个分区的RDD调用mapPartitions转换算子并行高效地训练对应的最优RVFL-SLC网络.利用collect执行算子将RDD每个分区对应的最优RVFL-SLC网络进行高效率地渐近融合获得DRVFLSLC网络以实现对大数据分类问题的近似求解.在部署了6个计算节点的Spark集群上,基于8个百万条记录的大规模数据集对DRVFL-SLC网络的可行性和有效性进行了验证.结果表明,DRVFL-SLC网络拥有很好的加速比、可扩展性以及规模增长性,同时能够获得比在单机上利用全量数据训练的RVFL-SLC网络更好的泛化表现.In order to solve the problem of poor generalization ability and high computational complexity of random vector functional link(RVFL) network when dealing with large-scale data classification, we design and implement a distributed RVFL network with subspace-based local connections in Spark framework(DRVFL-SLC). Firstly, in order to take advantage of the partition parallelism of resilient distributed dataset(RDD), the large-scale dataset stored in the Hadoop distributed file system HDFS is randomly divided into random sample partition(RSP) data blocks and each RSP data block corresponds to a partition of the RDD, where the RSP data block is a subset of data that maintains probability distribution consistency with the big data at a given significance level. After that, the mapPartitions transformation is invoked on the RDD containing multiple partitions in a distributed environment and this operation trains the corresponding optimal RVFL-SLC efficiently in parallel. Then, the collect execution operator is used to efficiently fuse the optimal RVFL-SLC corresponding to each partition of the RDD to obtain DRVFL-SLC for realizing the classification of big data. Finally, the feasibility and effectiveness of DRVFL-SLC are verified based on several large-scale data set with at least million records on a Spark cluster deployed with 6computing nodes. The results show that DRVFL-SLC has a good speedup ratio, scalability and scale growth, and can achieve better generalization performance than RVFL-SLC trained on a single machine with full data.

关 键 词:人工智能 随机向量函数链接网络 子空间局部链接 随机样本划分 HADOOP分布式文件系统 

分 类 号:TP311[自动化与计算机技术—计算机软件与理论] TP14[自动化与计算机技术—计算机科学与技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象