基于Spark的分布式大数据机器学习算法被引量：9

Distributed Big Data Machine Learning Algorithms Based on Spark

作　　者：王芮韩锐[2] 贾玉祥[1] WANG Rui;HAN Rui;Jia Yu-xiang(School of Information Engineering,Zhengzhou University,Zhengzhou 450001,China;Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China)

机构地区：[1]郑州大学信息工程学院,河南郑州450001 [2]中国科学院计算技术研究所先进计算机系统研究中心,北京100190

出　　处：《计算机与现代化》2018年第11期119-126,共8页Computer and Modernization

摘　　要：对于大数据而言,机器学习技术是不可或缺的;对于机器学习而言,大规模的数据可以提升模型的精准度。然而复杂的机器学习算法从时间和性能上都急需分布式内存计算这种关键技术。Spark分布式内存计算可以实现算法的并行操作,有利于机器学习算法处理大数据集。因此本文提出在Spark分布式内存环境下实现非线性机器学习算法,其中包括多层可变神经网络、BPPGD SVM、K-means,并在实现的基础上进行数据压缩、数据偏向抽样或者数据加载等方面的优化。为了实现充分配置资源批量运行脚本,本文也实现Spark ML调度框架来调度以上优化算法。实验结果表明,优化后的3种算法平均误差降低了40%,平均时间缩短了90%。For big data, machine learning technology is a tool of analysis which is indispensable. For machine learning, more and more data may improve the accuracy of the model, however complex machine learning algorithms urgently require such key technologies as distributed memory computing in terms of time and performance. Spark distributed memory computing can implement the parallel operation of the algorithm, which is beneficial for machine learning algorithms to process large data sets. Therefore, this paper presents nonlinear machine learning algorithms in Spark distributed memory environment, including multi-layer variable neural network, BPPGD SVM, K-means. And we make optimizations about data compression, data bias sampling, or data loading based on the above implementation. At the same time, the SparkML scheduling framework is implemented to dispatch the above optimization algorithms. The experimental results show that the average error of the three optimized algorithms is reduced 40% and the average time is reduced 90%.

关键词：数据压缩偏向抽样随机梯度下降神经网络支持向量机

分类号：TP183[自动化与计算机技术—控制理论与控制工程]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于Spark的分布式大数据机器学习算法被引量：9

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于Spark的分布式大数据机器学习算法 被引量：9

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于Spark的分布式大数据机器学习算法被引量：9