基于Hadoop架构的数据驱动的SVM并行增量学习算法  被引量:2

Data driven parallel incremental support vector machine learning algorithm based on Hadoop framework

在线阅读下载全文

作  者:邳文君 宫秀军[1,2] 

机构地区:[1]天津大学计算机科学与技术学院,天津300350 [2]天津市认知计算与应用重点实验室(天津大学),天津300350

出  处:《计算机应用》2016年第11期3044-3049,共6页journal of Computer Applications

基  金:国家自然科学基金资助项目(61170177);国家863计划重点项目(2015AA020101);国家973计划项目(2013CB32930X)~~

摘  要:针对传统支持向量机(SVM)算法难以处理大规模训练数据的困境,提出一种基于Hadoop的数据驱动的并行增量Adaboost-SVM算法(PIASVM)。利用集成学习策略,局部分类器处理一个分区的数据,融合其分类结果得到组合分类器;增量学习中用权值刻画样本的空间分布特性,对样本进行迭代加权,利用遗忘因子实现新增样本的选择及历史样本的淘汰;采用基于HBase的控制器组件用以调度迭代过程,持久化中间结果并减小MapReduce原有框架迭代过程中的带宽压力。多组实验结果表明,所提算法具有优良的加速比、扩展率和数据伸缩度,在保证分类精度的基础上提高了SVM算法对大规模数据的处理能力。Traditional Support Vector Machine (SVM) algorithm is difficuh to deal with the problem of large scale training data, an efficient data driven Parallel Incremental Adaboost-SVM (PIASVM) learning algorithm based on Hadoop was proposed. An ensemble system was used to make each classifier process a partition of the data, and then integrated the classification results to get the combination classifier. Weights were used to depict the spatial distribution prosperities of samples which were to be iteratively reweighted during the incremental training stage, and forgetting factor was applied to select new samples and eliminate historical samples. Also, the controller component based on HBase was used to schedule the iterative procedure, persist the intermediate results and reduce the bandwidth pressure of iterative MapReduce. The experimental results on multiple data sets demonstrate that the proposed algorithm has good performance in speedup, sizeup and scaleup, and high processing capacity of large-scale data while guaranteeing high accuracy.

关 键 词:HADOOP HBASE 支持向量机 增量学习 集成学习 遗忘因子 控制器组件 

分 类 号:TP311[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象