基于HBase的多分类逻辑回归算法研究  被引量:11

Research on multi classification logistic regression based on HBase

在线阅读下载全文

作  者:刘黎志[1,2] 邓介一 吴云韬[1,2] Liu Lizhi;Deng Jieyi;Wu Yuntao(Hubei Province Key Laboratory of Intelligent Robot,Wuhan Institute of Technology,Wuhan 430205,China;School of Computer Science&Engineering,Wuhan Institute of Technology,Wuhan 430205,China)

机构地区:[1]武汉工程大学,智能机器人湖北省重点实验室,武汉430205 [2]武汉工程大学,计算机科学与工程学院,武汉430205

出  处:《计算机应用研究》2018年第10期3007-3010,共4页Application Research of Computers

基  金:湖北省自然科学基金资助项目(2014CFB791);湖北省高等学校优秀中青年科技创新团队计划资助项目(T201206)

摘  要:为解决在大数据环境下,用于训练多分类逻辑回归模型的数据集可能会超过执行计算的客户端内存的问题,提出了块批量梯度下降算法,用于计算回归模型的系数。将训练数据集存入HBase后,通过设置表扫描对象的起始行键参数,可取出大小合适的含训练样本及结果值的数据块;同时为避免客户端到服务端频繁的RPC调用,取出的数据块可进行多次迭代计算,以加快系数的收敛。当取出的数据块达到指定的迭代次数后,再按行键次序取出下一个数据块。如此循环,直到系数收敛或达到指定的循环控制阈值。多分类的逻辑回归问题可转换为二分类来解决,因此需要为每一个分类在训练数据表中设定结果值列,结合训练样本列簇,按块批量梯度下降算法得到每个分类的回归系数。实验结果表明得到的回归系数能准确地对测试样本进行分类。In big data environment,the training dataset for logistic regression model may exceed the memory size of the client machine that executes computing,so this paper proposed a chunk BGD algorithm to compute the coefficients of regression model.After putting the training dataset in HBase,a data chunk with appropriate size including training sample data and classification result value could be obtained by setting the StartRow and StopRow parameters of the scan object.In case of avoiding frequent RPC calls from client to server,the chunk could be iterated multi-times to accelerate the convergence of coefficients.When the obtained chunk reaches the specified iteration times,then next chunk was taken out according to the order of row keys.These kinds of circles would be repeated until the convergence of coefficients or reaching the loop control threshold.Multi classification logistic regression problem could be resolved by converting to two classification model,so the result value column qualifier for each classification must be added into training data table in HBase,combining with the training sample column family,each classification regression coefficients could be obtained by chunk BGD algorithm.The result of experiment proves that the testing samples can be classified accurately by the regression coefficients.

关 键 词:块批量梯度下降 多分类 逻辑回归 大数据 HBASE 

分 类 号:TP301.6[自动化与计算机技术—计算机系统结构]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象