分布式机器学习中的自适应同步并行策略  

Adaptive Synchronous Parallel Strategy in Distributed Machine Learning

在线阅读下载全文

作  者:王晓晓 朱晓娟[1] WANG Xiao-xiao;ZHU Xiao-juan(School of Computer Science and Engineering,Anhui University of Science and Technology,Huainan 232001,China)

机构地区:[1]安徽理工大学计算机科学与工程学院,安徽淮南232001

出  处:《辽东学院学报(自然科学版)》2024年第4期283-290,共8页Journal of Liaodong University:Natural Science Edition

基  金:安徽省高校自然科学研究重点项目(KJ2020A0300)。

摘  要:分布式机器学习中的资源异构和资源不稳定性易造成掉队问题,使并行策略难以平衡同步滞后和过时梯度,导致同步开销较高,降低了模型的整体训练效率。因此,提出一种面向分布式机器学习的自适应同步并行策略。首先,利用计算节点参数版本和训练延迟时间识别掉队节点;其次,通过参数服务器比较最新、最旧参数的版本差和阈值,判断出计算节点所处状态;最后,基于小批量随机梯度下降算法,采用不同全局模型参数更新规则自适应调节不同状态的计算节点。实验结果表明,相较于其他并行策略,所提策略的收敛时间减少了9.61%~41.15%,准确率最高提升了3.29%。In distributed machine learning,the straggle problem caused by resource heterogeneity and resource instability leads to high synchronization overhead and reduces the overall model training efficiencyThe existence of stragglers makes it difficult for existing parallel strategies to balance the effects of synchronization lag and stale gradient To solve this problem,an adaptive synchronous parallel strategy for distributed machine learning is proposed Firstly,the stragglers are identified by the version of compute node parameters and the training delay time Secondly,the parameter server determines the status of the compute node by comparing the version difference of the latest and oldest parameters and the size of the threshold Finally,based on the small-batch stochastic gradient descent algorithm,different global model parameter update rules are adapted to compute nodes in different statesThe experimental results show that,compared to other parallel strategies,the convergence time of the proposed method is reduced by 961%~41.15%,and the accuracy of the proposed method is improved by 3.29%.

关 键 词:分布式机器学习 掉队节点 参数服务器 同步策略 

分 类 号:TP391.41[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象