检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:Shen-Yi ZHAO Chang-Wei SHI Yin-Peng XIE Wu-Jun LI
出 处:《Science China(Information Sciences)》2024年第11期73-87,共15页中国科学(信息科学)(英文版)
基 金:supported by National Key R&D Program of China(Grant No.2020YFA0713901);National Natural Science Foundation of China(Grant Nos.61921006,62192783);Fundamental Research Funds for the Central Universities(Grant No.020214380108)。
摘 要:Stochastic gradient descent(SGD)and its variants have been the dominating optimization methods in machine learning.Compared with SGD with small-batch training,SGD with large-batch training can better utilize the computational power of current multi-core systems such as graphics processing units(GPUs)and can reduce the number of communication rounds in distributed training settings.Thus,SGD with large-batch training has attracted considerable attention.However,existing empirical results showed that large-batch training typically leads to a drop in generalization accuracy.Hence,how to guarantee the generalization ability in large-batch training becomes a challenging task.In this paper,we propose a simple yet effective method,called stochastic normalized gradient descent with momentum(SNGM),for large-batch training.We prove that with the same number of gradient computations,SNGM can adopt a larger batch size than momentum SGD(MSGD),which is one of the most widely used variants of SGD,to converge to an?-stationary point.Empirical results on deep learning verify that when adopting the same large batch size,SNGM can achieve better test accuracy than MSGD and other state-of-the-art large-batch training methods.
关 键 词:non-convex problems large-batch training stochastic normalized gradient descent MOMENTUM
分 类 号:TP181[自动化与计算机技术—控制理论与控制工程]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:3.15.3.240