An automatic learning rate decay strategy for stochastic gradient descent optimization methods in neural networks
Kang Wang
National Laboratory for Parallel and Distributed Processing, School of Computer, National University of Defense Technology, Changsha, China
Search for more papers by this authorYong Dou
National Laboratory for Parallel and Distributed Processing, School of Computer, National University of Defense Technology, Changsha, China
Search for more papers by this authorCorresponding Author
Tao Sun
National Laboratory for Parallel and Distributed Processing, School of Computer, National University of Defense Technology, Changsha, China
Correspondence Tao Sun, National Laboratory for Parallel and Distributed Processing, School of Computer, National University of Defense Technology, 410073 Changsha, China.
Email: [email protected]
Search for more papers by this authorPeng Qiao
National Laboratory for Parallel and Distributed Processing, School of Computer, National University of Defense Technology, Changsha, China
Search for more papers by this authorDong Wen
National Laboratory for Parallel and Distributed Processing, School of Computer, National University of Defense Technology, Changsha, China
Search for more papers by this authorKang Wang
National Laboratory for Parallel and Distributed Processing, School of Computer, National University of Defense Technology, Changsha, China
Search for more papers by this authorYong Dou
National Laboratory for Parallel and Distributed Processing, School of Computer, National University of Defense Technology, Changsha, China
Search for more papers by this authorCorresponding Author
Tao Sun
National Laboratory for Parallel and Distributed Processing, School of Computer, National University of Defense Technology, Changsha, China
Correspondence Tao Sun, National Laboratory for Parallel and Distributed Processing, School of Computer, National University of Defense Technology, 410073 Changsha, China.
Email: [email protected]
Search for more papers by this authorPeng Qiao
National Laboratory for Parallel and Distributed Processing, School of Computer, National University of Defense Technology, Changsha, China
Search for more papers by this authorDong Wen
National Laboratory for Parallel and Distributed Processing, School of Computer, National University of Defense Technology, Changsha, China
Search for more papers by this authorAbstract
Stochastic Gradient Descent (SGD) series optimization methods play the vital role in training neural networks, attracting growing attention in science and engineering fields of the intelligent system. The choice of learning rates affects the convergence rate of SGD series optimization methods. Currently, learning rate adjustment strategies mainly face the following problems: (1) The traditional learning rate decay method mainly adopts manual manner during training iterations, the small learning rate produced from which causes slow convergence in training neural networks. (2) Adaptive method (e.g., Adam) has poor generalization performance. To alleviate the above issues, we propose a novel automatic learning rate decay strategy for SGD optimization methods in neural networks. On the basis of the observation that the convergence rate's upper bound enjoys minimization in a specific iteration concerning the current learning rate, we first present the expression of the current learning rate determined by historical learning rates. And merely one extra parameter is initialized to generate automatic decreasing learning rates during the training process. Our proposed approach is applied to SGD and Momentum SGD optimization algorithms, and concrete theoretical proof explains its convergence. Numerical simulations are conducted on the MNIST and Cifar-10 data sets with different neural networks. Experimental results show that our algorithm outperforms existing classical ones, achieving faster convergence rate, better stability, and generalization performance in neural network training. It also lays a foundation for large-scale parallel search of initial parameters in intelligent systems.
CONFLICTS OF INTEREST
The authors declare no conflicts of interest.
REFERENCES
- 1Dwivedi YK, Hughes L, Ismagilova E, Aarts G, Williams MD. Artificial intelligence (AI): multidisciplinary perspectives on emerging challenges, opportunities, and agenda for research, practice and policy. Int J Inf Manage. 2019; 57(7):101994.
- 2Sze V, Chen YH, Yang TJ, Emer JS. Efficient processing of deep neural networks: a tutorial and survey. Proc IEEE. 2017; 105(12): 2295-2329.
- 3Gu J, Wang Z, Kuen J, Ma L, Shahroudy A, Shuai B. Recent advances in convolutional neural networks. Pattern Recognit. 2018; 77: 354-377.
- 4Khan A, Sohail A, Zahoora U, Qureshi AS. A survey of the recent architectures of deep convolutional neural networks. Artif Intell Rev. 2020; 53(8): 5455-5516.
- 5Zhang YD, Satapathy SC, Guttery DS, Górriz JM. Improved breast cancer classification through combining graph convolutional network and convolutional neural network. Inf Process Manage. 2021; 58(2):102439.
- 6Kong F. Facial expression recognition method based on deep convolutional neural network combined with improved LBP features. Pers Ubiquitous Comput. 2019; 23(3): 531-539.
- 7Geng Q, Zhou Z, Cao X. Survey of recent progress in semantic image segmentation with CNNs. Sci China Inf Sci. 2018; 61(5):051101.
- 8He N, Fang L, Plaza A. Hybrid first and second order attention Unet for building segmentation in remote sensing images. Sci China Inf Sci. 2020; 63(4): 1-12.
- 9Liu C, Gardner SJ, Wen N, Elshaikh MA. Automatic segmentation of the prostate on CT images using deep neural networks (DNN). Int J Radiat Oncol Biol Phys. 2019; 104(4): 924-932.
- 10Ren S, He K, Girshick R, Sun J. Faster R-CNN: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst. 2015; 28: 91-99.
- 11Jiang B, Huang X, Yang C, Yuan J. SLTFNet: a spatial and language-temporal tensor fusion network for video moment retrieval. Inf Process Manage. 2019; 56(6):102104.
- 12Gu Y, Liu H, Wang T, Li S, Gao G. Deep feature extraction and motion representation for satellite video scene classification. Sci China Inf Sci. 2020; 63(4):140307.
- 13Tóth L. Convolutional deep maxout networks for phone recognition. In: et al., eds. Fifteenth Annual Conference of the International Speech Communication Association. International Speech Communication Association (ISCA); 2014. doi:10.21437/Interspeech.2014-278
- 14Sercu T, Goel V. Advances in very deep convolutional neural networks for LVCSR. arXiv preprint arXiv:1604.01792. 2016. doi:10.21437/Interspeech.2016-1033
- 15Cunha W, Mangaravite V, Gomes C, Canuto S. On the cost-effectiveness of neural and non-neural approaches and representations for text classification: a comprehensive comparative study. Inf Process Manage. 2021; 58(3):102481.
- 16Kadlec R, Schmid M, Bajgar O, Kleindienst J. Text understanding with the attention sum reader network. arXiv preprint arXiv:1603.01547. 2016. doi:10.18653/v1/P16-1086
- 17Wang S, Huang M, Deng Z. Densely connected CNN with multi-scale feature attention for text classification. In: Lang J, ed. IJCAI. International Joint Conferences on Artificial Intelligence; 2018: 4468-4474.
- 18Ye D, Liu Z, Sun M, et al. Mastering complex control in MOBA games with deep reinforcement learning. Proc AAAI Conf Artif Intell. 2020; 34(4): 6672-6679.
- 19Silver D, Schrittwieser J, Simonyan K, Antonoglou I. Mastering the game of go without human knowledge. Nature. 2017; 550(7676): 354-359.
- 20Silver D, Huang A, Maddison CJ, et al. Mastering the game of Go with deep neural networks and tree search. Nature. 2016; 529(7587): 484-489.
- 21Krizhevsky A, Sutskever I. Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst. 2012; 25: 1097-1105.
- 22Zoph B, Vasudevan V, Shlens J. Learning transferable architectures for scalable image recognition. In: Mortensen E et al., eds. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Computer Vision Foundation; 2018: 8697-8710.
- 23Sermanet P, Eigen D, Zhang X, Mathieu M. OverFeat: integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229. 2013. http://arxiv.org/abs/1312.6229v4
- 24Zeiler MD, Fergus R. Visualizing and understanding convolutional networks. In: Fleet D, Pajdla T, Schiele B, Tuytelaars T, eds., European Conference on Computer Vision. Springer; 2014: 818-833.
- 25Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. 2014. http://arxiv.org/abs/1409.1556v6
- 26Szegedy C, Liu W, Jia Y, Sermanet P. Going deeper with convolutions. In: Mortensen E et al., eds. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Computer Vision Foundation; 2015: 1-9.
- 27He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Mortensen E et al., eds. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Computer Vision Foundation; 2016: 770-778.
- 28Huang G, Liu Z, Van Der Maaten L. Densely connected convolutional networks. In: Mortensen E et al., eds. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Computer Vision Foundation; 2017: 4700-4708.
- 29Hecht-Nielsen R. Theory of the backpropagation neural network. In: Wechsler H, ed. Neural Networks for Perception. Academic Press; 1992: 65-93.
10.1016/B978-0-12-741252-8.50010-8 Google Scholar
- 30Bengio Y. Practical recommendations for gradient-based training of deep architectures. In: Montavon G et al., eds. Neural Networks: Tricks of the Trade; 2012: 437-478.
10.1007/978-3-642-35289-8_26 Google Scholar
- 31Goodfellow I, Bengio Y, Courville A. Deep Learning. MIT Press; 2016.
- 32Zhao SY, Xie YP, Li WJ. On the convergence and improvement of stochastic normalized gradient descent. Sci China Inf Sci. 2021; 64(3): 1-13.
- 33Zhang H, Du W, Li Z, Liu X, Long J. Nonconvex rank relaxations based matrix regression for face reconstruction and recognition. In: Zheng N et al., eds. 2020 Chinese Automation Congress (CAC). IEEE; 2020: 2335-2340.
- 34Sun T, Li D. Capri: Consensus accelerated proximal reweighted iteration for a class of nonconvex minimizations. In: IEEE Transactions on Knowledge and Data Engineering. IEEE; 2020. doi:10.1109/TKDE.2020.2989393
- 35Mandt S, Hoffman M, Blei D. A variational analysis of stochastic gradient algorithms. In: Balcan MF, Weinberger KQ, eds. International Conference on Machine Learning; 2016: 354-363.
- 36Gower RM, Loizou N, Qian X. SGD: general analysis and improved rates. In: Chaudhuri K, Salakhutdinov R, eds. International Conference on Machine Learning. JMLR; 2019: 5200-5209.
- 37Wang B, Gu Q, Boedihardjo M, Barekat F. DP-LSSGD: a stochastic optimization method to lift the utility in privacy-preserving ERM. arXiv preprint arXiv:1906.12056. 2019.
- 38Polyak BT. Some methods of speeding up the convergence of iteration methods. USSR Comput Math Math Phys. 1964; 4(5): 1-17.
10.1016/0041-5553(64)90137-5 Google Scholar
- 39Lei Y, Hu T, Tang K. Generalization performance of multi-pass stochastic gradient descent with convex loss functions. J Mach Learn Res. 2021; 22(25): 1-41.
- 40Tseng P. An incremental gradient (-projection) method with momentum term and adaptive stepsize rule. SIAM J Optim. 1998; 8(2): 506-531.
- 41Sutskever I, Martens J, Dahl G. On the importance of initialization and momentum in deep learning. In: Dasgupta S, McAllester D, eds. International Conference on Machine Learning. JMLR; 2013: 1139-1147.
- 42Kingma DP, Ba J. Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. 2014. http://arxiv.org/abs/1412.6980v8
- 43Loshchilov I, Hutter F. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101. 2017.
- 44Sun T, Qiao L, Liao Q, Li D. Novel convergence results of adaptive stochastic gradient descents. IEEE Trans Image Process. 2020; 30: 1044-1056.
- 45Sun T, Ling H, Shi Z, Li D, Wang B. Training deep neural networks with adaptive momentum inspired by the quadratic optimization. arXiv preprint arXiv:2110.09057. 2021.
- 46Goyal P, Dollár P, Girshick R, Noordhuis P. Accurate, large minibatch SGD: training ImageNet in 1 hour. arXiv preprint arXiv:1706.02677. 2017.
- 47Smith LN. Cyclical learning rates for training neural networks. In: Mortensen E, ed. 2017 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE; 2017: 464-472.
- 48Loshchilov I, Hutter F. SGDR: stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983. 2016.
- 49Duchi J, Hazan E, Singer Y. Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res. 2011; 12(7): 2121-2159.
- 50Tieleman T, Hinton G. Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Netw Mach Learn. 2012; 6: 26-31.
- 51Keskar NS, Socher R. Improving generalization performance by switching from adam to SGD. arXiv preprint arXiv:1712.07628. 2017.
- 52Xu Z, Dai AM, Kemp J, Metz L. Learning an adaptive learning rate schedule. arXiv preprint arXiv:1909.09712. 2019.
- 53Yan S. Understanding LSTM Networks, Vol. 11; 2015. http://colah.github.io/posts/2015-08-Understanding-LSTMs/
- 54Nguyen T, Baraniuk R, Bertozzi A. Momentumrnn: integrating momentum into recurrent neural networks. Adv Neural Inf Process Syst. 2020; 33: 1924-1936.
- 55Shu J, Zhu Y, Zhao Q, Meng D, Xu Z. Meta-LR-Schedule-Net: learned LR schedules that scale and generalize. arXiv preprint arXiv:2007.14546. 2020. doi:10.48550/arXiv.2007.14546
- 56LeCun Y, Bottou L, Bengio Y. Gradient-based learning applied to document recognition. Proc IEEE. 1998; 86(11): 2278-2324.
- 57Krizhevsky A, Hinton G. Learning Multiple Layers of Features from Tiny Images. Thesis. University of Toronto; 2009.