International Journal of Intelligent Systems

RESEARCH ARTICLE

An automatic learning rate decay strategy for stochastic gradient descent optimization methods in neural networks

Kang Wang

orcid.org/0000-0002-0632-5901

National Laboratory for Parallel and Distributed Processing, School of Computer, National University of Defense Technology, Changsha, China

Search for more papers by this author

Yong Dou,

Yong Dou

orcid.org/0000-0002-1256-8934

National Laboratory for Parallel and Distributed Processing, School of Computer, National University of Defense Technology, Changsha, China

Search for more papers by this author

Tao Sun,

Corresponding Author

Tao Sun

[email protected]

orcid.org/0000-0001-5024-1900

National Laboratory for Parallel and Distributed Processing, School of Computer, National University of Defense Technology, Changsha, China

Correspondence Tao Sun, National Laboratory for Parallel and Distributed Processing, School of Computer, National University of Defense Technology, 410073 Changsha, China.

Email: [email protected]

Search for more papers by this author

Peng Qiao,

Peng Qiao

orcid.org/0000-0001-6752-7892

National Laboratory for Parallel and Distributed Processing, School of Computer, National University of Defense Technology, Changsha, China

Search for more papers by this author

Dong Wen,

Dong Wen

orcid.org/0000-0002-1537-0077

National Laboratory for Parallel and Distributed Processing, School of Computer, National University of Defense Technology, Changsha, China

Search for more papers by this author

Kang Wang,

Kang Wang

orcid.org/0000-0002-0632-5901

National Laboratory for Parallel and Distributed Processing, School of Computer, National University of Defense Technology, Changsha, China

Search for more papers by this author

Yong Dou,

Yong Dou

orcid.org/0000-0002-1256-8934

National Laboratory for Parallel and Distributed Processing, School of Computer, National University of Defense Technology, Changsha, China

Search for more papers by this author

Tao Sun,

Corresponding Author

Tao Sun

[email protected]

orcid.org/0000-0001-5024-1900

National Laboratory for Parallel and Distributed Processing, School of Computer, National University of Defense Technology, Changsha, China

Correspondence Tao Sun, National Laboratory for Parallel and Distributed Processing, School of Computer, National University of Defense Technology, 410073 Changsha, China.

Email: [email protected]

Search for more papers by this author

Peng Qiao,

Peng Qiao

orcid.org/0000-0001-6752-7892

National Laboratory for Parallel and Distributed Processing, School of Computer, National University of Defense Technology, Changsha, China

Search for more papers by this author

Dong Wen,

Dong Wen

orcid.org/0000-0002-1537-0077

National Laboratory for Parallel and Distributed Processing, School of Computer, National University of Defense Technology, Changsha, China

Search for more papers by this author

First published: 31 March 2022

https://doi.org/10.1002/int.22883

Citations: 20

Share a link

Email
Wechat
Bluesky

Abstract

Stochastic Gradient Descent (SGD) series optimization methods play the vital role in training neural networks, attracting growing attention in science and engineering fields of the intelligent system. The choice of learning rates affects the convergence rate of SGD series optimization methods. Currently, learning rate adjustment strategies mainly face the following problems: (1) The traditional learning rate decay method mainly adopts manual manner during training iterations, the small learning rate produced from which causes slow convergence in training neural networks. (2) Adaptive method (e.g., Adam) has poor generalization performance. To alleviate the above issues, we propose a novel automatic learning rate decay strategy for SGD optimization methods in neural networks. On the basis of the observation that the convergence rate's upper bound enjoys minimization in a specific iteration concerning the current learning rate, we first present the expression of the current learning rate determined by historical learning rates. And merely one extra parameter is initialized to generate automatic decreasing learning rates during the training process. Our proposed approach is applied to SGD and Momentum SGD optimization algorithms, and concrete theoretical proof explains its convergence. Numerical simulations are conducted on the MNIST and Cifar-10 data sets with different neural networks. Experimental results show that our algorithm outperforms existing classical ones, achieving faster convergence rate, better stability, and generalization performance in neural network training. It also lays a foundation for large-scale parallel search of initial parameters in intelligent systems.

CONFLICTS OF INTEREST

The authors declare no conflicts of interest.

REFERENCES

1Dwivedi YK, Hughes L, Ismagilova E, Aarts G, Williams MD. Artificial intelligence (AI): multidisciplinary perspectives on emerging challenges, opportunities, and agenda for research, practice and policy. Int J Inf Manage. 2019; 57(7):101994.
Google Scholar
2Sze V, Chen YH, Yang TJ, Emer JS. Efficient processing of deep neural networks: a tutorial and survey. Proc IEEE. 2017; 105(12): 2295-2329.
10.1109/JPROC.2017.2761740
Web of Science® Google Scholar
3Gu J, Wang Z, Kuen J, Ma L, Shahroudy A, Shuai B. Recent advances in convolutional neural networks. Pattern Recognit. 2018; 77: 354-377.
10.1016/j.patcog.2017.10.013
Web of Science® Google Scholar
4Khan A, Sohail A, Zahoora U, Qureshi AS. A survey of the recent architectures of deep convolutional neural networks. Artif Intell Rev. 2020; 53(8): 5455-5516.
10.1007/s10462-020-09825-6
Web of Science® Google Scholar
5Zhang YD, Satapathy SC, Guttery DS, Górriz JM. Improved breast cancer classification through combining graph convolutional network and convolutional neural network. Inf Process Manage. 2021; 58(2):102439.
10.1016/j.ipm.2020.102439
Web of Science® Google Scholar
6Kong F. Facial expression recognition method based on deep convolutional neural network combined with improved LBP features. Pers Ubiquitous Comput. 2019; 23(3): 531-539.
10.1007/s00779-019-01238-9
Web of Science® Google Scholar
7Geng Q, Zhou Z, Cao X. Survey of recent progress in semantic image segmentation with CNNs. Sci China Inf Sci. 2018; 61(5):051101.
10.1007/s11432-017-9189-6
Web of Science® Google Scholar
8He N, Fang L, Plaza A. Hybrid first and second order attention Unet for building segmentation in remote sensing images. Sci China Inf Sci. 2020; 63(4): 1-12.
10.1007/s11432-019-2791-7
Web of Science® Google Scholar
9Liu C, Gardner SJ, Wen N, Elshaikh MA. Automatic segmentation of the prostate on CT images using deep neural networks (DNN). Int J Radiat Oncol Biol Phys. 2019; 104(4): 924-932.
10.1016/j.ijrobp.2019.03.017
PubMed Web of Science® Google Scholar
10Ren S, He K, Girshick R, Sun J. Faster R-CNN: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst. 2015; 28: 91-99.
Google Scholar
11Jiang B, Huang X, Yang C, Yuan J. SLTFNet: a spatial and language-temporal tensor fusion network for video moment retrieval. Inf Process Manage. 2019; 56(6):102104.
10.1016/j.ipm.2019.102104
Web of Science® Google Scholar
12Gu Y, Liu H, Wang T, Li S, Gao G. Deep feature extraction and motion representation for satellite video scene classification. Sci China Inf Sci. 2020; 63(4):140307.
10.1007/s11432-019-2784-4
Web of Science® Google Scholar
13Tóth L. Convolutional deep maxout networks for phone recognition. In: et al., eds. Fifteenth Annual Conference of the International Speech Communication Association. International Speech Communication Association (ISCA); 2014. doi:10.21437/Interspeech.2014-278
Google Scholar
14Sercu T, Goel V. Advances in very deep convolutional neural networks for LVCSR. arXiv preprint arXiv:1604.01792. 2016. doi:10.21437/Interspeech.2016-1033
Google Scholar
15Cunha W, Mangaravite V, Gomes C, Canuto S. On the cost-effectiveness of neural and non-neural approaches and representations for text classification: a comprehensive comparative study. Inf Process Manage. 2021; 58(3):102481.
10.1016/j.ipm.2020.102481
Web of Science® Google Scholar
16Kadlec R, Schmid M, Bajgar O, Kleindienst J. Text understanding with the attention sum reader network. arXiv preprint arXiv:1603.01547. 2016. doi:10.18653/v1/P16-1086
Google Scholar
17Wang S, Huang M, Deng Z. Densely connected CNN with multi-scale feature attention for text classification. In: Lang J, ed. IJCAI. International Joint Conferences on Artificial Intelligence; 2018: 4468-4474.
Google Scholar
18Ye D, Liu Z, Sun M, et al. Mastering complex control in MOBA games with deep reinforcement learning. Proc AAAI Conf Artif Intell. 2020; 34(4): 6672-6679.
Google Scholar
19Silver D, Schrittwieser J, Simonyan K, Antonoglou I. Mastering the game of go without human knowledge. Nature. 2017; 550(7676): 354-359.
10.1038/nature24270
CAS PubMed Web of Science® Google Scholar
20Silver D, Huang A, Maddison CJ, et al. Mastering the game of Go with deep neural networks and tree search. Nature. 2016; 529(7587): 484-489.
10.1038/nature16961
CAS PubMed Web of Science® Google Scholar
21Krizhevsky A, Sutskever I. Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst. 2012; 25: 1097-1105.
Google Scholar
22Zoph B, Vasudevan V, Shlens J. Learning transferable architectures for scalable image recognition. In: Mortensen E et al., eds. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Computer Vision Foundation; 2018: 8697-8710.
Google Scholar
23Sermanet P, Eigen D, Zhang X, Mathieu M. OverFeat: integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229. 2013. http://arxiv.org/abs/1312.6229v4
Google Scholar
24Zeiler MD, Fergus R. Visualizing and understanding convolutional networks. In: Fleet D, Pajdla T, Schiele B, Tuytelaars T, eds., European Conference on Computer Vision. Springer; 2014: 818-833.
Google Scholar
25Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. 2014. http://arxiv.org/abs/1409.1556v6
Google Scholar
26Szegedy C, Liu W, Jia Y, Sermanet P. Going deeper with convolutions. In: Mortensen E et al., eds. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Computer Vision Foundation; 2015: 1-9.
Web of Science® Google Scholar
27He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Mortensen E et al., eds. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Computer Vision Foundation; 2016: 770-778.
Google Scholar
28Huang G, Liu Z, Van Der Maaten L. Densely connected convolutional networks. In: Mortensen E et al., eds. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Computer Vision Foundation; 2017: 4700-4708.
Google Scholar
29Hecht-Nielsen R. Theory of the backpropagation neural network. In: Wechsler H, ed. Neural Networks for Perception. Academic Press; 1992: 65-93.
10.1016/B978-0-12-741252-8.50010-8
Google Scholar
30Bengio Y. Practical recommendations for gradient-based training of deep architectures. In: Montavon G et al., eds. Neural Networks: Tricks of the Trade; 2012: 437-478.
10.1007/978-3-642-35289-8_26
Google Scholar
31Goodfellow I, Bengio Y, Courville A. Deep Learning. MIT Press; 2016.
Google Scholar
32Zhao SY, Xie YP, Li WJ. On the convergence and improvement of stochastic normalized gradient descent. Sci China Inf Sci. 2021; 64(3): 1-13.
10.1007/s11432-020-3023-7
Web of Science® Google Scholar
33Zhang H, Du W, Li Z, Liu X, Long J. Nonconvex rank relaxations based matrix regression for face reconstruction and recognition. In: Zheng N et al., eds. 2020 Chinese Automation Congress (CAC). IEEE; 2020: 2335-2340.
Google Scholar
34Sun T, Li D. Capri: Consensus accelerated proximal reweighted iteration for a class of nonconvex minimizations. In: IEEE Transactions on Knowledge and Data Engineering. IEEE; 2020. doi:10.1109/TKDE.2020.2989393
Google Scholar
35Mandt S, Hoffman M, Blei D. A variational analysis of stochastic gradient algorithms. In: Balcan MF, Weinberger KQ, eds. International Conference on Machine Learning; 2016: 354-363.
Google Scholar
36Gower RM, Loizou N, Qian X. SGD: general analysis and improved rates. In: Chaudhuri K, Salakhutdinov R, eds. International Conference on Machine Learning. JMLR; 2019: 5200-5209.
Google Scholar
37Wang B, Gu Q, Boedihardjo M, Barekat F. DP-LSSGD: a stochastic optimization method to lift the utility in privacy-preserving ERM. arXiv preprint arXiv:1906.12056. 2019.
Google Scholar
38Polyak BT. Some methods of speeding up the convergence of iteration methods. USSR Comput Math Math Phys. 1964; 4(5): 1-17.
10.1016/0041-5553(64)90137-5
Google Scholar
39Lei Y, Hu T, Tang K. Generalization performance of multi-pass stochastic gradient descent with convex loss functions. J Mach Learn Res. 2021; 22(25): 1-41.
CAS Google Scholar
40Tseng P. An incremental gradient (-projection) method with momentum term and adaptive stepsize rule. SIAM J Optim. 1998; 8(2): 506-531.
10.1137/S1052623495294797
Web of Science® Google Scholar
41Sutskever I, Martens J, Dahl G. On the importance of initialization and momentum in deep learning. In: Dasgupta S, McAllester D, eds. International Conference on Machine Learning. JMLR; 2013: 1139-1147.
Google Scholar
42Kingma DP, Ba J. Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. 2014. http://arxiv.org/abs/1412.6980v8
Google Scholar
43Loshchilov I, Hutter F. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101. 2017.
Google Scholar
44Sun T, Qiao L, Liao Q, Li D. Novel convergence results of adaptive stochastic gradient descents. IEEE Trans Image Process. 2020; 30: 1044-1056.
10.1109/TIP.2020.3038535
PubMed Web of Science® Google Scholar
45Sun T, Ling H, Shi Z, Li D, Wang B. Training deep neural networks with adaptive momentum inspired by the quadratic optimization. arXiv preprint arXiv:2110.09057. 2021.
Google Scholar
46Goyal P, Dollár P, Girshick R, Noordhuis P. Accurate, large minibatch SGD: training ImageNet in 1 hour. arXiv preprint arXiv:1706.02677. 2017.
Google Scholar
47Smith LN. Cyclical learning rates for training neural networks. In: Mortensen E, ed. 2017 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE; 2017: 464-472.
Google Scholar
48Loshchilov I, Hutter F. SGDR: stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983. 2016.
Google Scholar
49Duchi J, Hazan E, Singer Y. Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res. 2011; 12(7): 2121-2159.
Google Scholar
50Tieleman T, Hinton G. Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Netw Mach Learn. 2012; 6: 26-31.
Google Scholar
51Keskar NS, Socher R. Improving generalization performance by switching from adam to SGD. arXiv preprint arXiv:1712.07628. 2017.
Google Scholar
52Xu Z, Dai AM, Kemp J, Metz L. Learning an adaptive learning rate schedule. arXiv preprint arXiv:1909.09712. 2019.
Google Scholar
53Yan S. Understanding LSTM Networks, Vol. 11; 2015. http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Google Scholar
54Nguyen T, Baraniuk R, Bertozzi A. Momentumrnn: integrating momentum into recurrent neural networks. Adv Neural Inf Process Syst. 2020; 33: 1924-1936.
Google Scholar
55Shu J, Zhu Y, Zhao Q, Meng D, Xu Z. Meta-LR-Schedule-Net: learned LR schedules that scale and generalize. arXiv preprint arXiv:2007.14546. 2020. doi:10.48550/arXiv.2007.14546
Google Scholar
56LeCun Y, Bottou L, Bengio Y. Gradient-based learning applied to document recognition. Proc IEEE. 1998; 86(11): 2278-2324.
10.1109/5.726791
Web of Science® Google Scholar
57Krizhevsky A, Hinton G. Learning Multiple Layers of Features from Tiny Images. Thesis. University of Toronto; 2009.
Google Scholar

Citing Literature

All articles

An automatic learning rate decay strategy for stochastic gradient descent optimization methods in neural networks

Abstract

CONFLICTS OF INTEREST

REFERENCES

Citing Literature

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley

An automatic learning rate decay strategy for stochastic gradient descent optimization methods in neural networks

Abstract

CONFLICTS OF INTEREST

REFERENCES

Citing Literature

References

Related

Information