Abstract
Gradient descent is prevalent for large-scale optimization problems in machine learning; especially it nowadays plays a major role in computing and correcting the connection strength of neural networks in deep learning. However, many gradient-based optimization methods contain more sensitive hyper-parameters which require endless ways of configuring. In this paper, we present a novel adaptive mechanism called adaptive exponential decay rate (AEDR). AEDR uses an adaptive exponential decay rate rather than a fixed and preconfigured one, and it can allow us to eliminate one otherwise tuning sensitive hyper-parameters. AEDR also can be used to calculate exponential decay rate adaptively by employing the moving average of both gradients and squared gradients over time. The mechanism is then applied to Adadelta and Adam; it reduces the number of hyper-parameters of Adadelta and Adam to only a single one to be turned. We use neural network of long short-term memory and LeNet to demonstrate how learning rate adapts dynamically. We show promising results compared with other state-of-the-art methods on four data sets, the IMDB (movie reviews), SemEval-2016 (sentiment analysis in twitter) (IMDB), CIFAR-10 and Pascal VOC-2012.









Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Lecun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444
Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: International conference on neural information processing systems. Curran Associates Inc., pp 1097–1105
Tompson J, Jain A, Lecun Y et al (2014) Joint training of a convolutional network and a graphical model for human pose estimation. Eprint Arxiv, pp 1799–1807
Farabet C, Couprie C, Najman L et al (2013) Learning hierarchical features for scene labeling. IEEE Trans Pattern Anal Mach Intell 35(8):1915–1929
Alex K, Ilya S, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: NIPS
Hinton G, Deng L, Yu D et al (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process Mag 29(6):82–97
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems, pp 3104–3112
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473
Ma J, Sheridan RP, Liaw A et al (2015) Deep neural nets as a method for quantitative structure-activity relationships. J Chem Inf Model 55(2):263
Xiong HY, Alipanahi B, Lee LJ et al (2015) The human splicing code reveals new insights into the genetic determinants of disease. Science 347(6218):1254806
Khalil-Hani M, Liew SS, Bakhteri R (2015) An optimized second order stochastic learning algorithm for neural network training. In: International conference on neural information processing. Springer, pp 38–45
Blundell C, Cornebise J, Kavukcuoglu K, Wierstra D (2015) Weight uncertainty in neural networks. In: Proceedings of the 32nd International Conference on Machine Learning, Computer science, vol 37, pp 1613–1622
Sutskever I, Martens J, Dahl GE et al (2013) On the importance of initialization and momentum in deep learning. ICML (3) 28:1139–1147
Johnson, R, Tong Z (2013) Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in neural information processing systems, pp 315–323
Kingma D, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980
Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504
Deng L, Li J, Huang JT et al (2013) Recent advances in deep learning for speech research at Microsoft. In: IEEE international conference on acoustics, speech and signal processing. IEEE, pp 8604–8608
Dauphin Y, Pascanu R, Gulcehre C, Cho K, Ganguli S, Bengio Y (2014). Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. arXiv, 1C14
Robbins H, Monro S (1951) A stochastic approximation method. Ann Math Stat 22(3):400–407
Darken C, Chang J, Moody J (1992) Learning rate schedules for faster stochastic gradient search. In: Neural networks for signal processing II, proceedings of the 1992 IEEE workshop, (September), 1C11
Sutton RS (1986) Two problems with backpropagation and other steepest-descent learning procedures for networks. In: Proceedings of 8th annual conference. Cognitive Science Society
Bottou L (1991) Stochastic gradient learning in neural networks. In: Neuro-Nimes
Zeiler MD (2012) ADADELTA: an adaptive learning rate method. arXiv:1212.5701
Nesterov Y (1983) A method for unconstrained convex minimization problem with the rate of convergence o(1/k2). Doklady ANSSSR (translated as Soviet. Math. Docl.), vol 269, pp 543–547
Bengio Y, Boulanger-Lewandowski N, Pascanu R (2013) Advances in optimizing recurrent networks. IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, Vancouver, BC, Canada, pp 8624–8628
Qian N (1999) On the momentum term in gradient descent learning algorithms. Neural Netw: The Official Journal of the International Neural Network Society 12(1):145C151
Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res 12:2121C2159
Dean J, Corrado GS, Monga R, Chen K, Devin M, Le QV, Ng AY (2012) Large scale distributed deep networks. In: NIPS 2012: neural information processing systems, 1C11
Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing, pp 1532–1543
Schaul T, Zhang S, Lecun Y (2012) No more pesky. Learn Rates 28:343–351
Maas AL, Daly RE, Pham PT et al (2011) Learning word vectors for sentiment analysis. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies-volume 1. Association for Computational Linguistics, pp 142–150
Nakov P, Ritter A, Rosenthal S, Sebastiani F, Stoyanov V (2011) Evaluation measures for the SemEval-2016 ask 4 sentiment analysis in Twitter. http://alt.qcri.org/semeval2016/task4/
Krizhevsky A, Hinton G (2009) Learning multiple layers of features from tiny images. Technical report, University of Toronto
Everingham M, Gool LV, Williams CKI et al (2010) The pascal, visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Guo J (2013) Backpropagation through time. Unpubl. ms., Harbin Institute of Technology
Zhang H, Li J, Ji Y, Yue H (2017) Understanding subtitles by character-level sequence-to-sequence learning. IEEE Trans Ind Inf 13(2):616–624
Hu F, Xu X, Wang J, Yang Z, Li L (2017) Memory-enhanced latent semantic model: short text understanding for sentiment analysis. International conference on database systems for advanced applications. Springer, Cham, pp 393–407
Steijvers M, Grunwald P (1996) A recurrent network that performs a context-sensitive prediction task. In: Conference of the cognitive science
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15:1929–58
Wang S, Manning CD (2013) Fast dropout training. In: Proceedings of the 30th international conference on machine learning, pp 118–126. ACM
Wang S, Manning C (2013) Fast dropout training. In: Proceedings of the 30th international conference on machine learning (ICML-13), pp 118–126
Babaeizadeh M, Smaragdis P, Campbell RH (2016) NoiseOut: a simple way to prune neural networks. In: Emdnn Nips workshops. arXiv:1611.06211
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. International conference on learning representations, computer science, pp 1150–1210
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2818–2826
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L (2015) ImageNet large scale visual recognition challenge. Int J Comput Vis 115:211C252
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86:2278C2323
Deng Y, Ren Z, Kong Y et al (2017) A hierarchical fused fuzzy deep neural network for data classification. IEEE Trans Fuzzy Syst 25(4):1006–1012
Yue D, Feng B, Kong Y, Ren Z, Dai Q (2017) Deep direct reinforcement learning for financial signal representation and trading. IEEE Trans Neural Netw Learn Syst 28(3):653–664
Zhang H, Chow TWS, Wu QMJ (2016) Organizing books and authors by multilayer SOM. IEEE Trans Neural Netw Learn Syst 27(12):2537
Acknowledgements
The work was supported by the Fundamental Research Funds For the Central Universities (No. XDJK2017D059), Scientific and Technological Research Program of Chongqing University of Education (Nos. KY2016TZ02 and 2017XJPT07), Key Research Program of Chongqing Education Science 13th Five-Year Plan 2017 (No. 2017-GX-139). Li Li is the corresponding author for the paper.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest
Rights and permissions
About this article
Cite this article
Zhang, J., Hu, F., Li, L. et al. An adaptive mechanism to achieve learning rate dynamically. Neural Comput & Applic 31, 6685–6698 (2019). https://doi.org/10.1007/s00521-018-3495-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-018-3495-0