Skip to main content
Log in

An adaptive mechanism to achieve learning rate dynamically

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Gradient descent is prevalent for large-scale optimization problems in machine learning; especially it nowadays plays a major role in computing and correcting the connection strength of neural networks in deep learning. However, many gradient-based optimization methods contain more sensitive hyper-parameters which require endless ways of configuring. In this paper, we present a novel adaptive mechanism called adaptive exponential decay rate (AEDR). AEDR uses an adaptive exponential decay rate rather than a fixed and preconfigured one, and it can allow us to eliminate one otherwise tuning sensitive hyper-parameters. AEDR also can be used to calculate exponential decay rate adaptively by employing the moving average of both gradients and squared gradients over time. The mechanism is then applied to Adadelta and Adam; it reduces the number of hyper-parameters of Adadelta and Adam to only a single one to be turned. We use neural network of long short-term memory and LeNet to demonstrate how learning rate adapts dynamically. We show promising results compared with other state-of-the-art methods on four data sets, the IMDB (movie reviews), SemEval-2016 (sentiment analysis in twitter) (IMDB), CIFAR-10 and Pascal VOC-2012.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Notes

  1. http://www.csie.ntu.edu.tw/~cjlin/libsvm/.

  2. http://alt.qcri.org/semeval2016/task4/.

  3. http://www.cs.toronto.edu/~kriz/cifar.html.

  4. http://host.robots.ox.ac.uk/pascal/VOC/voc2012/index.html.

References

  1. Lecun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444

    Article  Google Scholar 

  2. Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: International conference on neural information processing systems. Curran Associates Inc., pp 1097–1105

  3. Tompson J, Jain A, Lecun Y et al (2014) Joint training of a convolutional network and a graphical model for human pose estimation. Eprint Arxiv, pp 1799–1807

  4. Farabet C, Couprie C, Najman L et al (2013) Learning hierarchical features for scene labeling. IEEE Trans Pattern Anal Mach Intell 35(8):1915–1929

    Article  Google Scholar 

  5. Alex K, Ilya S, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: NIPS

  6. Hinton G, Deng L, Yu D et al (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process Mag 29(6):82–97

    Article  Google Scholar 

  7. Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems, pp 3104–3112

  8. Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473

  9. Ma J, Sheridan RP, Liaw A et al (2015) Deep neural nets as a method for quantitative structure-activity relationships. J Chem Inf Model 55(2):263

    Article  Google Scholar 

  10. Xiong HY, Alipanahi B, Lee LJ et al (2015) The human splicing code reveals new insights into the genetic determinants of disease. Science 347(6218):1254806

    Article  Google Scholar 

  11. Khalil-Hani M, Liew SS, Bakhteri R (2015) An optimized second order stochastic learning algorithm for neural network training. In: International conference on neural information processing. Springer, pp 38–45

  12. Blundell C, Cornebise J, Kavukcuoglu K, Wierstra D (2015) Weight uncertainty in neural networks. In: Proceedings of the 32nd International Conference on Machine Learning, Computer science, vol 37, pp 1613–1622

  13. Sutskever I, Martens J, Dahl GE et al (2013) On the importance of initialization and momentum in deep learning. ICML (3) 28:1139–1147

    Google Scholar 

  14. Johnson, R, Tong Z (2013) Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in neural information processing systems, pp 315–323

  15. Kingma D, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980

  16. Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504

    Article  MathSciNet  Google Scholar 

  17. Deng L, Li J, Huang JT et al (2013) Recent advances in deep learning for speech research at Microsoft. In: IEEE international conference on acoustics, speech and signal processing. IEEE, pp 8604–8608

  18. Dauphin Y, Pascanu R, Gulcehre C, Cho K, Ganguli S, Bengio Y (2014). Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. arXiv, 1C14

  19. Robbins H, Monro S (1951) A stochastic approximation method. Ann Math Stat 22(3):400–407

    Article  MathSciNet  Google Scholar 

  20. Darken C, Chang J, Moody J (1992) Learning rate schedules for faster stochastic gradient search. In: Neural networks for signal processing II, proceedings of the 1992 IEEE workshop, (September), 1C11

  21. Sutton RS (1986) Two problems with backpropagation and other steepest-descent learning procedures for networks. In: Proceedings of 8th annual conference. Cognitive Science Society

  22. Bottou L (1991) Stochastic gradient learning in neural networks. In: Neuro-Nimes

  23. Zeiler MD (2012) ADADELTA: an adaptive learning rate method. arXiv:1212.5701

  24. Nesterov Y (1983) A method for unconstrained convex minimization problem with the rate of convergence o(1/k2). Doklady ANSSSR (translated as Soviet. Math. Docl.), vol 269, pp 543–547

  25. Bengio Y, Boulanger-Lewandowski N, Pascanu R (2013) Advances in optimizing recurrent networks. IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, Vancouver, BC, Canada, pp 8624–8628

    Google Scholar 

  26. Qian N (1999) On the momentum term in gradient descent learning algorithms. Neural Netw: The Official Journal of the International Neural Network Society 12(1):145C151

    Article  MathSciNet  Google Scholar 

  27. Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res 12:2121C2159

    MathSciNet  MATH  Google Scholar 

  28. Dean J, Corrado GS, Monga R, Chen K, Devin M, Le QV, Ng AY (2012) Large scale distributed deep networks. In: NIPS 2012: neural information processing systems, 1C11

  29. Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing, pp 1532–1543

  30. Schaul T, Zhang S, Lecun Y (2012) No more pesky. Learn Rates 28:343–351

    Google Scholar 

  31. Maas AL, Daly RE, Pham PT et al (2011) Learning word vectors for sentiment analysis. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies-volume 1. Association for Computational Linguistics, pp 142–150

  32. Nakov P, Ritter A, Rosenthal S, Sebastiani F, Stoyanov V (2011) Evaluation measures for the SemEval-2016 ask 4 sentiment analysis in Twitter. http://alt.qcri.org/semeval2016/task4/

  33. Krizhevsky A, Hinton G (2009) Learning multiple layers of features from tiny images. Technical report, University of Toronto

  34. Everingham M, Gool LV, Williams CKI et al (2010) The pascal, visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338

    Article  Google Scholar 

  35. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  36. Guo J (2013) Backpropagation through time. Unpubl. ms., Harbin Institute of Technology

  37. Zhang H, Li J, Ji Y, Yue H (2017) Understanding subtitles by character-level sequence-to-sequence learning. IEEE Trans Ind Inf 13(2):616–624

    Article  Google Scholar 

  38. Hu F, Xu X, Wang J, Yang Z, Li L (2017) Memory-enhanced latent semantic model: short text understanding for sentiment analysis. International conference on database systems for advanced applications. Springer, Cham, pp 393–407

    Chapter  Google Scholar 

  39. Steijvers M, Grunwald P (1996) A recurrent network that performs a context-sensitive prediction task. In: Conference of the cognitive science

  40. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15:1929–58

    MathSciNet  MATH  Google Scholar 

  41. Wang S, Manning CD (2013) Fast dropout training. In: Proceedings of the 30th international conference on machine learning, pp 118–126. ACM

  42. Wang S, Manning C (2013) Fast dropout training. In: Proceedings of the 30th international conference on machine learning (ICML-13), pp 118–126

  43. Babaeizadeh M, Smaragdis P, Campbell RH (2016) NoiseOut: a simple way to prune neural networks. In: Emdnn Nips workshops. arXiv:1611.06211

  44. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. International conference on learning representations, computer science, pp 1150–1210

  45. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  46. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2818–2826

  47. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L (2015) ImageNet large scale visual recognition challenge. Int J Comput Vis 115:211C252

    Article  MathSciNet  Google Scholar 

  48. LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86:2278C2323

    Article  Google Scholar 

  49. Deng Y, Ren Z, Kong Y et al (2017) A hierarchical fused fuzzy deep neural network for data classification. IEEE Trans Fuzzy Syst 25(4):1006–1012

    Article  Google Scholar 

  50. Yue D, Feng B, Kong Y, Ren Z, Dai Q (2017) Deep direct reinforcement learning for financial signal representation and trading. IEEE Trans Neural Netw Learn Syst 28(3):653–664

    Article  Google Scholar 

  51. Zhang H, Chow TWS, Wu QMJ (2016) Organizing books and authors by multilayer SOM. IEEE Trans Neural Netw Learn Syst 27(12):2537

    Article  Google Scholar 

Download references

Acknowledgements

The work was supported by the Fundamental Research Funds For the Central Universities (No. XDJK2017D059), Scientific and Technological Research Program of Chongqing University of Education (Nos. KY2016TZ02 and 2017XJPT07), Key Research Program of Chongqing Education Science 13th Five-Year Plan 2017 (No. 2017-GX-139). Li Li is the corresponding author for the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Li Li.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, J., Hu, F., Li, L. et al. An adaptive mechanism to achieve learning rate dynamically. Neural Comput & Applic 31, 6685–6698 (2019). https://doi.org/10.1007/s00521-018-3495-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-018-3495-0

Keywords