Abstract
Reward functions are an essential component of many robot learning methods. Defining such functions, however, remains hard in many practical applications. For tasks such as grasping, there are no reliable success measures available. Defining reward functions by hand requires extensive task knowledge and often leads to undesired emergent behavior. We introduce a framework, wherein the robot simultaneously learns an action policy and a model of the reward function by actively querying a human expert for ratings. We represent the reward model using a Gaussian process and evaluate several classical acquisition functions (AFs) from the Bayesian optimization literature in this context. Furthermore, we present a novel AF, expected policy divergence. We demonstrate results of our method for a robot grasping task and show that the learned reward function generalizes to a similar task. Additionally, we evaluate the proposed novel AF on a real robot pendulum swing-up task.

















Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Akrour, R., Schoenauer, M., & Sebag, M. (2011). Preference-based policy learning. Machine learning and knowledge discovery in databases. Berlin: Springer.
Akrour, R., Schoenauer, M., & Sebag, M. (2013). Interactive robot education. In European Conference on Machine Learning Workshop.
Balasubramanian, R., Ling, X., Brook, P. D., Smith, J. R., & Matsuoka, Y. (2012). Physical human interactive guidance: Identifying grasping principles from human-planned grasps. IEEE Transactions on Robotics, 28(4), 899–910.
Bratman, J., Singh, S., Sorg, J., & Lewis, R. (2012). Strong mitigation: Nesting search for good policies within search for good reward. In International Conference on Autonomous Agents and Multiagent Systems.
Cakmak, M., & Thomaz, A. L. (2012). Designing robot learners that ask good questions. In International Conference on Human-Robot Interaction.
Cheng, W., Fürnkranz, J., Hüllermeier, E., & Park, S.-H. (2011). Preference-based policy iteration: Leveraging preference learning for reinforcement learning. Machine learning and knowledge discovery in databases. Berlin: Springer.
Chu, W., & Ghahramani, Z. (2005). Preference learning with Gaussian processes. In International Conference on Machine Learning.
Dang, H., & Allen, P. K. (2012). Learning grasp stability. In International Conference on Robotics and Automation.
Daniel, C., Neumann, G., & Peters, J. (2013). Learning sequential motor tasks. In International Conference on Robotics and Automation.
Devlin, S., & Kudenko, D. (2012). Dynamic potential-based reward shaping. In International Conference on Autonomous Agents and Multiagent Systems.
Dorigo, M., & Colombetti, M. (1994). Robot shaping: Developing autonomous agents through learning. Artificial Intelligence, 71, 321–370.
Engel, Y., Mannor, S., & Meir, R. (2005). Reinforcement learning with Gaussian processes. In International Conference on Machine Learning.
Ghavamzadeh, M., & Engel, Y. (2007). Bayesian policy gradient algorithms. Advances in neural information processing systems. Cambridge, MA: MIT Press.
Griffith, S., Subramanian, K., Scholz, J., Isbell, C., & Thomaz, A. L. (2013). Policy shaping: Integrating human feedback with reinforcement learning. In Advances in Neural Information Processing Systems.
Hake, H. W., & Garner, W. R. (1951). The effect of presenting various numbers of discrete steps on scale reading accuracy. Journal of Experimental Psychology, 42, 358.
Hoffman, M. D., Brochu, E., & de Freitas, N. (2011). Portfolio allocation for Bayesian optimization. In Conference on Uncertainty in Artificial Intelligence.
Ijspeert, A., Nakanishi, J., & Schaal, S. (2003). Learning attractor landscapes for learning motor primitives. Advances in neural information processing systems. Cambridge, MA: MIT Press.
Jain, A., Wojcik, B., Joachims, T., & Saxena, A. (2013). Learning trajectory preferences for manipulators via iterative improvement. Advances in neural information processing systems. Cambridge, MA: MIT Press.
Julier, S. J, & Uhlmann, J. K. (1997). A new extension of the kalman filter to nonlinear systems. In International Symposium on Aerospace/Defense Sensing, Simulation and Controls.
Knox, W. B., & Stone, P. (2009). Interactively shaping agents via human reinforcement: The TAMER framework. In International Conference on Knowledge Capture.
Kober, J., Mohler, B. J., & Peters, J. (2008). Learning perceptual coupling for motor primitives. In Intelligent Robots and Systems.
Konidaris, G., & Barto, A. (2006). Autonomous shaping: Knowledge transfer in reinforcement learning. In International Conference on Machine Learning.
Kormushev, P., Calinon, S., & Caldwell, D. (2010). Robot motor skill coordination with EM-based reinforcement learning. In Intelligent Robots and Systems.
Kroemer, O., Detry, R., Piater, J., & Peters, J. (2010). Combining active learning and reactive control for robot grasping. Robotics and Autonomous Systems, 58, 1105–1116.
Kushner, H. J. (1964). A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise. Journal of Fluids Engineering, 86(1), 97–106.
Lopes, M., Melo, F., & Montesano, L. (2009). Active learning for reward estimation in inverse reinforcement learning. Machine learning and knowledge discovery in databases. Berlin: Springer.
Miller, G. A. (1956). The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychological Review, 63(2), 81–97.
Mockus, J., Tiesis, V., & Zilinskas, A. (1978). The application of Bayesian methods for seeking the extremum. Towards global optimization. Amsterdam: North Holland.
Ng, A., & Coates, A. (1998). Autonomous inverted helicopter flight via reinforcement learning. In Experimental Robotics IX.
Ng, A., Harada, D., & Russell, S. (1999). Policy invariance under reward transformations: Theory and application to reward shaping. In International Conference on Machine Learning.
Peters, J., Mülling, K., & Altun, Y. (2010). Relative entropy policy search. In National Conference on Artificial Intelligence.
Rasmussen, C. E., & Rasmussen, C. K. I. (2006). Gaussian processes for machine learning. Cambridge, MA: MIT Press.
Ratliff, N., Bagnell, A., & Zinkevich, M. (2006). Maximum margin planning. In International Conference on Machine Learning.
Ratliff, N., Silver, D., & Bagnell, A. (2009). Learning to search: Functional gradient techniques for imitation learning. Autonomous Robots, 27, 25–53.
Schoenauer, M., Akrour, R., Sebag, M., & Souplet, J.-C. (2014). Programming by feedback. In International Conference on Machine Learning.
Singh, S., Lewis, R. L., Barto, A. G., & Sorg, J. (2010). Intrinsically motivated reinforcement learning: An evolutionary perspective. Autonomous Mental Development, 2, 70–82.
Srinivas, N., Krause, A., Kakade, S. M., & Seeger, M. (2010). Gaussian process optimization in the bandit setting: No regret and experimental design. In Internation Conference on Machine Learning.
Suárez Feijóo, R., Cornellá Medrano, J., & Roa Garzón, M. (2014). Grasp quality measures: Review and performance. Autonomous Robots, 38(1), 65–88. http://link.springer.com/article/10.1007/s10514-014-9402-3
Thomaz, A. L., & Breazeal, C. (2008). Teachable robots: Understanding human teaching behavior to build more effective robot learners. Artificial Intelligence, 172, 716–737.
Wilson, A., Fern, A., & Tadepalli, P. (2012). A bayesian approach for policy learning from trajectory preference queries. In Advances in Neural Information Processing Systems.
Ziebart, B., Maas, A., Bagnell, A., & Dey, A. (2008) Maximum entropy inverse reinforcement learning. In Conference on Artificial Intelligence.
Acknowledgments
The authors want to thank for the support of the European Union Projects #FP7-ICT-270327 (Complacs) and #FP7-ICT-2013-10 (3rd Hand).
Author information
Authors and Affiliations
Corresponding author
Additional information
This is one of several papers published in Autonomous Robots comprising the “Special Issue on Robotics Science and Systems”.
Rights and permissions
About this article
Cite this article
Daniel, C., Kroemer, O., Viering, M. et al. Active reward learning with a novel acquisition function. Auton Robot 39, 389–405 (2015). https://doi.org/10.1007/s10514-015-9454-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10514-015-9454-z