# Reinforcement Learning and Control

Methods for training agents to make sequences of decisions by rewarding desired behaviors and punishing undesired ones, used in robotics, game playing, and more.

#### TibGM: A Transferable and Information-Based Graphical Model Approach for Reinforcement Learning

Tameem Adel, Adrian Weller, June 2019. (In 36th International Conference on Machine Learning). Long Beach.

Abstract▼ URL

One of the challenges to reinforcement learning (RL) is scalable transferability among complex tasks. Incorporating a graphical model (GM), along with the rich family of related methods, as a basis for RL frameworks provides potential to address issues such as transferability, generalisation and exploration. Here we propose a flexible GM-based RL framework which leverages efficient inference procedures to enhance generalisation and transfer power. In our proposed transferable and information-based graphical model framework ‘TibGM’, we show the equivalence between our mutual information-based objective in the GM, and an RL consolidated objective consisting of a standard reward maximisation target and a generalisation/transfer objective. In settings where there is a sparse or deceptive reward signal, our TibGM framework is flexible enough to incorporate exploration bonuses depicting intrinsic rewards. We empirically verify improved performance and exploration power.

#### Policy search for learning robot control using sparse data

B. Bischoff, D. Nguyen-Tuong, D. van Hoof, A. McHutchon, Carl Edward Rasmussen, A. Knoll, M. P. Deisenroth, 2014. (In IEEE International Conference on Robotics and Automation). Hong Kong, China. IEEE. **DOI**: 10.1109/ICRA.2014.6907422.

Abstract▼ URL

In many complex robot applications, such as grasping and manipulation, it is difficult to program desired task solutions beforehand, as robots are within an uncertain and dynamic environment. In such cases, learning tasks from experience can be a useful alternative. To obtain a sound learning and generalization performance, machine learning, especially, reinforcement learning, usually requires sufficient data. However, in cases where only little data is available for learning, due to system constraints and practical issues, reinforcement learning can act suboptimally. In this paper, we investigate how model-based reinforcement learning, in particular the probabilistic inference for learning control method (PILCO), can be tailored to cope with the case of sparse data to speed up learning. The basic idea is to include further prior knowledge into the learning process. As PILCO is built on the probabilistic Gaussian processes framework, additional system knowledge can be incorporated by defining appropriate prior distributions, e.g. a linear mean Gaussian prior. The resulting PILCO formulation remains in closed form and analytically tractable. The proposed approach is evaluated in simulation as well as on a physical robot, the Festo Robotino XT. For the robot evaluation, we employ the approach for learning an object pick-up task. The results show that by including prior knowledge, policy learning can be sped up in presence of sparse data.

#### A minimum relative entropy principle for adaptive control in linear quadratic regulators

Daniel A. Braun, Pedro A. Ortega, 2010. (In Proceedings of the 7th international conference on informatics in control, automation and robotics).

Abstract▼

The design of optimal adaptive controllers is usually based on heuristics, because solving Bellman’s equations over information states is notoriously intractable. Approximate adaptive controllers often rely on the principle of certainty-equivalence where the control process deals with parameter point estimates as if they represented “true” parameter values. Here we present a stochastic control rule instead where controls are sampled from a posterior distribution over a set of probabilistic input-output models and the true model is identified by Bayesian inference. This allows reformulating the adaptive control problem as an inference and sampling problem derived from a minimum relative entropy principle. Importantly, inference and action sampling both work forward in time and hence such a Bayesian adaptive controller is applicable on-line. We demonstrate the improved performance that can be achieved by such an approach for linear quadratic regulator examples.

#### Path Integral Control and Bounded Rationality

Daniel A. Braun, Pedro A. Ortega, Evangelos Theodorou, Stefan Schaal, 2011. (In 2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning).

Abstract▼ URL

Path integral methods have recently been shown to be applicable to a very general class of optimal control problems. Here we examine the path integral formalism from a decision-theoretic point of view, since an optimal controller can always be regarded as an instance of a perfectly rational decision-maker that chooses its actions so as to maximize its expected utility. The problem with perfect rationality is, however, that finding optimal actions is often very difficult due to prohibitive computational resource costs that are not taken into account. In contrast, a bounded rational decision-maker has only limited resources and therefore needs to strike some compromise between the desired utility and the required resource costs. In particular, we suggest an information-theoretic measure of resource costs that can be derived axiomatically. As a consequence we obtain a variational principle for choice probabilities that trades off maximizing a given utility criterion and avoiding resource costs that arise due to deviating from initially given default choice probabilities. The resulting bounded rational policies are in general probabilistic. We show that the solutions found by the path integral formalism are such bounded rational policies. Furthermore, we show that the same formalism generalizes to discrete control problems, leading to linearly solvable bounded rational control policies in the case of Markov systems. Importantly, Bellman’s optimality principle is not presupposed by this variational principle, but it can be derived as a limit case. This suggests that the information theoretic formalization of bounded rationality might serve as a general principle in control design that unifies a number of recently reported approximate optimal control methods both in the continuous and discrete domain.

#### Evaluating Model-Based Planning and Planner Amortization for Continuous Control

Arunkumar Byravan, Leonard Hasenclever, Piotr Trochim, Mehdi Mirza, Alessandro Davide Ialongo, Yuval Tassa, Jost Tobias Springenberg, Abbas Abdolmaleki, Nicolas Heess, Josh Merel, Martin Riedmiller, 2022. (In 10th International Conference on Learning Representations).

Abstract▼ URL

There is a widespread intuition that model-based control methods should be able to surpass the data efficiency of model-free approaches. In this paper we attempt to evaluate this intuition on various challenging locomotion tasks. We take a hybrid approach, combining model predictive control (MPC) with a learned model and model-free policy learning; the learned policy serves as a proposal for MPC. We show that MPC with learned proposals and models (trained on the fly or transferred from related tasks) can significantly improve performance and data efficiency with respect to model-free methods. However, we find that well-tuned model-free agents are strong baselines even for high DoF control problems. Finally, we show that it is possible to distil a model-based planner into a policy that amortizes the planning computation without any loss of performance.

#### Lazily Adapted Constant Kinky Inference for Nonparametric Regression and Model-Reference Adaptive Control

Jan-Peter Calliess, 2016. (arXiv).

Abstract▼ URL

Techniques known as Nonlinear Set Membership prediction, Lipschitz Interpolation or Kinky Inference are approaches to machine learning that utilise presupposed Lipschitz properties to compute inferences over unobserved function values. Provided a bound on the true best Lipschitz constant of the target function is known a priori they offer convergence guarantees as well as bounds around the predictions. Considering a more general setting that builds on Hölder continuity relative to pseudo-metrics, we propose an online method for estimating the Hoelder constant online from function value observations that possibly are corrupted by bounded observational errors. Utilising this to compute adaptive parameters within a kinky inference rule gives rise to a nonparametric machine learning method, for which we establish strong universal approximation guarantees. That is, we show that our prediction rule can learn any continuous function in the limit of increasingly dense data to within a worst-case error bound that depends on the level of observational uncertainty. We apply our method in the context of nonparametric model-reference adaptive control (MRAC). Across a range of simulated aircraft roll-dynamics and performance metrics our approach outperforms recently proposed alternatives that were based on Gaussian processes and RBF-neural networks. For discrete-time systems, we provide stability guarantees for our learning-based controllers both for the batch and the online learning setting.

#### Nonlinear Set Membership Regression with Adaptive Hyper-Parameter Estimation for Online Learning and Control

Jan-Peter Calliess, Stephen Roberts, Carl Edward Rasmussen, Jan Maciejowski, 2018. (In Proceedings of the European Control Conference).

Abstract▼ URL

Methods known as Lipschitz Interpolation or Nonlinear Set Membership regression have become established tools for nonparametric system-identification and data-based control. They utilise presupposed Lipschitz properties to compute inferences over unobserved function values. Unfortunately, they rely on the a priori knowledge of a Lipschitz constant of the underlying target function which serves as a hyperparameter. We propose a closed-form estimator of the Lipschitz constant that is robust to bounded observational noise in the data. The merger of Lipschitz Interpolation with the new hyperparameter estimator gives a new nonparametric machine learning method for which we derive online learning convergence guarantees. Furthermore, we apply our learning method to model-reference adaptive control and provide a convergence guarantee on the closed-loop dynamics. In a simulated flight manoeuvre control scenario, we compare the performance of our approach to recently proposed alternative learning-based controllers.

#### Stochastic Flows and Geometric Optimization on the Orthogonal Group

Krzysztof Choromanski, David Cheikhi, Jared Davis, Valerii Likhosherstov, Achille Nazaret, Achraf Bahamou, Xingyou Song, Mrugank Akarte, Jack Parker-Holder, Jacob Bergquist, Yuan Gao, Aldo Pacchiano, Tamas Sarlos, Adrian Weller, Vikas Sindhwani, 2020. (In 37th International Conference on Machine Learning).

Abstract▼ URL

We present a new class of stochastic, geometrically-driven optimization algorithms on the orthogonal group O(d) and naturally reductive homogeneous manifolds obtained from the action of the rotation group SO(d). We theoretically and experimentally demonstrate that our methods can be applied in various fields of machine learning including deep, convolutional and recurrent neural networks, reinforcement learning, normalizing flows and metric learning. We show an intriguing connection between efficient stochastic optimization on the orthogonal group and graph theory (e.g. matching problem, partition functions over graphs, graph-coloring). We leverage the theory of Lie groups and provide theoretical results for the designed class of algorithms. We demonstrate broad applicability of our methods by showing strong performance on the seemingly unrelated tasks of learning world models to obtain stable policies for the most difficult Humanoid agent from OpenAI Gym and improving convolutional neural networks.

#### Distributional Reinforcement Learning with Quantile Regression

Will Dabney, Mark Rowland, Marc G. Bellemare, Rémi Munos, February 2018. (In 32nd AAAI Conference on Artificial Intelligence). New Orleans.

Abstract▼ URL

In reinforcement learning an agent interacts with the environment by taking actions and observing the next state and reward. When sampled probabilistically, these state transitions, rewards, and actions can all induce randomness in the observed long-term return. Traditionally, reinforcement learning algorithms average over this randomness to estimate the value function. In this paper, we build on recent work advocating a distributional approach to reinforcement learning in which the distribution over returns is modeled explicitly instead of only estimating the mean. That is, we examine methods of learning the value distribution instead of the value function. We give results that close a number of gaps between the theoretical and algorithmic results given by Bellemare, Dabney, and Munos (2017). First, we extend existing results to the approximate distribution setting. Second, we present a novel distributional reinforcement learning algorithm consistent with our theoretical formulation. Finally, we evaluate this new algorithm on the Atari 2600 games, observing that it significantly outperforms many of the recent improvements on DQN, including the related distributional algorithm C51.

#### Efficient Reinforcement Learning using Gaussian Processes

Marc Peter Deisenroth, 2010. Karlsruhe Institute of Technology, Karlsruhe, Germany.

Abstract▼ URL

In many research areas, including control and medical applications, we face decision-making problems where data are limited and/or the underlying generative process is complicated and partially unknown. In these scenarios, we can profit from algorithms that learn from data and aid decision making. Reinforcement learning (RL) is a general computational approach to experience-based goal-directed learning for sequential decision making under uncertainty. However, RL often lacks efficiency in terms of the number of required trials when no task-specific knowledge is available. This lack of efficiency makes RL often inapplicable to (optimal) control problems. Thus, a central issue in RL is to speed up learning by extracting more information from available experience. The contributions of this dissertation are threefold: 1. We propose PILCO, a fully Bayesian approach for efficient RL in continuous-valued state and action spaces when no expert knowledge is available. PILCO is based on well-established ideas from statistics and machine learning. PILCO’s key ingredient is a probabilistic dynamics model learned from data, which is implemented by a Gaussian process (GP). The GP carefully quantifies knowledge by a probability distribution over plausible dynamics models. By averaging over all these models during long-term planning and decision making, PILCO takes uncertainties into account in a principled way and, therefore, reduces model bias, a central problem in model-based RL. 2. Due to its generality and efficiency, PILCO can be considered a conceptual and practical approach to jointly learning models and controllers when expert knowledge is difficult to obtain or simply not available. For this scenario, we investigate PILCO’s properties its applicability to challenging real and simulated nonlinear control problems. For example, we consider the tasks of learning to swing up a double pendulum attached to a cart or to balance a unicycle with five degrees of freedom. Across all tasks we report unprecedented automation and an unprecedented learning efficiency for solving these tasks. 3. As a step toward pilco’s extension to partially observable Markov decision processes, we propose a principled algorithm for robust filtering and smoothing in GP dynamic systems. Unlike commonly used Gaussian filters for nonlinear systems, it does neither rely on function linearization nor on finite-sample representations of densities. Our algorithm profits from exact moment matching for predictions while keeping all computations analytically tractable. We present experimental evidence that demonstrates the robustness and the advantages of our method over unscented Kalman filters, the cubature Kalman filter, and the extended Kalman filter.

#### Gaussian Processes for Data-Efficient Learning in Robotics and Control

Marc Peter Deisenroth, Dieter Fox, Carl Edward Rasmussen, 2015. (IEEE Transactions on Pattern Analysis and Machine Intelligence). **DOI**: 10.1109/TPAMI.2013.218.

Abstract▼

Autonomous learning has been a promising direction in control and robotics for more than a decade since data-driven learning allows to reduce the amount of engineering knowledge, which is otherwise required. However, autonomous reinforcement learning (RL) approaches typically require many interactions with the system to learn controllers, which is a practical limitation in real systems, such as robots, where many interactions can be impractical and time consuming. To address this problem, current learning approaches typically require task-speciﬁc knowledge in form of expert demonstrations, realistic simulators, pre-shaped policies, or speciﬁc knowledge about the underlying dynamics. In this article, we follow a different approach and speed up learning by extracting more information from data. In particular, we learn a probabilistic, non-parametric Gaussian process transition model of the system. By explicitly incorporating model uncertainty into long-term planning and controller learning our approach reduces the effects of model errors, a key problem in model-based learning. Compared to state-of-the art RL our model-based policy search method achieves an unprecedented speed of learning. We demonstrate its applicability to autonomous learning in real robot and control tasks.

#### Bayesian Inference for Efficient Learning in Control

Marc Peter Deisenroth, Carl Edward Rasmussen, June 2009. (In Multidisciplinary Symposium on Reinforcement Learning). Montréal, QC, Canada.

Abstract▼ URL

In contrast to humans or animals, artificial learners often require more trials when learning motor control tasks solely based on experience. Efficient autonomous learners will reduce the amount of engineering required to solve control problems. By using probabilistic forward models, we can employ two key ingredients of biological learning systems to speed up artificial learning. We present a consistent and coherent Bayesian framework that allows for efficient autonomous experience-based learning. We demonstrate the success of our learning algorithm by applying it to challenging nonlinear control problems in simulation and in hardware.

#### Efficient Reinforcement Learning for Motor Control

Marc Peter Deisenroth, Carl Edward Rasmussen, September 2009. (In 10th International PhD Workshop on Systems and Control). Hluboká nad Vltavou, Czech Republic.

Abstract▼ URL

Artificial learners often require many more trials than humans or animals when learning motor control tasks in the absence of expert knowledge. We implement two key ingredients of biological learning systems, generalization and incorporation of uncertainty into the decision-making process, to speed up artificial learning. We present a coherent and fully Bayesian framework that allows for efficient artificial learning in the absence of expert knowledge. The success of our learning framework is demonstrated on challenging nonlinear control problems in simulation and in hardware.

#### PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Marc Peter Deisenroth, Carl Edward Rasmussen, 2011. (In 28th International Conference on Machine Learning).

Abstract▼ URL

In this paper, we introduce PILCO, a practical, data-efficient model-based policy search method. PILCO reduces model bias, one of the key problems of model-based reinforcement learning, in a principled way. By learning a probabilistic dynamics model and explicitly incorporating model uncertainty into long-term planning, PILCO can cope with very little data and facilitates learning from scratch in only a few trials. Policy evaluation is performed in closed form using state-of-the-art approximate inference. Furthermore, policy gradients are computed analytically for policy improvement. We report unprecedented learning efficiency on challenging and high-dimensional control tasks.

**Comment:** web site

#### Learning to Control a Low-Cost Manipulator using Data-Efficient Reinforcement Learning

Marc Peter Deisenroth, Carl Edward Rasmussen, Dieter Fox, June 2011. (In 9th International Conference on Robotics: Science & Systems). Los Angeles, CA, USA.

Abstract▼ URL

Over the last years, there has been substantial progress in robust manipulation in unstructured environments. The long-term goal of our work is to get away from precise, but very expensive robotic systems and to develop affordable, potentially imprecise, self-adaptive manipulator systems that can interactively perform tasks such as playing with children. In this paper, we demonstrate how a low-cost off-the-shelf robotic system can learn closed-loop policies for a stacking task in only a handful of trials - from scratch. Our manipulator is inaccurate and provides no pose feedback. For learning a controller in the work space of a Kinect-style depth camera, we use a model-based reinforcement learning technique. Our learning method is data efficient, reduces model bias, and deals with several noise sources in a principled way during long-term planning. We present a way of incorporating state-space constraints into the learning process and analyze the learning gain by exploiting the sequential structure of the stacking task.

**Comment:** project site

#### Model-Based Reinforcement Learning with Continuous States and Actions

Marc Peter Deisenroth, Carl Edward Rasmussen, Jan Peters, April 2008. (In Proceedings of the 16th European Symposium on Artificial Neural Networks (ESANN 2008)). Bruges, Belgium.

Abstract▼ URL

Finding an optimal policy in a reinforcement learning (RL) framework with continuous state and action spaces is challenging. Approximate solutions are often inevitable. GPDP is an approximate dynamic programming algorithm based on Gaussian process (GP) models for the value functions. In this paper, we extend GPDP to the case of unknown transition dynamics. After building a GP model for the transition dynamics, we apply GPDP to this model and determine a continuous-valued policy in the entire state space. We apply the resulting controller to the underpowered pendulum swing up. Moreover, we compare our results on this RL task to a nearly optimal discrete DP solution in a fully known environment.

#### Gaussian process dynamic programming

Marc Peter Deisenroth, Carl Edward Rasmussen, Jan Peters, March 2009. (Neurocomputing). Elsevier B. V.. **DOI**: 10.1016/j.neucom.2008.12.019.

Abstract▼ URL

Reinforcement learning (RL) and optimal control of systems with continuous states and actions require approximation techniques in most interesting cases. In this article, we introduce Gaussian process dynamic programming (GPDP), an approximate value function-based RL algorithm. We consider both a classic optimal control problem, where problem-specific prior knowledge is available, and a classic RL problem, where only very general priors can be used. For the classic optimal control problem, GPDP models the unknown value functions with Gaussian processes and generalizes dynamic programming to continuous-valued states and actions. For the RL problem, GPDP starts from a given initial state and explores the state space using Bayesian active learning. To design a fast learner, available data have to be used efficiently. Hence, we propose to learn probabilistic models of the a priori unknown transition dynamics and the value functions on the fly. In both cases, we successfully apply the resulting continuous-valued controllers to the under-actuated pendulum swing up and analyze the performances of the suggested algorithms. It turns out that GPDP uses data very efficiently and can be applied to problems, where classic dynamic programming would be cumbersome.

**Comment:** code.

#### A Comparison of Human and Agent Reinforcement Learning in Partially Observable Domains

Finale Doshi-Velez, Zoubin Ghahramani, 2011. (In 33rd Annual Meeting of the Cognitive Science Society). Boston, MA.

Abstract▼ URL

It is commonly stated that reinforcement learning (RL) algorithms learn slower than humans. In this work, we investigate this claim using two standard problems from the RL literature. We compare the performance of human subjects to RL techniques. We find that context—the meaningfulness of the observations—–plays a significant role in the rate of human RL. Moreover, without contextual information, humans often fare much worse than classic algorithms. Comparing the detailed responses of humans and RL algorithms, we also find that humans appear to employ rather different strategies from standard algorithms, even in cases where they had indistinguishable performance to them. Our research both sheds light on human RL and provides insights for improving RL algorithms.

#### Leave no Trace: Learning to Reset for Safe and Autonomous Reinforcement Learning

Benjamin Eysenbach, Shixiang Gu, Julian Ibarz, Sergey Levine, Apr 2018. (In 6th International Conference on Learning Representations). Vancouver CANADA.

Abstract▼ URL

Deep reinforcement learning algorithms can learn complex behavioral skills, but real-world application of these methods requires a large amount of experience to be collected by the agent. In practical settings, such as robotics, this involves repeatedly attempting a task, resetting the environment between each attempt. However, not all tasks are easily or automatically reversible. In practice, this learning process requires extensive human intervention. In this work, we propose an autonomous method for safe and efficient reinforcement learning that simultaneously learns a forward and reset policy, with the reset policy resetting the environment for a subsequent attempt. By learning a value function for the reset policy, we can automatically determine when the forward policy is about to enter a non-reversible state, providing for uncertainty-aware safety aborts. Our experiments illustrate that proper use of the reset policy can greatly reduce the number of manual resets required to learn a task, can reduce the number of unsafe actions that lead to non-reversible states, and can automatically induce a curriculum.

**Comment:** [Video]

#### Learning Nonlinear Dynamical Systems Using an EM Algorithm

Zoubin Ghahramani, Sam T. Roweis, 1998. (In NIPS). Edited by Michael J. Kearns, Sara A. Solla, David A. Cohn. The MIT Press. **ISBN**: 0-262-11245-0.

Abstract▼ URL

The Expectation Maximization (EM) algorithm is an iterative procedure for maximum likelihood parameter estimation from data sets with missing or hidden variables. It has been applied to system identification in linear stochastic state-space models, where the state variables are hidden from the observer and both the state and the parameters of the model have to be estimated simultaneously [9]. We present a generalization of the EM algorithm for parameter estimation in nonlinear dynamical systems. The “expectation” step makes use of Extended Kalman Smoothing to estimate the state, while the “maximization” step re-estimates the parameters using these uncertain state estimates. In general, the nonlinear maximization step is difficult because it requires integrating out the uncertainty in the states. However, if Gaussian radial basis function (RBF) approximators are used to model the nonlinearities, the integrals become tractable and the maximization step can be solved via systems of linear equations.

#### Deep Reinforcement Learning for Robotic Manipulation with Asynchronous Off-Policy Updates

Shixiang Gu, Ethan Holly, Timothy Lillicrap, Sergey Levine, May 2017. (In IEEE International Conference on Robotics and Automation). SINGAPORE.

Abstract▼ URL

Reinforcement learning holds the promise of enabling autonomous robots to learn large repertoires of behavioral skills with minimal human intervention. However, robotic applications of reinforcement learning often compromise the autonomy of the learning process in favor of achieving training times that are practical for real physical systems. This typically involves introducing hand-engineered policy representations and human-supplied demonstrations. Deep reinforcement learning alleviates this limitation by training general-purpose neural network policies, but applications of direct deep reinforcement learning algorithms have so far been restricted to simulated settings and relatively simple tasks, due to their apparent high sample complexity. In this paper, we demonstrate that a recent deep reinforcement learning algorithm based on off-policy training of deep Q-functions can scale to complex 3D manipulation tasks and can learn deep neural network policies efficiently enough to train on real physical robots. We demonstrate that the training times can be further reduced by parallelizing the algorithm across multiple robots which pool their policy updates asynchronously. Our experimental evaluation shows that our method can learn a variety of 3D manipulation skills in simulation and a complex door opening skill on real robots without any prior demonstrations or manually designed representations.

**Comment:** [Google Blogpost] [MIT Technology Review] [Video]

#### Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic

Shixiang Gu, Timothy Lillicrap, Zoubin Ghahramani, Richard E. Turner, Sergey Levine, April 2017. (In 5th International Conference on Learning Representations). Toulon France.

Abstract▼ URL

Model-free deep reinforcement learning (RL) methods have been successful in a wide variety of simulated domains. However, a major obstacle facing deep RL in the real world is their high sample complexity. Batch policy gradient methods offer stable learning, but at the cost of high variance, which often requires large batches. TD-style methods, such as off-policy actor-critic and Q-learning, are more sample-efficient but biased, and often require costly hyperparameter sweeps to stabilize. In this work, we aim to develop methods that combine the stability of policy gradients with the efficiency of off-policy RL. We present Q-Prop, a policy gradient method that uses a Taylor expansion of the off-policy critic as a control variate. Q-Prop is both sample efficient and stable, and effectively combines the benefits of on-policy and off-policy methods. We analyze the connection between Q-Prop and existing model-free algorithms, and use control variate theory to derive two variants of Q-Prop with conservative and aggressive adaptation. We show that conservative Q-Prop provides substantial gains in sample efficiency over trust region policy optimization (TRPO) with generalized advantage estimation (GAE), and improves stability over deep deterministic policy gradient (DDPG), the state-of-the-art on-policy and off-policy methods, on OpenAI Gym’s MuJoCo continuous control environments.

#### Interpolated Policy Gradient: Merging On-Policy and Off-Policy Policy Gradient Estimation for Deep Reinforcement Learning

Shixiang Gu, Timothy Lillicrap, Zoubin Ghahramani, Richard E. Turner, Bernhard Schölkopf, Sergey Levine, Dec 2017. (In Advances in Neural Information Processing Systems 31). Long Beach USA.

Abstract▼ URL

Off-policy model-free deep reinforcement learning methods using previously collected data can improve sample efficiency over on-policy policy gradient techniques. On the other hand, on-policy algorithms are often more stable and easier to use. This paper examines, both theoretically and empirically, approaches to merging on- and off-policy updates for deep reinforcement learning. Theoretical results show that off-policy updates with a value function estimator can be interpolated with on-policy policy gradient updates whilst still satisfying performance bounds. Our analysis uses control variate methods to produce a family of policy gradient algorithms, with several recently proposed algorithms being special cases of this family. We then provide an empirical comparison of these techniques with the remaining algorithmic details fixed, and show how different mixing of off-policy gradient estimates with on-policy samples contribute to improvements in empirical performance. The final algorithm provides a generalization and unification of existing deep policy gradient techniques, has theoretical guarantees on the bias introduced by off-policy updates, and improves on the state-of-the-art model-free deep RL methods on a number of OpenAI Gym continuous control benchmarks.

#### Continuous Deep Q-Learning with Model-based Acceleration

Shixiang Gu, Timothy Lillicrap, Ilya Sutskever, Sergey Levine, June 2016. (In 33rd International Conference on Machine Learning). New York USA.

Abstract▼ URL

Model-free reinforcement learning has been successfully applied to a range of challenging problems, and has recently been extended to handle large neural network policies and value functions. However, the sample complexity of model-free algorithms, particularly when using high-dimensional function approximators, tends to limit their applicability to physical systems. In this paper, we explore algorithms and representations to reduce the sample complexity of deep reinforcement learning for continuous control tasks. We propose two complementary techniques for improving the efficiency of such algorithms. First, we derive a continuous variant of the Q-learning algorithm, which we call normalized adantage functions (NAF), as an alternative to the more commonly used policy gradient and actor-critic methods. NAF representation allows us to apply Q-learning with experience replay to continuous tasks, and substantially improves performance on a set of simulated robotic control tasks. To further improve the efficiency of our approach, we explore the use of learned models for accelerating model-free reinforcement learning. We show that iteratively refitted local linear models are especially effective for this, and demonstrate substantially faster learning on domains where such models are applicable.

#### Reinforcement Learning with Reference Tracking Control in Continuous State Spaces

Joseph Hall, Carl Edward Rasmussen, Jan Maciejowski, 2011. (In Proceedings of 50th IEEE Conference on Decision and Control and European Control Conference).

Abstract▼ URL

The contribution described in this paper is an algorithm for learning nonlinear, reference tracking, control policies given no prior knowledge of the dynamical system and limited interaction with the system through the learning process. Concepts from the field of reinforcement learning, Bayesian statistics and classical control have been brought together in the formulation of this algorithm which can be viewed as a form indirect self tuning regulator. On the task of reference tracking using the inverted pendulum it was shown to yield generally improved performance on the best controller derived from the standard linear quadratic method using only 30 s of total interaction with the system. Finally, the algorithm was shown to work on the double pendulum proving its ability to solve nontrivial control tasks.

#### Modelling and Control of Nonlinear Systems using Gaussian Processes with Partial Model Information

Joseph Hall, Carl Edward Rasmussen, Jan Maciejowski, 2012. (In 51st IEEE Conference on Decision and Control).

Abstract▼ URL

Gaussian processes are gaining increasing popularity among the control community, in particular for the modelling of discrete time state space systems. However, it has not been clear how to incorporate model information, in the form of known state relationships, when using a Gaussian process as a predictive model. An obvious example of known prior information is position and velocity related states. Incorporation of such information would be beneficial both computationally and for faster dynamics learning. This paper introduces a method of achieving this, yielding faster dynamics learning and a reduction in computational effort from O(Dn2) to O((D-F)n2) in the prediction stage for a system with D states, F known state relationships and n observations. The effectiveness of the method is demonstrated through its inclusion in the PILCO learning algorithm with application to the swing-up and balance of a torque-limited pendulum and the balancing of a robotic unicycle in simulation.

#### Successor Uncertainties: exploration and uncertainty in temporal difference learning

David Janz, Jiri Hron, Przemyslaw Mazur, José Miguel Hernández-Lobato, Katja Hofmann, Sebastian Tschiatschek, 2019. (NeurIPS).

Abstract▼ URL

Posterior sampling for reinforcement learning (PSRL) is an effective method for balancing exploration and exploitation in reinforcement learning. Randomised value functions (RVF) can be viewed as a promising approach to scaling PSRL. However, we show that most contemporary algorithms combining RVF with neural network function approximation do not possess the properties which make PSRL effective, and provably fail in sparse reward problems. Moreover, we find that propagation of uncertainty, a property of PSRL previously thought important for exploration, does not preclude this failure. We use these insights to design Successor Uncertainties (SU), a cheap and easy to implement RVF algorithm that retains key properties of PSRL. SU is highly effective on hard tabular exploration benchmarks. Furthermore, on the Atari 2600 domain, it surpasses human performance on 38 of 49 games tested (achieving a median human normalised score of 2.09), and outperforms its closest RVF competitor, Bootstrapped DQN, on 36 of those.

#### Sequence Tutor: Conservative fine-tuning of sequence generation models with KL-control

Natasha Jaques, Shixiang Gu, Dzmitry Bahdanau, Jose Miguel Hernndez Lobato, Richard E. Turner, Douglas Eck, Aug 2017. (In 34th International Conference on Machine Learning). Sydney AUSTRALIA.

Abstract▼ URL

This paper proposes a general method for improving the structure and quality of sequences generated by a recurrent neural network (RNN), while maintaining information originally learned from data, as well as sample diversity. An RNN is first pre-trained on data using maximum likelihood estimation (MLE), and the probability distribution over the next token in the sequence learned by this model is treated as a prior policy. Another RNN is then trained using reinforcement learning (RL) to generate higher-quality outputs that account for domain-specific incentives while retaining proximity to the prior policy of the MLE RNN. To formalize this objective, we derive novel off-policy RL methods for RNNs from KL-control. The effectiveness of the approach is demonstrated on two applications; 1) generating novel musical melodies, and 2) computational molecular generation. For both problems, we show that the proposed method improves the desired properties and structure of the generated sequences, while maintaining information learned from data.

**Comment:** [MIT Technology Review] [Video]

#### Gaussian process model based predictive control

Juš Kocijan, Roderick Murray-Smith, Carl Edward Rasmussen, Agathe Girard, 2004. (In American Control Conference). (Proceedings of the ACC 2004). Boston, MA.

Abstract▼ URL

Gaussian process models provide a probabilistic non-parametric modelling approach for black-box identi cation of non-linear dynamic systems. The Gaussian processes can highlight areas of the input space where prediction quality is poor, due to the lack of data or its complexity, by indicating the higher variance around the predicted mean. Gaussian process models contain noticeably less coef cients to be optimised. This paper illustrates possible application of Gaussian process models within model-based predictive control. The extra information provided within Gaussian process model is used in predictive control, where optimisation of control signal takes the variance information into account. The predictive control principle is demonstrated on control of pH process benchmark.

#### Predictive control with Gaussian process models

Juš Kocijan, Roderick Murray-Smith, Carl Edward Rasmussen, Bojan Likar, 2003. (In IEEE Region 8 Eurocon 2003: Computer as a Tool). Edited by B. Zajc, M. Tkal.

Abstract▼ URL

This paper describes model-based predictive control based on Gaussian processes. Gaussian process models provide a probabilistic non-parametric modelling approach for black-box identification of non-linear dynamic systems. It offers more insight in variance of obtained model response, as well as fewer parameters to determine than other models. The Gaussian processes can highlight areas of the input space where prediction quality is poor, due to the lack of data or its complexity, by indicating the higher variance around the predicted mean. This property is used in predictive control, where optimisation of control signal takes the variance information into account. The predictive control principle is demonstrated on a simulated example of nonlinear system.

#### Assessing Approximations for Gaussian Process Classification

Malte Kuß, Carl Edward Rasmussen, April 2006. (In Advances in Neural Information Processing Systems 18). Edited by Y. Weiss, B. Schölkopf, J. Platt. Cambridge, MA, USA. Whistler, BC, Canada. The MIT Press.

Abstract▼ URL

Gaussian processes are attractive models for probabilistic classification but unfortunately exact inference is analytically intractable. We compare Laplace’s method and Expectation Propagation (EP) focusing on marginal likelihood estimates and predictive performance. We explain theoretically and corroborate empirically that EP is superior to Laplace. We also compare to a sophisticated MCMC scheme and show that EP is surprisingly accurate.

#### Learning-based Nonlinear Model Predictive Control

Daniel Limon, Jan-Peter Calliess, Jan Maciejowski, July 2017. (In IFAC 2017 World Congress). Toulouse, France. **DOI**: 10.1016/j.ifacol.2017.08.1050.

Abstract▼

This paper presents stabilizing Model Predictive Controllers (MPC) in which prediction models are inferred from experimental data of the inputs and outputs of the plant. Using a nonparametric machine learning technique called LACKI, the estimated (possibly nonlinear) model function together with an estimation of Hoelder constant is provided. Based on these, a number of predictive controllers with stability guaranteed by design are proposed. Firstly, the case when the prediction model is estimated off- line is considered and robust stability and recursive feasibility is ensured by using tightened constraints in the optimisation problem. This controller has been extended to the more interesting and complex case: the online learning of the model, where the new data collected from feedback is added to enhance the prediction model. A on-line learning MPC based on a double sequence of predictions is proposed. Stability of the online learning MPC is proved. These controllers are illustrated by simulation.

#### Iterative Amortized Policy Optimization

Joseph Marino, Alexandre Piche, Alessandro Davide Ialongo, Yisong Yue, 2021. (In Advances in Neural Information Processing Systems 34). Edited by M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, J. Wortman Vaughan. Curran Associates, Inc..

Abstract▼ URL

Policy networks are a central feature of deep reinforcement learning (RL) algorithms for continuous control, enabling the estimation and sampling of high-value actions. From the variational inference perspective on RL, policy networks, when used with entropy or KL regularization, are a form of amortized optimization, optimizing network parameters rather than the policy distributions directly. However, direct amortized mappings can yield suboptimal policy estimates and restricted distributions, limiting performance and exploration. Given this perspective, we consider the more flexible class of iterative amortized optimizers. We demonstrate that the resulting technique, iterative amortized policy optimization, yields performance improvements over direct amortization on benchmark continuous control tasks.

#### Bayesian Learning for Data-Efficient Control

Rowan McAllister, 2016. University of Cambridge, Department of Engineering, Cambridge, UK.

Abstract▼ URL

Applications to learn control of unfamiliar dynamical systems with increasing autonomy are ubiquitous. From robotics, to finance, to industrial processing, autonomous learning helps obviate a heavy reliance on experts for system identification and controller design. Often real world systems are nonlinear, stochastic, and expensive to operate (e.g. slow, energy intensive, prone to wear and tear). Ideally therefore, nonlinear systems can be identified with minimal system interaction. This thesis considers data efficient autonomous learning of control of nonlinear, stochastic systems. Data efficient learning critically requires probabilistic modelling of dynamics. Traditional control approaches use deterministic models, which easily overfit data, especially small datasets. We use probabilistic Bayesian modelling to learn systems from scratch, similar to the PILCO algorithm, which achieved unprecedented data efficiency in learning control of several benchmarks. We extend PILCO in three principle ways. First, we learn control under significant observation noise by simulating a filtered control process using a tractably analytic framework of Gaussian distributions. In addition, we develop the `latent variable belief Markov decision process’ when filters must predict under real-time constraints. Second, we improve PILCO’s data efficiency by directing exploration with predictive loss uncertainty and Bayesian optimisation, including a novel approximation to the Gittins index. Third, we take a step towards data efficient learning of high-dimensional control using Bayesian neural networks (BNN). Experimentally we show although filtering mitigates adverse effects of observation noise, much greater performance is achieved when optimising controllers with evaluations faithful to reality: by simulating closed-loop filtered control if executing closed-loop filtered control. Thus, controllers are optimised w.r.t. how they are used, outperforming filters applied to systems optimised by unfiltered simulations. We show directed exploration improves data efficiency. Lastly, we show BNN dynamics models are almost as data efficient as Gaussian process models. Results show data efficient learning of high-dimensional control is possible as BNNs scale to high-dimensional state inputs.

#### Data-Efficient Reinforcement Learning in Continuous State-Action Gaussian-POMDPs

Rowan McAllister, Carl Edward Rasmussen, December 2017. (In Advances in Neural Information Processing Systems 31). Long Beach, California.

Abstract▼ URL

We present a data-efficient reinforcement learning method for continuous state-action systems under significant observation noise. Data-efficient solutions under small noise exist, such as PILCO which learns the cartpole swing-up task in 30s. PILCO evaluates policies by planning state-trajectories using a dynamics model. However, PILCO applies policies to the observed state, therefore planning in observation space. We extend PILCO with filtering to instead plan in belief space, consistent with partially observable Markov decisions process (POMDP) planning. This enables data-efficient learning under significant observation noise, outperforming more naive methods such as post-hoc application of a filter to policies optimised by the original (unfiltered) PILCO algorithm. We test our method on the cartpole swing-up task, which involves nonlinear dynamics and requires nonlinear control.

#### Nonlinear Modelling and Control using Gaussian Processes

Andrew McHutchon, 2014. University of Cambridge, Department of Engineering, Cambridge, UK.

Abstract▼ URL

In many scientific disciplines it is often required to make predictions about how a system will behave or to deduce the correct control values to elicit a particular desired response. Efficiently solving both of these tasks relies on the construction of a model capturing the system’s operation. In the most interesting situations, the model needs to capture strongly nonlinear effects and deal with the presence of uncertainty and noise. Building models for such systems purely based on a theoretical understanding of underlying physical principles can be infeasibly complex and require a large number of simplifying assumptions. An alternative is to use a data-driven approach, which builds a model directly from observations. A powerful and principled approach to doing this is to use a Gaussian Process (GP). In this thesis we start by discussing how GPs can be applied to data sets which have noise affecting their inputs. We present the “Noisy Input GP”, which uses a simple local-linearisation to refer the input noise into heteroscedastic output noise, and compare it to other methods both theoretically and empirically. We show that this technique leads to a effective model for nonlinear functions with input and output noise. We then consider the broad topic of GP state space models for application to dynamical systems. We discuss a very wide variety of approaches for using GPs in state space models, including introducing a new method based on moment-matching, which consistently gave the best performance. We analyse the methods in some detail including providing a systematic comparison between approximate-analytic and particle methods. To our knowledge such a comparison has not been provided before in this area. Finally, we investigate an automatic control learning framework, which uses Gaussian Processes to model a system for which we wish to design a controller. Controller design for complex systems is a difficult task and thus a framework which allows an automatic design directly from data promises to be extremely useful. We demonstrate that the previously published framework cannot cope with the presence of observation noise but that the introduction of a state space model dramatically improves its performance. This contribution, along with some other suggested improvements opens the door for this framework to be used in real-world applications.

#### Adaptive, Cautious, Predictive control with Gaussian Process Priors

Roderick Murray-Smith, Daniel Sbarbaro, Carl Edward Rasmussen, Agathe Girard, August 2003. (In IFAC SYSID 2003). Edited by P. Van den Hof, B. Wahlberg, S. Weiland. (Proceedings of the 13th IFAC Symposium on System Identification). Oxford, UK. Rotterdam, The Netherlands. Elsevier Science Ltd.

Abstract▼ URL

Nonparametric Gaussian Process models, a Bayesian statistics approach, are used to implement a nonlinear adaptive control law. Predictions, including propagation of the state uncertainty are made over a k-step horizon. The expected value of a quadratic cost function is minimised, over this prediction horizon, without ignoring the variance of the model predictions. The general method and its main features are illustrated on a simulation example.

#### A Unified Framework for Resource-Bounded Agents Interacting with an Unknown Environment

Pedro A. Ortega, 2011. Department of Engineering, University of Cambridge,

Abstract▼ URL

The aim of this thesis is to present a mathematical framework for conceptualizing and constructing adaptive autonomous systems under resource constraints. The first part of this thesis contains a concise presentation of the foundations of classical agency: namely the formalizations of decision making and learning. Decision making includes: (a) subjective expected utility (SEU) theory, the framework of decision making under uncertainty; (b) the maximum SEU principle to choose the optimal solution; and (c) its application to the design of autonomous systems, culminating in the Bellman optimality equations. Learning includes: (a) Bayesian probability theory, the theory for reasoning under uncertainty that extends logic; and (b) Bayes-Optimal agents, the application of Bayesian probability theory to the design of optimal adaptive agents. Then, two major problems of the maximum SEU principle are highlighted: (a) the prohibitive computational costs and (b) the need for the causal precedence of the choice of the policy. The second part of this thesis tackles the two aforementioned problems. First, an information-theoretic notion of resources in autonomous systems is established. Second, a framework for resource-bounded agency is introduced. This includes: (a) a maximum bounded SEU principle that is derived from a set of axioms of utility; (b) an axiomatic model of probabilistic causality, which is applied for the formalization of autonomous systems having uncertainty over their policy and environment; and (c) the Bayesian control rule, which is derived from the maximum bounded SEU principle and the model of causality, implementing a stochastic adaptive control law that deals with the case where autonomous agents are uncertain about their policy and environment.

#### A conversion between utility and information

Pedro A. Ortega, Daniel A. Braun, 2010. (In The third conference on artificial general intelligence). Paris. Atlantis Press.

Abstract▼ URL

Rewards typically express desirabilities or preferences over a set of alternatives. Here we propose that rewards can be defined for any probability distribution based on three desiderata, namely that rewards should be real- valued, additive and order-preserving, where the later implies that more probable events should also be more desirable. Our main result states that rewards are then uniquely determined by the negative information content. To analyze stochastic processes, we define the utility of a realization as its reward rate. Under this interpretation, we show that the expected utility of a stochastic process is its negative entropy rate. Furthermore, we apply our results to analyze agent-environment interactions. We show that the expected utility that will actually be achieved by the agent is given by the negative cross-entropy from the input-output (I/O) distribution of the coupled interaction system and the agent’s I/O distribution. Thus, our results allow for an information-theoretic interpretation of the notion of utility and the characterization of agent-environment interactions in terms of entropy dynamics.

#### A Bayesian rule for adaptive control based on causal interventions

Pedro A. Ortega, Daniel A. Braun, 2010. (In The third conference on artificial general intelligence). Paris. Atlantis Press.

Abstract▼ URL

Explaining adaptive behavior is a central problem in artificial intelligence research. Here we formalize adaptive agents as mixture distributions over sequences of inputs and outputs (I/O). Each distribution of the mixture constitutes a “possible world”, but the agent does not know which of the possible worlds it is actually facing. The problem is to adapt the I/O stream in a way that is compatible with the true world. A natural measure of adaptation can be obtained by the Kullback Leibler (KL) divergence between the I/O distribution of the true world and the I/O distribution expected by the agent that is uncertain about possible worlds. In the case of pure input streams, the Bayesian mixture provides a well-known solution for this problem. We show, however, that in the case of I/O streams this solution breaks down, because outputs are issued by the agent itself and require a different probabilistic syntax as provided by intervention calculus. Based on this calculus, we obtain a Bayesian control rule that allows modeling adaptive behavior with mixture distributions over I/O streams. This rule might allow for a novel approach to adaptive control based on a minimum KL-principle.

#### A minimum relative entropy principle for learning and acting

Pedro A. Ortega, Daniel A. Braun, 2010. (Journal of Artificial Intelligence Research). **DOI**: 10.1613/jair.3062.

Abstract▼ URL

This paper proposes a method to construct an adaptive agent that is univemacmacrsal with respect to a given class of experts, where each expert is designed specifically for a particular environment. This adaptive control problem is formalized as the problem of minimizing the relative entropy of the adaptive agent from the expert that is most suitable for the unknown environment. If the agent is a passive observer, then the optimal solution is the well-known Bayesian predictor. However, if the agent is active, then its past actions need to be treated as causal interventions on the I/O stream rather than normal probability conditions. Here it is shown that the solution to this new variational problem is given by a stochastic controller called the Bayesian control rule, which implements adaptive behavior as a mixture of experts. Furthermore, it is shown that under mild assumptions, the Bayesian control rule converges to the control law of the most suitable expert.

#### An axiomatic formalization of bounded rationality based on a utility-information equivalence

Pedro A. Ortega, Daniel A. Braun, 2010. Dept. of Engineering, University of Cambridge,

Abstract▼ URL

Classic decision-theory is based on the maximum expected utility (MEU) principle, but crucially ignores the resource costs incurred when determining optimal decisions. Here we propose an axiomatic framework for bounded decision-making that considers resource costs. Agents are formalized as probability measures over input-output streams. We postulate that any such probability measure can be assigned a corresponding conjugate utility function based on three axioms: utilities should be real-valued, additive and monotonic mappings of probabilities. We show that these axioms enforce a unique conversion law between utility and probability (and thereby, information). Moreover, we show that this relation can be characterized as a variational principle: given a utility function, its conjugate probability measure maximizes a free utility functional. Transformations of probability measures can then be formalized as a change in free utility due to the addition of new constraints expressed by a target utility function. Accordingly, one obtains a criterion to choose a probability measure that trades off the maximization of a target utility function and the cost of the deviation from a reference distribution. We show that optimal control, adaptive estimation and adaptive control problems can be solved this way in a resource-efficient way. When resource costs are ignored, the MEU principle is recovered. Our formalization might thus provide a principled approach to bounded rationality that establishes a close link to information theory.

#### Information, Utility and Bounded Rationality

Pedro A. Ortega, Daniel A. Braun, 2011. (In The fourth conference on artificial general intelligence). Springer-Verlag. Lecture Notes on Artificial Intelligence.

Abstract▼ URL

Perfectly rational decision-makers maximize expected utility, but crucially ignore the resource costs incurred when determining optimal actions. Here we employ an axiomatic framework for bounded rational decision-making based on a thermodynamic interpretation of resource costs as information costs. This leads to a variational free utility principle akin to thermodynamical free energy that trades off utility and information costs. We show that bounded optimal control solutions can be derived from this variational principle, which leads in general to stochastic policies. Furthermore, we show that risk-sensitive and robust (minimax) control schemes fall out naturally from this framework if the environment is considered as a bounded rational and perfectly rational opponent, respectively. When resource costs are ignored, the maximum expected utility principle is recovered.

#### Reinforcement Learning and the Bayesian Control Rule

Pedro A. Ortega, Daniel A. Braun, Simon Godsill, 2011. (In The fourth conference on artificial general intelligence). Springer-Verlag. Lecture Notes on Artificial Intelligence.

Abstract▼ URL

We present an actor-critic scheme for reinforcement learning in complex domains. The main contribution is to show that planning and I/O dynamics can be separated such that an intractable planning problem reduces to a simple multi-armed bandit problem, where each lever stands for a potentially arbitrarily complex policy. Furthermore, we use the Bayesian control rule to construct an adaptive bandit player that is universal with respect to a given class of optimal bandit players, thus indirectly constructing an adaptive agent that is universal with respect to a given class of policies.

#### PIPPS: Flexible Model-Based Policy Search Robust to the Curse of Chaos

Paavo Parmas, Carl Edward Rasmussen, Jan Peters, Kenji Doya, 2018. (In 35th International Conference on Machine Learning).

Abstract▼ URL

Previously, the exploding gradient problem has been explained to be central in deep learning and model-based reinforcement learning, because it causes numerical issues and instability in optimization. Our experiments in model-based reinforcement learning imply that the problem is not just a numerical issue, but it may be caused by a fundamental chaos-like nature of long chains of nonlinear computations. Not only do the magnitudes of the gradients become large, the direction of the gradients becomes essentially random. We show that reparameterization gradients suffer from the problem, while likelihood ratio gradients are robust. Using our insights, we develop a model-based policy search framework, Probabilistic Inference for Particle-Based Policy Search (PIPPS), which is easily extensible, and allows for almost arbitrary models and policies, while simultaneously matching the performance of previous data-efficient learning algorithms. Finally, we invent the total propagation algorithm, which efficiently computes a union over all pathwise derivative depths during a single backwards pass, automatically giving greater weight to estimators with lower variance, sometimes improving over reparameterization gradients by 106 times.

#### Sample and Feedback Efficient Hierarchical Reinforcement Learning from Human Preferences

Robert Pinsler, Riad Akrour, Takayuki Osa, Jan Peters, Gerhard Neumann, May 2018. (In IEEE International Conference on Robotics and Automation). Brisbane, Australia.

Abstract▼ URL

While reinforcement learning has led to promising results in robotics, defining an informative reward function is challenging. Prior work considered including the human in the loop to jointly learn the reward function and the optimal policy. Generating samples from a physical robot and requesting human feedback are both taxing efforts for which efficiency is critical. We propose to learn reward functions from both the robot and the human perspectives to improve on both efficiency metrics. Learning a reward function from the human perspective increases feedback efficiency by assuming that humans rank trajectories according to a low-dimensional outcome space. Learning a reward function from the robot perspective circumvents the need for a dynamics model while retaining the sample efficiency of model-based approaches. We provide an algorithm that incorporates bi-perspective reward learning into a general hierarchical reinforcement learning framework and demonstrate the merits of our approach on a toy task and a simulated robot grasping task.

#### Factored Contextual Policy Search with Bayesian Optimization

Robert Pinsler, Peter Karkus, Andras Kupcsik, David Hsu, Wee Sun Lee, May 2019. (In IEEE International Conference on Robotics and Automation). Montreal, Canada.

Abstract▼ URL

Scarce data is a major challenge to scaling robot learning to truly complex tasks, as we need to generalize locally learned policies over different task contexts. Contextual policy search offers data-efficient learning and generalization by explicitly conditioning the policy on a parametric context space. In this paper, we further structure the contextual policy representation. We propose to factor contexts into two components: target contexts that describe the task objectives, e.g. target position for throwing a ball; and environment contexts that characterize the environment, e.g. initial position or mass of the ball. Our key observation is that experience can be directly generalized over target contexts. We show that this can be easily exploited in contextual policy search algorithms. In particular, we apply factorization to a Bayesian optimization approach to contextual policy search both in sampling-based and active learning settings. Our simulation results show faster learning and better generalization in various robotic domains. See our supplementary video: https://youtu.be/MNTbBAOufDY.

#### Temporal Difference Models: Model-Free Deep RL for Model-Based Control

Vitchyr Pong, Shixiang Gu, Murtaza Dalal, Sergey Levine, Apr 2018. (In 6th International Conference on Learning Representations). Vancouver CANADA.

Abstract▼ URL

Model-free reinforcement learning (RL) has been proven to be a powerful, general tool for learning complex behaviors. However, its sample efficiency is often impractically large for solving challenging real-world problems, even for off-policy algorithms such as Q-learning. A limiting factor in classic model-free RL is that the learning signal consists only of scalar rewards, ignoring much of the rich information contained in state transition tuples. Model-based RL uses this information, by training a predictive model, but often does not achieve the same asymptotic performance as model-free RL due to model bias. We introduce temporal difference models (TDMs), a family of goal-conditioned value functions that can be trained with model-free learning and used for model-based control. TDMs combine the benefits of model-free and model-based RL: they leverage the rich information in state transitions to learn very efficiently, while still attaining asymptotic performance that exceeds that of direct model-based RL methods. Our experimental results show that, on a range of continuous control tasks, TDMs provide a substantial improvement in efficiency compared to state-of-the-art model-based and model-free methods.

#### Probabilistic Inference for Fast Learning in Control

Carl Edward Rasmussen, Marc Peter Deisenroth, November 2008. (In Recent Advances in Reinforcement Learning). Edited by S. Girgin, M. Loth, R. Munos, P. Preux, D. Ryabko. Villeneuve d'Ascq, France. Springer-Verlag. Lecture Notes in Computer Science (LNCS).

Abstract▼ URL

We provide a novel framework for very fast model-based reinforcement learning in continuous state and action spaces. The framework requires probabilistic models that explicitly characterize their levels of confidence. Within this framework, we use flexible, non-parametric models to describe the world based on previously collected experience. We demonstrate learning on the cart-pole problem in a setting where we provide very limited prior knowledge about the task. Learning progresses rapidly, and a good policy is found after only a hand-full of iterations.

**Comment:** videos and more. slides.

#### Gaussian processes in reinforcement learning

Carl Edward Rasmussen, Malte Kuß, December 2004. (In Advances in Neural Information Processing Systems 16). Edited by S. Thrun, L.K. Saul, B. Schölkopf. Cambridge, MA, USA. The MIT Press.

Abstract▼ URL

We exploit some useful properties of Gaussian process (GP) regression models for reinforcement learning in continuous state spaces and discrete time. We demonstrate how the GP model allows evaluation of the value function in closed form. The resulting policy iteration algorithm is demonstrated on a simple problem with a two dimensional state space. Further, we speculate that the intrinsic ability of GP models to characterise distributions of functions would allow the method to capture entire distributions over future values instead of merely their expectation, which has traditionally been the focus of much of reinforcement learning.

#### An Analysis of Categorical Distributional Reinforcement Learning

Mark Rowland, Marc G. Bellemare, Will Dabney, Rémi Munos, Yee Whye Teh, April 2018. (In 21st International Conference on Artificial Intelligence and Statistics). Playa Blanca, Lanzarote, Canary Islands.

Abstract▼ URL

Distributional approaches to value-based reinforcement learning model the entire distribution of returns, rather than just their expected values, and have recently been shown to yield state-of-the-art empirical performance. This was demonstrated by the recently proposed C51 algorithm, based on categorical distributional reinforcement learning (CDRL) [Bellemare et al., 2017]. However, the theoretical properties of CDRL algorithms are not yet well understood. In this paper, we introduce a framework to analyse CDRL algorithms, establish the importance of the projected distributional Bellman operator in distributional RL, draw fundamental connections between CDRL and the Cramér distance, and give a proof of convergence for sample-based categorical distributional reinforcement learning algorithms.

#### Simultaneous Localization and Mapping with Sparse Extended Information Filters

Sebastian Thrun, Yufeng Liu, Daphne Koller, Andrew Y. Ng, Zoubin Ghahramani, Hugh F. Durrant-Whyte, 2004. (I. J. Robotic Res.).

Abstract▼ URL

This paper describes a scalable algorithm for the simultaneous mapping and localization (SLAM) problem. SLAM is the problem of acquiring a map of a static environment with a mobile robot. The vast majority of SLAM algorithms are based on the extended Kalman filter (EKF). This paper advocates an algorithm that relies on the dual of the EKF, the extended information filter (EIF). We show that when represented in the information form, map posteriors are dominated by a small number of links that tie together nearby features in the map. This insight is developed into a sparse variant of the EIF, called the sparse extended information filters (SEIF). SEIFs represent maps by graphical networks of features that are locally interconnected, where links represent relative information between pairs of nearby features, as well as information about the robot’s pose relative to the map. We show that all essential update equations in SEIFs can be executed in constant time, irrespective of the size of the map. We also provide empirical results obtained for a benchmark data set collected in an outdoor environment, and using a multi-robot mapping simulation.