## Publications

#### Perfusion Quantification using Gaussian Process Deconvolution

Irene K. Andersen, Anna Szymkowiak, Carl Edward Rasmussen, L. G. Hanson, J. R. Marstrand, H. B. W. Larsson, Lars Kai Hansen, 2002. (Magnetic Resonance in Medicine). **DOI**: 10.1002/mrm.10213.

Abstract▼ URL

The quantification of perfusion using dynamic susceptibility contrast MR imaging requires deconvolution to obtain the residual impulse-response function (IRF). Here, a method using a Gaussian process for deconvolution, GPD, is proposed. The fact that the IRF is smooth is incorporated as a constraint in the method. The GPD method, which automatically estimates the noise level in each voxel, has the advantage that model parameters are optimized automatically. The GPD is compared to singular value decomposition (SVD) using a common threshold for the singular values and to SVD using a threshold optimized according to the noise level in each voxel. The comparison is carried out using artificial data as well as using data from healthy volunteers. It is shown that GPD is comparable to SVD variable optimized threshold when determining the maximum of the IRF, which is directly related to the perfusion. GPD provides a better estimate of the entire IRF. As the signal to noise ratio increases or the time resolution of the measurements increases, GPD is shown to be superior to SVD. This is also found for large distribution volumes.

#### Understanding Probabilistic Sparse Gaussian Process Approximations

Matthias Stephan Bauer, Mark van der Wilk, Carl Edward Rasmussen, 2016. (In Advances in Neural Information Processing Systems 29).

Abstract▼ URL

Good sparse approximations are essential for practical inference in Gaussian Processes as the computational cost of exact methods is prohibitive for large datasets. The Fully Independent Training Conditional (FITC) and the Variational Free Energy (VFE) approximations are two recent popular methods. Despite superficial similarities, these approximations have surprisingly different theoretical properties and behave differently in practice. We thoroughly investigate the two methods for regression both analytically and through illustrative examples, and draw conclusions to guide practical application.

**Comment:** arXiv

#### Policy search for learning robot control using sparse data

B. Bischoff, D. Nguyen-Tuong, D. van Hoof, A. McHutchon, Carl Edward Rasmussen, A. Knoll, M. P. Deisenroth, 2014. (In IEEE International Conference on Robotics and Automation). Hong Kong, China. IEEE. **DOI**: 10.1109/ICRA.2014.6907422.

Abstract▼ URL

In many complex robot applications, such as grasping and manipulation, it is difficult to program desired task solutions beforehand, as robots are within an uncertain and dynamic environment. In such cases, learning tasks from experience can be a useful alternative. To obtain a sound learning and generalization performance, machine learning, especially, reinforcement learning, usually requires sufficient data. However, in cases where only little data is available for learning, due to system constraints and practical issues, reinforcement learning can act suboptimally. In this paper, we investigate how model-based reinforcement learning, in particular the probabilistic inference for learning control method (PILCO), can be tailored to cope with the case of sparse data to speed up learning. The basic idea is to include further prior knowledge into the learning process. As PILCO is built on the probabilistic Gaussian processes framework, additional system knowledge can be incorporated by defining appropriate prior distributions, e.g. a linear mean Gaussian prior. The resulting PILCO formulation remains in closed form and analytically tractable. The proposed approach is evaluated in simulation as well as on a physical robot, the Festo Robotino XT. For the robot evaluation, we employ the approach for learning an object pick-up task. The results show that by including prior knowledge, policy learning can be sped up in presence of sparse data.

#### Rates of Convergence for Sparse Variational Gaussian Process Regression

David R Burt, Carl Edward Rasmussen, Mark van der Wilk, 2019. (arXiv).

Abstract▼ URL

Excellent variational approximations to Gaussian process posteriors have been developed which avoid the O(N3) scaling with dataset size N. They reduce the computational cost to O(NM2), with M ≪N being the number of inducing variables, which summarise the process. While the computational cost seems to be linear in N, the true complexity of the algorithm depends on how M must increase to ensure a certain quality of approximation. We address this by characterising the behavior of an upper bound on the KL divergence to the posterior. We show that with high probability the KL divergence can be made arbitrarily small by growing M more slowly than N. A particular case of interest is that for regression with normally distributed inputs in D-dimensions with the popular Squared Exponential kernel, M=O(logDN) is sufficient. Our results show that as datasets grow, Gaussian process posteriors can truly be approximated cheaply, and provide a concrete rule for how to increase M in continual learning scenarios.

#### Convergence of Sparse Variational Inference in Gaussian Processes Regression

David R. Burt, Carl Edward Rasmussen, Mark van der Wilk, 2020. (Journal of Machine Learning Research).

Abstract▼ URL

Gaussian processes are distributions over functions that are versatile and mathematically convenient priors in Bayesian modelling. However, their use is often impeded for data with large numbers of observations, N, due to the cubic (in N) cost of matrix operations used in exact inference. Many solutions have been proposed that rely on M 2). While the computational cost appears linear in N, the true complexity depends on how M must scale with N to ensure a certain quality of the approximation. In this work, we investigate upper and lower bounds on how M needs to grow with N to ensure high quality approximations. We show that we can make the KL-divergence between the approximate model and the exact posterior arbitrarily small for a Gaussian-noise regression model with M D) suffice and a method with an overall computational cost of O(N(log N)2D(log log N)2) can be used to perform inference.

#### Manifold Gaussian Processes for Regression

Roberto Calandra, Jan Peters, Carl Edward Rasmussen, Marc Peter Deisenroth, 2016. (In International Joint Conference on Neural Networks).

Abstract▼ URL

Off-the-shelf Gaussian Process (GP) covariance functions encode smoothness assumptions on the structure of the function to be modeled. To model complex and nondifferentiable functions, these smoothness assumptions are often too restrictive. One way to alleviate this limitation is to find a different representation of the data by introducing a feature space. This feature space is often learned in an unsupervised way, which might lead to data representations that are not useful for the overall regression task. In this paper, we propose Manifold Gaussian Processes, a novel supervised method that jointly learns a transformation of the data into a feature space and a GP regression from the feature space to observed space. The Manifold GP is a full GP and allows to learn data representations, which are useful for the overall regression task. As a proof-of-concept, we evaluate our approach on complex non-smooth functions where standard GPs perform poorly, such as step functions and robotics tasks with contacts.

#### Nonlinear Set Membership Regression with Adaptive Hyper-Parameter Estimation for Online Learning and Control

Jan-Peter Calliess, Stephen Roberts, Carl Edward Rasmussen, Jan Maciejowski, 2018. (In Proceedings of the European Control Conference).

Abstract▼ URL

Methods known as Lipschitz Interpolation or Nonlinear Set Membership regression have become established tools for nonparametric system-identification and data-based control. They utilise presupposed Lipschitz properties to compute inferences over unobserved function values. Unfortunately, they rely on the a priori knowledge of a Lipschitz constant of the underlying target function which serves as a hyperparameter. We propose a closed-form estimator of the Lipschitz constant that is robust to bounded observational noise in the data. The merger of Lipschitz Interpolation with the new hyperparameter estimator gives a new nonparametric machine learning method for which we derive online learning convergence guarantees. Furthermore, we apply our learning method to model-reference adaptive control and provide a convergence guarantee on the closed-loop dynamics. In a simulated flight manoeuvre control scenario, we compare the performance of our approach to recently proposed alternative learning-based controllers.

#### Lazily Adapted Constant Kinky Inference for non-parametric regression and model-reference adaptive control

Jan-Peter Calliess, Stephen J. Roberts, Carl Edward Rasmussen, Jan Maciejowski, 2020. (Automatica). **DOI**: 10.1016/j.automatica.2020.109216.

Abstract▼

Techniques known as Nonlinear Set Membership prediction or Lipschitz Interpolation are approaches to supervised machine learning that utilise presupposed Lipschitz properties to perform inference over unobserved function values. Provided a bound on the true best Lipschitz constant of the target function is known a priori, they offer convergence guarantees, as well as bounds around the predictions. Considering a more general setting that builds on Lipschitz continuity, we propose an online method for estimating the Lipschitz constant online from function value observations that are possibly corrupted by bounded noise. Utilising this as a data-dependent hyper-parameter gives rise to a nonparametric machine learning method, for which we establish strong universal approximation guarantees. That is, we show that our prediction rule can learn any continuous function on compact support in the limit of increasingly dense data, up to a worst-case error that can be bounded by the level of observational error. We also consider applications of our nonparametric regression method to learning-based control. For a class of discrete-time settings, we establish convergence guarantees on the closed-loop tracking error of our online learning-based controllers. To provide evidence that our method can be beneficial not only in theory but also in practice, we apply it in the context of nonparametric model-reference adaptive control (MRAC). Across a range of simulated aircraft roll-dynamics and performance metrics our approach outperforms recently proposed alternatives that were based on Gaussian processes and RBF-neural networks.

#### Gaussian Processes for time-marked time-series data

John P. Cunningham, Zoubin Ghahramani, Carl Edward Rasmussen, 2012. (In 15th International Conference on Artificial Intelligence and Statistics).

Abstract▼ URL

In many settings, data is collected as multiple time series, where each recorded time series is an observation of some underlying dynamical process of interest. These observations are often time-marked with known event times, and one desires to do a range of standard analyses. When there is only one time marker, one simply aligns the observations temporally on that marker. When multiple time-markers are present and are at different times on different time series observations, these analyses are more difficult. We describe a Gaussian Process model for analyzing multiple time series with multiple time markings, and we test it on a variety of data.

#### Gaussian Processes for Data-Efficient Learning in Robotics and Control

Marc Peter Deisenroth, Dieter Fox, Carl Edward Rasmussen, 2015. (IEEE Transactions on Pattern Analysis and Machine Intelligence). **DOI**: 10.1109/TPAMI.2013.218.

Abstract▼

Autonomous learning has been a promising direction in control and robotics for more than a decade since data-driven learning allows to reduce the amount of engineering knowledge, which is otherwise required. However, autonomous reinforcement learning (RL) approaches typically require many interactions with the system to learn controllers, which is a practical limitation in real systems, such as robots, where many interactions can be impractical and time consuming. To address this problem, current learning approaches typically require task-speciﬁc knowledge in form of expert demonstrations, realistic simulators, pre-shaped policies, or speciﬁc knowledge about the underlying dynamics. In this article, we follow a different approach and speed up learning by extracting more information from data. In particular, we learn a probabilistic, non-parametric Gaussian process transition model of the system. By explicitly incorporating model uncertainty into long-term planning and controller learning our approach reduces the effects of model errors, a key problem in model-based learning. Compared to state-of-the art RL our model-based policy search method achieves an unprecedented speed of learning. We demonstrate its applicability to autonomous learning in real robot and control tasks.

#### Approximate Dynamic Programming with Gaussian Processes

Marc Peter Deisenroth, Jan Peters, Carl Edward Rasmussen, June 2008. (In 2008 American Control Conference (ACC 2008)). Seattle, WA, USA.

Abstract▼ URL

In general, it is difficult to determine an optimal closed-loop policy in nonlinear control problems with continuous-valued state and control domains. Hence, approximations are often inevitable. The standard method of discretizing states and controls suffers from the curse of dimensionality and strongly depends on the chosen temporal sampling rate. The paper introduces Gaussian Process Dynamic Programming (GPDP). In GPDP, value functions in the Bellman recursion of the dynamic programming algorithm are modeled using Gaussian processes. GPDP returns an optimal state-feedback for a finite set of states. Based on these outcomes, we learn a possibly discontinuous closed-loop policy on the entire state space by switching between two independently trained Gaussian processes.

**Comment:** code.

#### Bayesian Inference for Efficient Learning in Control

Marc Peter Deisenroth, Carl Edward Rasmussen, June 2009. (In Multidisciplinary Symposium on Reinforcement Learning). Montréal, QC, Canada.

Abstract▼ URL

In contrast to humans or animals, artificial learners often require more trials when learning motor control tasks solely based on experience. Efficient autonomous learners will reduce the amount of engineering required to solve control problems. By using probabilistic forward models, we can employ two key ingredients of biological learning systems to speed up artificial learning. We present a consistent and coherent Bayesian framework that allows for efficient autonomous experience-based learning. We demonstrate the success of our learning algorithm by applying it to challenging nonlinear control problems in simulation and in hardware.

#### Efficient Reinforcement Learning for Motor Control

Marc Peter Deisenroth, Carl Edward Rasmussen, September 2009. (In 10th International PhD Workshop on Systems and Control). Hluboká nad Vltavou, Czech Republic.

Abstract▼ URL

Artificial learners often require many more trials than humans or animals when learning motor control tasks in the absence of expert knowledge. We implement two key ingredients of biological learning systems, generalization and incorporation of uncertainty into the decision-making process, to speed up artificial learning. We present a coherent and fully Bayesian framework that allows for efficient artificial learning in the absence of expert knowledge. The success of our learning framework is demonstrated on challenging nonlinear control problems in simulation and in hardware.

#### PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Marc Peter Deisenroth, Carl Edward Rasmussen, 2011. (In 28th International Conference on Machine Learning).

Abstract▼ URL

In this paper, we introduce PILCO, a practical, data-efficient model-based policy search method. PILCO reduces model bias, one of the key problems of model-based reinforcement learning, in a principled way. By learning a probabilistic dynamics model and explicitly incorporating model uncertainty into long-term planning, PILCO can cope with very little data and facilitates learning from scratch in only a few trials. Policy evaluation is performed in closed form using state-of-the-art approximate inference. Furthermore, policy gradients are computed analytically for policy improvement. We report unprecedented learning efficiency on challenging and high-dimensional control tasks.

**Comment:** web site

#### Learning to Control a Low-Cost Manipulator using Data-Efficient Reinforcement Learning

Marc Peter Deisenroth, Carl Edward Rasmussen, Dieter Fox, June 2011. (In 9th International Conference on Robotics: Science & Systems). Los Angeles, CA, USA.

Abstract▼ URL

Over the last years, there has been substantial progress in robust manipulation in unstructured environments. The long-term goal of our work is to get away from precise, but very expensive robotic systems and to develop affordable, potentially imprecise, self-adaptive manipulator systems that can interactively perform tasks such as playing with children. In this paper, we demonstrate how a low-cost off-the-shelf robotic system can learn closed-loop policies for a stacking task in only a handful of trials - from scratch. Our manipulator is inaccurate and provides no pose feedback. For learning a controller in the work space of a Kinect-style depth camera, we use a model-based reinforcement learning technique. Our learning method is data efficient, reduces model bias, and deals with several noise sources in a principled way during long-term planning. We present a way of incorporating state-space constraints into the learning process and analyze the learning gain by exploiting the sequential structure of the stacking task.

**Comment:** project site

#### Model-Based Reinforcement Learning with Continuous States and Actions

Marc Peter Deisenroth, Carl Edward Rasmussen, Jan Peters, April 2008. (In Proceedings of the 16th European Symposium on Artificial Neural Networks (ESANN 2008)). Bruges, Belgium.

Abstract▼ URL

Finding an optimal policy in a reinforcement learning (RL) framework with continuous state and action spaces is challenging. Approximate solutions are often inevitable. GPDP is an approximate dynamic programming algorithm based on Gaussian process (GP) models for the value functions. In this paper, we extend GPDP to the case of unknown transition dynamics. After building a GP model for the transition dynamics, we apply GPDP to this model and determine a continuous-valued policy in the entire state space. We apply the resulting controller to the underpowered pendulum swing up. Moreover, we compare our results on this RL task to a nearly optimal discrete DP solution in a fully known environment.

#### Gaussian process dynamic programming

Marc Peter Deisenroth, Carl Edward Rasmussen, Jan Peters, March 2009. (Neurocomputing). Elsevier B. V.. **DOI**: 10.1016/j.neucom.2008.12.019.

Abstract▼ URL

Reinforcement learning (RL) and optimal control of systems with continuous states and actions require approximation techniques in most interesting cases. In this article, we introduce Gaussian process dynamic programming (GPDP), an approximate value function-based RL algorithm. We consider both a classic optimal control problem, where problem-specific prior knowledge is available, and a classic RL problem, where only very general priors can be used. For the classic optimal control problem, GPDP models the unknown value functions with Gaussian processes and generalizes dynamic programming to continuous-valued states and actions. For the RL problem, GPDP starts from a given initial state and explores the state space using Bayesian active learning. To design a fast learner, available data have to be used efficiently. Hence, we propose to learn probabilistic models of the a priori unknown transition dynamics and the value functions on the fly. In both cases, we successfully apply the resulting continuous-valued controllers to the under-actuated pendulum swing up and analyze the performances of the suggested algorithms. It turns out that GPDP uses data very efficiently and can be applied to problems, where classic dynamic programming would be cumbersome.

**Comment:** code.

#### Robust Filtering and Smoothing with Gaussian Processes

Marc Peter Deisenroth, Ryan D. Turner, Marco F. Huber, Uwe D. Hanebeck, Carl Edward Rasmussen, 2012. (IEEE Transactions on Automatic Control). **DOI**: 10.1109/TAC.2011.2179426.

Abstract▼ URL

We propose a principled algorithm for robust Bayesian filtering and smoothing in nonlinear stochastic dynamic systems when both the transition function and the measurement function are described by nonparametric Gaussian process (GP) models. GPs are gaining increasing importance in signal processing, machine learning, robotics, and control for representing unknown system functions by posterior probability distributions. This modern way of “system identification” is more robust than finding point estimates of a parametric function representation. Our principled filtering/smoothing approach for GP dynamic systems is based on analytic moment matching in the context of the forward-backward algorithm. Our numerical evaluations demonstrate the robustness of the proposed approach in situations where other state-of-the-art Gaussian filters and smoothers can fail.

#### Clustering Protein Sequence and Structure Space with Infinite Gaussian Mixture Models

Ananya Dubey, Seungwoo Hwang, Claudia Rangel, Carl Edward Rasmussen, Zoubin Ghahramani, David L. Wild, 2004. (In Pacific Symposium on Biocomputing). Edited by Russ B. Altman, A. Keith Dunker, Lawrence Hunter, Tiffany A. Jung, Teri E. Klein. World Scientific. **ISBN**: 981-238-598-3.

Abstract▼ URL

We describe a novel approach to the problem of automatically clustering protein sequences and discovering protein families, subfamilies etc., based on the theory of infinite Gaussian mixtures models. This method allows the data itself to dictate how many mixture components are required to model it, and provides a measure of the probability that two proteins belong to the same cluster. We illustrate our methods with application to three data sets: globin sequences, globin sequences with known three-dimensional structures and G-protein coupled receptor sequences. The consistency of the clusters indicate that our method is producing biologically meaningful results, which provide a very good indication of the underlying families and subfamilies. With the inclusion of secondary structure and residue solvent accessibility information, we obtain a classification of sequences of known structure which both reflects and extends their SCOP classifications. A supplementray web site containing larger versions of the figures is available at http://public.kgi.edu/approximately wid/PSB04/index.html

#### Clustering Protein Sequence and Structure Space with Infinite Gaussian Mixture Models

A. Dubey, S. Hwang, C. Rangel, Carl Edward Rasmussen, Zoubin Ghahramani, David L. Wild, 2004. (In Pacific Symposium on Biocomputing 2004). (Pacific Symposium on Biocomputing 2004; Vol. 9). Singapore. The Big Island of Hawaii. World Scientific Publishing.

Abstract▼ URL

We describe a novel approach to the problem of automatically clustering protein sequences and discovering protein families, subfamilies etc., based on the thoery of infinite Gaussian mixture models. This method allows the data itself to dictate how many mixture components are required to model it, and provides a measure of the probability that two proteins belong to the same cluster. We illustrate our methods with application to three data sets: globin sequences, globin sequences with known tree-dimensional structures and G-pretein coupled receptor sequences. The consistency of the clusters indicate that that our methods is producing biologically meaningful results, which provide a very good indication of the underlying families and subfamilies. With the inclusion of secondary structure and residue solvent accessibility information, we obtain a classification of sequences of known structure which reflects and extends their SCOP classifications.

#### Additive Gaussian Processes

David Duvenaud, Hannes Nickisch, Carl Edward Rasmussen, 2011. (In Advances in Neural Information Processing Systems 24). Granada, Spain.

Abstract▼ URL

We introduce a Gaussian process model of functions which are additive. An additive function is one which decomposes into a sum of low-dimensional functions, each depending on only a subset of the input variables. Additive GPs generalize both Generalized Additive Models, and the standard GP models which use squared-exponential kernels. Hyperparameter learning in this model can be seen as Bayesian Hierarchical Kernel Learning (HKL). We introduce an expressive but tractable parameterization of the kernel function, which allows efficient evaluation of all input interaction terms, whose number is exponential in the input dimension. The additional structure discoverable by this model results in increased interpretability, as well as state-of-the-art predictive power in regression tasks.

#### Prediction on Spike Data Using Kernel Algorithms

Jan Eichhorn, Andreas S. Tolias, Alexander Zien, Malte Kuß, Carl Edward Rasmussen, Jason Weston, Nikos K. Logothetis, Bernhard Schölkopf, 2004. (In Advances in Neural Information Processing Systems 16). Edited by Sebastian Thrun, Lawrence K. Saul, Bernhard Schölkopf. Cambridge, MA, USA. Vancouver, BC, Canada. The MIT Press.

Abstract▼ URL

We report and compare the performance of different learning algorithms based on data from cortical recordings. The task is to predict the orientation of visual stimuli from the activity of a population of simultaneously recorded neurons. We compare several ways of improving the coding of the input (i.e., the spike data) as well as of the output (i.e., the orientation), and report the results obtained using different kernel algorithms.

#### Semi-supervised kernel regression using whitened function classes

Matthias O. Franz, Younghee Kwon, Carl Edward Rasmussen, Bernhard Schölkopf, 2004. (In Lecture Notes in Computer Science (LNCS)). Edited by C. E. Rasmussen, H. H. Bülthoff, M. A. Giese, B. Schölkopf. (Pattern Recognition, Proceedings of the 26th DAGM Symposium). Berlin, Germany. Springer.

Abstract▼ URL

The use of non-orthonormal basis functions in ridge regression leads to an often undesired non-isotropic prior in function space. In this study, we investigate an alternative regularization technique that results in an implicit whitening of the basis functions by penalizing directions in function space with a large prior variance. The regularization term is computed from unlabelled input data that characterizes the input distribution. Tests on two datasets using polynomial basis functions showed an improved average performance compared to standard ridge regression.

#### Variational Gaussian Process State-Space Models

Roger Frigola, Yutian Chen, Carl Edward Rasmussen, 2014. (In Advances in Neural Information Processing Systems 27). Edited by Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, K.Q. Weinberger.

Abstract▼ URL

State-space models have been successfully used for more than fifty years in different areas of science and engineering. We present a procedure for efficient variational Bayesian learning of nonlinear state-space models based on sparse Gaussian processes. The result of learning is a tractable posterior over nonlinear dynamical systems. In comparison to conventional parametric models, we offer the possibility to straightforwardly trade off model capacity and computational cost whilst avoiding overfitting. Our main algorithm uses a hybrid inference approach combining variational Bayes and sequential Monte Carlo. We also present stochastic variational inference and online learning approaches for fast learning with long time series.

#### Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC

Roger Frigola, Fredrik Lindsten, Thomas B. Schön, Carl Edward Rasmussen, 2013. (In Advances in Neural Information Processing Systems 26). Edited by L. Bottou, C.J.C. Burges, Z. Ghahramani, M. Welling, K.Q. Weinberger. Curran Associates, Inc..

Abstract▼ URL

State-space models are successfully used in many areas of science, engineering and economics to model time series and dynamical systems. We present a fully Bayesian approach to inference and learning in nonlinear nonparametric state-space models. We place a Gaussian process prior over the transition dynamics, resulting in a flexible model able to capture complex dynamical phenomena. However, to enable efficient inference, we marginalize over the dynamics of the model and instead infer directly the joint smoothing distribution through the use of specially tailored Particle Markov Chain Monte Carlo samplers. Once a sample from the smoothing distribution is computed, the state transition predictive distribution can be formulated analytically. We make use of sparse Gaussian process models to greatly reduce the computational complexity of the approach.

#### Identification of Gaussian Process State-Space Models with Particle Stochastic Approximation EM

Roger Frigola, Fredrik Lindsten, Thomas B. Schön, Carl Edward Rasmussen, 2014. (In Proceedings of the 19th World Congress of the International Federation of Automatic Control (IFAC)).

Abstract▼ URL

Gaussian process state-space models (GP-SSMs) are a very flexible family of models of nonlinear dynamical systems. They comprise a Bayesian nonparametric representation of the dynamics of the system and additional (hyper-)parameters governing the properties of this nonparametric representation. The Bayesian formalism enables systematic reasoning about the uncertainty in the system dynamics. We present an approach to maximum likelihood identification of the parameters in GP-SSMs, while retaining the full nonparametric description of the dynamics. The method is based on a stochastic approximation version of the EM algorithm that employs recent developments in particle Markov chain Monte Carlo for efficient identification.

#### Integrated Pre-Processing for Bayesian Nonlinear System Identification with Gaussian Processes

Roger Frigola, Carl Edward Rasmussen, 2013. (In Decision and Control (CDC), 2013 IEEE 52nd Annual Conference on).

Abstract▼ URL

We introduce GP-FNARX: a new model for nonlinear system identification based on a nonlinear autoregressive exogenous model (NARX) with filtered regressors (F) where the nonlinear regression problem is tackled using sparse Gaussian processes (GP). We integrate data pre-processing with system identification into a fully automated procedure that goes from raw data to an identified model. Both pre-processing parameters and GP hyper-parameters are tuned by maximizing the marginal likelihood of the probabilistic model. We obtain a Bayesian model of the system’s dynamics which is able to report its uncertainty in regions where the data is scarce. The automated approach, the modeling of uncertainty and its relatively low computational cost make of GP-FNARX a good candidate for applications in robotics and adaptive control.

#### Distributed Variational Inference in Sparse Gaussian Process Regression and Latent Variable Models

Yarin Gal, Mark van der Wilk, Carl Rasmussen, 2014. (In Advances in Neural Information Processing Systems 27). Edited by Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, K.Q. Weinberger. Curran Associates, Inc..

Abstract▼ URL

Gaussian processes (GPs) are a powerful tool for probabilistic inference over functions. They have been applied to both regression and non-linear dimensionality reduction, and offer desirable properties such as uncertainty estimates, robustness to over-fitting, and principled ways for tuning hyper-parameters. However the scalability of these models to big datasets remains an active topic of research. We introduce a novel re-parametrisation of variational inference for sparse GP regression and latent variable models that allows for an efficient distributed algorithm. This is done by exploiting the decoupling of the data given the inducing points to re-formulate the evidence lower bound in a Map-Reduce setting. We show that the inference scales well with data and computational resources, while preserving a balanced distribution of the load among the nodes. We further demonstrate the utility in scaling Gaussian processes to big data. We show that GP performance improves with increasing amounts of data in regression (on flight data with 2 million records) and latent variable modelling (on MNIST). The results show that GPs perform better than many common models often used for big data.

#### Deep Convolutional Networks as shallow Gaussian Processes

Adrià Garriga-Alonso, Carl Edward Rasmussen, Laurence Aitchison, 2019. (In International Conference on Learning Representations (ICLR)).

Abstract▼ URL

We show that the output of a (residual) convolutional neural network (CNN) with an appropriate prior over the weights and biases is a Gaussian process (GP) in the limit of infinitely many convolutional filters, extending similar results for dense networks. For a CNN, the equivalent kernel can be computed exactly and, unlike “deep kernels”, has very few parameters: only the hyperparameters of the original CNN. Further, we show that this kernel has two properties that allow it to be computed efficiently; the cost of evaluating the kernel for a pair of images is similar to a single forward pass through the original CNN with only one filter per layer. The kernel equivalent to a 32-layer ResNet obtains 0.84% classification error on MNIST, a new record for GPs with a comparable number of parameters.

#### Gaussian Process priors with uncertain inputs — application to multiple-step ahead time series forecasting

Agathe Girard, Carl Edward Rasmussen, Joaquin Quiñonero-Candela, Roderick Murray-Smith, December 2003. (In Advances in Neural Information Processing Systems 15). Edited by S. Becker, S. Thrun, K. Obermayer. Cambridge, MA, USA. The MIT Press.

Abstract▼ URL

We consider the problem of multi-step ahead prediction in time series analysis using the non-parametric Gaussian process model. k-step ahead forecasting of a discrete-time non-linear dynamic system can be performed by doing repeated one-step ahead predictions. For a state-space model of the form yt = f(yt-1,…,yt-L), the prediction of y at time t + k is based on the point estimates of the previous outputs. In this paper, we show how, using an analytical Gaussian approximation, we can formally incorporate the uncertainty about intermediate regressor values, thus updating the uncertainty on the current prediction.

#### A Choice Model with Infinitely Many Latent Features

Dilan Görür, Frank Jäkel, Carl Edward Rasmussen, June 2006. (In 23rd International Conference on Machine Learning). Edited by W. W. Cohen, Andrew Moore. New York, NY, USA. Pittsburgh, PA, USA. ACM Press. **DOI**: 10.1145/1143844.1143890.

Abstract▼ URL

Elimination by aspects (EBA) is a probabilistic choice model describing how humans decide between several options. The options from which the choice is made are characterized by binary features and associated weights. For instance, when choosing which mobile phone to buy the features to consider may be: long lasting battery, color screen, etc. Existing methods for inferring the parameters of the model assume pre-specified features. However, the features that lead to the observed choices are not always known. Here, we present a non-parametric Bayesian model to infer the features of the options and the corresponding weights from choice data. We use the Indian buffet process (IBP) as a prior over the features. Inference using Markov chain Monte Carlo (MCMC) in conjugate IBP models has been previously described. The main contribution of this paper is an MCMC algorithm for the EBA model that can also be used in inference for other non-conjugate IBP models—this may broaden the use of IBP priors considerably.

#### Dirichlet Process Gaussian Mixture Models: Choice of the base distribution

Dilan Görür, Carl Edward Rasmussen, July 2010. (Journal of Computer Science and Technology). Beijing, China. Science Press. **DOI**: 10.1007/s11390-010-9355-8.

Abstract▼ URL

In the Bayesian mixture modeling framework it is possible to infer the necessary number of components to model the data and therefore it is unnecessary to explicitly restrict the number of components. Nonparametric mixture models sidestep the problem of finding the “correct” number of mixture components by assuming infinitely many components. In this paper Dirichlet process mixture (DPM) models are cast as infinite mixture models and inference using Markov chain Monte Carlo is described. The specification of the priors on the model parameters is often guided by mathematical and practical convenience. The primary goal of this paper is to compare the choice of conjugate and non-conjugate base distributions on a particular class of DPM models which is widely used in applications, the Dirichlet process Gaussian mixture model (DPGMM). We compare computational efficiency and modeling performance of DPGMM defined using a conjugate and a conditionally conjugate base distribution. We show that better density models can result from using a wider class of priors with no or only a modest increase in computational effort.

#### Modelling Spikes with Mixtures of Factor Analysers

Dilan Görür, Carl Edward Rasmussen, Andreas S. Tolias, Fabian Sinz, Nikos K. Logothetis, 09 2004. (In DAGM 2004). Edited by C. E. Rasmussen, H. H. Bülthoff, B. Schölkopf, M. A. Giese. (Pattern Recognition: Proceedings of the 26th DAGM Symposium). Berlin, Germany. Tübingen, Germany. Springer. Lecture Notes in Computer Science (LNCS).

Abstract▼ URL

Identifying the action potentials of individual neurons from extracellular recordings, known as spike sorting, is a challenging problem. We consider the spike sorting problem using a generative model,mixtures of factor analysers, which concurrently performs clustering and feature extraction. The most important advantage of this method is that it quantifies the certainty with which the spikes are classified. This can be used as a means for evaluating the quality of clustering and therefore spike isolation. Using this method, nearly simultaneously occurring spikes can also be modelled which is a hard task for many of the spike sorting methods. Furthermore, modelling the data with a generative model allows us to generate simulated data.

#### Reinforcement Learning with Reference Tracking Control in Continuous State Spaces

Joseph Hall, Carl Edward Rasmussen, Jan Maciejowski, 2011. (In Proceedings of 50th IEEE Conference on Decision and Control and European Control Conference).

Abstract▼ URL

The contribution described in this paper is an algorithm for learning nonlinear, reference tracking, control policies given no prior knowledge of the dynamical system and limited interaction with the system through the learning process. Concepts from the field of reinforcement learning, Bayesian statistics and classical control have been brought together in the formulation of this algorithm which can be viewed as a form indirect self tuning regulator. On the task of reference tracking using the inverted pendulum it was shown to yield generally improved performance on the best controller derived from the standard linear quadratic method using only 30 s of total interaction with the system. Finally, the algorithm was shown to work on the double pendulum proving its ability to solve nontrivial control tasks.

#### Modelling and Control of Nonlinear Systems using Gaussian Processes with Partial Model Information

Joseph Hall, Carl Edward Rasmussen, Jan Maciejowski, 2012. (In 51st IEEE Conference on Decision and Control).

Abstract▼ URL

Gaussian processes are gaining increasing popularity among the control community, in particular for the modelling of discrete time state space systems. However, it has not been clear how to incorporate model information, in the form of known state relationships, when using a Gaussian process as a predictive model. An obvious example of known prior information is position and velocity related states. Incorporation of such information would be beneficial both computationally and for faster dynamics learning. This paper introduces a method of achieving this, yielding faster dynamics learning and a reduction in computational effort from O(Dn2) to O((D-F)n2) in the prediction stage for a system with D states, F known state relationships and n observations. The effectiveness of the method is demonstrated through its inclusion in the PILCO learning algorithm with application to the swing-up and balance of a torque-limited pendulum and the balancing of a robotic unicycle in simulation.

#### Pruning from adaptive regularization

Lars Kai Hansen, Carl Edward Rasmussen, 1994. (Neural Computation). The MIT Press.

Abstract▼ URL

Inspired by the recent upsurge of interest in Bayesian methods we consider adaptive regularization. A generalization based scheme for adaptation of regularization parameters is introduced and compared to Bayesian regularization. We show that pruning arises naturally within both adaptive regularization schemes. As model example we have chosen the simplest possible: estimating the mean of a random variable with known variance. Marked similarities are found between the two methods in that they both involve a “noise limit”, below which they regularize with infinite weight decay, i.e., they prune. However, pruning is not always beneficial. We show explicitly that both methods in some cases may increase the generalization error. This corresponds to situations where the underlying assumptions of the regularizer are poorly matched to the environment.

#### Bayesian modelling of fMRI time series

Pedro A. d. F. R. Højen-Sørensen, Carl Edward Rasmussen, Lars Kai Hansen, 2000. (In Advances in Neural Information Processing Systems 12). Edited by Todd K. Leen Sara A. Solla, Klaus-Robert Müller. The MIT Press.

Abstract▼ URL

We present a Hidden Markov Model (HMM) for inferring the hidden psychological state (or neural activity) during single trial fMRI activation experiments with blocked task paradigms. Inference is based on Bayesian methodology, using a combination of analytical and a variety of Markov Chain Monte Carlo (MCMC) sampling techniques. The advantage of this method is that detection of short time learning effects between repeated trials is possible since inference is based only on single trial experiments.

#### Non-Factorised Variational Inference in Dynamical Systems

Alessandro Davide Ialongo, Mark van der Wilk, James Hensman, Carl Edward Rasmussen, December 2018. (In First Symposium on Advances in Approximate Bayesian Inference). Montreal.

Abstract▼ URL

We focus on variational inference in dynamical systems where the discrete time transition function (or evolution rule) is modelled by a Gaussian process. The dominant approach so far has been to use a factorised posterior distribution, decoupling the transition function from the system states. This is not exact in general and can lead to an overconfident posterior over the transition function as well as an overestimation of the intrinsic stochasticity of the system (process noise). We propose a new method that addresses these issues and incurs no additional computational costs.

#### Overcoming Mean-Field Approximations in Recurrent Gaussian Process Models

Alessandro Davide Ialongo, Mark van der Wilk, James Hensman, Carl Edward Rasmussen, June 2019. (In 36th International Conference on Machine Learning). Long Beach.

Abstract▼ URL

We identify a new variational inference scheme for dynamical systems whose transition function is modelled by a Gaussian process. Inference in this setting has either employed computationally intensive MCMC methods, or relied on factorisations of the variational posterior. As we demonstrate in our experiments, the factorisation between latent system states and transition function can lead to a miscalibrated posterior and to learning unnecessarily large noise terms. We eliminate this factorisation by explicitly modelling the dependence between state trajectories and the Gaussian process posterior. Samples of the latent states can then be tractably generated by conditioning on this representation. The method we obtain (VCDT: variationally coupled dynamics and trajectories) gives better predictive performance and more calibrated estimates of the transition function, yet maintains the same time and space complexities as mean-field methods. Code is available at: https://github.com/ialong/GPt.

#### Closed-form Inference and Prediction in Gaussian Process State-Space Models

Alessandro Davide Ialongo, Mark van der Wilk, Carl Edward Rasmussen, December 2017. (In NIPS Time Series Workshop 2017). Long Beach.

Abstract▼ URL

We examine an analytic variational inference scheme for the Gaussian Process State Space Model (GPSSM) - a probabilistic model for system identification and time-series modelling. Our approach performs variational inference over both the system states and the transition function. We exploit Markov structure in the true posterior, as well as an inducing point approximation to achieve linear time complexity in the length of the time series. Contrary to previous approaches, no Monte Carlo sampling is required: inference is cast as a deterministic optimisation problem. In a number of experiments, we demonstrate the ability to model non-linear dynamics in the presence of both process and observation noise as well as to impute missing information (e.g. velocities from raw positions through time), to de-noise, and to estimate the underlying dimensionality of the system. Finally, we also introduce a closed-form method for multi-step prediction, and a novel criterion for assessing the quality of our approximate posterior.

#### A case based comparison of identification with neural network and Gaussian process models

Juš Kocijan, Blaž Banko, Bojan Likar, Agathe Girard, Roderick Murray-Smith, Carl Edward Rasmussen, 2003. (In IFAC Internaltional Conference on Intelligent Control Systems and Signal Processing).

Abstract▼ URL

In this paper an alternative approach to black-box identification of non-linear dynamic systems is compared with the more established approach of using artificial neural networks. The Gaussian process prior approach is a representative of non-parametric modelling approaches. It was compared on a pH process modelling case study. The purpose of modelling was to use the model for control design. The comparison revealed that even though Gaussian process models can be effectively used for modelling dynamic systems caution has to be axercised when signals are selected.

#### Gaussian process model based predictive control

Juš Kocijan, Roderick Murray-Smith, Carl Edward Rasmussen, Agathe Girard, 2004. (In American Control Conference). (Proceedings of the ACC 2004). Boston, MA.

Abstract▼ URL

Gaussian process models provide a probabilistic non-parametric modelling approach for black-box identi cation of non-linear dynamic systems. The Gaussian processes can highlight areas of the input space where prediction quality is poor, due to the lack of data or its complexity, by indicating the higher variance around the predicted mean. Gaussian process models contain noticeably less coef cients to be optimised. This paper illustrates possible application of Gaussian process models within model-based predictive control. The extra information provided within Gaussian process model is used in predictive control, where optimisation of control signal takes the variance information into account. The predictive control principle is demonstrated on control of pH process benchmark.

#### Predictive control with Gaussian process models

Juš Kocijan, Roderick Murray-Smith, Carl Edward Rasmussen, Bojan Likar, 2003. (In IEEE Region 8 Eurocon 2003: Computer as a Tool). Edited by B. Zajc, M. Tkal.

Abstract▼ URL

This paper describes model-based predictive control based on Gaussian processes. Gaussian process models provide a probabilistic non-parametric modelling approach for black-box identification of non-linear dynamic systems. It offers more insight in variance of obtained model response, as well as fewer parameters to determine than other models. The Gaussian processes can highlight areas of the input space where prediction quality is poor, due to the lack of data or its complexity, by indicating the higher variance around the predicted mean. This property is used in predictive control, where optimisation of control signal takes the variance information into account. The predictive control principle is demonstrated on a simulated example of nonlinear system.

#### Approximate Inference for Robust Gaussian Process Regression

Malte Kuß, Tobias Pfingsten, Lehel Csatò, Carl Edward Rasmussen, 2005. Max Planck Institute for Biological Cybernetics, Tübingen, Germany.

Abstract▼ URL

Gaussian process (GP) priors have been successfully used in non-parametric Bayesian regression and classification models. Inference can be performed analytically only for the regression model with Gaussian noise. For all other likelihood models inference is intractable and various approximation techniques have been proposed. In recent years expectation-propagation (EP) has been developed as a general method for approximate inference. This article provides a general summary of how expectation-propagation can be used for approximate inference in Gaussian process models. Furthermore we present a case study describing its implementation for a new robust variant of Gaussian process regression. To gain further insights into the quality of the EP approximation we present experiments in which we compare to results obtained by Markov chain Monte Carlo (MCMC) sampling.

#### Assessing Approximate Inference for Binary Gaussian Process Classification

Malte Kuß, Carl Edward Rasmussen, 2005. (Journal of Machine Learning Research).

Abstract▼ URL

Gaussian process priors can be used to define flexible, probabilistic classification models. Unfortunately exact Bayesian inference is analytically intractable and various approximation techniques have been proposed. In this work we review and compare Laplace’s method and Expectation Propagation for approximate Bayesian inference in the binary Gaussian process classification model. We present a comprehensive comparison of the approximations, their predictive performance and marginal likelihood estimates to results obtained by MCMC sampling. We explain theoretically and corroborate empirically the advantages of Expectation Propagation compared to Laplace’s method.

#### Assessing Approximations for Gaussian Process Classification

Malte Kuß, Carl Edward Rasmussen, April 2006. (In Advances in Neural Information Processing Systems 18). Edited by Y. Weiss, B. Schölkopf, J. Platt. Cambridge, MA, USA. Whistler, BC, Canada. The MIT Press.

Abstract▼ URL

Gaussian processes are attractive models for probabilistic classification but unfortunately exact inference is analytically intractable. We compare Laplace’s method and Expectation Propagation (EP) focusing on marginal likelihood estimates and predictive performance. We explain theoretically and corroborate empirically that EP is superior to Laplace. We also compare to a sophisticated MCMC scheme and show that EP is surprisingly accurate.

#### Approximate inference for Fully Bayesian Gaussian process Regression

Vidhi Lalchand, Carl Edward Rasmussen, 2020. (In 2nd Symposium on Advances in Approximate Bayesian Inference).

Abstract▼ URL

Learning in Gaussian Process models occurs through the adaptation of hyperparameters of the mean and the covariance function. The classical approach entails maximizing the marginal likelihood yielding fixed point estimates (an approach called Type II maximum likelihood or ML-II). An alternative learning procedure is to infer the posterior over hyper-parameters in a hierarchical specication of GPs we call Fully Bayesian Gaussian Process Regression (GPR). This work considers two approximation schemes for the intractable hyperparameter posterior: 1) Hamiltonian Monte Carlo (HMC) yielding a sampling based approximation and 2) Variational Inference (VI) where the posterior over hyperparameters is approximated by a factorized Gaussian (mean-field) or a full-rank Gaussian accounting for correlations between hyperparameters. We analyse the predictive performance for fully Bayesian GPR on a range of benchmark data sets.

#### Sparse Spectrum Gaussian Process Regression

Miguel Lázaro-Gredilla, Joaquin Quiñonero-Candela, Carl Edward Rasmussen, Aníbal Figueiras-Vidal, June 2010. (Journal of Machine Learning Research).

Abstract▼ URL

We present a new sparse Gaussian Process (GP) model for regression. The key novel idea is to sparsify the *spectral representation* of the GP. This leads to a simple, practical algorithm for regression tasks. We compare the achievable trade-offs between predictive accuracy and computational requirements, and show that these are typically superior to existing state-of-the-art sparse approximations. We discuss both the weight space and function space representations, and note that the new construction implies priors over functions which are always stationary, and can approximate any covariance function in this class.

#### Data-Efficient Reinforcement Learning in Continuous State-Action Gaussian-POMDPs

Rowan McAllister, Carl Edward Rasmussen, December 2017. (In Advances in Neural Information Processing Systems 31). Long Beach, California.

Abstract▼ URL

We present a data-efficient reinforcement learning method for continuous state-action systems under significant observation noise. Data-efficient solutions under small noise exist, such as PILCO which learns the cartpole swing-up task in 30s. PILCO evaluates policies by planning state-trajectories using a dynamics model. However, PILCO applies policies to the observed state, therefore planning in observation space. We extend PILCO with filtering to instead plan in belief space, consistent with partially observable Markov decisions process (POMDP) planning. This enables data-efficient learning under significant observation noise, outperforming more naive methods such as post-hoc application of a filter to policies optimised by the original (unfiltered) PILCO algorithm. We test our method on the cartpole swing-up task, which involves nonlinear dynamics and requires nonlinear control.

#### Gaussian Process Training with Input Noise

Andrew McHutchon, Carl Edward Rasmussen, 2011. (In Advances in Neural Information Processing Systems 24). Edited by J. Shawe-Taylor, R.S. Zemel, P.L. Bartlett, F. Pereira, K.Q. Weinberger. Granada, Spain. Curran Associates, Inc..

Abstract▼ URL

In standard Gaussian Process regression input locations are assumed to be noise free. We present a simple yet effective GP model for training on input points corrupted by i.i.d. Gaussian noise. To make computations tractable we use a local linear expansion about each input point. This allows the input noise to be recast as output noise proportional to the squared gradient of the GP posterior mean. The input noise hyperparameters are trained alongside other hyperparameters by the usual method of maximisation of the marginal likelihood, and allow estimation of the noise levels on each input dimension. Training uses an iterative scheme, which alternates between optimising the hyperparameters and calculating the posterior gradient. Analytic predictive moments can then be found for Gaussian distributed test points. We compare our model to others over a range of different regression problems and show that it improves over current methods.

#### Adaptive, Cautious, Predictive control with Gaussian Process Priors

Roderick Murray-Smith, Daniel Sbarbaro, Carl Edward Rasmussen, Agathe Girard, August 2003. (In IFAC SYSID 2003). Edited by P. Van den Hof, B. Wahlberg, S. Weiland. (Proceedings of the 13th IFAC Symposium on System Identification). Oxford, UK. Rotterdam, The Netherlands. Elsevier Science Ltd.

Abstract▼ URL

Nonparametric Gaussian Process models, a Bayesian statistics approach, are used to implement a nonlinear adaptive control law. Predictions, including propagation of the state uncertainty are made over a k-step horizon. The expected value of a quadratic cost function is minimised, over this prediction horizon, without ignoring the variance of the model predictions. The general method and its main features are illustrated on a simulation example.

#### Approximations for Binary Gaussian Process Classification

Hannes Nickisch, Carl Edward Rasmussen, October 2008. (Journal of Machine Learning Research).

Abstract▼ URL

We provide a comprehensive overview of many recent algorithms for approximate inference in Gaussian process models for probabilistic binary classification. The relationships between several approaches are elucidated theoretically, and the properties of the different algorithms are corroborated by experimental results. We examine both 1) the quality of the predictive distributions and 2) the suitability of the different marginal likelihood approximations for model selection (selecting hyperparameters) and compare to a gold standard based on MCMC. Interestingly, some methods produce good predictive distributions although their marginal likelihood approximations are poor. Strong conclusions are drawn about the methods: The Expectation Propagation algorithm is almost always the method of choice unless the computational budget is very tight. We also extend existing methods in various ways, and provide unifying code implementing all approaches.

#### Gaussian Mixture Modeling with Gaussian Process Latent Variable Models

Hannes Nickisch, Carl Edward Rasmussen, September 2010. (In Proceedings of the 32nd DAGM Symposium on Pattern Recognition). Darmstadt, Germany. Springer. Lecture Notes in Computer Science (LNCS). **DOI**: 10.1007/978-3-642-15986-2_28.

Abstract▼ URL

Density modeling is notoriously difficult for high dimensional data. One approach to the problem is to search for a lower dimensional manifold which captures the main characteristics of the data. Recently, the Gaussian Process Latent Variable Model (GPLVM) has successfully been used to find low dimensional manifolds in a variety of complex data. The GPLVM consists of a set of points in a low dimensional latent space, and a stochastic map to the observed space. We show how it can be interpreted as a density model in the observed space. However, the GPLVM is not trained as a density model and therefore yields bad density estimates. We propose a new training strategy and obtain improved generalisation performance and better density estimates in comparative evaluations on several benchmark data sets.

#### Benchmarking the neural linear model for regression

Sebastian W. Ober, Carl Edward Rasmussen, 2019. (In 2nd Symposium on Advances in Approximate Bayesian Inference).

Abstract▼ URL

The neural linear model is a simple adaptive Bayesian linear regression method that has recently been used in a number of problems ranging from Bayesian optimization to reinforcement learning. Despite its apparent successes in these settings, to the best of our knowledge there has been no systematic exploration of its capabilities on simple regression tasks. In this work we characterize these on the UCI datasets, a popular benchmark for Bayesian regression models, as well as on the recently introduced UCI “gap” datasets, which are better tests of out-of-distribution uncertainty. We demonstrate that the neural linear model is a simple method that shows generally good performance on these tasks, but at the cost of requiring good hyperparameter tuning.

#### The promises and pitfalls of deep kernel learning

Sebastian W. Ober, Carl Edward Rasmussen, Mark van der Wilk, 2021. (In 37th Conference on Uncertainty in Artificial Intelligence).

Abstract▼ URL

Deep kernel learning (DKL) and related techniques aim to combine the representational power of neural networks with the reliable uncertainty estimates of Gaussian processes. One crucial aspect of these models is an expectation that, because they are treated as Gaussian process models optimized using the marginal likelihood, they are protected from overfitting. However, we identify situations where this is not the case. We explore this behavior, explain its origins and consider how it applies to real datasets. Through careful experimentation on the UCI, CIFAR-10, and the UTKFace datasets, we find that the overfitting from overparameterized maximum marginal likelihood, in which the model is “somewhat Bayesian”, can in certain scenarios be worse than that from not being Bayesian at all. We explain how and when DKL can still be successful by investigating optimization dynamics. We also find that failures of DKL can be rectified by a fully Bayesian treatment, which leads to the desired performance improvements over standard neural networks and Gaussian processes.

#### Active Learning of Model Evidence Using Bayesian Quadrature

Michael A. Osborne, David Duvenaud, Roman Garnett, Carl Edward Rasmussen, Stephen J. Roberts, Zoubin Ghahramani, December 2012. (In Advances in Neural Information Processing Systems 25). Lake Tahoe, California, USA.

Abstract▼ URL

Numerical integration is a key component of many problems in scientiﬁc computing, statistical modelling, and machine learning. Bayesian Quadrature is a model-based method for numerical integration which, relative to standard Monte Carlo methods, offers increased sample efficiency and a more robust estimate of the uncertainty in the estimated integral. We propose a novel Bayesian Quadrature approach for numerical integration when the integrand is non-negative, such as the case of computing the marginal likelihood, predictive distribution, or normalising constant of a probabilistic model. Our approach approximately marginalises the quadrature model’s hyperparameters in closed form, and introduces an active learning scheme to optimally select function evaluations, as opposed to using Monte Carlo samples. We demonstrate our method on both a number of synthetic benchmarks and a real scientiﬁc problem from astronomy.

#### PIPPS: Flexible Model-Based Policy Search Robust to the Curse of Chaos

Paavo Parmas, Carl Edward Rasmussen, Jan Peters, Kenji Doya, 2018. (In 35th International Conference on Machine Learning).

Abstract▼ URL

Previously, the exploding gradient problem has been explained to be central in deep learning and model-based reinforcement learning, because it causes numerical issues and instability in optimization. Our experiments in model-based reinforcement learning imply that the problem is not just a numerical issue, but it may be caused by a fundamental chaos-like nature of long chains of nonlinear computations. Not only do the magnitudes of the gradients become large, the direction of the gradients becomes essentially random. We show that reparameterization gradients suffer from the problem, while likelihood ratio gradients are robust. Using our insights, we develop a model-based policy search framework, Probabilistic Inference for Particle-Based Policy Search (PIPPS), which is easily extensible, and allows for almost arbitrary models and policies, while simultaneously matching the performance of previous data-efficient learning algorithms. Finally, we invent the total propagation algorithm, which efficiently computes a union over all pathwise derivative depths during a single backwards pass, automatically giving greater weight to estimators with lower variance, sometimes improving over reparameterization gradients by 106 times.

#### Model-based design analysis and yield optimization

Tobias Pfingsten, Daniel Herrmann, Carl Edward Rasmussen, 2006. (IEEE Transactions on Semiconductor Manufacturing). **DOI**: 10.1109/TSM.2006.883589.

Abstract▼ URL

Fluctuations are inherent to any fabrication process. Integrated circuits and microelectromechanical systems are particularly affected by these variations, and due to high-quality requirements the effect on the devices’ perform ance has to be understood quantitatively. In recent years, it has become possible to model the performance of such complex systems on the basis of design specifications, and model-based sensitivity analysis has made its way into industrial engineering. We show how an efficient Bayesian approach, using a Gaussian process prior, can replace the commonly used brute-force Monte Carlo scheme, making it possible to apply the analysis to computationally costly models. We introduce a number of global, statistically justified sensitivity measures for design analysis and optimization. Two models of integrated systems serve us as case studies to introduce the analysis and to assess its convergence properties. We show that the Bayesian Monte Carlo scheme can save costly simulation runs and can ensure a reliable accuracy of the analysis.

**Comment:** Winner of the 2006 Best Paper Award for the journal.

#### Propagation of Uncertainty in Bayesian Kernel Models - Application to Multiple-Step Ahead Forecasting

Joaquin Quiñonero-Candela, Agathe Girard, Jan Larsen, Carl Edward Rasmussen, April 2003. (In ICASSP 2003). (IEEE International Conference on Acoustics, Speech and Signal Processing). Hong Kong.

Abstract▼ URL

The object of Bayesian modelling is the predictive distribution, which in a forecasting scenario enables improved estimates of forecasted values and their uncertainties. In this paper we focus on reliably estimating the predictive mean and variance of forecasted values using Bayesian kernel based models such as the Gaussian Process and the Relevance Vector Machine. We derive novel analytic expressions for the predictive mean and variance for Gaussian kernel shapes under the assumption of a Gaussian input distribution in the static case, and of a recursive Gaussian predictive density in iterative forecasting. The capability of the method is demonstrated for forecasting of time-series and compared to approximate methods.

#### Propagation of Uncertainty in Bayesian Kernel Models - Application to Multiple-Step Ahead Forecasting

Joaquin Quiñonero-Candela, Agathe Girard, Jan Larsen, Carl Edward Rasmussen, 2003. (In NNSP 2003). Edited by C. Molina, T. Adali, J. Larsen, M. Van Hulle, S. C. Douglas, J. Rouat. (Proceedings of 2003 IEEE International Workshop on Neural Networks for Signal Processing). Piscataway, New Jersey. Toulouse. IEEE Press.

Abstract▼ URL

The object of Bayesian modelling is the predictive distribution, which in a forecasting scenario enables improved estimates of forecasted values and their uncertainties. In this paper we focus on reliably estimating the predictive mean and variance of forecasted values using Bayesian kernel based models such as the Gaussian Process and the Relevance Vector Machine. We derive novel analytic expressions for the predictive mean and variance for Gaussian kernel shapes under the assumption of a Gaussian input distribution in the static case, and of a recursive Gaussian predictive density in iterative forecasting. The capability of the method is demonstrated for forecasting of time-series and compared to approximate methods.

**Comment:** Electronic version of Quiñonero-Candela, Girard, Larsen and Rasmussen, 2003 which should have been presented at ICASSP 03, but was cancelled due to bird flu epidemic.

#### Prediction at an uncertain input for Gaussian processes and Relevance Vector Machines Application to multiple-step ahead time-series prediction

Joaquin Quiñonero-Candela, Agathe Girard, Carl Edward Rasmussen, 2003. Instititute for Mathemetical Modelling, DTU,

**Comment:** techreport

#### A Unifying View of Sparse Approximate Gaussian Process Regression

Joaquin Quiñonero-Candela, Carl Edward Rasmussen, 2005. (Journal of Machine Learning Research).

Abstract▼ URL

We provide a new unifying view, including all existing proper probabilistic sparse approximations for Gaussian process regression. Our approach relies on expressing the effective prior which the methods are using. This allows new insights to be gained, and highlights the relationship between existing methods. It also allows for a clear theoretically justified ranking of the closeness of the known approximations to the corresponding full GPs. Finally we point directly to designs of new better sparse approximations, combining the best of the existing strategies, within attractive computational constraints.

#### Analysis of Some Methods for Reduced Rank Gaussian Process Regression

Joaquin Quiñonero-Candela, Carl Edward Rasmussen, 2005. (In Switching and Learning in Feedback Systems). Edited by R. Murray-Smith, R. Shorten. Berlin, Heidelberg. Springer.

Abstract▼ URL

While there is strong motivation for using Gaussian Processes (GPs) due to their excellent performance in regression and classification problems, their computational complexity makes them impractical when the size of the training set exceeds a few thousand cases. This has motivated the recent proliferation of a number of cost-effective approximations to GPs, both for classification and for regression. In this paper we analyze one popular approximation to GPs for regression: the reduced rank approximation. While generally GPs are equivalent to infinite linear models, we show that Reduced Rank Gaussian Processes (RRGPs) are equivalent to finite sparse linear models. We also introduce the concept of degenerate GPs and show that they correspond to inappropriate priors. We show how to modify the RRGP to prevent it from being degenerate at test time. Training RRGPs consists both in learning the covariance function hyperparameters and the support set. We propose a method for learning hyperparameters for a given support set. We also review the Sparse Greedy GP (SGGP) approximation (Somla and Bartlett, 2001), which is a way of learning the support set for given hyperparameters based on approximating the posterior. We propose an alternative method to the SGGP that has better generalization capabilities. Finally we make experiments to compare the different ways of training a RRGP. We provide some Matlab code for learning RRGPs.

#### Evaluating Predictive Uncertainty Challenge

Joaquin Quiñonero-Candela, Carl Edward Rasmussen, Fabian Sinz, Olivier Bousquet, Bernhard Schölkopf, 04 2006. (In Machine Learning Challenges. Evaluating predictive uncertainty, visual object classification and recognising tectual entailment. First PASCAL Machine Learning Challenges Workshop). Edited by J. Quiñonero-Candela, I. Dagan, B. Magnini, F. d’Alché-Buc. Berlin, Germany. Southampton, United Kingdom. Springer. Lecture Notes in Computer Science (LNCS). **DOI**: 10.1007/11736790_1.

Abstract▼ URL

This Chapter presents the PASCAL1 Evaluating Predictive Uncertainty Challenge, introduces the contributed Chapters by the participants who obtained outstanding results, and provides a discussion with some lessons to be learnt. The Challenge was set up to evaluate the ability of Machine Learning algorithms to provide good “probabilistic predictions”, rather than just the usual “point predictions” with no measure of uncertainty, in regression and classification problems. Participants had to compete on a number of regression and classification tasks, and were evaluated by both traditional losses that only take into account point predictions and losses we proposed that evaluate the quality of the probabilistic predictions.

#### Approximation Methods for Gaussian Process Regression

Joaquin Quiñonero-Candela, Carl Edward Rasmussen, Christopher K. I. Williams, September 2007. (In Large-Scale Kernel Machines). Edited by L. Bottou, O. Chapelle, D. DeCoste, J. Weston. Cambridge, MA, USA. The MIT Press. Neural Information Processing.

Abstract▼ URL

A wealth of computationally efficient approximation methods for Gaussian process regression have been recently proposed. We give a unifying overview of sparse approximations, following Quiñonero-Candela and Rasmussen (2005), and a brief review of approximate matrix-vector multiplication methods.

**Comment:** book

#### The Infinite Gaussian Mixture Model

Carl Edward Rasmussen, 2000. (In Advances in Neural Information Processing Systems 12). Edited by Todd K. Leen Sara A. Solla, Klaus-Robert Müller. The MIT Press.

Abstract▼ URL

In a Bayesian mixture model it is not necessary a priori to limit the number of components to be finite. In this paper an infinite Gaussian mixture model is presented which neatly sidesteps the difficult problem of finding the “right” number of mixture components. Inference in the model is done using an efficient parameter-free Markov Chain that relies entirely on Gibbs sampling.

#### Gaussian Processes to Speed up Hybrid Monte Carlo for Expensive Bayesian Integrals

Carl Edward Rasmussen, 2003. (In Bayesian Statistics 7). Edited by J. M. Bernardo, M. J. Bayarri, J. O. Berger, A. P. Dawid, D. Heckerman, A. F. M. Smith, M. West. Oxford University Press.

Abstract▼ URL

Hybrid Monte Carlo (HMC) is often the method of choice for computing Bayesian integrals that are not analytically tractable. However the success of this method may require a very large number of evaluations of the (un-normalized) posterior and its partial derivatives. In situations where the posterior is computationally costly to evaluate, this may lead to an unacceptable computational load for HMC. I propose to use a Gaussian Process model of the (log of the) posterior for most of the computations required by HMC. Within this scheme only occasional evaluation of the actual posterior is required to guarantee that the samples generated have exactly the desired distribution, even if the GP model is somewhat inaccurate. The method is demonstrated on a 10 dimensional problem, where 200 evaluations suffice for the generation of 100 roughly independent points from the posterior. Thus, the proposed scheme allows Bayesian treatment of models with posteriors that are computationally demanding, such as models involving computer simulation.

#### Gaussian Processes in Machine Learning

Carl Edward Rasmussen, 2004. (In Advanced Lectures on Machine Learning: ML Summer Schools 2003, Canberra, Australia, February 2 - 14, 2003, Tübingen, Germany, August 4 - 16, 2003, Revised Lectures). Edited by Olivier Bousquet, Ulrike von Luxburg, Gunnar Rätsch. Heidelberg. Springer-Verlag. Lecture Notes in Computer Science (LNCS).

Abstract▼ URL

We give a basic introduction to Gaussian Process regression models. We focus on understanding the role of the stochastic process and how it is used to define a distribution over functions. We present the simple equations for incorporating training data and examine how to learn the hyperparameters using the marginal likelihood. We explain the practical advantages of Gaussian Process and end with conclusions and a look at the current trends in GP work.

**Comment:** Copyright by Springer, springerlink

#### A practical Monte Carlo implementation of Bayesian learning

Carl Edward Rasmussen, 1996. (In Advances in Neural Information Processing Systems 8). Edited by D. S. Touretzky, M. C. Mozer, M. E. Hasselmo. Cambridge, MA., USA. The MIT Press.

Abstract▼ URL

A practical method for Bayesian training of feed-forward neural networks using sophisticated Monte Carlo methods is presented and evaluated. In reasonably small amounts of computer time this approach outperforms other state-of-the-art methods on 5 datalimited tasks from real world domains.

#### Evaluation of Gaussian Processes and other Methods for non-linear Regression

Carl Edward Rasmussen, 1996. University of Toronto, Department of Computer Science, Toronto, CANADA.

Abstract▼ URL

This thesis develops two Bayesian learning methods relying on Gaussian processes and a rigorous statistical approach for evaluating such methods. In these experimental designs the sources of uncertainty in the estimated generalisation performances due to both variation in training and test sets are accounted for. The framework allows for estimation of generalisation performance as well as statistical tests of significance for pairwise comparisons. Two experimental designs are recommended and supported by the DELVE software environment. Two new non-parametric Bayesian learning methods relying on Gaussian process priors over functions are developed. These priors are controlled by hyperparameters which set the characteristic length scale for each input dimension. In the simplest method, these parameters are fit from the data using optimization. In the second, fully Bayesian method, a Markov chain Monte Carlo technique is used to integrate over the hyperparameters. One advantage of these Gaussian process methods is that the priors and hyperparameters of the trained models are easy to interpret. The Gaussian process methods are benchmarked against several other methods, on regression tasks using both real data and data generated from realistic simulations. The experiments show that small datasets are unsuitable for benchmarking purposes because the uncertainties in performance measurements are large. A second set of experiments provide strong evidence that the bagging procedure is advantageous for the Multivariate Adaptive Regression Splines (MARS) method. The simulated datasets have controlled characteristics which make them useful for understanding the relationship between properties of the dataset and the performance of different methods. The dependency of the performance on available computation time is also investigated. It is shown that a Bayesian approach to learning in multi-layer perceptron neural networks achieves better performance than the commonly used early stopping procedure, even for reasonably short amounts of computation time. The Gaussian process methods are shown to consistently outperform the more conventional methods.

#### Modeling and Visualizing Uncertainty in Gene Expression Clusters Using Dirichlet Process Mixtures

Carl Edward Rasmussen, Bernhard J. de la Cruz, Zoubin Ghahramani, David L. Wild, 2009. (IEEE/ACM Transactions on Computational Biology and Bioinformatics). **DOI**: 10.1109/TCBB.2007.70269. **ISSN**: 1545-5963.

Abstract▼ URL

Although the use of clustering methods has rapidly become one of the standard computational approaches in the literature of microarray gene expression data, little attention has been paid to uncertainty in the results obtained. Dirichlet process mixture (DPM) models provide a nonparametric Bayesian alternative to the bootstrap approach to modeling uncertainty in gene expression clustering. Most previously published applications of Bayesian model-based clustering methods have been to short time series data. In this paper, we present a case study of the application of nonparametric Bayesian clustering methods to the clustering of high-dimensional nontime series gene expression data using full Gaussian covariances. We use the probability that two genes belong to the same cluster in a DPM model as a measure of the similarity of these gene expression profiles. Conversely, this probability can be used to define a dissimilarity measure, which, for the purposes of visualization, can be input to one of the standard linkage algorithms used for hierarchical clustering. Biologically plausible results are obtained from the Rosetta compendium of expression profiles which extend previously published cluster analyses of this data.

#### Probabilistic Inference for Fast Learning in Control

Carl Edward Rasmussen, Marc Peter Deisenroth, November 2008. (In Recent Advances in Reinforcement Learning). Edited by S. Girgin, M. Loth, R. Munos, P. Preux, D. Ryabko. Villeneuve d’Ascq, France. Springer-Verlag. Lecture Notes in Computer Science (LNCS).

Abstract▼ URL

We provide a novel framework for very fast model-based reinforcement learning in continuous state and action spaces. The framework requires probabilistic models that explicitly characterize their levels of confidence. Within this framework, we use flexible, non-parametric models to describe the world based on previously collected experience. We demonstrate learning on the cart-pole problem in a setting where we provide very limited prior knowledge about the task. Learning progresses rapidly, and a good policy is found after only a hand-full of iterations.

**Comment:** videos and more. slides.

#### Occam’s Razor

Carl Edward Rasmussen, Zoubin Ghahramani, December 2001. (In Advances in Neural Information Processing Systems 13). Edited by T. G. Diettrich T. Leen, V. Tresp. Cambridge, MA, USA. The MIT Press.

Abstract▼ URL

The Bayesian paradigm apparently only sometimes gives rise to Occam’s Razor; at other times very large models perform well. We give simple examples of both kinds of behaviour. The two views are reconciled when measuring complexity of functions, rather than of the machinery used to implement them. We analyze the complexity of functions for some linear in the parameter models that are equivalent to Gaussian Processes, and always find Occam’s Razor at work.

#### Infinite Mixtures of Gaussian Process Experts

Carl Edward Rasmussen, Zoubin Ghahramani, December 2002. (In Advances in Neural Information Processing Systems 14). Edited by T. G. Dietterich, S. Becker, Z. Ghahramani. Cambridge, MA, USA. The MIT Press.

Abstract▼ URL

We present an extension to the Mixture of Experts (ME) model, where the individual experts are Gaussian Process (GP) regression models. Using an input-dependent adaptation of the Dirichlet Process, we implement a gating network for an infinite number of Experts. Inference in this model may be done efficiently using a Markov Chain relying on Gibbs sampling. The model allows the effective covariance function to vary with the inputs, and may handle large datasets — thus potentially overcoming two of the biggest hurdles with GP models. Simulations show the viability of this approach.

#### Bayesian Monte Carlo

Carl Edward Rasmussen, Zoubin Ghahramani, December 2003. (In Advances in Neural Information Processing Systems 15). Edited by S. Becker, S. Thrun, K. Obermayer. Cambridge, MA, USA. The MIT Press.

Abstract▼ URL

We investigate Bayesian alternatives to classical Monte Carlo methods for evaluating integrals. Bayesian Monte Carlo (BMC) allows the incorporation of prior knowledge, such as smoothness of the integrand, into the estimation. In a simple problem we show that this outperforms any classical importance sampling method. We also attempt more challenging multidimensional integrals involved in computing marginal likelihoods of statistical models (a.k.a. partition functions and model evidences). We find that Bayesian Monte Carlo outperformed Annealed Importance Sampling, although for very high dimensional problems or problems with massive multimodality BMC may be less adequate. One advantage of the Bayesian approach to Monte Carlo is that samples can be drawn from any distribution. This allows for the possibility of active design of sample points so as to maximise information gain.

#### Gaussian processes in reinforcement learning

Carl Edward Rasmussen, Malte Kuß, December 2004. (In Advances in Neural Information Processing Systems 16). Edited by S. Thrun, L.K. Saul, B. Schölkopf. Cambridge, MA, USA. The MIT Press.

Abstract▼ URL

We exploit some useful properties of Gaussian process (GP) regression models for reinforcement learning in continuous state spaces and discrete time. We demonstrate how the GP model allows evaluation of the value function in closed form. The resulting policy iteration algorithm is demonstrated on a simple problem with a two dimensional state space. Further, we speculate that the intrinsic ability of GP models to characterise distributions of functions would allow the method to capture entire distributions over future values instead of merely their expectation, which has traditionally been the focus of much of reinforcement learning.

#### The DELVE manual

Carl Edward Rasmussen, Radford M. Neal, Geoffrey E. Hinton, Drew van Camp, Mike Revow, Zoubin Ghahramani, Rafal Kustra, Robert Tibshirani, 1996.

Abstract▼ URL

DELVE – Data for Evaluating Learning in Valid Experiments – is a collection of datasets from many sources, an environment within which this data can be used to assess the performance of methods for learning relationships from data, and a repository for the results of such experiments.

**Comment:** The delve website.

#### Gaussian Processes for Machine Learning (GPML) Toolbox

Carl Edward Rasmussen, Hannes Nickisch, December 2010. (Journal of Machine Learning Research).

Abstract▼ URL

The GPML toolbox provides a wide range of functionality for Gaussian process (GP) inference and prediction. GPs are specified by mean and covariance functions; we offer a library of simple mean and covariance functions and mechanisms to compose more complex ones. Several likelihood functions are supported including Gaussian and heavy-tailed for regression as well as others suitable for classification. Finally, a range of inference methods is provided, including exact and variational inference, Expectation Propagation, and Laplace’s method dealing with non-Gaussian likelihoods and FITC for dealing with large regression tasks.

**Comment:** Toolbox avaiable from here. Implements algorithms from Rasmussen and Williams, 2006.

#### Healing the Relevance Vector Machine through Augmentation

Carl Edward Rasmussen, Joaquin Quiñonero-Candela, 2005. (In 22nd International Conference on Machine Learning). Edited by L. De Raedt, S. Wrobel. Bonn, Germany.

Abstract▼ URL

The Relevance Vector Machine (RVM) is a sparse approximate Bayesian kernel method. It provides full predictive distributions for test cases. However, the predictive uncertainties have the unintuitive property, that *they get smaller the further you move away from the training cases*. We give a thorough analysis. Inspired by the analogy to non-degenerate Gaussian Processes, we suggest augmentation to solve the problem. The purpose of the resulting model, RVM*, is primarily to corroborate the theoretical and experimental analysis. Although RVM* could be used in practical applications, it is no longer a truly sparse model. Experiments show that sparsity comes at the expense of worse predictive distributions.

#### Gaussian Processes for Machine Learning

Carl Edward Rasmussen, Christopher K. I. Williams, 2006. The MIT Press.

Abstract▼ URL

Gaussian processes (GPs) provide a principled, practical, probabilistic approach to learning in kernel machines. GPs have received increased attention in the machine-learning community over the past decade, and this book provides a long-needed systematic and unified treatment of theoretical and practical aspects of GPs in machine learning. The treatment is comprehensive and self-contained, targeted at researchers and students in machine learning and applied statistics.

**Comment:** Winner of the 2009 DeGroot Prize. Book web page, chapters and entire book pdf. GPML Toolbox.

#### Presynaptic and postsynaptic comptetition in models for the development of neuromuscular connections

Carl Edward Rasmussen, David J. Willshaw, 1993. (Biological Cybernetics). Springer. **DOI**: 10.1007/BF00198773.

Abstract▼ URL

In the establishment of connections between nerve and muscle there is an initial stage when each muscle fibre is innervated by several different motor axons. Withdrawal of connections then takes place until each fibre has contact from just a single axon. The evidence suggests that the withdrawal process involves competition between nerve terminals. We examine in formal models several types of competitive mechanism that have been proposed for this phenomenon. We show that a model which combines competition for a presynaptic resource with competition for a postsynaptic resource is superior to others. This model accounts for many anatomical and physiological findings and has a biologically plausible implementation. Intrinsic withdrawal appears to be a side effect of the competitive mechanism rather than a separate non-competitive feature. The model’s capabilities are confirmed by theoretical analysis and full scale computer simulations.

#### Gaussian Process Change Point Models

Yunus Saatçi, Ryan Turner, Carl Edward Rasmussen, June 2010. (In 27th International Conference on Machine Learning). Haifa, Israel.

Abstract▼ URL

We combine Bayesian online change point detection with Gaussian processes to create a nonparametric time series model which can handle change points. The model can be used to locate change points in an online manner; and, unlike other Bayesian online change point detection algorithms, is applicable when temporal correlations in a regime are expected. We show three variations on how to apply Gaussian processes in the change point context, each with their own advantages. We present methods to reduce the computational burden of these models and demonstrate it on several real world data sets.

#### Ensembling geophysical models with Bayesian Neural Networks

Ushnish Sengupta, Matt Amos, J. Scott Hosking, Carl Edward Rasmussen, Matthew P. Juniper, Paul J. Young, 2021. (In Advances in Neural Information Processing Systems 34).

Abstract▼ URL

Ensembles of geophysical models improve projection accuracy and express uncertainties. We develop a novel data-driven ensembling strategy for combining geophysical models using Bayesian Neural Networks, which infers spatiotemporally varying model weights and bias while accounting for heteroscedastic uncertainties in the observations. This produces more accurate and uncertainty-aware projections without sacrificing interpretability. Applied to the prediction of total column ozone from an ensemble of 15 chemistry-climate models, we find that the Bayesian neural network ensemble (BayNNE) outperforms existing ensembling methods, achieving a 49.4% reduction in RMSE for temporal extrapolation, and a 67.4% reduction in RMSE for polar data voids, compared to a weighted mean. Uncertainty is also well-characterized, with 90.6% of the data points in our extrapolation validation dataset lying within 2 standard deviations and 98.5% within 3 standard deviations.

#### Kernel Identification Through Transformers

Fergus Simpson, Ian Davies, Vidhi Lalchand, Alessandro Vullo, Nicolas Durrande, Carl Edward Rasmussen, 2021. (In Advances in Neural Information Processing Systems 34).

Abstract▼ URL

Kernel selection plays a central role in determining the performance of Gaussian Process (GP) models, as the chosen kernel determines both the inductive biases and prior support of functions under the GP prior. This work addresses the challenge of constructing custom kernel functions for high-dimensional GP regression models. Drawing inspiration from recent progress in deep learning, we introduce a novel approach named KITT: Kernel Identification Through Transformers. KITT exploits a transformer-based architecture to generate kernel recommendations in under 0.1 seconds, which is several orders of magnitude faster than conventional kernel search algorithms. We train our model using synthetic data generated from priors over a vocabulary of known kernels. By exploiting the nature of the self-attention mechanism, KITT is able to process datasets with inputs of arbitrary dimension. We demonstrate that kernels chosen by KITT yield strong performance over a diverse collection of regression benchmarks.

#### Marginalised Gaussian Processes with Nested Sampling

Fergus Simpson, Vidhi Lalchand, Carl Edward Rasmussen, 2021. (In Advances in Neural Information Processing Systems 34). Curran Associates, Inc..

Abstract▼ URL

Gaussian Process models are a rich distribution over functions with inductive biases controlled by a kernel function. Learning occurs through optimisation of the kernel hyperparameters using the marginal likelihood as the objective. This work proposes nested sampling as a means of marginalising kernel hyperparameters, because it is a technique that is well-suited to exploring complex, multi-modal distributions. We benchmark against Hamiltonian Monte Carlo on time-series and two-dimensional regression tasks, finding that a principled approach to quantifying hyperparameter uncertainty substantially improves the quality of prediction intervals.

#### Learning Depth From Stereo

Fabian Sinz, Joaquin Quiñonero-Candela, Gökhan H. Bakir, Carl Edward Rasmussen, Matthias O. Franz, 09 2004. (In 26th DAGM Symposium). Edited by C. E. Rasmussen, H. H. Bülthoff, B. Schölkopf, M. A. Giese. (Pattern Recognition: Proceedings of the 26th DAGM Symposium). Berlin, Germany. Tübingen, Germany. Springer. Lecture Notes in Computer Science (LNCS).

Abstract▼ URL

We compare two approaches to the problem of estimating the depth of a point in space from observing its image position in two different cameras: 1. The classical photogrammetric approach explicitly models the two cameras and estimates their intrinsic and extrinsic parameters using a tedious calibration procedure; 2. A generic machine learning approach where the mapping from image to spatial coordinates is directly approximated by a Gaussian Process regression. Our results show that the generic learning approach, in addition to simplifying the procedure of calibration, can lead to higher depth accuracies than classical calibration although no specific domain knowledge is used.

#### Warped Gaussian Processes

Edward Snelson, Carl Edward Rasmussen, Zoubin Ghahramani, December 2004. (In Advances in Neural Information Processing Systems 16). Edited by S. Thrun, L. Saul, B. Schölkopf. Cambridge, MA, USA. The MIT Press. **ISBN**: 0-262-20152-6.

Abstract▼ URL

We generalise the Gaussian process (GP) framework for regression by learning a nonlinear transformation of the GP outputs. This allows for non-Gaussian processes and non-Gaussian noise. The learning algorithm chooses a nonlinear transformation such that transformed data is well-modelled by a GP. This can be seen as including a preprocessing transformation as an integral part of the probabilistic modelling problem, rather than as an ad-hoc step. We demonstrate on several real regression problems that learning the transformation can lead to significantly better performance than using a regular GP, or a GP with a fixed transformation.

#### Derivative observations in Gaussian Process models of dynamic systems

Ercan Solak, Roderick Murray-Smith, William E. Leithead, Douglas Leith, Carl Edward Rasmussen, December 2003. (In Advances in Neural Information Processing Systems 15). Edited by S. Becker, S. Thrun, K. Obermayer. Cambridge, MA, USA. The MIT Press.

Abstract▼ URL

Gaussian processes provide an approach to nonparametric modelling which allows a straightforward combination of function and derivative observations in an empirical model. This is of particular importance in identification of nonlinear dynamic systems from experimental data. 1) It allows us to combine derivative information, and associated uncertainty with normal function observations into the learning and inference process. This derivative information can be in the form of priors specified by an expert or identified from perturbation data close to equilibrium. 2) It allows a seamless fusion of multiple local linear models in a consistent manner, inferring consistent models and ensuring that integrability constraints are met. 3) It improves dramatically the computational efficiency of Gaussian process models for dynamic system identification, by summarising large quantities of near-equilibrium data by a handful of linearisations, reducing the training set size - traditionally a problem for Gaussian process models.

#### The Need for Open Source Software in Machine Learning

Sören Sonnenburg, Mikio L. Braun, Cheng Soon Ong, Samy Bengio, Leon Bottou, Geoffrey Holmes, Yann LeCun, Klaus-Robert Müller, Fernando Pereira, Carl Edward Rasmussen, Gunnar Rätsch, Bernhard Schölkopf, Alexander Smola, Pascal Vincent, Jason Weston, Robert C. Williamson, October 2007. (Journal of Machine Learning Research).

Abstract▼ URL

Open source tools have recently reached a level of maturity which makes them suitable for building large-scale real-world systems. At the same time, the field of machine learning has developed a large body of powerful learning algorithms for diverse applications. However, the true potential of these methods is not realized, since existing implementations are not openly shared, resulting in software with low usability, and weak interoperability. We argue that this situation can be significantly improved by increasing incentives for researchers to publish their software under an open source model. Additionally, we outline the problems authors are faced with when trying to publish algorithmic implementations of machine learning methods. We believe that a resource of peer reviewed software accompanied by short articles would be highly valuable to both the machine learning and the general scientific community.

#### Numerically Stable Sparse Gaussian Processes via Minimum Separation using Cover Trees

Alexander Terenin, David R. Burt, Artem Artemev, Seth Flaxman, Mark van der Wilk, Carl Edward Rasmussen, Hong Ge, 2022. (arXiv).

Abstract▼ URL

As Gaussian processes mature, they are increasingly being deployed as part of larger machine learning and decision-making systems, for instance in geospatial modeling, Bayesian optimization, or in latent Gaussian models. Within a system, the Gaussian process model needs to perform in a stable and reliable manner to ensure it interacts correctly with other parts the system. In this work, we study the numerical stability of scalable sparse approximations based on inducing points. We derive sufficient and in certain cases necessary conditions on the inducing points for the computations performed to be numerically stable. For low-dimensional tasks such as geospatial modeling, we propose an automated method for computing inducing points satisfying these conditions. This is done via a modification of the cover tree data structure, which is of independent interest. We additionally propose an alternative sparse approximation for regression with a Gaussian likelihood which trades off a small amount of performance to further improve stability. We evaluate the proposed techniques on a number of examples, showing that, in geospatial settings, sparse approximations with guaranteed numerical stability often perform comparably to those without.

#### Deep Structured Mixtures of Gaussian Processes

Martin Trapp, Robert Peharz, Franz Pernkopf, Carl Edward Rasmussen, August 2020. (In 23rd International Conference on Artificial Intelligence and Statistics). Online.

Abstract▼ URL

Gaussian Processes (GPs) are powerful non-parametric Bayesian regression models that allow exact posterior inference, but exhibit high computational and memory costs. In order to improve scalability of GPs, approximate posterior inference is frequently employed, where a prominent class of approximation techniques is based on local GP experts. However, local-expert techniques proposed so far are either not well-principled, come with limited approximation guarantees, or lead to intractable models. In this paper, we introduce deep structured mixtures of GP experts, a stochastic process model which i) allows exact posterior inference, ii) has attractive computational and memory costs, and iii) when used as GP approximation, captures predictive uncertainties consistently better than previous expert-based approximations. In a variety of experiments, we show that deep structured mixtures have a low approximation error and often perform competitive or outperform prior work.

#### System Identification in Gaussian Process Dynamical Systems

Ryan Turner, Marc Peter Deisenroth, Carl Edward Rasmussen, December 2009. (In NIPS Workshop on Nonparametric Bayes). Edited by Dilan Görür. Whistler, BC, Canada.

**Comment:** poster.

#### State-Space Inference and Learning with Gaussian Processes

Ryan Turner, Marc Peter Deisenroth, Carl Edward Rasmussen, May 13–15 2010. (In 13th International Conference on Artificial Intelligence and Statistics). Edited by Yee Whye Teh, Mike Titterington. Chia Laguna, Sardinia, Italy. W & CP.

Abstract▼ URL

State-space inference and learning with Gaussian processes (GPs) is an unsolved problem. We propose a new, general methodology for inference and learning in nonlinear state-space models that are described probabilistically by non-parametric GP models. We apply the expectation maximization algorithm to iterate between inference in the latent state-space and learning the parameters of the underlying GP dynamics model.

**Comment:** poster.

#### Model Based Learning of Sigma Points in Unscented Kalman Filtering

Ryan Turner, Carl Edward Rasmussen, August 2010. (In Machine Learning for Signal Processing (MLSP 2010)). Edited by Samuel Kaski, David J. Miller, Erkki Oja, Antti Honkela. Kittilä, Finland. **ISBN**: 978-1-4244-7876-7.

Abstract▼ URL

The unscented Kalman filter (UKF) is a widely used method in control and time series applications. The UKF suffers from arbitrary parameters necessary for a step known as sigma point placement, causing it to perform poorly in nonlinear problems. We show how to treat sigma point placement in a UKF as a learning problem in a model based view. We demonstrate that learning to place the sigma points correctly from data can make sigma point collapse much less likely. Learning can result in a significant increase in predictive performance over default settings of the parameters in the UKF and other filters designed to avoid the problems of the UKF, such as the GP-ADF. At the same time, we maintain a lower computational complexity than the other methods. We call our method UKF-L.

#### Model based learning of sigma points in unscented Kalman filtering

Ryan D. Turner, Carl Edward Rasmussen, 2012. (Neurocomputing). **DOI**: 10.1016/j.neucom.2011.07.029.

Abstract▼ URL

The unscented Kalman filter (UKF) is a widely used method in control and time series applications. The UKF suffers from arbitrary parameters necessary for sigma point placement, potentially causing it to perform poorly in nonlinear problems. We show how to treat sigma point placement in a UKF as a learning problem in a model based view. We demonstrate that learning to place the sigma points correctly from data can make sigma point collapse much less likely. Learning can result in a significant increase in predictive performance over default settings of the parameters in the UKF and other filters designed to avoid the problems of the UKF, such as the GP-ADF. At the same time, we maintain a lower computational complexity than the other methods. We call our method UKF-L.

#### Adaptive Sequential Bayesian Change Point Detection

Ryan Turner, Yunus Saatçi, Carl Edward Rasmussen, December 2009. (In NIPS Workshop on Temporal Segmentation). Edited by Zaïd Harchaoui. Whistler, BC, Canada.

Abstract▼ URL

Real-world time series are often nonstationary with respect to the parameters of some underlying prediction model (UPM). Furthermore, it is often desirable to adapt the UPM to incoming regime changes as soon as possible, necessitating sequential inference about change point locations. A Bayesian algorithm for online change point detection (BOCPD) has been introduced recently by Adams and MacKay (2007). In this algorithm, uncertainty about the last change point location is updated sequentially, and is integrated out to make online predictions robust to parameter changes. BOCPD requires a set of fixed hyper-parameters which allow the user to fully specify the hazard function for change points and the prior distribution over the parameters of the UPM. In practice, finding the “right” hyper-parameters can be quite difficult. We therefore extend BOCPD by introducing hyper-parameter learning, without sacrificing the online nature of the algorithm. Hyper-parameter learning is performed by optimizing the marginal likelihood of the BOCPD model, a closed-form quantity which can be computed sequentially. We illustrate performance on three real-world datasets.

#### Gaussian processes for regression

Chris K. I. Williams, Carl Edward Rasmussen, 1996. (In Advances in Neural Information Processing Systems 8). Edited by D. S. Touretzky, M. C. Mozer, M. E. Hasselmo. Cambridge, MA., USA. The MIT Press.

Abstract▼ URL

The Bayesian analysis of neural networks is difficult because a simple prior over weights implies a complex prior over functions. We investigate the use of a Gaussian process prior over functions, which permits the predictive Bayesian analysis for fixed values of hyperparameters to be carried out exactly using matrix operations. Two methods, using optimization and averaging (via Hybrid Monte Carlo) over hyperparameters have been tested on a number of challenging problems and have produced excellent results.

#### Convolutional Gaussian Processes

Mark van der Wilk, Carl Edward Rasmussen, James Hensman, 2017. (In Advances in Neural Information Processing Systems 31).

Abstract▼ URL

We present a practical way of introducing convolutional structure into Gaussian processes, making them more suited to high-dimensional inputs like images. The main contribution of our work is the construction of an inter-domain inducing point approximation that is well-tailored to the convolutional kernel. This allows us to gain the generalisation benefit of a convolutional kernel, together with fast but accurate posterior inference. We investigate several variations of the convolutional kernel, and apply it to MNIST and CIFAR-10, which have both been known to be challenging for Gaussian processes. We also show how the marginal likelihood can be used to find an optimal weighting between convolutional and RBF kernels to further improve performance. We hope that this illustration of the usefulness of a marginal likelihood will help automate discovering architectures in larger models.

**Comment:** arXiv

#### Observations on the Nyström Method for Gaussian Process Prediction

Christopher K. I. Williams, Carl Edward Rasmussen, Anton Schwaighofer, Volker Tresp, 2002. University of Edinburgh,

Abstract▼ URL

A number of methods for speeding up Gaussian Process (GP) prediction have been proposed, including the Nyström method of Williams and Seeger (2001). In this paper we focus on two issues (1) the relationship of the Nyström method to the Subset of Regressors method (Poggio and Girosi 1990; Luo and Wahba, 1997) and (2) understanding in what circumstances the Nyström approximation would be expected to provide a good approximation to exact GP regression.