David received an MPhil in Machine Learning, Speech and Language Technology from Cambridge University in 2018. Prior to this, David received a BA in Mathematics from Williams College. David began his PhD in 2018 and is supervised by Prof. Carl Rasmussen. David is funded by a Dr. Herchel Smith Fellowship. David is interested in Bayesian nonparametrics and approximate inference.

## Publications

#### Tighter Bounds on the Log Marginal Likelihood of Gaussian Process Regression Using Conjugate Gradients

Artem Artemev, David R. Burt, Mark van der Wilk, 2021. (In 38th International Conference on Machine Learning).

Abstract▼ URL

We propose a lower bound on the log marginal likelihood of Gaussian process regression models that can be computed without matrix factorisation of the full kernel matrix. We show that approximate maximum likelihood learning of model parameters by maximising our lower bound retains many benefits of the sparse variational approach while reducing the bias introduced into hyperparameter learning. The basis of our bound is a more careful analysis of the log-determinant term appearing in the log marginal likelihood, as well as using the method of conjugate gradients to derive tight lower bounds on the term involving a quadratic form. Our approach is a step forward in unifying methods relying on lower bound maximisation (e.g. variational methods) and iterative approaches based on conjugate gradients for training Gaussian processes. In experiments, we show improved predictive performance with our model for a comparable amount of training time compared to other conjugate gradient based approaches.

#### Scalable Approximate Inference and Model Selection in Gaussian Process Regression

David R. Burt, 2022. University of Cambridge, Department of Engineering, Cambridge, UK.

Abstract▼ URL

Models with Gaussian process priors and Gaussian likelihoods are one of only a handful of Bayesian models where inference can be performed without the need for approximation. However, a frequent criticism of these models from practitioners of Bayesian machine learning is that they are challenging to scale to large datasets due to the need to compute a large kernel matrix and perform standard linear-algebraic operations with this matrix. This limitation has driven decades of research in both statistics and machine learning seeking to scale Gaussian process regression models to ever-larger datasets. This thesis builds on this line of research. We focus on the problem of approximate inference and model selection with approximate maximum marginal likelihood as applied to Gaussian process regression. Our discussion is guided by three questions: Does an approximation work on a range of models and datasets? Can you verify that an approximation has worked on a given dataset? Is an approximation easy for a practitioner to use? While we are far from the first to ask these questions, we offer new insights into each question in the context of Gaussian process regression. In the first part of this thesis, we focus on sparse variational Gaussian process regression (Titsias, 2009). We provide new diagnostics for inference with this method that can be used as practical guides for practitioners trying to balance computation and accuracy with this approximation. We then provide an asymptotic analysis that highlights properties of the model and dataset that are sufficient for this approximation to perform reliable inference with a small computational cost. This analysis builds on an approach laid out in Burt (2018), as well as on similar guarantees in the kernel ridge regression literature. In the second part of this thesis, we consider iterative methods, especially the method of conjugate gradients, as applied to Gaussian process regression (Gibbs and MacKay, 1997). We primarily focus on improving the reliability of approximate maximum marginal likelihood when using these approximations. We investigate how the method of conjugate gradients and related approaches can be used to derive bounds on quantities related to the log marginal likelihood. This idea can be used to improve the speed and stability of model selection with these approaches, making them easier to use in practice.

#### Understanding variational inference in function-space

David R. Burt, Sebastian W. Ober, Adrià Garriga-Alonso, Mark van der Wilk, 2021. (In 3rd Symposium on Advances in Approximate Bayesian Inference).

Abstract▼ URL

Recent work has attempted to directly approximate the ‘function-space’ or predictive posterior distribution of Bayesian models, without approximating the posterior distribution over the parameters. This is appealing in e.g. Bayesian neural networks, where we only need the former, and the latter is hard to represent. In this work, we highlight some advantages and limitations of employing the Kullback-Leibler divergence in this setting. For example, we show that minimizing the KL divergence between a wide class of parametric distributions and the posterior induced by a (non-degenerate) Gaussian process prior leads to an ill-defined objective function. Then, we propose (featurized) Bayesian linear regression as a benchmark for ‘function-space’ inference methods that directly measures approximation quality. We apply this methodology to assess aspects of the objective function and inference scheme considered in Sun et al. (2018), emphasizing the quality of approximation to Bayesian inference as opposed to predictive performance.

#### Rates of Convergence for Sparse Variational Gaussian Process Regression

David R Burt, Carl Edward Rasmussen, Mark van der Wilk, 2019. (arXiv).

Abstract▼ URL

Excellent variational approximations to Gaussian process posteriors have been developed which avoid the O(N3) scaling with dataset size N. They reduce the computational cost to O(NM2), with M ≪N being the number of inducing variables, which summarise the process. While the computational cost seems to be linear in N, the true complexity of the algorithm depends on how M must increase to ensure a certain quality of approximation. We address this by characterising the behavior of an upper bound on the KL divergence to the posterior. We show that with high probability the KL divergence can be made arbitrarily small by growing M more slowly than N. A particular case of interest is that for regression with normally distributed inputs in D-dimensions with the popular Squared Exponential kernel, M=O(logDN) is sufficient. Our results show that as datasets grow, Gaussian process posteriors can truly be approximated cheaply, and provide a concrete rule for how to increase M in continual learning scenarios.

#### Convergence of Sparse Variational Inference in Gaussian Processes Regression

David R. Burt, Carl Edward Rasmussen, Mark van der Wilk, 2020. (Journal of Machine Learning Research).

Abstract▼ URL

Gaussian processes are distributions over functions that are versatile and mathematically convenient priors in Bayesian modelling. However, their use is often impeded for data with large numbers of observations, N, due to the cubic (in N) cost of matrix operations used in exact inference. Many solutions have been proposed that rely on M 2). While the computational cost appears linear in N, the true complexity depends on how M must scale with N to ensure a certain quality of the approximation. In this work, we investigate upper and lower bounds on how M needs to grow with N to ensure high quality approximations. We show that we can make the KL-divergence between the approximate model and the exact posterior arbitrarily small for a Gaussian-noise regression model with M D) suffice and a method with an overall computational cost of O(N(log N)2D(log log N)2) can be used to perform inference.

#### Wide Mean-Field Bayesian Neural Networks Ignore the Data

Beau Coker, Wessel P. Bruinsma, David R. Burt, Weiwei Pan, Finale Doshi-Velez, 2022. (In 25th International Conference on Artificial Intelligence and Statistics).

Abstract▼ URL

Bayesian neural networks (BNNs) combine the expressive power of deep learning with the advantages of Bayesian formalism. In recent years, the analysis of wide, deep BNNs has provided theoretical insight into their priors and posteriors. However, we have no analogous insight into their posteriors under approximate inference. In this work, we show that mean-field variational inference entirely fails to model the data when the network width is large and the activation function is odd. Specifically, for fully-connected BNNs with odd activation functions and a homoscedastic Gaussian likelihood, we show that the optimal mean-field variational posterior predictive (i.e., function space) distribution converges to the prior predictive distribution as the width tends to infinity. We generalize aspects of this result to other likelihoods. Our theoretical results are suggestive of underfitting behavior previously observered in BNNs. While our convergence bounds are non-asymptotic and constants in our analysis can be computed, they are currently too loose to be applicable in standard training regimes. Finally, we show that the optimal approximate posterior need not tend to the prior if the activation function is not odd, showing that our statements cannot be generalized arbitrarily.

#### How Tight Can PAC-Bayes Be in the Small Data Regime?

Andrew Y. K. Foong, Wessel P. Bruinsma, David R. Burt, Richard E. Turner, 2021. (In Advances in Neural Information Processing Systems 34). Curran Associates, Inc..

Abstract▼ URL

In this paper, we investigate the question: Given a small number of datapoints, for example N = 30, how tight can PAC-Bayes and test set bounds be made? For such small datasets, test set bounds adversely affect generalisation performance by withholding data from the training procedure. In this setting, PAC-Bayes bounds are especially attractive, due to their ability to use all the data to simultaneously learn a posterior and bound its generalisation risk. We focus on the case of i.i.d. data with a bounded loss and consider the generic PAC-Bayes theorem of Germain et al. While their theorem is known to recover many existing PAC-Bayes bounds, it is unclear what the tightest bound derivable from their framework is. For a fixed learning algorithm and dataset, we show that the tightest possible bound coincides with a bound considered by Catoni; and, in the more natural case of distributions over datasets, we establish a lower bound on the best bound achievable in expectation. Interestingly, this lower bound recovers the Chernoff test set bound if the posterior is equal to the prior. Moreover, to illustrate how tight these bounds can be, we study synthetic one-dimensional classification tasks in which it is feasible to meta-learn both the prior and the form of the bound to numerically optimise for the tightest bounds possible. We ind that in this simple, controlled scenario, PAC-Bayes bounds are competitive with comparable, commonly used Chernoff test set bounds. However, the sharpest test set bounds still lead to better guarantees on the generalisation error than the PAC-Bayes bounds we consider.

#### On the Expressiveness of Approximate Inference in Bayesian Neural Networks

Andrew Foong, David Burt, Yingzhen Li, Richard Turner, 2020. (In Advances in Neural Information Processing Systems 34).

#### Bandit optimisation of functions in the Matérn kernel RKHS

David Janz, David Burt, Javier Gonzalez, 2020. (In 23rd International Conference on Artificial Intelligence and Statistics).

Abstract▼ URL

We consider the problem of optimising functions in the reproducing kernel Hilbert space (RKHS) of a Matérn kernel with smoothness parameter u over the domain [0,1]^d under noisy bandit feedback. Our contribution, the π-GP-UCB algorithm, is the first practical approach with guaranteed sublinear regret for all u gt;1 and d ≥ 1. Empirical validation suggests better performance and drastically improved computational scalablity compared with its predecessor, Improved GP-UCB.

#### Sparse Gaussian Process Hyperparameters: Optimize or Integrate?

Vidhi Lalchand, Wessel P. Bruinsma, David R. Burt, Carl E. Rasmussen, 2022. (In nips36).

Abstract▼ URL

The kernel function and its hyperparameters are the central model selection choice in a Gaussian process [Rasmussen and Williams, 2006]. Typically, the hyperparameters of the kernel are chosen by maximising the marginal likelihood, an approach known as Type-II maximum likelihood (ML-II). However, ML-II does not account for hyperparameter uncertainty, and it is well-known that this can lead to severely biased estimates and an underestimation of predictive uncertainty. While there are several works which employ a fully Bayesian characterisation of GPs, relatively few propose such approaches for the sparse GPs paradigm. In this work we propose an algorithm for sparse Gaussian process regression which leverages MCMC to sample from the hyperparameter posterior within the variational inducing point framework of [Titsias, 2009]. This work is closely related to Hensman et al. [2015b], but side-steps the need to sample the inducing points, thereby significantly improving sampling efficiency in the Gaussian likelihood case. We compare this scheme against natural baselines in literature along with stochastic variational GPs (SVGPs) along with an extensive computational analysis.

#### Numerically Stable Sparse Gaussian Processes via Minimum Separation using Cover Trees

Alexander Terenin, David R. Burt, Artem Artemev, Seth Flaxman, Mark van der Wilk, Carl Edward Rasmussen, Hong Ge, 2022. (arXiv).

Abstract▼ URL

As Gaussian processes mature, they are increasingly being deployed as part of larger machine learning and decision-making systems, for instance in geospatial modeling, Bayesian optimization, or in latent Gaussian models. Within a system, the Gaussian process model needs to perform in a stable and reliable manner to ensure it interacts correctly with other parts the system. In this work, we study the numerical stability of scalable sparse approximations based on inducing points. We derive sufficient and in certain cases necessary conditions on the inducing points for the computations performed to be numerically stable. For low-dimensional tasks such as geospatial modeling, we propose an automated method for computing inducing points satisfying these conditions. This is done via a modification of the cover tree data structure, which is of independent interest. We additionally propose an alternative sparse approximation for regression with a Gaussian likelihood which trades off a small amount of performance to further improve stability. We evaluate the proposed techniques on a number of examples, showing that, in geospatial settings, sparse approximations with guaranteed numerical stability often perform comparably to those without.