Alex Matthews

I have recently taken up a position as a Research Scientist at DeepMind. This is my old university page.

I am a postdoctoral Research Associate in the University of Cambridge Machine Learning Group. I work with Zoubin Ghahramani with whom I also completed my PhD. I have done work on Gaussian processes. In the area of approximate Bayesian inference, I have worked on variational methods and Markov chain Monte Carlo methods. Most recently, I have worked on probabilistic deep learning.

*A full list of my publications can be found on my Google Scholar page.*

I am one of the founding developers on GPflow a Gaussian process library built on TensorFlow. I have also contributed C++ linear algebra ops to TensorFlow itself, for which I won a Google open source software award.

My doctoral thesis can be found here.

My GitHub page can be found here. I sometimes Tweet about statistics and machine learning. A recent version of my CV is available.

An introductory level talk for my work on Gaussian processes, for which the slides are available can be found here

As an undergraduate I studied Natural Sciences at the University of Cambridge, specialising in theoretical physics. My fourth year project, which was later published, studied scattering in the fractional quantum Hall effect with Nigel Cooper. After that I worked in industry for Navetas Energy Management, a University of Oxford spin-out company which applies machine learning to the problem of home energy disaggregation.

## Publications

#### MCMC for Variationally Sparse Gaussian Processes

James Hensman, Alexander G D G Matthews, Maurizio Filippone, Zoubin Ghahramani, December 2015. (In Advances in Neural Information Processing Systems 28). Montreal, Canada.

Abstract▼ URL

Gaussian process (GP) models form a core part of probabilistic machine learning. Considerable research effort has been made into attacking three issues with GP models: how to compute efficiently when the number of data is large; how to approximate the posterior when the likelihood is not Gaussian and how to estimate covariance function parameter posteriors. This paper simultaneously addresses these, using a variational approximation to the posterior which is sparse in support of the function but otherwise free-form. The result is a Hybrid Monte-Carlo sampling scheme which allows for a non-Gaussian approximation over the function values and covariance parameters simultaneously, with efficient computations based on inducing-point sparse GPs. Code to replicate each experiment in this paper will be available shortly.

#### Scalable Variational Gaussian Process Classification

James Hensman, Alexander G D G Matthews, Zoubin Ghahramani, May 2015. (In 18th International Conference on Artificial Intelligence and Statistics). San Diego, California, USA.

Abstract▼ URL

Gaussian process classification is a popular method with a number of appealing properties. We show how to scale the model within a variational inducing point framework, out-performing the state of the art on benchmark datasets. Importantly, the variational formulation an be exploited to allow classification in problems with millions of data points, as we demonstrate in experiments.

#### Variational Bayesian dropout: pitfalls and fixes

Jiri Hron, Alexander G. D. G. Matthews, Zoubin Ghahramani, 2018. (ICML).

Abstract▼ URL

Dropout, a stochastic regularisation technique for training of neural networks, has recently been reinterpreted as a specific type of approximate inference algorithm for Bayesian neural networks. The main contribution of the reinterpretation is in providing a theoretical framework useful for analysing and extending the algorithm. We show that the proposed framework suffers from several issues; from undefined or pathological behaviour of the true posterior related to use of improper priors, to an ill-defined variational objective due to singularity of the approximating distribution relative to the true posterior. Our analysis of the improper log uniform prior used in variational Gaussian dropout suggests the pathologies are generally irredeemable, and that the algorithm still works only because the variational formulation annuls some of the pathologies. To address the singularity issue, we proffer Quasi-KL (QKL) divergence, a new approximate inference objective for approximation of high-dimensional distributions. We show that motivations for variational Bernoulli dropout based on discretisation and noise have QKL as a limit. Properties of QKL are studied both theoretically and on a simple practical example which shows that the QKL-optimal approximation of a full rank Gaussian with a degenerate one naturally leads to the Principal Component Analysis solution.

#### Classification using log Gaussian Cox processes

Alexander G. D. G. Matthews, Zoubin Ghahramani, 2014. (arXiv preprint arXiv:1405.4141).

Abstract▼ URL

McCullagh and Yang (2006) suggest a family of classification algorithms based on Cox processes. We further investigate the log Gaussian variant which has a number of appealing properties. Conditioned on the covariates, the distribution over labels is given by a type of conditional Markov random field. In the supervised case, computation of the predictive probability of a single test point scales linearly with the number of training points and the multiclass generalization is straightforward. We show new links between the supervised method and classical nonparametric methods. We give a detailed analysis of the pairwise graph representable Markov random field, which we use to extend the model to semi-supervised learning problems, and propose an inference method based on graph min-cuts. We give the first experimental analysis on supervised and semi-supervised datasets and show good empirical performance.

#### Comparing lower bounds on the entropy of mixture distributions for use in variational inference

Alexander G. D. G Matthews, James Hensman, Zoubin Ghahramani, December 2014. (In NIPS workshop on Advances in Variational Inference). Montreal, Canada.

#### On Sparse Variational methods and the Kullback-Leibler divergence between stochastic processes

Alexander G D G Matthews, James Hensman, Richard E. Turner, Zoubin Ghahramani, May 2016. (In 19th International Conference on Artificial Intelligence and Statistics). Cadiz, Spain.

Abstract▼ URL

The variational framework for learning inducing variables (Titsias, 2009a) has had a large impact on the Gaussian process literature. The framework may be interpreted as minimizing a rigorously defined Kullback-Leibler divergence between the approximating and posterior processes. To our knowledge this connection has thus far gone unremarked in the literature. In this paper we give a substantial generalization of the literature on this topic. We give a new proof of the result for infinite index sets which allows inducing points that are not data points and likelihoods that depend on all function values. We then discuss augmented index sets and show that, contrary to previous works, marginal consistency of augmentation is not enough to guarantee consistency of variational inference with the original model. We then characterize an extra condition where such a guarantee is obtainable. Finally we show how our framework sheds light on interdomain sparse approximations and sparse approximations for Cox processes.

#### Gaussian process behaviour in wide deep neural networks

Alexander G. D. G. Matthews, Jiri Hron, Mark Rowland, Richard E. Turner, Zoubin Ghahramani, 2018. (ICLR).

Abstract▼ URL

Whilst deep neural networks have shown great empirical success, there is still much work to be done to understand their theoretical properties. In this paper, we study the relationship between random, wide, fully connected, feedforward networks with more than one hidden layer and Gaussian processes with a recursive kernel definition. We show that, under broad conditions, as we make the architecture increasingly wide, the implied random function converges in distribution to a Gaussian process, formalising and extending existing results by Neal (1996) to deep networks. To evaluate convergence rates empirically, we use maximum mean discrepancy. We then compare finite Bayesian deep networks from the literature to Gaussian processes in terms of the key predictive quantities of interest, finding that in some cases the agreement can be very close. We discuss the desirability of Gaussian process behaviour and review non-Gaussian alternative models from the literature.

#### Sample-then-optimise posterior sampling for Bayesian linear models

Alexander G. D. G. Matthews, Jiri Hron, Richard E. Turner, Zoubin Ghahramani, 2017. (AABI (NeurIPS workshop)).

Abstract▼ URL

In modern machine learning it is common to train models which have an extremely high intrinsic capacity. The results obtained are often i nitialization dependent, are different for disparate optimizers and in some cases have no explicit regularization. This raises difficult questions about generalization. A natural approach to questions of generalization is a Bayesian one. There is therefore a growing literature attempting to understand how Bayesian posterior inference could emerge from the complexity of modern practice, even without having such a procedure as the stated goal. In this work we consider a simple special case where exact Bayesian posterior sampling emerges from sampling (cf initialization) and then gradient descent. Specifically, for a Bayesian linear model, if we parameterize it in terms of a deterministic function of an isotropic normal prior, then the action of sampling from the prior followed by first order optimization of the squared loss will give a posterior sample. Although the assumptions are stronger than many real problems, it still exhibits the challenging properties of redundant model capacity and a lack of explicit regularizers, along with initialization and optimizer dependence. It is therefore an interesting controlled test case. Given its simplicity, the method itself may turn out to be of independent interest from our original goal.