# Approximate Inference and Scalable Algorithms

For all but the simplest statistical models, exact learning and inference are computationally intractable. Approximate inference methods make it possible to learn realistic models from large data sets. Generally, approximate inference methods trade off computation time for accuracy. Some of the major classes of approximate inference methods include Markov chain Monte Carlo methods, variational methods and related algorithms such as Expectation Propagation.

#### TibGM: A Transferable and Information-Based Graphical Model Approach for Reinforcement Learning

Tameem Adel, Adrian Weller, June 2019. (In 36th International Conference on Machine Learning). Long Beach.

Abstract▼ URL

One of the challenges to reinforcement learning (RL) is scalable transferability among complex tasks. Incorporating a graphical model (GM), along with the rich family of related methods, as a basis for RL frameworks provides potential to address issues such as transferability, generalisation and exploration. Here we propose a flexible GM-based RL framework which leverages efficient inference procedures to enhance generalisation and transfer power. In our proposed transferable and information-based graphical model framework ‘TibGM’, we show the equivalence between our mutual information-based objective in the GM, and an RL consolidated objective consisting of a standard reward maximisation target and a generalisation/transfer objective. In settings where there is a sparse or deceptive reward signal, our TibGM framework is flexible enough to incorporate exploration bonuses depicting intrinsic rewards. We empirically verify improved performance and exploration power.

#### Gauged Mini-Bucket Elimination for Approximate Inference

Sungsoo Ahn, Michael Chertkov, Jinwoo Shin, Adrian Weller, April 2018. (In 21st International Conference on Artificial Intelligence and Statistics). Playa Blanca, Lanzarote, Canary Islands.

Abstract▼ URL

Computing the partition function Z of a discrete graphical model is a fundamental inference challenge. Since this is computationally intractable, variational approximations are often used in practice. Recently, so-called gauge transformations were used to improve variational lower bounds on Z. In this paper, we propose a new gauge-variational approach, termed WMBE-G, which combines gauge transformations with the weighted mini-bucket elimination (WMBE) method. WMBE-G can provide both upper and lower bounds on Z, and is easier to optimize than the prior gauge-variational algorithm. We show that WMBE-G strictly improves the earlier WMBE approximation for symmetric models including Ising models with no magnetic field. Our experimental results demonstrate the effectiveness of WMBE-G even for generic, nonsymmetric models.

#### Bucket renormalization for approximate inference

Sungsoo Ahn, Michael Chertkov, Adrian Weller, Jinwoo Shin, 2018. (In 35th International Conference on Machine Learning).

Abstract▼ URL

Probabilistic graphical models are a key tool in machine learning applications. Computing the partition function, i.e., normalizing constant, is a fundamental task of statistical inference but it is generally computationally intractable, leading to extensive study of approximation methods. Iterative variational methods are a popular and successful family of approaches. However, even state of the art variational methods can return poor results or fail to converge on difficult instances. In this paper, we instead consider computing the partition function via sequential summation over variables. We develop robust approximate algorithms by combining ideas from mini-bucket elimination with tensor network and renormalization group methods from statistical physics. The resulting “convergence-free” methods show good empirical performance on both synthetic and real-world benchmark models, even for difficult instances.

#### Deep kernel processes

Laurence Aitchison, Adam X. Yang, Sebastian W. Ober, 2021. (In 38th International Conference on Machine Learning).

Abstract▼ URL

We define deep kernel processes in which positive definite Gram matrices are progressively transformed by nonlinear kernel functions and by sampling from (inverse) Wishart distributions. Remarkably, we find that deep Gaussian processes (DGPs), Bayesian neural networks (BNNs), infinite BNNs, and infinite BNNs with bottlenecks can all be written as deep kernel processes. For DGPs the equivalence arises because the Gram matrix formed by the inner product of features is Wishart distributed, and as we show, standard isotropic kernels can be written entirely in terms of this Gram matrix — we do not need knowledge of the underlying features. We define a tractable deep kernel process, the deep inverse Wishart process, and give a doubly-stochastic inducing-point variational inference scheme that operates on the Gram matrices, not on the features, as in DGPs. We show that the deep inverse Wishart process gives superior performance to DGPs and infinite BNNs on fully-connected baselines.

#### Depth Uncertainty in Neural Networks

Javier Antorán, James Urquhart Allingham, José Miguel Hernández-Lobato, 2020. (In Advances in Neural Information Processing Systems 33). Edited by Hugo Larochelle, Marc'Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, Hsuan-Tien Lin.

Abstract▼ URL

Existing methods for estimating uncertainty in deep learning tend to require multiple forward passes, making them unsuitable for applications where computational resources are limited. To solve this, we perform probabilistic reasoning over the depth of neural networks. Different depths correspond to subnetworks which share weights and whose predictions are combined via marginalisation, yielding model uncertainty. By exploiting the sequential structure of feed-forward networks, we are able to both evaluate our training objective and make predictions with a single forward pass. We validate our approach on real-world regression and image classification tasks. Our approach provides uncertainty calibration, robustness to dataset shift, and accuracies competitive with more computationally expensive baselines.

**Comment:** Code

#### Adapting the Linearised Laplace Model Evidence for Modern Deep Learning

Javier Antorán, David Janz, James Urquhart Allingham, Erik A. Daxberger, Riccardo Barbano, Eric T. Nalisnick, José Miguel Hernández-Lobato, 2022. (In 39th International Conference on Machine Learning). Edited by Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, Sivan Sabato. PMLR. Proceedings of Machine Learning Research.

Abstract▼ URL

The linearised Laplace method for estimating model uncertainty has received renewed attention in the Bayesian deep learning community. The method provides reliable error bars and admits a closed-form expression for the model evidence, allowing for scalable selection of model hyperparameters. In this work, we examine the assumptions behind this method, particularly in conjunction with model selection. We show that these interact poorly with some now-standard tools of deep learning–stochastic approximation methods and normalisation layers–and make recommendations for how to better adapt this classic method to the modern setting. We provide theoretical support for our recommendations and validate them empirically on MLPs, classic CNNs, residual networks with and without normalisation layers, generative autoencoders and transformers.

#### Partitioned Variational Inferece: A Framework for Probabilistic Federated Learning

Matthew Ashman, Thang D. Bui, Cuong V. Nguyen, Efstratios Markou, Adrian Weller, Siddharth Swaroop, Richard E. Turner, 2022.

Abstract▼ URL

The proliferation of computing devices has brought about an opportunity to deploy machine learning models on new problem domains using previously inaccessible data. Traditional algorithms for training such models often require data to be stored on a single machine with compute performed by a single node, making them unsuitable for decentralised training on multiple devices. This deficiency has motivated the development of federated learning algorithms, which allow multiple data owners to train collaboratively and use a shared model whilst keeping local data private. However, many of these algorithms focus on obtaining point estimates of model parameters, rather than probabilistic estimates capable of capturing model uncertainty, which is essential in many applications. Variational inference (VI) has become the method of choice for fitting many modern probabilistic models. In this paper we introduce partitioned variational inference (PVI), a general framework for performing VI in the federated setting. We develop new supporting theory for PVI, demonstrating a number of properties that make it an attractive choice for practitioners; use PVI to unify a wealth of fragmented, yet related literature; and provide empirical results that showcase the effectiveness of PVI in a variety of federated settings.

#### Sparse Gaussian process variational autoencoders

Matthew Ashman, Jonny So, Will Tebbutt, Vincent Fortuin, Michael Pearce, Richard E. Turner, 2020.

Abstract▼ URL

Large, multi-dimensional spatio-temporal datasets are omnipresent in modern science and engineering. An effective framework for handling such data are Gaussian process deep generative models (GP-DGMs), which employ GP priors over the latent variables of DGMs. Existing approaches for performing inference in GP-DGMs do not support sparse GP approximations based on inducing points, which are essential for the computational efficiency of GPs, nor do they handle missing data – a natural occurrence in many spatio-temporal datasets – in a principled manner. We address these shortcomings with the development of the sparse Gaussian process variational autoencoder (SGP-VAE), characterised by the use of partial inference networks for parameterising sparse GP approximations. Leveraging the benefits of amortised variational inference, the SGP-VAE enables inference in multi-output sparse GPs on previously unobserved data with no additional training. The SGP-VAE is evaluated in a variety of experiments where it outperforms alternative approaches including multi-output GPs and structured VAEs.

#### Lost Relatives of the Gumbel Trick

Matej Balog, Nilesh Tripuraneni, Zoubin Ghahramani, Adrian Weller, August 2017. (In 34th International Conference on Machine Learning). Sydney, Australia.

Abstract▼ URL

The Gumbel trick is a method to sample from a discrete probability distribution, or to estimate its normalizing partition function. The method relies on repeatedly applying a random perturbation to the distribution in a particular way, each time solving for the most likely configuration. We derive an entire family of related methods, of which the Gumbel trick is one member, and show that the new methods have superior properties in several settings with minimal additional computational cost. In particular, for the Gumbel trick to yield computational benefits for discrete graphical models, Gumbel perturbations on all configurations are typically replaced with so-called low-rank perturbations. We show how a subfamily of our new methods adapts to this setting, proving new upper and lower bounds on the log partition function and deriving a family of sequential samplers for the Gibbs distribution. Finally, we balance the discussion by showing how the simpler analytical form of the Gumbel trick enables additional theoretical results.

#### Understanding Probabilistic Sparse Gaussian Process Approximations

Matthias Stephan Bauer, Mark van der Wilk, Carl Edward Rasmussen, 2016. (In Advances in Neural Information Processing Systems 29).

Abstract▼ URL

Good sparse approximations are essential for practical inference in Gaussian Processes as the computational cost of exact methods is prohibitive for large datasets. The Fully Independent Training Conditional (FITC) and the Variational Free Energy (VFE) approximations are two recent popular methods. Despite superficial similarities, these approximations have surprisingly different theoretical properties and behave differently in practice. We thoroughly investigate the two methods for regression both analytically and through illustrative examples, and draw conclusions to guide practical application.

**Comment:** arXiv

#### On sparsity and overcompleteness in image models

Pietro Berkes, Richard E. Turner, Maneesh Sahani, 2008. (In nips20). Edited by J. C. Platt, D. Koller, Y. Singer, S. Roweis. mit.

Abstract▼ URL

Computational models of visual cortex, and in particular those based on sparse coding, have enjoyed much recent attention. Despite this currency, the question of how sparse or how over-complete a sparse representation should be, has gone without principled answer. Here, we use Bayesian model-selection methods to address these questions for a sparse-coding model based on a Student-t prior. Having validated our methods on toy data, we find that natural images are indeed best modelled by extremely sparse distributions; although for the Student-t prior, the associated optimal basis size is only modestly over-complete.

#### A Structured Model of Video Reproduces Primary Visual Cortical Organisation

Pietro Berkes, Richard E. Turner, Maneesh Sahani, 09 2009. (PLoS Computational Biology). Public Library of Science. **DOI**: 10.1371/journal.pcbi.1000495.

Abstract▼ URL

The visual system must learn to infer the presence of objects and features in the world from the images it encounters, and as such it must, either implicitly or explicitly, model the way these elements interact to create the image. Do the response properties of cells in the mammalian visual system reflect this constraint? To address this question, we constructed a probabilistic model in which the identity and attributes of simple visual elements were represented explicitly and learnt the parameters of this model from unparsed, natural video sequences. After learning, the behaviour and grouping of variables in the probabilistic model corresponded closely to functional and anatomical properties of simple and complex cells in the primary visual cortex (V1). In particular, feature identity variables were activated in a way that resembled the activity of complex cells, while feature attribute variables responded much like simple cells. Furthermore, the grouping of the attributes within the model closely parallelled the reported anatomical grouping of simple cells in cat V1. Thus, this generative model makes explicit an interpretation of complex and simple cells as elements in the segmentation of a visual scene into basic independent features, along with a parametrisation of their moment-by-moment appearances. We speculate that such a segmentation may form the initial stage of a hierarchical system that progressively separates the identity and appearance of more articulated visual elements, culminating in view-invariant object recognition.

#### Modelling Non-Smooth Signals with Complex Spectral Structure

Wessel P. Bruinsma, Martin Tegnér, Richard E. Turner, 2022. (In 25th International Conference on Artificial Intelligence and Statistics).

Abstract▼ URL

The Gaussian Process Convolution Model (GPCM; Tobar et al., 2015a) is a model for signals with complex spectral structure. A significant limitation of the GPCM is that it assumes a rapidly decaying spectrum: it can only model smooth signals. Moreover, inference in the GPCM currently requires (1) a mean-field assumption, resulting in poorly calibrated uncertainties, and (2) a tedious variational optimisation of large covariance matrices. We redesign the GPCM model to induce a richer distribution over the spectrum with relaxed assumptions about smoothness: the Causal Gaussian Process Convolution Model (CGPCM) introduces a causality assumption into the GPCM, and the Rough Gaussian Process Convolution Model (RGPCM) can be interpreted as a Bayesian nonparametric generalisation of the fractional Ornstein–Uhlenbeck process. We also propose a more effective variational inference scheme, going beyond the mean-field assumption: we design a Gibbs sampler which directly samples from the optimal variational solution, circumventing any variational optimisation entirely. The proposed variations of the GPCM are validated in experiments on synthetic and real-world data, showing promising results.

#### Deep Gaussian Processes for Regression using Approximate Expectation Propagation

Thang D. Bui, Daniel Hernández-Lobato, José Miguel Hernández-Lobato, Yingzhen Li, Richard E. Turner, June 2016. (In 33rd International Conference on Machine Learning). New York, USA.

Abstract▼ URL

Deep Gaussian processes (DGPs) are multi-layer hierarchical generalisations of Gaussian processes (GPs) and are formally equivalent to neural networks with multiple, infinitely wide hidden layers. DGPs are nonparametric probabilistic models and as such are arguably more flexible, have a greater capacity to generalise, and provide better calibrated uncertainty estimates than alternative deep models. This paper develops a new approximate Bayesian learning scheme that enables DGPs to be applied to a range of medium to large scale regression problems for the first time. The new method uses an approximate Expectation Propagation procedure and a novel and efficient extension of the probabilistic backpropagation algorithm for learning. We evaluate the new method for non-linear regression on eleven real-world datasets, showing that it always outperforms GP regression and is almost always better than state-of-the-art deterministic and sampling-based approximate inference methods for Bayesian neural networks. As a by-product, this work provides a comprehensive analysis of six approximate Bayesian methods for training neural networks.

#### Streaming sparse Gaussian process approximations

Thang D. Bui, Cuong V. Nguyen, Richard E. Turner, December 2017. (In Advances in Neural Information Processing Systems 31). Long Beach, California, USA.

Abstract▼ URL

Sparse approximations for Gaussian process models provide a suite of methods that enable these models to be deployed in large data regime and enable analytic intractabilities to be sidestepped. However, the field lacks a principled method to handle streaming data in which the posterior distribution over function values and the hyperparameters are updated in an online fashion. The small number of existing approaches either use suboptimal hand-crafted heuristics for hyperparameter learning, or suffer from catastrophic forgetting or slow updating when new data arrive. This paper develops a new principled framework for deploying Gaussian process probabilistic models in the streaming setting, providing principled methods for learning hyperparameters and optimising pseudo-input locations. The proposed framework is experimentally validated using synthetic and real-world datasets.

**Comment:** The first two authors contributed equally.

#### Tree-structured Gaussian Process Approximations

Thang D. Bui, Richard E. Turner, 2014. (In Advances in Neural Information Processing Systems 28). Edited by Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, K.Q. Weinberger. Curran Associates, Inc..

Abstract▼

Gaussian process regression can be accelerated by constructing a small pseudo-dataset to summarize the observed data. This idea sits at the heart of many approximation schemes, but such an approach requires the number of pseudo-datapoints to be scaled with the range of the input space if the accuracy of the approximation is to be maintained. This presents problems in time-series settings or in spatial datasets where large numbers of pseudo-datapoints are required since computation typically scales quadratically with the pseudo-dataset size. In this paper we devise an approximation whose complexity grows linearly with the number of pseudo-datapoints. This is achieved by imposing a tree or chain structure on the pseudo-datapoints and calibrating the approximation using a Kullback-Leibler (KL) minimization. Inference and learning can then be performed efficiently using the Gaussian belief propagation algorithm. We demonstrate the validity of our approach on a set of challenging regression tasks including missing data imputation for audio and spatial datasets. We trace out the speed-accuracy trade-off for the new method and show that the frontier dominates those obtained from a large number of existing approximation techniques.

#### A Unifying Framework for Gaussian Process Pseudo-Point Approximations using Power Expectation Propagation

Thang D. Bui, Josiah Yan, Richard E. Turner, 2017. (Journal of Machine Learning Research).

Abstract▼ URL

Gaussian processes (GPs) are flexible distributions over functions that enable high-level assumptions about unknown functions to be encoded in a parsimonious, flexible and general way. Although elegant, the application of GPs is limited by computational and analytical intractabilities that arise when data are sufficiently numerous or when employing non-Gaussian models. Consequently, a wealth of GP approximation schemes have been developed over the last 15 years to address these key limitations. Many of these schemes employ a small set of pseudo data points to summarise the actual data. In this paper we develop a new pseudo-point approximation framework using Power Expectation Propagation (Power EP) that unifies a large number of these pseudo-point approximations. Unlike much of the previous venerable work in this area, the new framework is built on standard methods for approximate inference (variational free-energy, EP and Power EP methods) rather than employing approximations to the probabilistic generative model itself. In this way all of the approximation is performed at `inference time' rather than at`

modelling time’, resolving awkward philosophical and empirical questions that trouble previous approaches. Crucially, we demonstrate that the new framework includes new pseudo-point approximation methods that outperform current approaches on regression and classification tasks.

#### Understanding variational inference in function-space

David R. Burt, Sebastian W. Ober, Adrià Garriga-Alonso, Mark van der Wilk, 2021. (In 3rd Symposium on Advances in Approximate Bayesian Inference).

Abstract▼ URL

Recent work has attempted to directly approximate the ‘function-space’ or predictive posterior distribution of Bayesian models, without approximating the posterior distribution over the parameters. This is appealing in e.g. Bayesian neural networks, where we only need the former, and the latter is hard to represent. In this work, we highlight some advantages and limitations of employing the Kullback-Leibler divergence in this setting. For example, we show that minimizing the KL divergence between a wide class of parametric distributions and the posterior induced by a (non-degenerate) Gaussian process prior leads to an ill-defined objective function. Then, we propose (featurized) Bayesian linear regression as a benchmark for ‘function-space’ inference methods that directly measures approximation quality. We apply this methodology to assess aspects of the objective function and inference scheme considered in Sun et al. (2018), emphasizing the quality of approximation to Bayesian inference as opposed to predictive performance.

#### Rates of Convergence for Sparse Variational Gaussian Process Regression

David R Burt, Carl Edward Rasmussen, Mark van der Wilk, 2019. (arXiv).

Abstract▼ URL

Excellent variational approximations to Gaussian process posteriors have been developed which avoid the O(N3) scaling with dataset size N. They reduce the computational cost to O(NM2), with M ≪N being the number of inducing variables, which summarise the process. While the computational cost seems to be linear in N, the true complexity of the algorithm depends on how M must increase to ensure a certain quality of approximation. We address this by characterising the behavior of an upper bound on the KL divergence to the posterior. We show that with high probability the KL divergence can be made arbitrarily small by growing M more slowly than N. A particular case of interest is that for regression with normally distributed inputs in D-dimensions with the popular Squared Exponential kernel, M=O(logDN) is sufficient. Our results show that as datasets grow, Gaussian process posteriors can truly be approximated cheaply, and provide a concrete rule for how to increase M in continual learning scenarios.

#### Convergence of Sparse Variational Inference in Gaussian Processes Regression

David R. Burt, Carl Edward Rasmussen, Mark van der Wilk, 2020. (Journal of Machine Learning Research).

Abstract▼ URL

Gaussian processes are distributions over functions that are versatile and mathematically convenient priors in Bayesian modelling. However, their use is often impeded for data with large numbers of observations, N, due to the cubic (in N) cost of matrix operations used in exact inference. Many solutions have been proposed that rely on M 2). While the computational cost appears linear in N, the true complexity depends on how M must scale with N to ensure a certain quality of the approximation. In this work, we investigate upper and lower bounds on how M needs to grow with N to ensure high quality approximations. We show that we can make the KL-divergence between the approximate model and the exact posterior arbitrarily small for a Gaussian-noise regression model with M D) suffice and a method with an overall computational cost of O(N(log N)2D(log log N)2) can be used to perform inference.

#### Lazily Adapted Constant Kinky Inference for non-parametric regression and model-reference adaptive control

Jan-Peter Calliess, Stephen J. Roberts, Carl Edward Rasmussen, Jan Maciejowski, 2020. (Automatica). **DOI**: 10.1016/j.automatica.2020.109216.

Abstract▼

Techniques known as Nonlinear Set Membership prediction or Lipschitz Interpolation are approaches to supervised machine learning that utilise presupposed Lipschitz properties to perform inference over unobserved function values. Provided a bound on the true best Lipschitz constant of the target function is known a priori, they offer convergence guarantees, as well as bounds around the predictions. Considering a more general setting that builds on Lipschitz continuity, we propose an online method for estimating the Lipschitz constant online from function value observations that are possibly corrupted by bounded noise. Utilising this as a data-dependent hyper-parameter gives rise to a nonparametric machine learning method, for which we establish strong universal approximation guarantees. That is, we show that our prediction rule can learn any continuous function on compact support in the limit of increasingly dense data, up to a worst-case error that can be bounded by the level of observational error. We also consider applications of our nonparametric regression method to learning-based control. For a class of discrete-time settings, we establish convergence guarantees on the closed-loop tracking error of our online learning-based controllers. To provide evidence that our method can be beneficial not only in theory but also in practice, we apply it in the context of nonparametric model-reference adaptive control (MRAC). Across a range of simulated aircraft roll-dynamics and performance metrics our approach outperforms recently proposed alternatives that were based on Gaussian processes and RBF-neural networks.

#### Optimal Client Sampling for Federated Learning

Wenlin Chen, Samuel Horváth, Peter Richtárik, August 2022. (Transactions on Machine Learning Research).

Abstract▼ URL

It is well understood that client-master communication can be a primary bottleneck in federated learning (FL). In this work, we address this issue with a novel client subsampling scheme, where we restrict the number of clients allowed to communicate their updates back to the master node. In each communication round, all participating clients compute their updates, but only the ones with important updates communicate back to the master. We show that importance can be measured using only the norm of the update and give a formula for optimal client participation. This formula minimizes the distance between the full update, where all clients participate, and our limited update, where the number of participating clients is restricted. In addition, we provide a simple algorithm that approximates the optimal formula for client participation, which allows for secure aggregation and stateless clients, and thus does not compromise client privacy. We show both theoretically and empirically that for Distributed SGD (DSGD) and Federated Averaging (FedAvg), the performance of our approach can be close to full participation and superior to the baseline where participating clients are sampled uniformly. Moreover, our approach is orthogonal to and compatible with existing methods for reducing communication overhead, such as local methods and communication compression methods.

**Comment:** arXiv

#### Stochastic Flows and Geometric Optimization on the Orthogonal Group

Krzysztof Choromanski, David Cheikhi, Jared Davis, Valerii Likhosherstov, Achille Nazaret, Achraf Bahamou, Xingyou Song, Mrugank Akarte, Jack Parker-Holder, Jacob Bergquist, Yuan Gao, Aldo Pacchiano, Tamas Sarlos, Adrian Weller, Vikas Sindhwani, 2020. (In 37th International Conference on Machine Learning).

Abstract▼ URL

We present a new class of stochastic, geometrically-driven optimization algorithms on the orthogonal group O(d) and naturally reductive homogeneous manifolds obtained from the action of the rotation group SO(d). We theoretically and experimentally demonstrate that our methods can be applied in various fields of machine learning including deep, convolutional and recurrent neural networks, reinforcement learning, normalizing flows and metric learning. We show an intriguing connection between efficient stochastic optimization on the orthogonal group and graph theory (e.g. matching problem, partition functions over graphs, graph-coloring). We leverage the theory of Lie groups and provide theoretical results for the designed class of algorithms. We demonstrate broad applicability of our methods by showing strong performance on the seemingly unrelated tasks of learning world models to obtain stable policies for the most difficult Humanoid agent from OpenAI Gym and improving convolutional neural networks.

#### The Geometry of Random Features

Krzysztof Choromanski, Mark Rowland, Tamas Sarlos, Vikas Sindhwani, Richard E. Turner, Adrian Weller, April 2018. (In 21st International Conference on Artificial Intelligence and Statistics). Playa Blanca, Lanzarote, Canary Islands.

Abstract▼ URL

We present an in-depth examination of the effectiveness of radial basis function kernel (beyond Gaussian) estimators based on orthogonal random feature maps. We show that orthogonal estimators outperform state-of-the-art mechanisms that use iid sampling under weak conditions for tails of the associated Fourier distributions. We prove that for the case of many dimensions, the superiority of the orthogonal transform can be accurately measured by a property we define called the charm of the kernel, and that orthogonal random features provide optimal (in terms of mean squared error) kernel estimators. We provide the first theoretical results which explain why orthogonal random features outperform unstructured on downstream tasks such as kernel ridge regression by showing that orthogonal random features provide kernel algorithms with better spectral properties than the previous state-of-the-art. Our results enable practitioners more generally to estimate the benefits from applying orthogonal transforms.

#### The unreasonable effectiveness of structured random orthogonal embeddings

Krzysztof Choromanski, Mark Rowland, Adrian Weller, December 2017. (In Advances in Neural Information Processing Systems 31). Long Beach, California.

Abstract▼ URL

We examine a class of embeddings based on structured random matrices with orthogonal rows which can be applied in many machine learning applications including dimensionality reduction and kernel approximation. For both the Johnson-Lindenstrauss transform and the angular kernel, we show that we can select matrices yielding guaranteed improved performance in accuracy and/or speed compared to earlier methods. We introduce matrices with complex entries which give significant further accuracy improvement. We provide geometric and Markov chain-based perspectives to help understand the benefits, and empirical results which suggest that the approach is helpful in a wider range of applications.

#### Wide Mean-Field Bayesian Neural Networks Ignore the Data

Beau Coker, Wessel P. Bruinsma, David R. Burt, Weiwei Pan, Finale Doshi-Velez, 2022. (In 25th International Conference on Artificial Intelligence and Statistics).

Abstract▼ URL

Bayesian neural networks (BNNs) combine the expressive power of deep learning with the advantages of Bayesian formalism. In recent years, the analysis of wide, deep BNNs has provided theoretical insight into their priors and posteriors. However, we have no analogous insight into their posteriors under approximate inference. In this work, we show that mean-field variational inference entirely fails to model the data when the network width is large and the activation function is odd. Specifically, for fully-connected BNNs with odd activation functions and a homoscedastic Gaussian likelihood, we show that the optimal mean-field variational posterior predictive (i.e., function space) distribution converges to the prior predictive distribution as the width tends to infinity. We generalize aspects of this result to other likelihoods. Our theoretical results are suggestive of underfitting behavior previously observered in BNNs. While our convergence bounds are non-asymptotic and constants in our analysis can be computed, they are currently too loose to be applicable in standard training regimes. Finally, we show that the optimal approximate posterior need not tend to the prior if the activation function is not odd, showing that our statements cannot be generalized arbitrarily.

#### Derivation of Expectation Propagation for "Fast Gaussian process methods for point process intensity estimation"

J. P. Cunningham, 2008. Stanford University,

Abstract▼ URL

We derive the Expectation Propagation algorithm updates for approximating the posterior distribution on intensity in a conditionally inhomogeneous gamma interval process with a Gaussian Process prior (GP IGIP), a model which appeared in Cunningham, Shenoy, Sahani (2008) ICML.

#### Bayesian Deep Learning via Subnetwork Inference

Erik A. Daxberger, Eric T. Nalisnick, James Urquhart Allingham, Javier Antorán, José Miguel Hernández-Lobato, 2021. (In 32nd International Conference on Machine Learning). Edited by Marina Meila, Tong Zhang. PMLR. Proceedings of Machine Learning Research.

Abstract▼ URL

The Bayesian paradigm has the potential to solve core issues of deep neural networks such as poor calibration and data inefficiency. Alas, scaling Bayesian inference to large weight spaces often requires restrictive approximations. In this work, we show that it suffices to perform inference over a small subset of model weights in order to obtain accurate predictive posteriors. The other weights are kept as point estimates. This subnetwork inference framework enables us to use expressive, otherwise intractable, posterior approximations over such subsets. In particular, we implement subnetwork linearized Laplace: We first obtain a MAP estimate of all weights and then infer a full-covariance Gaussian posterior over a subnetwork. We propose a subnetwork selection strategy that aims to maximally preserve the model’s predictive uncertainty. Empirically, our approach is effective compared to ensembles and less expressive posterior approximations over full networks.

#### Approximate Dynamic Programming with Gaussian Processes

Marc Peter Deisenroth, Jan Peters, Carl Edward Rasmussen, June 2008. (In 2008 American Control Conference (ACC 2008)). Seattle, WA, USA.

Abstract▼ URL

In general, it is difficult to determine an optimal closed-loop policy in nonlinear control problems with continuous-valued state and control domains. Hence, approximations are often inevitable. The standard method of discretizing states and controls suffers from the curse of dimensionality and strongly depends on the chosen temporal sampling rate. The paper introduces Gaussian Process Dynamic Programming (GPDP). In GPDP, value functions in the Bellman recursion of the dynamic programming algorithm are modeled using Gaussian processes. GPDP returns an optimal state-feedback for a finite set of states. Based on these outcomes, we learn a possibly discontinuous closed-loop policy on the entire state space by switching between two independently trained Gaussian processes.

**Comment:** code.

#### Large Scale Non-parametric Inference: Data Parallelisation in the Indian Buffet Process

Finale Doshi-Velez, David Knowles, Shakir Mohamed, Zoubin Ghahramani, December 2009. (In Advances in Neural Information Processing Systems 23). Cambridge, MA, USA. The MIT Press.

Abstract▼ URL

Nonparametric Bayesian models provide a framework for flexible probabilistic modelling of complex datasets. Unfortunately, the high-dimensional averages required for Bayesian methods can be slow, especially with the unbounded representations used by nonparametric models. We address the challenge of scaling Bayesian inference to the increasingly large datasets found in real-world applications. We focus on parallelisation of inference in the Indian Buffet Process (IBP), which allows data points to have an unbounded number of sparse latent features. Our novel MCMC sampler divides a large data set between multiple processors and uses message passing to compute the global likelihoods and posteriors. This algorithm, the first parallel inference scheme for IBP-based models, scales to datasets orders of magnitude larger than have previously been possible.

#### On the Expressiveness of Approximate Inference in Bayesian Neural Networks

Andrew Foong, David Burt, Yingzhen Li, Richard Turner, 2020. (In Advances in Neural Information Processing Systems 34).

#### Deep Classifiers with Label Noise Modeling and Distance Awareness

Vincent Fortuin, Mark Collier, Florian Wenzel, James Urquhart Allingham, Jeremiah Zhe Liu, Dustin Tran, Balaji Lakshminarayanan, Jesse Berent, Rodolphe Jenatton, Effrosyni Kokiopoulou, 2022. (Transactions on Machine Learning Research).

Abstract▼ URL

Uncertainty estimation in deep learning has recently emerged as a crucial area of interest to advance reliability and robustness in safety-critical applications. While there have been many proposed methods that either focus on distance-aware model uncertainties for out-of-distribution detection or on input-dependent label uncertainties for in-distribution calibration, both of these types of uncertainty are often necessary. In this work, we propose the HetSNGP method for jointly modeling the model and data uncertainty. We show that our proposed model affords a favorable combination between these two types of uncertainty and thus outperforms the baseline methods on some challenging out-of-distribution datasets, including CIFAR-100C, ImageNet-C, and ImageNet-A. Moreover, we propose HetSNGP Ensemble, an ensembled version of our method which additionally models uncertainty over the network parameters and outperforms other ensemble baselines.

**Comment:** Code

#### Bayesian neural network priors revisited

Vincent Fortuin, Adrià Garriga-Alonso, Sebastian W. Ober, Florian Wenzel, Gunnar Rätsch, Richard E. Turner, Mark van der Wilk, Laurence Aitchison, 2022. (In 10th International Conference on Learning Representations).

Abstract▼ URL

Isotropic Gaussian priors are the de facto standard for modern Bayesian neural network inference. However, it is unclear whether these priors accurately reflect our true beliefs about the weight distributions or give optimal performance. To find better priors, we study summary statistics of neural network weights in networks trained using stochastic gradient descent (SGD). We find that convolutional neural network (CNN) and ResNet weights display strong spatial correlations, while fully connected networks (FCNNs) display heavy-tailed weight distributions. We show that building these observations into priors can lead to improved performance on a variety of image classification datasets. Surprisingly, these priors mitigate the cold posterior effect in FCNNs, but slightly increase the cold posterior effect in ResNets.

#### Bayesian generalised ensemble Markov chain Monte Carlo

Jes Frellsen, Ole Winther, Zoubin Ghahramani, Jesper Ferkinghoff-Borg, May 2016. (In 19th International Conference on Artificial Intelligence and Statistics). Cadiz, Spain.

Abstract▼

Bayesian generalised ensemble (BayesGE) is a new method that addresses two major drawbacks of standard Markov chain Monte Carlo algorithms for inference in high-dimensional probability models: inapplicability to estimate the partition function, and poor mixing properties. BayesGE uses a Bayesian approach to iteratively update the belief about the density of states (distribution of the log likelihood under the prior) for the model, with the dual purpose of enhancing the sampling efficiency and make the estimation of the partition function tractable. We benchmark BayesGE on Ising and Potts systems and show that it compares favourably to existing state-of-the-art methods.

#### Bayesian Time Series Learning with Gaussian Processes

Roger Frigola, 2015. University of Cambridge, Department of Engineering, Cambridge, UK.

Abstract▼ URL

The analysis of time series data is important in fields as disparate as the social sciences, biology, engineering or econometrics. In this dissertation, we present a number of algorithms designed to learn Bayesian nonparametric models of time series. The goal of these kinds of models is twofold. First, they aim at making predictions which quantify the uncertainty due to limitations in the quantity and the quality of the data. Second, they are flexible enough to model highly complex data whilst preventing overfitting when the data does not warrant complex models. We begin with a unifying literature review on time series models based on Gaussian processes. Then, we centre our attention on the Gaussian Process State-Space Model (GP-SSM): a Bayesian nonparametric generalisation of discrete-time nonlinear state-space models. We present a novel formulation of the GP-SSM that offers new insights into its properties. We then proceed to exploit those insights by developing new learning algorithms for the GP-SSM based on particle Markov chain Monte Carlo and variational inference. Finally, we present a filtered nonlinear auto-regressive model with a simple, robust and fast learning algorithm that makes it well suited to its application by non-experts on large datasets. Its main advantage is that it avoids the computationally expensive (and potentially difficult to tune) smoothing step that is a key part of learning nonlinear state-space models.

#### Latent Gaussian Processes for Distribution Estimation of Multivariate Categorical Data

Yarin Gal, Yutian Chen, Zoubin Ghahramani, 2015. (In Proceedings of the 32nd International Conference on Machine Learning (ICML-15)).

Abstract▼ URL

Multivariate categorical data occur in many applications of machine learning. One of the main difficulties with these vectors of categorical variables is sparsity. The number of possible observations grows exponentially with vector length, but dataset diversity might be poor in comparison. Recent models have gained significant improvement in supervised tasks with this data. These models embed observations in a continuous space to capture similarities between them. Building on these ideas we propose a Bayesian model for the unsupervised task of distribution estimation of multivariate categorical data. We model vectors of categorical variables as generated from a non-linear transformation of a continuous latent space. Non-linearity captures multi-modality in the distribution. The continuous representation addresses sparsity. Our model ties together many existing models, linking the linear categorical latent Gaussian model, the Gaussian process latent variable model, and Gaussian process classification. We derive inference for our model based on recent developments in sampling based variational inference. We show empirically that the model outperforms its linear and discrete counterparts in imputation tasks of sparse data.

#### Improving the Gaussian Process Sparse Spectrum Approximation by Representing Uncertainty in Frequency Inputs

Yarin Gal, Richard Turner, 2015. (In Proceedings of the 32nd International Conference on Machine Learning (ICML-15)).

Abstract▼ URL

Standard sparse pseudo-input approximations to the Gaussian process (GP) cannot handle complex functions well. Sparse spectrum alternatives attempt to answer this but are known to over-fit. We suggest the use of variational inference for the sparse spectrum approximation to avoid both issues. We model the covariance function with a finite Fourier series approximation and treat it as a random variable. The random covariance function has a posterior, on which a variational distribution is placed. The variational distribution transforms the random covariance function to fit the data. We study the properties of our approximate inference, compare it to alternative ones, and extend it to the distributed and stochastic domains. Our approximation captures complex functions better than standard approaches and avoids over-fitting.

#### Propagation Algorithms for Variational Bayesian Learning

Zoubin Ghahramani, Matthew J. Beal, 2000. (In NIPS). Edited by Michael J. Kearns, Sara A. Solla, David A. Cohn. The MIT Press. **ISBN**: 0-262-11245-0.

Abstract▼ URL

Variational approximations are becoming a widespread tool for Bayesian learning of graphical models. We provide some theoretical results for the variational updates in a very general family of conjugate-exponential graphical models. We show how the belief propagation and the junction tree algorithms can be used in the inference step of variational Bayesian learning. Applying these results to the Bayesian analysis of linear-Gaussian state-space models we obtain a learning procedure that exploits the Kalman smoothing propagation, while integrating over all model parameters. We demonstrate how this can be used to infer the hidden state dimensionality of the state-space model in a variety of synthetic problems and one real high-dimensional data set.

#### Variational Learning for Switching State-Space Models

Zoubin Ghahramani, Geoffrey E. Hinton, 2000. (Neural Computation).

Abstract▼ URL

We introduce a new statistical model for time series which iteratively segments data into regimes with approximately linear dynamics and learns the parameters of each of these linear regimes. This model combines and generalizes two of the most widely used stochastic time series models—hidden Markov models and linear dynamical systems—and is closely related to models that are widely used in the control and econometrics literatures. It can also be derived by extending the mixture of experts neural network (Jacobs et al, 1991) to its fully dynamical version, in which both expert and gating networks are recurrent. Inferring the posterior probabilities of the hidden states of this model is computationally intractable, and therefore the exact Expectation Maximization (EM) algorithm cannot be applied. However, we present a variational approximation that maximizes a lower bound on the log likelihood and makes use of both the forward–backward recursions for hidden Markov models and the Kalman filter recursions for linear dynamical systems. We tested the algorithm both on artificial data sets and on a natural data set of respiration force from a patient with sleep apnea. The results suggest that variational approximations are a viable method for inference and learning in switching state-space models.

#### Scaling Multidimensional Gaussian Processes using Projected Additive Approximations

E. Gilboa, Yunus Saatçi, John P. Cunningham, 2013. (In 30th International Conference on Machine Learning).

Abstract▼ URL

Exact Gaussian Process (GP) regression has O(N3) runtime for data size N, making it intractable for large N. Many algorithms for improving GP scaling approximate the covariance with lower rank matrices. Other work has exploited structure inherent in particular covariance functions, including GPs with implied Markov structure, and equispaced inputs (both enable O(N) runtime). However, these GP advances have not been extended to the multidimensional input setting, despite the preponderance of multidimensional applications. This paper introduces and tests novel extensions of structured GPs to multidimensional inputs. We present new methods for additive GPs, showing a novel connection between the classic backﬁtting method and the Bayesian framework. To achieve optimal accuracy-complexity tradeoff, we extend this model with a novel variant of projection pursuit regression. Our primary result – projection pursuit Gaussian Process Regression – shows orders of magnitude speedup while preserving high accuracy. The natural second and third steps include non-Gaussian observations and higher dimensional equispaced grid methods. We introduce novel techniques to address both of these necessary directions. We thoroughly illustrate the power of these three advances on several datasets, achieving close performance to the naive Full GP at orders of magnitude less cost.

#### Meta-learning probabilistic inference for prediction

Jonathan Gordon, John Bronskill, Matthias Bauer, Sebastian Nowozin, Richard Turner, April 2019. (In 7th International Conference on Learning Representations). New Orleans.

Abstract▼ URL

This paper introduces a new framework for data efficient and versatile learning. Specifically: 1) We develop ML-PIP, a general framework for Meta-Learning approximate Probabilistic Inference for Prediction. ML-PIP extends existing probabilistic interpretations of meta-learning to cover a broad class of methods. 2) We introduce , an instance of the framework employing a flexible and versatile amortization network that takes few-shot learning datasets as inputs, with arbitrary numbers of shots, and outputs a distribution over task-specific parameters in a single forward pass. Versa substitutes optimization at test time with forward passes through inference networks, amortizing the cost of inference and relieving the need for second derivatives during training. 3) We evaluate on benchmark datasets where the method sets new state-of-the-art results, and can handle arbitrary number of shots, and for classification, arbitrary numbers of classes at train and test time. The power of the approach is then demonstrated through a challenging few-shot ShapeNet view reconstruction task.

#### Convolutional Conditional Neural Processes

Jonathan Gordon, Wessel Bruinsma, Andrew Y. K. Foong, James Requeima, Yann Dubois, Richard Turner, April 2020. (In 8th International Conference on Learning Representations). Adis Ababa.

Abstract▼ URL

We introduce the Convolutional Conditional Neural Process (ConvCNP), a new member of the Neural Process family that models translation equivariance in the data. Translation equivariance is an important inductive bias for many learning problems including time series modelling, spatial data, and images. The model embeds data sets into an infinite-dimensional function space, as opposed to finite-dimensional vector spaces. To formalize this notion, we extend the theory of neural representations of sets to include functional representations, and demonstrate that any translation-equivariant embedding can be represented using a convolutional deep-set. We evaluate ConvCNPs in several settings, demonstrating that they achieve state-of-the-art performance compared to existing NPs. We demonstrate that building in translation equivariance enables zero-shot generalization to challenging, out-of-domain tasks.

#### Neural Adaptive Sequential Monte Carlo

Shixiang Gu, Zoubin Ghahramani, Richard E. Turner, Dec 2015. (In Advances in Neural Information Processing Systems 29). Montréal CANADA.

Abstract▼ URL

Sequential Monte Carlo (SMC), or particle filtering, is a popular class of methods for sampling from an intractable target distribution using a sequence of simpler intermediate distributions. Like other importance sampling-based methods, performance is critically dependent on the proposal distribution: a bad proposal can lead to arbitrarily inaccurate estimates of the target distribution. This paper presents a new method for automatically adapting the proposal using an approximation of the Kullback-Leibler divergence between the true posterior and the proposal distribution. The method is very flexible, applicable to any parameterised proposal distribution and it supports online and batch variants. We use the new framework to adapt powerful proposal distributions with rich parameterisations based upon neural networks leading to Neural Adaptive Sequential Monte Carlo (NASMC). Experiments indicate that NASMC significantly improves inference in a non-linear state space model outperforming adaptive proposal methods including the Extended Kalman and Unscented Particle Filters. Experiments also indicate that improved inference translates into improved parameter learning when NASMC is used as a subroutine of Particle Marginal Metropolis Hastings. Finally we show that NASMC is able to train a neural network-based deep recurrent generative model achieving results that compete with the state-of-the-art for polymorphic music modelling. NASMC can be seen as bridging the gap between adaptive SMC methods and the recent work in scalable, black-box variational inference.

#### Gaussian Process Volatility Model

Yue Wu, José Miguel Hernández-Lobato, Zoubin Ghahramani, December 2014. (In Advances in Neural Information Processing Systems 28). Montreal, Canada.

Abstract▼ URL

The prediction of time-changing variances is an important task in the modeling of financial data. Standard econometric models are often limited as they assume rigid functional relationships for the evolution of the variance. Moreover, functional parameters are usually learned by maximum likelihood, which can lead to overfitting. To address these problems we introduce GP-Vol, a novel non-parametric model for time-changing variances based on Gaussian Processes. This new model can capture highly flexible functional relationships for the variances. Furthermore, we introduce a new online algorithm for fast inference in GP-Vol. This method is much faster than current offline inference procedures and it avoids overfitting problems by following a fully Bayesian approach. Experiments with financial data show that GP-Vol performs significantly better than current standard alternatives.

#### MuProp: Unbiased Backpropagation for Stochastic Neural Networks

Shixiang Gu, Sergey Levine, Ilya Sutskever, Andriy Mnih, May 2016. (In 4th International Conference on Learning Representations). San Juan PUERTO RICO.

Abstract▼ URL

Deep neural networks are powerful parametric models that can be trained efficiently using the backpropagation algorithm. Stochastic neural networks combine the power of large parametric functions with that of graphical models, which makes it possible to learn very complex distributions. However, as backpropagation is not directly applicable to stochastic networks that include discrete sampling operations within their computational graph, training such networks remains difficult. We present MuProp, an unbiased gradient estimator for stochastic networks, designed to make this task easier. MuProp improves on the likelihood-ratio estimator by reducing its variance using a control variate based on the first-order Taylor expansion of a mean-field network. Crucially, unlike prior attempts at using backpropagation for training stochastic networks, the resulting estimator is unbiased and well behaved. Our experiments on structured output prediction and discrete latent variable modeling demonstrate that MuProp yields consistently good performance across a range of difficult tasks.

#### Pruning from adaptive regularization

Lars Kai Hansen, Carl Edward Rasmussen, 1994. (Neural Computation). The MIT Press.

Abstract▼ URL

Inspired by the recent upsurge of interest in Bayesian methods we consider adaptive regularization. A generalization based scheme for adaptation of regularization parameters is introduced and compared to Bayesian regularization. We show that pruning arises naturally within both adaptive regularization schemes. As model example we have chosen the simplest possible: estimating the mean of a random variable with known variance. Marked similarities are found between the two methods in that they both involve a “noise limit”, below which they regularize with infinite weight decay, i.e., they prune. However, pruning is not always beneficial. We show explicitly that both methods in some cases may increase the generalization error. This corresponds to situations where the underlying assumptions of the regularizer are poorly matched to the environment.

#### MCMC for Variationally Sparse Gaussian Processes

James Hensman, Alexander G D G Matthews, Maurizio Filippone, Zoubin Ghahramani, December 2015. (In Advances in Neural Information Processing Systems 28). Montreal, Canada.

Abstract▼ URL

Gaussian process (GP) models form a core part of probabilistic machine learning. Considerable research effort has been made into attacking three issues with GP models: how to compute efficiently when the number of data is large; how to approximate the posterior when the likelihood is not Gaussian and how to estimate covariance function parameter posteriors. This paper simultaneously addresses these, using a variational approximation to the posterior which is sparse in support of the function but otherwise free-form. The result is a Hybrid Monte-Carlo sampling scheme which allows for a non-Gaussian approximation over the function values and covariance parameters simultaneously, with efficient computations based on inducing-point sparse GPs. Code to replicate each experiment in this paper will be available shortly.

#### Scalable Variational Gaussian Process Classification

James Hensman, Alexander G D G Matthews, Zoubin Ghahramani, May 2015. (In 18th International Conference on Artificial Intelligence and Statistics). San Diego, California, USA.

Abstract▼ URL

Gaussian process classification is a popular method with a number of appealing properties. We show how to scale the model within a variational inducing point framework, out-performing the state of the art on benchmark datasets. Importantly, the variational formulation an be exploited to allow classification in problems with millions of data points, as we demonstrate in experiments.

#### Learning Feature Selection Dependencies in Multi-task Learning

Daniel Hernández-Lobato, José Miguel Hernández-Lobato, December 2013. (In Advances in Neural Information Processing Systems 27). Lake Tahoe, California, USA.

Abstract▼ URL

A probabilistic model based on the horseshoe prior is proposed for learning de- pendencies in the process of identifying relevant features for prediction. Exact inference is intractable in this model. However, expectation propagation offers an approximate alternative. Because the process of estimating feature selection dependencies may suffer from over-fitting in the model proposed, additional data from a multi-task learning scenario are considered for induction. The same model can be used in this setting with few modifications. Furthermore, the assumptions made are less restrictive than in other multi-task methods: The different tasks must share feature selection dependencies, but can have different relevant features and model coefficients. Experiments with real and synthetic data show that this model performs better than other multi-task alternatives from the literature. The experiments also show that the model is able to induce suitable feature selection dependencies for the problems considered, only from the training data.

#### Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation

Daniel Hernández-Lobato, José Miguel Hernández-Lobato, Pierre Dupont, July 2013. (Journal of Machine Learning Research).

Abstract▼ URL

We describe a Bayesian method for group feature selection in linear regression problems. The method is based on a generalized version of the standard spike-and-slab prior distribution which is often used for individual feature selection. Exact Bayesian inference under the prior considered is infeasible for typical regression problems. However, approximate inference can be carried out efficiently using Expectation Propagation (EP). A detailed analysis of the generalized spike-and-slab prior shows that it is well suited for regression problems that are sparse at the group level. Furthermore, this prior can be used to introduce prior knowledge about specific groups of features that are a priori believed to be more relevant. An experimental evaluation compares the performance of the proposed method with those of group LASSO, Bayesian group LASSO, automatic relevance determination and additional variants used for group feature selection. The results of these experiments show that a model based on the generalized spike-and-slab prior and the EP algorithm has state-of-the-art prediction performance in the problems analyzed. Furthermore, this model is also very useful to carry out sequential experimental design (also known as active learning), where the data instances that are most informative are iteratively included in the training set, reducing the number of instances needed to obtain a particular level of prediction accuracy.

#### Predictive Entropy Search for Efficient Global Optimization of Black-box Functions

José Miguel Hernández-Lobato, Matthew W. Hoffman, Zoubin Ghahramani, December 2014. (In Advances in Neural Information Processing Systems 28). Montreal, Canada.

Abstract▼ URL

We propose a novel information-theoretic approach for Bayesian optimization called Predictive Entropy Search (PES). At each iteration, PES selects the next evaluation point that maximizes the expected information gained with respect to the global maximum. PES codifies this intractable acquisition function in terms of the expected reduction in the differential entropy of the predictive distribution. This reformulation allows PES to obtain approximations that are both more accurate and efficient than other alternatives such as Entropy Search (ES). Furthermore, PES can easily perform a fully Bayesian treatment of the model hyperparameters while ES cannot. We evaluate PES in both synthetic and realworld applications, including optimization problems in machine learning, finance, biotechnology, and robotics. We show that the increased accuracy of PES leads to significant gains in optimization performance.

#### Stochastic Inference for Scalable Probabilistic Modeling of Binary Matrices

José Miguel Hernández-Lobato, Neil Houlsby, Zoubin Ghahramani, 2013. (In NIPS Workshop on Randomized Methods for Machine Learning).

Abstract▼ URL

Fully observed large binary matrices appear in a wide variety of contexts. To model them, probabilistic matrix factorization (PMF) methods are an attractive solution. However, current batch algorithms for PMF can be inefficient since they need to analyze the entire data matrix before producing any parameter updates. We derive an efficient stochastic inference algorithm for PMF models of fully observed binary matrices. Our method exhibits faster convergence rates than more expensive batch approaches and has better predictive performance than scalable alternatives. The proposed method includes new data subsampling strategies which produce large gains over standard uniform subsampling. We also address the task of automatically selecting the size of the minibatches of data and we propose an algorithm that adjusts this hyper-parameter in an online manner.

#### Black-Box Alpha Divergence Minimization

José Miguel Hernández-Lobato, Yingzhen Li, Mark Rowland, Thang D. Bui, Daniel Hernández-Lobato, Richard E. Turner, June 2016. (In 33rd International Conference on Machine Learning). New York USA.

Abstract▼ URL

Black-box alpha (BB-α) is a new approximate inference method based on the minimization of α-divergences. BB-α scales to large datasets because it can be implemented using stochastic gradient descent. BB-α can be applied to complex probabilistic models with little effort since it only requires as input the likelihood function and its gradients. These gradients can be easily obtained using automatic differentiation. By changing the divergence parameter α, the method is able to interpolate between variational Bayes (VB) (α→ 0) and an algorithm similar to expectation propagation (EP) (α = 1). Experiments on probit regression and neural network regression and classification problems show that BB-αwith non-standard settings of α, such as α = 0.5, usually produces better predictions than with α→ 0 (VB) or α = 1 (EP).

#### Gaussian Process Conditional Copulas with Applications to Financial Time Series

José Miguel Hernández-Lobato, James Robert Lloyds, Daniel Hernández-Lobato, December 2013. (In Advances in Neural Information Processing Systems 27). Lake Tahoe, California, USA.

Abstract▼ URL

The estimation of dependencies between multiple variables is a central problem in the analysis of financial time series. A common approach is to express these dependencies in terms of a copula function. Typically the copula function is assumed to be constant but this may be inaccurate when there are covariates that could have a large influence on the dependence structure of the data. To account for this, a Bayesian framework for the estimation of conditional copulas is proposed. In this framework the parameters of a copula are non-linearly related to some arbitrary conditioning variables. We evaluate the ability of our method to predict time-varying dependencies on several equities and currencies and observe consistent performance gains compared to static copula models and other time-varying copula methods.

#### A Scalable Gibbs Sampler for Probabilistic Entity Linking

Neil Houlsby, Massimiliano Ciaramita, 2014. (In 36th European Conference on Information Retrieval). Springer.

Abstract▼ URL

Entity linking involves labeling phrases in text with their referent entities, such as Wikipedia or Freebase entries. This task is challenging due to the large number of possible entities, in the millions, and heavy-tailed mention ambiguity. We formulate the problem in terms of probabilistic inference within a topic model, where each topic is associated with a Wikipedia article. To deal with the large number of topics we propose a novel efficient Gibbs sampling scheme which can also incorporate side information, such as the Wikipedia graph. This conceptually simple probabilistic approach achieves state-of-the-art performance in entity-linking on the Aida-CoNLL dataset.

#### Optimally-Weighted Herding is Bayesian Quadrature

Ferenc Huszár, David Duvenaud, July 2012. (In 28th Conference on Uncertainty in Artificial Intelligence). Catalina Island, California.

Abstract▼ URL

Herding and kernel herding are deterministic methods of choosing samples which summarise a probability distribution. A related task is choosing samples for estimating integrals using Bayesian quadrature. We show that the criterion minimised when selecting samples in kernel herding is equivalent to the posterior variance in Bayesian quadrature. We then show that sequential Bayesian quadrature can be viewed as a weighted version of kernel herding which achieves performance superior to any other weighted herding method. We demonstrate empirically a rate of convergence faster than O(1/N). Our results also imply an upper bound on the empirical error of the Bayesian quadrature estimate.

#### Variational Inference in Dynamical Systems

Alessandro Davide Ialongo, 2022. University of Cambridge, Department of Engineering, Cambridge, UK. **DOI**: https://doi.org/10.17863/CAM.91368.

Abstract▼ URL

Dynamical systems are a powerful formalism to analyse the world around us. Many datasets are sequential in nature, and can be described by a discrete time evolution law. We are interested in approaching the analysis of such datasets from a probabilistic perspective. We would like to maintain justified beliefs about quantities which, though useful in explaining the behaviour of a system, may not be observable, as well as about the system’s evolution itself, especially in regimes we have not yet observed in our data. The framework of statistical inference gives us the tools to do so, yet, for many systems of interest, performing inference exactly is not computationally or analytically tractable. The contribution of this thesis, then, is twofold: first, we uncover two sources of bias in existing variational inference methods applied to dynamical systems in general, and state space models whose transition function is drawn from a Gaussian process (GPSSM) in particular. We show bias can derive from assuming posteriors in non-linear systems to be jointly Gaussian, and from assuming that we can sever the dependence between latent states and transition function in state space model posteriors. Second, we propose methods to address these issues, undoing the resulting biases. We do this without compromising on computational efficiency or on the ability to scale to larger datasets and higher dimensions, compared to the methods we rectify. One method, the Markov Autoregressive Flow (Markov AF) addresses the Gaussian assumption, by providing a more flexible class of posteriors, based on normalizing flows, which can be easily evaluated, sampled, and optimised. The other method, Variationally Coupled Dynamics and Trajectories (VCDT), tackles the factorisation assumption, leveraging sparse Gaussian processes and their variational representation to reintroduce dependence between latent states and the transition function at no extra computational cost. Since the objective of inference is to maintain calibrated beliefs, if we employed approximations which are significantly biased in non-linear, noisy systems, or when there is little data available, we would have failed in our objective, as those are precisely the regimes in which uncertainty quantification is all the more important. Hence we think it is essential, if we wish to act optimally on such beliefs, to uncover, and, if possible, to correct, all sources of systematic bias in our inference methods.

#### Non-Factorised Variational Inference in Dynamical Systems

Alessandro Davide Ialongo, Mark van der Wilk, James Hensman, Carl Edward Rasmussen, December 2018. (In First Symposium on Advances in Approximate Bayesian Inference). Montreal.

Abstract▼ URL

We focus on variational inference in dynamical systems where the discrete time transition function (or evolution rule) is modelled by a Gaussian process. The dominant approach so far has been to use a factorised posterior distribution, decoupling the transition function from the system states. This is not exact in general and can lead to an overconfident posterior over the transition function as well as an overestimation of the intrinsic stochasticity of the system (process noise). We propose a new method that addresses these issues and incurs no additional computational costs.

#### Overcoming Mean-Field Approximations in Recurrent Gaussian Process Models

Alessandro Davide Ialongo, Mark van der Wilk, James Hensman, Carl Edward Rasmussen, June 2019. (In 36th International Conference on Machine Learning). Long Beach.

Abstract▼ URL

We identify a new variational inference scheme for dynamical systems whose transition function is modelled by a Gaussian process. Inference in this setting has either employed computationally intensive MCMC methods, or relied on factorisations of the variational posterior. As we demonstrate in our experiments, the factorisation between latent system states and transition function can lead to a miscalibrated posterior and to learning unnecessarily large noise terms. We eliminate this factorisation by explicitly modelling the dependence between state trajectories and the Gaussian process posterior. Samples of the latent states can then be tractably generated by conditioning on this representation. The method we obtain (VCDT: variationally coupled dynamics and trajectories) gives better predictive performance and more calibrated estimates of the transition function, yet maintains the same time and space complexities as mean-field methods. Code is available at: https://github.com/ialong/GPt.

#### Closed-form Inference and Prediction in Gaussian Process State-Space Models

Alessandro Davide Ialongo, Mark van der Wilk, Carl Edward Rasmussen, December 2017. (In NIPS Time Series Workshop 2017). Long Beach.

Abstract▼ URL

We examine an analytic variational inference scheme for the Gaussian Process State Space Model (GPSSM) - a probabilistic model for system identification and time-series modelling. Our approach performs variational inference over both the system states and the transition function. We exploit Markov structure in the true posterior, as well as an inducing point approximation to achieve linear time complexity in the length of the time series. Contrary to previous approaches, no Monte Carlo sampling is required: inference is cast as a deterministic optimisation problem. In a number of experiments, we demonstrate the ability to model non-linear dynamics in the presence of both process and observation noise as well as to impute missing information (e.g. velocities from raw positions through time), to de-noise, and to estimate the underlying dimensionality of the system. Finally, we also introduce a closed-form method for multi-step prediction, and a novel criterion for assessing the quality of our approximate posterior.

#### Categorical Reparametrization with Gumble-Softmax

Eric Jang, Shixiang Gu, Ben Poole, April 2017. (In 5th International Conference on Learning Representations). Toulon FRANCE.

Abstract▼ URL

Categorical variables are a natural choice for representing discrete structure in the world. However, stochastic neural networks rarely use categorical latent variables due to the inability to backpropagate through samples. In this work, we present an efficient gradient estimator that replaces the non-differentiable sample from a categorical distribution with a differentiable sample from a novel Gumbel-Softmax distribution. This distribution has the essential property that it can be smoothly annealed into a categorical distribution. We show that our Gumbel-Softmax estimator outperforms state-of-the-art gradient estimators on structured output prediction and unsupervised generative modeling tasks with categorical latent variables, and enables large speedups on semi-supervised classification.

#### Scalable Gaussian Process Variational Autoencoders

Metod Jazbec, Matt Ashman, Vincent Fortuin, Michael Pearce, Stephan Mandt, Gunnar Rätsch, 13–15 Apr 2021. (In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics). Edited by Arindam Banerjee, Kenji Fukumizu. Proceedings of Machine Learning Research. Proceedings of Machine Learning Research.

Abstract▼ URL

Conventional variational autoencoders fail in modeling correlations between data points due to their use of factorized priors. Amortized Gaussian process inference through GP-VAEs has led to significant improvements in this regard, but is still inhibited by the intrinsic complexity of exact GP inference. We improve the scalability of these methods through principled sparse inference approaches. We propose a new scalable GP-VAE model that outperforms existing approaches in terms of runtime and memory footprint, is easy to implement, and allows for joint end-to-end optimization of all components.

#### An Introduction to Variational Methods for Graphical Models

Michael I. Jordan, Zoubin Ghahramani, Tommi Jaakkola, Lawrence K. Saul, 1999. (Machine Learning).

Abstract▼ URL

This paper presents a tutorial introduction to the use of variational methods for inference and learning in graphical models (Bayesian networks and Markov random fields). We present a number of examples of graphical models, including the QMR-DT database, the sigmoid belief network, the Boltzmann machine, and several variants of hidden Markov models, in which it is infeasible to run exact inference algorithms. We then introduce variational methods, which exploit laws of large numbers to transform the original graphical model into a simplified graphical model in which inference is efficient. Inference in the simpified model provides bounds on probabilities of interest in the original model. We describe a general framework for generating variational transformations based on convex duality. Finally we return to the examples and demonstrate how variational algorithms can be formulated in each case.

#### Message Passing Algorithms for the Dirichlet Diffusion Tree

David A. Knowles, Jurgen Van Gael, Zoubin Ghahramani, 2011. (In 28th International Conference on Machine Learning).

Abstract▼ URL

We demonstrate efficient approximate inference for the Dirichlet Diffusion Tree (Neal, 2003), a Bayesian nonparametric prior over tree structures. Although DDTs provide a powerful and elegant approach for modeling hierarchies they haven’t seen much use to date. One problem is the computational cost of MCMC inference. We provide the first deterministic approximate inference methods for DDT models and show excellent performance compared to the MCMC alternative. We present message passing algorithms to approximate the Bayesian model evidence for a specific tree. This is used to drive sequential tree building and greedy search to find optimal tree structures, corresponding to hierarchical clusterings of the data. We demonstrate appropriate observation models for continuous and binary data. The empirical performance of our method is very close to the computationally expensive MCMC alternative on a density estimation problem, and significantly outperforms kernel density estimators.

**Comment:** web site

#### Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget

Anoop Korattikara, Yutian Chen, Max Welling, June 2014. (In 31st International Conference on Machine Learning). Beijing, China.

Abstract▼ URL

Can we make Bayesian posterior MCMC sampling more efficient when faced with very large datasets? We argue that computing the likelihood for N datapoints in the Metropolis-Hastings (MH) test to reach a single binary decision is computationally inefficient. We introduce an approximate MH rule based on a sequential hypothesis test that allows us to accept or reject samples with high confidence using only a fraction of the data required for the exact MH rule. While this method introduces an asymptotic bias, we show that this bias can be controlled and is more than offset by a decrease in variance due to our ability to draw more samples per unit of time.

**Comment:** supplementary

#### Assessing Approximate Inference for Binary Gaussian Process Classification

Malte Kuß, Carl Edward Rasmussen, 2005. (Journal of Machine Learning Research).

Abstract▼ URL

Gaussian process priors can be used to define flexible, probabilistic classification models. Unfortunately exact Bayesian inference is analytically intractable and various approximation techniques have been proposed. In this work we review and compare Laplace’s method and Expectation Propagation for approximate Bayesian inference in the binary Gaussian process classification model. We present a comprehensive comparison of the approximations, their predictive performance and marginal likelihood estimates to results obtained by MCMC sampling. We explain theoretically and corroborate empirically the advantages of Expectation Propagation compared to Laplace’s method.

#### Approximate Inference for the Loss-Calibrated Bayesian

Simon Lacoste-Julien, Ferenc Huszár, Zoubin Ghahramani, April 2011. (In 14th International Conference on Artificial Intelligence and Statistics). Edited by Geoff Gordon, David Dunson. Fort Lauderdale, FL, USA. Journal of Machine Learning Research.

Abstract▼ URL

We consider the problem of approximate inference in the context of Bayesian decision theory. Traditional approaches focus on approximating general properties of the posterior, ignoring the decision task – and associated losses – for which the posterior could be used. We argue that this can be suboptimal and propose instead to *loss-calibrate* the approximate inference methods with respect to the decision task at hand. We present a general framework rooted in Bayesian decision theory to analyze approximate inference from the perspective of losses, opening up several research directions. As a first loss-calibrated approximate inference attempt, we propose an EM-like algorithm on the Bayesian posterior risk and show how it can improve a standard approach to Gaussian process classification when losses are asymmetric.

#### Sparse Gaussian Process Hyperparameters: Optimize or Integrate?

Vidhi Lalchand, Wessel P. Bruinsma, David R. Burt, Carl E. Rasmussen, 2022. (In nips36).

Abstract▼ URL

The kernel function and its hyperparameters are the central model selection choice in a Gaussian process [Rasmussen and Williams, 2006]. Typically, the hyperparameters of the kernel are chosen by maximising the marginal likelihood, an approach known as Type-II maximum likelihood (ML-II). However, ML-II does not account for hyperparameter uncertainty, and it is well-known that this can lead to severely biased estimates and an underestimation of predictive uncertainty. While there are several works which employ a fully Bayesian characterisation of GPs, relatively few propose such approaches for the sparse GPs paradigm. In this work we propose an algorithm for sparse Gaussian process regression which leverages MCMC to sample from the hyperparameter posterior within the variational inducing point framework of [Titsias, 2009]. This work is closely related to Hensman et al. [2015b], but side-steps the need to sample the inducing points, thereby significantly improving sampling efficiency in the Gaussian likelihood case. We compare this scheme against natural baselines in literature along with stochastic variational GPs (SVGPs) along with an extensive computational analysis.

#### Dropout Inference in Bayesian Neural Networks with Alpha-divergences

Yingzhen Li, Yarin Gal, Aug 2017. (In 34th International Conference on Machine Learning). Sydney AUSTRALIA.

Abstract▼ URL

To obtain uncertainty estimates with real-world Bayesian deep learning models, practical inference approximations are needed. Dropout variational inference (VI) for example has been used for machine vision and medical applications, but VI can severely underestimates model uncertainty. Alpha-divergences are alternative divergences to VI’s KL objective, which are able to avoid VI’s uncertainty underestimation. But these are hard to use in practice: existing techniques can only use Gaussian approximating distributions, and require existing models to be changed radically, thus are of limited use for practitioners. We propose a re-parametrisation of the alpha-divergence objectives, deriving a simple inference technique which, together with dropout, can be easily implemented with existing models by simply changing the loss of the model. We demonstrate improved uncertainty estimates and accuracy compared to VI in dropout networks. We study our model’s epistemic uncertainty far away from the data using adversarial images, showing that these can be distinguished from non-adversarial images by examining our model’s uncertainty.

#### Stochastic Expectation Propagation

Yingzhen Li, José Miguel Hernández-Lobato, Richard E. Turner, Dec 2015. (In Advances in Neural Information Processing Systems 28). Montréal CANADA.

Abstract▼ URL

Expectation propagation (EP) is a deterministic approximation algorithm that is often used to perform approximate Bayesian parameter learning. EP approximates the full intractable posterior distribution through a set of local-approximations that are iteratively refined for each datapoint. EP can offer analytic and computational advantages over other approximations, such as Variational Inference (VI), and is the method of choice for a number of models. The local nature of EP appears to make it an ideal candidate for performing Bayesian learning on large models in large-scale datasets settings. However, EP has a crucial limitation in this context: the number approximating factors needs to increase with the number of data-points, N, which often entails a prohibitively large memory overhead. This paper presents an extension to EP, called stochastic expectation propagation (SEP), that maintains a global posterior approximation (like VI) but updates it in a local way (like EP ). Experiments on a number of canonical learning problems using synthetic and real-world datasets indicate that SEP performs almost as well as full EP, but reduces the memory consumption by a factor of N. SEP is therefore ideally suited to performing approximate Bayesian learning in the large model, large dataset setting.

#### Rényi Divergence Variational Inference

Yingzhen Li, Richard E. Turner, Dec 2016. (In Advances in Neural Information Processing Systems 29). Barcelona SPAIN.

Abstract▼ URL

This paper introduces the variational Rényi bound (VR) that extends traditional variational inference to Rényi’s alpha-divergences. This new family of variational methods unifies a number of existing approaches, and enables a smooth interpolation from the evidence lower-bound to the log (marginal) likelihood that is controlled by the value of alpha that parametrises the divergence. The reparameterization trick, Monte Carlo approximation and stochastic optimisation methods are deployed to obtain a tractable and unified framework for optimisation. We further consider negative alpha values and propose a novel variational inference method as a new special case in the proposed framework. Experiments on Bayesian neural networks and variational auto-encoders demonstrate the wide applicability of the VR bound.

#### Gradient Estimators for Implicit Models

Yingzhen Li, Richard E. Turner, May 2018. (In Sixth International Conference on Learning Representations). Vancouver CANADA.

Abstract▼ URL

Implicit models, which allow for the generation of samples but not for point-wise evaluation of probabilities, are omnipresent in real-world problems tackled by machine learning and a hot topic of current research. Some examples include data simulators that are widely used in engineering and scientific research, generative adversarial networks (GANs) for image synthesis, and hot-off-the-press approximate inference techniques relying on implicit distributions. The majority of existing approaches to learning implicit models rely on approximating the intractable distribution or optimisation objective for gradient- based optimisation, which is liable to produce inaccurate updates and thus poor models. This paper alleviates the need for such approximations by proposing the Stein gradient estimator, which directly estimates the score function of the implicitly defined distribution. The efficacy of the proposed estimator is empirically demonstrated by examples that include meta-learning for approximate inference and entropy regularised GANs that provide improved sample diversities.

#### Gaussian Process Vine Copulas for Multivariate Dependence

David Lopez-Paz, José Miguel Hernández-Lobato, Zoubin Ghahramani, June 2013. (In 30th International Conference on Machine Learning). Atlanta, Georgia, USA.

Abstract▼ URL

Copulas allow to learn marginal distributions separately from the multivariate dependence structure (copula) that links them together into a density function. Vine factorizations ease the learning of high-dimensional copulas by constructing a hierarchy of conditional bivariate copulas. However, to simplify inference, it is common to assume that each of these conditional bivariate copulas is independent from its conditioning variables. In this paper, we relax this assumption by discovering the latent functions that specify the shape of a conditional copula given its conditioning variables We learn these functions by following a Bayesian approach based on sparse Gaussian processes with expectation propagation for scalable, approximate inference. Experiments on real-world datasets show that, when modeling all conditional dependencies, we obtain better estimates of the underlying copula of the data.

#### Occlusive Components Analysis

Jörg Lücke, Richard E. Turner, Maneesh Sahani, Marc Henniges, 2009. (In Advances in Neural Information Processing Systems 22). Edited by Y Bengio, D Schuurmans, J Lafferty, C K I Williams, A Culotta. mit.

Abstract▼ URL

We study unsupervised learning in a probabilistic generative model for occlusion. The model uses two types of latent variables: one indicates which objects are present in the image, and the other how they are ordered in depth. This depth order then determines how the positions and appearances of the objects present, specified in the model parameters, combine to form the image. We show that the object parameters can be learnt from an unlabelled set of images in which objects occlude one another. Exact maximum-likelihood learning is intractable. However, we show that tractable approximations to Expectation Maximization (EM) can be found if the training images each contain only a small number of objects on average. In numerical experiments it is shown that these approximations recover the correct set of object parameters. Experiments on a novel version of the bars test using colored bars, and experiments on more realistic data, show that the algorithm performs well in extracting the generating causes. Experiments based on the standard bars benchmark test for object learning show that the algorithm performs well in comparison to other recent component extraction approaches. The model and the learning algorithm thus connect research on occlusion with the research field of multiple-causes component extraction methods.

#### Iterative Amortized Policy Optimization

Joseph Marino, Alexandre Piche, Alessandro Davide Ialongo, Yisong Yue, 2021. (In Advances in Neural Information Processing Systems 34). Edited by M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, J. Wortman Vaughan. Curran Associates, Inc..

Abstract▼ URL

Policy networks are a central feature of deep reinforcement learning (RL) algorithms for continuous control, enabling the estimation and sampling of high-value actions. From the variational inference perspective on RL, policy networks, when used with entropy or KL regularization, are a form of amortized optimization, optimizing network parameters rather than the policy distributions directly. However, direct amortized mappings can yield suboptimal policy estimates and restricted distributions, limiting performance and exploration. Given this perspective, we consider the more flexible class of iterative amortized optimizers. We demonstrate that the resulting technique, iterative amortized policy optimization, yields performance improvements over direct amortization on benchmark continuous control tasks.

#### Comparing lower bounds on the entropy of mixture distributions for use in variational inference

Alexander G. D. G Matthews, James Hensman, Zoubin Ghahramani, December 2014. (In NIPS workshop on Advances in Variational Inference). Montreal, Canada.

#### On Sparse Variational methods and the Kullback-Leibler divergence between stochastic processes

Alexander G D G Matthews, James Hensman, Richard E. Turner, Zoubin Ghahramani, May 2016. (In 19th International Conference on Artificial Intelligence and Statistics). Cadiz, Spain.

Abstract▼ URL

The variational framework for learning inducing variables (Titsias, 2009a) has had a large impact on the Gaussian process literature. The framework may be interpreted as minimizing a rigorously defined Kullback-Leibler divergence between the approximating and posterior processes. To our knowledge this connection has thus far gone unremarked in the literature. In this paper we give a substantial generalization of the literature on this topic. We give a new proof of the result for infinite index sets which allows inducing points that are not data points and likelihoods that depend on all function values. We then discuss augmented index sets and show that, contrary to previous works, marginal consistency of augmentation is not enough to guarantee consistency of variational inference with the original model. We then characterize an extra condition where such a guarantee is obtainable. Finally we show how our framework sheds light on interdomain sparse approximations and sparse approximations for Cox processes.

#### Addressing Bias in Active Learning with Depth Uncertainty Networks… or Not

Chelsea Murray, James Urquhart Allingham, Javier Antorán, José Miguel Hernández-Lobato, 2021. (In I (Still) Can't Believe It's Not Better! Workshop at NeurIPS 2021, Virtual Workshop, December 13, 2021). Edited by Melanie F. Pradier, Aaron Schein, Stephanie L. Hyland, Francisco J. R. Ruiz, Jessica Zosa Forde. PMLR. Proceedings of Machine Learning Research.

Abstract▼ URL

Farquhar et al. [2021] show that correcting for active learning bias with underparameterised models leads to improved downstream performance. For overparameterised models such as NNs, however, correction leads either to decreased or unchanged performance. They suggest that this is due to an “overfitting bias” which offsets the active learning bias. We show that depth uncertainty networks operate in a low overfitting regime, much like underparameterised models. They should therefore see an increase in performance with bias correction. Surprisingly, they do not. We propose that this negative result, as well as the results Farquhar et al. [2021], can be explained via the lens of the bias-variance decomposition of generalisation error.

#### Variational Continual Learning

Cuong V. Nguyen, Yingzhen Li, Thang D. Bui Richard E. Turner, May 2018. (In Sixth International Conference on Learning Representations). Vancouver CANADA.

Abstract▼ URL

This paper develops variational continual learning (VCL), a simple but general framework for continual learning that fuses online variational inference (VI) and recent advances in Monte Carlo VI for neural networks. The framework can successfully train both deep discriminative models and deep generative models in complex continual learning settings where existing tasks evolve over time and entirely new tasks emerge. Experimental results show that variational continual learning outperforms state-of-the-art continual learning methods on a variety of tasks, avoiding catastrophic forgetting in a fully automatic way.

#### Approximations for Binary Gaussian Process Classification

Hannes Nickisch, Carl Edward Rasmussen, October 2008. (Journal of Machine Learning Research).

Abstract▼ URL

We provide a comprehensive overview of many recent algorithms for approximate inference in Gaussian process models for probabilistic binary classification. The relationships between several approaches are elucidated theoretically, and the properties of the different algorithms are corroborated by experimental results. We examine both 1) the quality of the predictive distributions and 2) the suitability of the different marginal likelihood approximations for model selection (selecting hyperparameters) and compare to a gold standard based on MCMC. Interestingly, some methods produce good predictive distributions although their marginal likelihood approximations are poor. Strong conclusions are drawn about the methods: The Expectation Propagation algorithm is almost always the method of choice unless the computational budget is very tight. We also extend existing methods in various ways, and provide unifying code implementing all approaches.

#### Global inducing point variational posteriors for Bayesian neural networks and deep Gaussian processes

Sebastian W. Ober, Laurence Aitchison, 2021. (In 38th International Conference on Machine Learning).

Abstract▼ URL

We consider the optimal approximate posterior over the top-layer weights in a Bayesian neural network for regression, and show that it exhibits strong dependencies on the lower-layer weights. We adapt this result to develop a correlated approximate posterior over the weights at all layers in a Bayesian neural network. We extend this approach to deep Gaussian processes, unifying inference in the two model classes. Our approximate posterior uses learned “global” inducing points, which are defined only at the input layer and propagated through the network to obtain inducing inputs at subsequent layers. By contrast, standard “local”, inducing point methods from the deep Gaussian process literature optimise a separate set of inducing inputs at every layer, and thus do not model correlations across layers. Our method gives state-of-the-art performance for a variational Bayesian method, without data augmentation or tempering, on CIFAR-10 of 86.7%, which is comparable to SGMCMC without tempering but with data augmentation (88% in Wenzel et al. 2020).

#### A variational approximate posterior for the deep Wishart process

Sebastian W. Ober, Laurence Aitchison, 2021. (In Advances in Neural Information Processing Systems 34).

Abstract▼ URL

Recent work introduced deep kernel processes as an entirely kernel-based alternative to NNs (Aitchison et al. 2020). Deep kernel processes flexibly learn good top-layer representations by alternately sampling the kernel from a distribution over positive semi-definite matrices and performing nonlinear transformations. A particular deep kernel process, the deep Wishart process (DWP), is of particular interest because its prior can be made equivalent to deep Gaussian process (DGP) priors for kernels that can be expressed entirely in terms of Gram matrices. However, inference in DWPs has not yet been possible due to the lack of sufficiently flexible distributions over positive semi-definite matrices. Here, we give a novel approach to obtaining flexible distributions over positive semi-definite matrices by generalising the Bartlett decomposition of the Wishart probability density. We use this new distribution to develop an approximate posterior for the DWP that includes dependency across layers. We develop a doubly-stochastic inducing-point inference scheme for the DWP and show experimentally that inference in the DWP can improve performance over doing inference in a DGP with the equivalent prior.

#### Practical Deep Learning with Bayesian Principles

Kazuki Osawa, Siddharth Swaroop, Anirudh Jain, Runa Eschenhagen, Richard E. Turner, Rio Yokota, Mohammad Emtiyaz Khan, 2019. (In Advances in Neural Information Processing Systems 33).

Abstract▼ URL

Bayesian methods promise to fix many shortcomings of deep learning, but they are impractical and rarely match the performance of standard methods, let alone improve them. In this paper, we demonstrate practical training of deep networks with natural-gradient variational inference. By applying techniques such as batch normalisation, data augmentation, and distributed training, we achieve similar performance in about the same number of epochs as the Adam optimiser, even on large datasets such as ImageNet. Importantly, the benefits of Bayesian principles are preserved: predictive probabilities are well-calibrated, uncertainties on out-of-distribution data are improved, and continual-learning performance is boosted. This work enables practical deep learning while preserving benefits of Bayesian principles. A PyTorch implementation is available as a plug-and-play optimiser.

#### Predictive automatic relevance determination by expectation propagation

Yuan (Alan) Qi, Thomas P. Minka, Rosalind W. Picard, Zoubin Ghahramani, 2004. (In ICML). Edited by Carla E. Brodley. Association for Computing Machinery. ACM International Conference Proceeding Series.

Abstract▼ URL

In many real-world classification problems the input contains a large number of potentially irrelevant features. This paper proposes a new Bayesian framework for determining the relevance of input features. This approach extends one of the most successful Bayesian methods for feature selection and sparse learning, known as Automatic Relevance Determination (ARD). ARD finds the relevance of features by optimizing the model marginal likelihood, also known as the evidence. We show that this can lead to overfitting. To address this problem, we propose Predictive ARD based on estimating the predictive performance of the classifier. While the actual leave-one-out predictive performance is generally very costly to compute, the expectation propagation (EP) algorithm proposed by Minka provides an estimate of this predictive performance as a side-effect of its iterations. We exploit this in our algorithm to do feature selection, and to select data points in a sparse Bayesian kernel classifier. Moreover, we provide two other improvements to previous algorithms, by replacing Laplace’s approximation with the generally more accurate EP, and by incorporating the fast optimization algorithm proposed by Faul and Tipping. Our experiments show that our method based on the EP estimate of predictive performance is more accurate on test data than relevance determination by optimizing the evidence.

#### A Unifying View of Sparse Approximate Gaussian Process Regression

Joaquin Quiñonero-Candela, Carl Edward Rasmussen, 2005. (Journal of Machine Learning Research).

Abstract▼ URL

We provide a new unifying view, including all existing proper probabilistic sparse approximations for Gaussian process regression. Our approach relies on expressing the effective prior which the methods are using. This allows new insights to be gained, and highlights the relationship between existing methods. It also allows for a clear theoretically justified ranking of the closeness of the known approximations to the corresponding full GPs. Finally we point directly to designs of new better sparse approximations, combining the best of the existing strategies, within attractive computational constraints.

#### Challenges and Pitfalls of Bayesian Unlearning

Ambrish Rawat, James Requeima, Wessel Bruinsma, Richard Turner, 2022. (In ICML 2022 Workshop on Updatable Machine Learning (UpML)).

Abstract▼ URL

Machine unlearning refers to the task of removing a subset of training data, thereby removing its contributions to a trained model. Approximate unlearning are one class of methods for this task which avoid the need to retrain the model from scratch on the retained data. Bayes’ rule can be used to cast approximate unlearning as an inference problem where the objective is to obtain the updated posterior by dividing out the likelihood of deleted data. However this has its own set of challenges as one often doesn’t have access to the exact posterior of the model parameters. In this work we examine the use of the Laplace approximation and Variational Inference to obtain the updated posterior. With a neural network trained for a regression task as the guiding example, we draw insights on the applicability of Bayesian unlearning in practical scenarios.

#### Scaling the Indian Buffet Process via Submodular Maximization

Colorado Reed, Zoubin Ghahramani, 2013. (In ICML). JMLR.org. JMLR Proceedings.

Abstract▼ URL

Inference for latent feature models is inherently difficult as the inference space grows exponentially with the size of the input data and number of latent features. In this work, we use Kurihara & Welling (2008)’s maximization-expectation framework to perform approximate MAP inference for linear-Gaussian latent feature models with an Indian Buffet Process (IBP) prior. This formulation yields a submodular function of the features that corresponds to a lower bound on the model evidence. By adding a constant to this function, we obtain a nonnegative submodular function that can be maximized via a greedy algorithm that obtains at least a one-third approximation to the optimal solution. Our inference method scales linearly with the size of the input data, and we show the efficacy of our method on the largest datasets currently analyzed using an IBP model.

#### On the Convergence of Bound Optimization Algorithms

Ruslan Salakhutdinov, Sam T. Roweis, Zoubin Ghahramani, 2003. (In UAI). Edited by Christopher Meek, Uffe Kjærulff. Morgan Kaufmann. **ISBN**: 0-127-05664-5.

Abstract▼ URL

Many practitioners who use EM and related algorithms complain that they are sometimes slow. When does this happen, and what can be done about it? In this paper, we study the general class of bound optimization algorithms - including EM, Iterative Scaling, Non-negative Matrix Factorization, CCCP - and their relationship to direct optimization algorithms such as gradientbased methods for parameter learning. We derive a general relationship between the updates performed by bound optimization methods and those of gradient and second-order methods and identify analytic conditions under which bound optimization algorithms exhibit quasi-Newton behavior, and under which they possess poor, first-order convergence. Based on this analysis, we consider several specific algorithms, interpret and analyze their convergence properties and provide some recipes for preprocessing input to these algorithms to yield faster convergence behavior. We report empirical results supporting our analysis and showing that simple data preprocessing can result in dramatically improved performance of bound optimizers in practice.

#### Optimization with EM and Expectation-Conjugate-Gradient

Ruslan Salakhutdinov, Sam T. Roweis, Zoubin Ghahramani, 2003. (In ICML). Edited by Tom Fawcett, Nina Mishra. AAAI Press. **ISBN**: 1-57735-189-4.

Abstract▼ URL

We show a close relationship between the Expectation- Maximization (EM) algorithm and direct optimization algorithms such as gradientbased methods for parameter learning. We identify analytic conditions under which EM exhibits Newton-like behavior, and conditions under which it possesses poor, first-order convergence. Based on this analysis, we propose two novel algorithms for maximum likelihood estimation of latent variable models, and report empirical results showing that, as predicted by theory, the proposed new algorithms can substantially outperform standard EM in terms of speed of convergence in certain cases.

#### Compact Approximations to Bayesian Predictive Distributions

Edward Snelson, Zoubin Ghahramani, August 2005. (In 22nd International Conference on Machine Learning). Bonn, Germany. Omnipress.

Abstract▼ URL

We provide a general framework for learning precise, compact, and fast representations of the Bayesian predictive distribution for a model. This framework is based on minimizing the KL divergence between the true predictive density and a suitable compact approximation. We consider various methods for doing this, both sampling based approximations, and deterministic approximations such as expectation propagation. These methods are tested on a mixture of Gaussians model for density estimation and on binary linear classification, with both synthetic data sets for visualization and several real data sets. Our results show significant reductions in prediction time and memory footprint.

#### The Complexity of Inference in Latent Dirichlet Allocation

David Sontag, Daniel M. Roy, 2011. (In Advances in Neural Information Processing Systems 24). Cambridge, MA, USA. The MIT Press.

Abstract▼ URL

We consider the computational complexity of probabilistic inference in Latent Dirichlet Allocation (LDA). First, we study the problem of finding the maximum a posteriori (MAP) assignment of topics to words, where the document’s topic distribution is integrated out. We show that, when the effective number of topics per document is small, exact inference takes polynomial time. In contrast, we show that, when a document has a large number of topics, finding the MAP assignment of topics to words in LDA is NP-hard. Next, we consider the problem of finding the MAP topic distribution for a document, where the topic-word assignments are integrated out. We show that this problem is also NP-hard. Finally, we briefly discuss the problem of sampling from the posterior, showing that this is NP-hard in one restricted setting, but leaving open the general question.

#### Latent space variational Bayes

J.M. Sung, Z. Ghahramani, S.Y. Bang, November 2008. (IEEE Transactions on Pattern Analysis and Machine Intelligence). IEEE.

Abstract▼ URL

Variational Bayesian Expectation-Maximization (VBEM), an approximate inference method for probabilistic models based on factorizing over latent variables and model parameters, has been a standard technique for practical Bayesian inference. In this paper, we introduce a more general approximate inference framework for conjugate-exponential family models, which we call Latent-Space Variational Bayes (LSVB). In this approach, we integrate out the model parameters in an exact way, leaving only the latent variables. It can be shown that the LSVB approach gives better estimates of the model evidence as well as the distribution over the latent variables than the VBEM approach, but, in practice, the distribution over the latent variables has to be approximated. As a practical implementation, we present a First-order LSVB (FoLSVB) algorithm to approximate the distribution over the latent variables. From this approximate distribution, one can also estimate the model evidence and the posterior over the model parameters. The FoLSVB algorithm is directly comparable to the VBEM algorithm and has the same computational complexity. We discuss how LSVB generalizes the recently proposed collapsed variational methods to general conjugate-exponential families. Examples based on mixtures of Gaussians and mixtures of Bernoullis with synthetic and real-world data sets are used to illustrate some advantages of our method over VBEM.

#### Second-order latent space variational Bayes for approximate Bayesian inference

J.M. Sung, Z. Ghahramani, S.Y. Bang, December 2008. (IEEE Signal Processing Letters). IEEE.

Abstract▼ URL

In this letter, we consider a variational approximate Bayesian inference framework, latent-space variational Bayes (LSVB), in the general context of conjugate-exponential family models with latent variables. In the LSVB approach, we integrate out model parameters in an exact way and then perform the variational inference over only the latent variables. It can be shown that LSVB can achieve better estimates of the model evidence as well as the distribution over the latent variables than the popular variational Bayesian expectation-maximization (VBEM). However, the distribution over the latent variables in LSVB has to be approximated in practice. As an approximate implementation of LSVB, we propose a second-order LSVB (SoLSVB) method. In particular, VBEM can be derived as a special case of a first-order approximation in LSVB. SoLSVB can capture higher order statistics neglected in VBEM and can therefore achieve a better approximation. Examples of Gaussian mixture models are used to illustrate the comparison between our method and VBEM, demonstrating the improvement.

#### Advances in Software and Spatio-Temporal Modelling with Gaussian Processes

Will Tebbutt, 2022. University of Cambridge, Department of Engineering,

Abstract▼ URL

This thesis concerns the use of Gaussian processes (GPs) as distributions over unknown functions in Machine Learning and probabilistic modeling. GPs have been found to have utility in a wide range of applications owing to their flexibility, interpretability, and tractability. I advance their use in three directions. Firstly, the abstractions upon which software is built for their use in practice. In modern GP software libraries such as GPML, GPy, GPflow, and GPyTorch, the kernel is undoubtedly the dominant abstraction. While it remains highly successful it of course has limitations, and I propose to address some of these through a complementary abstraction: affine transformations of GPs. Specifically I show how a collection of GPs, and affine transformations thereof, can themselves be treated as a single GP. This in turn leads to a design for software, including exact and approximate inference algorithms. I demonstrate the utility of this software through a collection of worked examples, focussing on models which are more cleanly and easily expressed using this new software. Secondly, I develop a new scalable approximate inference algorithm for a class of GPs commonly utilised in spatio-temporal problems. This is a setting in which GPs excel, for example enabling the incorporation of important inductive biases, and observations made at arbitrary points in time and space. However, the computation required to perform exact inference and learning in GPs scales cubically in the number of observations, necessitating approximation, to which end I combine two important complementary classes of approximation: pseudo-point and Markovian. The key contribution is the insight that a simple and useful way to combine them turns out to be well-justified. This resolves an open question in the literature, provides new insight into existing work, and a new family of approximations. The efficacy of an important member of this family is demonstrated empirically. Finally I develop a GP model and associated approximate inference techniques for the prediction of sea surface temperatures (SSTs) on decadal time scales, which are relevant when taking planning decisions which consider resilience to climate change. There remains a large degree of uncertainty as to the state of the climate on such time scales, but it is thought to be possible to reduce this by exploiting the predictability of natural variability in the climate. The developed GP-based model incorporates a key assumption used by the existing statistical models employed for decadal prediction, thus retaining a valuable inductive bias, while offering several advantages. Amongst these is the lack of need for spatial aggregation of data, which is especially relevant when data are sparse, as is the case with historical ocean SST data. In summary, this thesis contributes to the practical use of GPs through a set of abstractions that are useful in the design of software, algorithms for approximate inference in spatial-temporal settings, and their use in decadal climate prediction.

#### Combining pseudo-point and state space approximations for sum-separable Gaussian Processes

Will Tebbutt, Arno Solin, Richard E. Turner, 2021. (In Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence). Edited by Cassio de Campos, Marloes H. Maathuis. PMLR. Proceedings of Machine Learning Research.

Abstract▼ URL

Gaussian processes (GPs) are important probabilistic tools for inference and learning in spatio-temporal modelling problems such as those in climate science and epidemiology. However, existing GP approximations do not simultaneously support large numbers of off-the-grid spatial data-points and long time-series which is a hallmark of many applications. Pseudo-point approximations, one of the gold-standard methods for scaling GPs to large data sets, are well suited for handling off-the-grid spatial data. However, they cannot handle long temporal observation horizons effectively reverting to cubic computational scaling in the time dimension. State space GP approximations are well suited to handling temporal data, if the temporal GP prior admits a Markov form, leading to linear complexity in the number of temporal observations, but have a cubic spatial cost and cannot handle off-the-grid spatial data. In this work we show that there is a simple and elegant way to combine pseudo-point methods with the state space GP approximation framework to get the best of both worlds. The approach hinges on a surprising conditional independence property which applies to space–time separable GPs. We demonstrate empirically that the combined approach is more scalable and applicable to a greater range of spatio-temporal problems than either method on its own.

#### Leader stochastic gradient descent (LSGD) for distributed training of deep learning models

Yunfei Teng, Wenbo Gao, Francois Chalus, Anna Choromanska, Donald Goldfarb, Adrian Weller, December 2019. (In Advances in Neural Information Processing Systems 33). Vancouver.

Abstract▼ URL

We consider distributed optimization under communication constraints for training deep learning models. We propose a new algorithm, whose parameter updates rely on two forces: a regular gradient step, and a corrective direction dictated by the currently best-performing worker (leader). Our method differs from the parameter-averaging scheme EASGD in a number of ways: (i) our objective formulation does not change the location of stationary points compared to the original optimization problem; (ii) we avoid convergence decelerations caused by pulling local workers descending to different local minima to each other (i.e. to the average of their parameters); (iii) our update by design breaks the curse of symmetry (the phenomenon of being trapped in poorly generalizing sub-optimal solutions in symmetric non-convex landscapes); and (iv) our approach is more communication efficient since it broadcasts only parameters of the leader rather than all workers. We provide theoretical analysis of the batch version of the proposed algorithm, which we call Leader Gradient Descent (LGD), and its stochastic variant (LSGD). Finally, we implement an asynchronous version of our algorithm and extend it to the multi-leader setting, where we form groups of workers, each represented by its own local leader (the best performer in a group), and update each worker with a corrective direction comprised of two attractive forces: one to the local, and one to the global leader (the best performer among all workers). The multi-leader setting is well-aligned with current hardware architecture, where local workers forming a group lie within a single computational node and different groups correspond to different nodes. For training convolutional neural networks, we empirically demonstrate that our approach compares favorably to state-of-the-art baselines.

#### Learning Stationary Time Series using Gaussian Process with Nonparametric Kernels

Felipe Tobar, Thang D. Bui, Richard E. Turner, Dec 2015. (In Advances in Neural Information Processing Systems 29). Montréal CANADA.

Abstract▼ URL

We introduce the Gaussian Process Convolution Model (GPCM), a two-stage nonparametric generative procedure to model stationary signals as the convolution between a continuous-time white-noise process and a continuous-time linear filter drawn from Gaussian process. The GPCM is a continuous-time nonparametricwindow moving average process and, conditionally, is itself a Gaussian process with a nonparametric kernel defined in a probabilistic fashion. The generative model can be equivalently considered in the frequency domain, where the power spectral density of the signal is specified using a Gaussian process. One of the main contributions of the paper is to develop a novel variational freeenergy approach based on inter-domain inducing variables that efficiently learns the continuous-time linear filter and infers the driving white-noise process. In turn, this scheme provides closed-form probabilistic estimates of the covariance kernel and the noise-free signal both in denoising and prediction scenarios. Additionally, the variational inference procedure provides closed-form expressions for the approximate posterior of the spectral density given the observed data, leading to new Bayesian nonparametric approaches to spectrum estimation. The proposed GPCM is validated using synthetic and real-world signals.

#### Statistical Models for Natural Sounds

Richard E. Turner, 2010. Gatsby Computational Neuroscience Unit, UCL,

Abstract▼ URL

It is important to understand the rich structure of natural sounds in order to solve important tasks, like automatic speech recognition, and to understand auditory processing in the brain. This thesis takes a step in this direction by characterising the statistics of simple natural sounds. We focus on the statistics because perception often appears to depend on them, rather than on the raw waveform. For example the perception of auditory textures, like running water, wind, fire and rain, depends on summary-statistics, like the rate of falling rain droplets, rather than on the exact details of the physical source. In order to analyse the statistics of sounds accurately it is necessary to improve a number of traditional signal processing methods, including those for amplitude demodulation, time-frequency analysis, and sub-band demodulation. These estimation tasks are ill-posed and therefore it is natural to treat them as Bayesian inference problems. The new probabilistic versions of these methods have several advantages. For example, they perform more accurately on natural signals and are more robust to noise, they can also fill-in missing sections of data, and provide error-bars. Furthermore, free-parameters can be learned from the signal. Using these new algorithms we demonstrate that the energy, sparsity, modulation depth and modulation time-scale in each sub-band of a signal are critical statistics, together with the dependencies between the sub-band modulators. In order to validate this claim, a model containing co-modulated coloured noise carriers is shown to be capable of generating a range of realistic sounding auditory textures. Finally, we explored the connection between the statistics of natural sounds and perception. We demonstrate that inference in the model for auditory textures qualitatively replicates the primitive grouping rules that listeners use to understand simple acoustic scenes. This suggests that the auditory system is optimised for the statistics of natural sounds.

#### Probabilistic Amplitude Demodulation

Richard E. Turner, M Sahani, 2007. (In 7th International Conference on Independent Component Analysis and Signal Separation).

Abstract▼ URL

Auditory scene analysis is extremely challenging. One approach, perhaps that adopted by the brain, is to shape useful representations of sounds on prior knowledge about their statistical structure. For example, sounds with harmonic sections are common and so time-frequency representations are efficient. Most current representations concentrate on the shorter components. Here, we propose representations for structures on longer time-scales, like the phonemes and sentences of speech. We decompose a sound into a product of processes, each with its own characteristic time-scale. This demodulation cascade relates to classical amplitude demodulation, but traditional algorithms fail to realise the representation fully. A new approach, probabilistic amplitude demodulation, is shown to out-perform the established methods, and to easily extend to representation of a full demodulation cascade.

#### Modeling natural sounds with modulation cascade processes

Richard E. Turner, Maneesh Sahani, 2008. (In nips20). Edited by J. C. Platt, D. Koller, Y. Singer, S. Roweis. mit.

Abstract▼ URL

Natural sounds are structured on many time-scales. A typical segment of speech, for example, contains features that span four orders of magnitude: Sentences (∼1s); phonemes (∼10−1 s); glottal pulses (∼ 10−2s); and formants (∼ 10−3s). The auditory system uses information from each of these time-scales to solve complicated tasks such as auditory scene analysis [1]. One route toward understanding how auditory processing accomplishes this analysis is to build neuroscience-inspired algorithms which solve similar tasks and to compare the properties of these algorithms with properties of auditory processing. There is however a discord: Current machine-audition algorithms largely concentrate on the shorter time-scale structures in sounds, and the longer structures are ignored. The reason for this is two-fold. Firstly, it is a difficult technical problem to construct an algorithm that utilises both sorts of information. Secondly, it is computationally demanding to simultaneously process data both at high resolution (to extract short temporal information) and for long duration (to extract long temporal information). The contribution of this work is to develop a new statistical model for natural sounds that captures structure across a wide range of time-scales, and to provide efficient learning and inference algorithms. We demonstrate the success of this approach on a missing data task.

#### Statistical inference for single- and multi-band probabilistic amplitude demodulation.

Richard E. Turner, Maneesh Sahani, 2010. (In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)).

Abstract▼ URL

Amplitude demodulation is an ill-posed problem and so it is natural to treat it from a Bayesian viewpoint, inferring the most likely carrier and envelope under probabilistic constraints. One such treatment is Probabilistic Amplitude Demodulation (PAD), which, whilst computationally more intensive than traditional approaches, offers several advantages. Here we provide methods for estimating the uncertainty in the PAD-derived envelopes and carriers, and for learning free-parameters like the time-scale of the envelope. We show how the probabilistic approach can naturally handle noisy and missing data. Finally, we indicate how to extend the model to signals which contain multiple modulators and carriers.

#### Two problems with variational expectation maximisation for time-series models

Richard E. Turner, Maneesh Sahani, 2011. (In Bayesian Time series models). Edited by D. Barber, T. Cemgil, S. Chiappa. Cambridge University Press.

Abstract▼ URL

Variational methods are a key component of the approximate inference and learning toolbox. These methods fill an important middle ground, retaining distributional information about uncertainty in latent variables, unlike maximum a posteriori methods (MAP), and yet generally requiring less computational time than Monte Carlo Markov Chain methods. In particular the variational Expectation Maximisation (vEM) and variational Bayes algorithms, both involving variational optimisation of a free-energy, are widely used in time-series modelling. Here, we investigate the success of vEM in simple probabilistic time-series models. First we consider the inference step of vEM, and show that a consequence of the well-known compactness property of variational inference is a failure to propagate uncertainty in time, thus limiting the usefulness of the retained distributional information. In particular, the uncertainty may appear to be smallest precisely when the approximation is poorest. Second, we consider parameter learning and analytically reveal systematic biases in the parameters found by vEM. Surprisingly, simpler variational approximations (such a mean-field) can lead to less bias than more complicated structured approximations.

#### Clamping Variables and Approximate Inference

Adrian Weller, Tony Jebara, 2014. (In Advances in Neural Information Processing Systems 28). Edited by Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, K.Q. Weinberger. Curran Associates, Inc..

Abstract▼ URL

It was recently proved using graph covers (Ruozzi, 2012) that the Bethe partition function is upper bounded by the true partition function for a binary pairwise model that is attractive. Here we provide a new, arguably simpler proof from first principles. We make use of the idea of clamping a variable to a particular value. For an attractive model, we show that summing over the Bethe partition functions for each sub-model obtained after clamping any variable can only raise (and hence improve) the approximation. In fact, we derive a stronger result that may have other useful implications. Repeatedly clamping until we obtain a model with no cycles, where the Bethe approximation is exact, yields the result. We also provide a related lower bound on a broad class of approximate partition functions of general pairwise multi-label models that depends only on the topology. We demonstrate that clamping a few wisely chosen variables can be of practical value by dramatically reducing approximation error.

**Comment:** Supplementary Material

#### Tree-based inference for Dirichlet process mixtures

Yang Xu, Katherine A. Heller, Zoubin Ghahramani, April 2009. (In 12th International Conference on Artificial Intelligence and Statistics). Edited by D. van Dyk, M. Welling. Clearwater Beach, FL, USA. Microtome Publishing (paper), Journal of Machine Learning Research (online). **Note**: ISSN 1938-7228.

Abstract▼ URL

The Dirichlet process mixture (DPM) is a widely used model for clustering and for general nonparametric Bayesian density estimation. Unfortunately, like in many statistical models, exact inference in a DPM is intractable, and approximate methods are needed to perform efficient inference. While most attention in the literature has been placed on Markov chain Monte Carlo (MCMC) [1, 2, 3], variational Bayesian (VB) [4] and collapsed variational methods [5], [6] recently introduced a novel class of approximation for DPMs based on Bayesian hierarchical clustering (BHC). These tree-based combinatorial approximations efficiently sum over exponentially many ways of partitioning the data and offer a novel lower bound on the marginal likelihood of the DPM [6]. In this paper we make the following contributions: (1) We show empirically that the BHC lower bounds are substantially tighter than the bounds given by VB [4] and by collapsed variational methods [5] on synthetic and real datasets. (2) We also show that BHC offers a more accurate predictive performance on these datasets. (3) We further improve the tree-based lower bounds with an algorithm that efficiently sums contributions from alternative trees. (4) We present a fast approximate method for BHC. Our results suggest that our combinatorial approximate inference methods and lower bounds may be useful not only in DPMs but in other models as well.

#### Continuous Relaxations for Discrete Hamiltonian Monte Carlo

Yichuan Zhang, Charles A. Sutton, Amos J. Storkey, Zoubin Ghahramani, 2012. (In NIPS). Edited by Peter L. Bartlett, Fernando C. N. Pereira, Christopher J. C. Burges, Léon Bottou, Kilian Q. Weinberger.

Abstract▼ URL

Continuous relaxations play an important role in discrete optimization, but have not seen much use in approximate probabilistic inference. Here we show that a general form of the Gaussian Integral Trick makes it possible to transform a wide class of discrete variable undirected models into fully continuous systems. The continuous representation allows the use of gradient-based Hamiltonian Monte Carlo for inference, results in new ways of estimating normalization constants (partition functions), and in general opens up a number of new avenues for inference in difficult discrete systems. We demonstrate some of these continuous relaxation inference algorithms on a number of illustrative problems.

#### Scalable Infomin Learning

Yanzhi Chen, Weihao Sun, Yingzhen Li, Adrian Weller, 2022. (In Advances in Neural Information Processing Systems).

Abstract▼ URL

The task of infomin learning aims to learn a representation with high utility while being uninformative about a specified target, with the latter achieved by minimising the mutual information between the representation and the target. It has broad applications, ranging from training fair prediction models against protected attributes, to unsupervised learning with disentangled representations. Recent works on infomin learning mainly use adversarial training, which involves training a neural network to estimate mutual information or its proxy and thus is slow and difficult to optimise. Drawing on recent advances in slicing techniques, we propose a new infomin learning approach, which uses a novel proxy metric to mutual information. We further derive an accurate and analytically computable approximation to this proxy metric, thereby removing the need of constructing neural network-based mutual information estimators. Compared to baselines, experiments on algorithmic fairness, disentangled representation learning and domain adaptation verify that our method can more effectively remove unwanted information with limited time budget.

#### Distributed Inference for Dirichlet Process Mixture Models

Hong Ge, Yutian Chen, Moquan Wan, Zoubin Ghahramani, 07–09 Jul 2015. (In Proceedings of the 32nd International Conference on Machine Learning). Edited by Francis Bach, David Blei. Lille, France. PMLR. Proceedings of Machine Learning Research.

Abstract▼ URL

Bayesian nonparametric mixture models based on the Dirichlet process (DP) have been widely used for solving problems like clustering, density estimation and topic modelling. These models make weak assumptions about the underlying process that generated the observed data. Thus, when more data are collected, the complexity of these models can change accordingly. These theoretical properties often lead to superior predictive performance when compared to traditional finite mixture models. However, despite the increasing amount of data available, the application of Bayesian nonparametric mixture models is so far limited to relatively small data sets. In this paper, we propose an efficient distributed inference algorithm for the DP and the HDP mixture model. The proposed method is based on a variant of the slice sampler for DPs. Since this sampler does not involve a pre-determined truncation, the stationary distribution of the sampling algorithm is unbiased. We provide both local thread-level and distributed machine-level parallel implementations and study the performance of this sampler through an extensive set of experiments on image and text data. When compared to existing inference algorithms, the proposed method exhibits state-of-the-art accuracy and strong scalability with up to 512 cores.

#### Turing: A Language for Flexible Probabilistic Inference

Hong Ge, Kai Xu, Zoubin Ghahramani, 09–11 Apr 2018. (In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics). Edited by Amos Storkey, Fernando Perez-Cruz. PMLR. Proceedings of Machine Learning Research.

Abstract▼ URL

Probabilistic programming promises to simplify and democratize probabilistic machine learning, but successful probabilistic programming systems require flexible, generic and efficient inference engines. In this work, we present a system called Turing for building MCMC algorithms for probabilistic programming inference. Turing has a very simple syntax and makes full use of the numerical capabilities in the Julia programming language, including all implemented probability distributions, and automatic differentiation. Turing supports a wide range of popular Monte Carlo algorithms, including Hamiltonian Monte Carlo (HMC), HMC with No-U-Turns (NUTS), Gibbs sampling, sequential Monte Carlo (SMC), and several particle MCMC (PMCMC) samplers. Most importantly, Turing inference is composable: it combines MCMC operations on subsets of variables, for example using a combination of an HMC engine and a particle Gibbs (PG) engine. We explore several combinations of inference methods with the aim of finding approaches that are both efficient and universal, i.e. applicable to arbitrary probabilistic models. NUTS—a popular variant of HMC that adapts Hamiltonian simulation path length automatically, although quite powerful for exploring differentiable target distributions, is however not universal. We identify some failure modes for the NUTS engine, and demonstrate that composition of PG (for discrete variables) and NUTS (for continuous variables) can be useful when the NUTS engine is either not applicable, or simply does not work well. Our aim is to present Turing and its composable inference engines to the world and encourage other researchers to build on this system to help advance the field of probabilistic machine learning.

#### Couplings for Multinomial Hamiltonian Monte Carlo

Kai Xu, Tor Erlend Fjelde, Charles Sutton, Hong Ge, 13–15 Apr 2021. (In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics). Edited by Arindam Banerjee, Kenji Fukumizu. PMLR. Proceedings of Machine Learning Research.

Abstract▼ URL

Hamiltonian Monte Carlo (HMC) is a popular sampling method in Bayesian inference. Recently, Heng & Jacob (2019) studied Metropolis HMC with couplings for unbiased Monte Carlo estimation, establishing a generic parallelizable scheme for HMC. However, in practice a different HMC method, multinomial HMC, is considered as the go-to method, e.g. as part of the no-U-turn sampler. In multinomial HMC, proposed states are not limited to end-points as in Metropolis HMC; instead points along the entire trajectory can be proposed. In this paper, we establish couplings for multinomial HMC, based on optimal transport for multinomial sampling in its transition. We prove an upper bound for the meeting time – the time it takes for the coupled chains to meet – based on the notion of local contractivity. We evaluate our methods using three targets: 1,000 dimensional Gaussians, logistic regression and log-Gaussian Cox point processes. Compared to Heng & Jacob (2019), coupled multinomial HMC generally attains a smaller meeting time, and is more robust to choices of step sizes and trajectory lengths, which allows re-use of existing adaptation methods for HMC. These improvements together paves the way for a wider and more practical use of coupled HMC methods.