[ 2022 | 2021 | 2020 | 2019 | 2018 | 2017 | 2016 | 2015 | 2014 | 2013 | 2012 | 2011 | 2010 | 2009 | 2008 | 2007 | 2006 | 2005 | 2004 | 2003 | 2002 | 2001 | past millennia ] |

## Gaussian Processes and Kernel MethodsGaussian processes are non-parametric distributions useful for doing Bayesian inference and learning on unknown functions. They can be used for non-linear regression, time-series modelling, classification, and many other problems. |

Vincent Dutordoir, Alan Saul, Zoubin Ghahramani, and Fergus Simpson.
**Neural diffusion
processes**.
In *arXiv*, Online, Apr 2022.

** Abstract:** Gaussian
processes provide an elegant framework for specifying prior and posterior
distributions over functions. They are, however, also computationally
expensive, and limited by the expressivity of their covariance function. We
propose Neural Diffusion Processes (NDPs), a novel approach based upon
diffusion models, that learn to sample from distributions over functions.
Using a novel attention block, we can incorporate properties of stochastic
processes, such as exchangeability, directly into the NDP's architecture. We
empirically show that NDPs are able to capture functional distributions that
are close to the true Bayesian posterior of a Gaussian process. This enables
a variety of downstream tasks, including hyperparameter marginalisation and
Bayesian optimisation.

Iskander Azangulov, Andrei Smolensky, Alexander Terenin, and Viacheslav
Borovitskiy.
**Stationary kernels and Gaussian
processes on Lie groups and their homogeneous spaces I: the compact
case**.
*arXiv*, 2022.

** Abstract:** Gaussian processes are
arguably the most important model class in spatial statistics. They encode
prior information about the modeled function and can be used for exact or
approximate Bayesian inference. In many applications, particularly in
physical sciences and engineering, but also in areas such as geostatistics
and neuroscience, invariance to symmetries is one of the most fundamental
forms of prior information one can consider. The invariance of a Gaussian
process' covariance to such symmetries gives rise to the most natural
generalization of the concept of stationarity to such spaces. In this work,
we develop constructive and practical techniques for building stationary
Gaussian processes on a very large class of non-Euclidean spaces arising in
the context of symmetries. Our techniques make it possible to (i) calculate
covariance kernels and (ii) sample from prior and posterior Gaussian
processes defined on such spaces, both in a practical manner. This work is
split into two parts, each involving different technical considerations: part
I studies compact spaces, while part II studies non-compact spaces possessing
certain structure. Our contributions make the non-Euclidean Gaussian process
models we study compatible with well-understood computational techniques
available in standard Gaussian process software packages, thereby making them
accessible to practitioners.

Wessel P. Bruinsma, Martin Tegnér, and Richard E. Turner.
**Modelling
non-smooth signals with complex spectral structure**.
In *aistats25*, 2022.

** Abstract:** The Gaussian Process
Convolution Model (GPCM; Tobar et al., 2015a) is a model for signals with
complex spectral structure. A significant limitation of the GPCM is that it
assumes a rapidly decaying spectrum: it can only model smooth signals.
Moreover, inference in the GPCM currently requires (1) a mean-field
assumption, resulting in poorly calibrated uncertainties, and (2) a tedious
variational optimisation of large covariance matrices. We redesign the GPCM
model to induce a richer distribution over the spectrum with relaxed
assumptions about smoothness: the Causal Gaussian Process Convolution Model
(CGPCM) introduces a causality assumption into the GPCM, and the Rough
Gaussian Process Convolution Model (RGPCM) can be interpreted as a Bayesian
nonparametric generalisation of the fractional Ornstein–Uhlenbeck process.
We also propose a more effective variational inference scheme, going beyond
the mean-field assumption: we design a Gibbs sampler which directly samples
from the optimal variational solution, circumventing any variational
optimisation entirely. The proposed variations of the GPCM are validated in
experiments on synthetic and real-world data, showing promising results.

David R. Burt.
**Scalable
Approximate Inference and Model Selection in Gaussian Process
Regression**.
PhD thesis, University of Cambridge, Department of Engineering, Cambridge, UK,
2022.

** Abstract:** Models with Gaussian process priors and
Gaussian likelihoods are one of only a handful of Bayesian models where
inference can be performed without the need for approximation. However, a
frequent criticism of these models from practitioners of Bayesian machine
learning is that they are challenging to scale to large datasets due to the
need to compute a large kernel matrix and perform standard linear-algebraic
operations with this matrix. This limitation has driven decades of research
in both statistics and machine learning seeking to scale Gaussian process
regression models to ever-larger datasets. This thesis builds on this line of
research. We focus on the problem of approximate inference and model
selection with approximate maximum marginal likelihood as applied to Gaussian
process regression. Our discussion is guided by three questions: Does an
approximation work on a range of models and datasets? Can you verify that an
approximation has worked on a given dataset? Is an approximation easy for a
practitioner to use? While we are far from the first to ask these questions,
we offer new insights into each question in the context of Gaussian process
regression. In the first part of this thesis, we focus on sparse variational
Gaussian process regression (Titsias, 2009). We provide new diagnostics for
inference with this method that can be used as practical guides for
practitioners trying to balance computation and accuracy with this
approximation. We then provide an asymptotic analysis that highlights
properties of the model and dataset that are sufficient for this
approximation to perform reliable inference with a small computational cost.
This analysis builds on an approach laid out in Burt (2018), as well as on
similar guarantees in the kernel ridge regression literature. In the second
part of this thesis, we consider iterative methods, especially the method of
conjugate gradients, as applied to Gaussian process regression (Gibbs and
MacKay, 1997). We primarily focus on improving the reliability of approximate
maximum marginal likelihood when using these approximations. We investigate
how the method of conjugate gradients and related approaches can be used to
derive bounds on quantities related to the log marginal likelihood. This idea
can be used to improve the speed and stability of model selection with these
approaches, making them easier to use in practice.

Talay M Cheema.
**Contrasting
discrete and continuous methods for Bayesian system identification**.
In *Workshop on Continuous Time Machine Learning at the 39th International
Conference on Machine Learning*, 2022.

** Abstract:** In
recent years, there has been considerable interest in embedding continuous
time methods in machine learning algorithms. In system identification, the
task is to learn a dynamical model from incomplete observation data, and when
prior knowledge is in continuous time – for example, mechanistic
differential equation models – it seems natural to use continuous time
models for learning. Yet when learning flexible, nonlinear, probabilistic
dynamics models, most previous work has focused on discrete time models to
avoid computational, numerical, and mathematical difficulties. In this work
we show, with the aid of small-scale examples, that this mismatch between
model and data generating process can be consequential under certain
circumstances, and we discuss possible modifications to discrete time models
which may better suit them to handling data generated by continuous time
processes.

Wenlin Chen, Austin Tripp, and José Miguel Hernández-Lobato.
**Meta-learning adaptive deep
kernel Gaussian processes for molecular property prediction**.
*arXiv*, 2022.

** Abstract:** We propose Adaptive Deep
Kernel Fitting with Implicit Function Theorem (ADKF-IFT), a novel framework
for learning deep kernel Gaussian processes (GPs) by interpolating between
meta-learning and conventional deep kernel learning. Our approach employs a
bilevel optimization objective where we meta-learn generally useful feature
representations across tasks, in the sense that task-specific GP models
estimated on top of such features achieve the lowest possible predictive loss
on average. We solve the resulting nested optimization problem using the
implicit function theorem (IFT). We show that our ADKF-IFT framework contains
previously proposed Deep Kernel Learning (DKL) and Deep Kernel Transfer (DKT)
as special cases. Although ADKF-IFT is a completely general method, we argue
that it is especially well-suited for drug discovery problems and demonstrate
that it significantly outperforms previous state-of-the-art methods on a
variety of real-world few-shot molecular property prediction tasks and
out-of-domain molecular property prediction and optimization tasks.

Alessandro Davide Ialongo.
**Variational
Inference in Dynamical Systems**.
PhD thesis, University of Cambridge, Department of Engineering, Cambridge, UK,
2022.

** Abstract:** Dynamical systems are a powerful
formalism to analyse the world around us. Many datasets are sequential in
nature, and can be described by a discrete time evolution law. We are
interested in approaching the analysis of such datasets from a probabilistic
perspective. We would like to maintain justified beliefs about quantities
which, though useful in explaining the behaviour of a system, may not be
observable, as well as about the system's evolution itself, especially in
regimes we have not yet observed in our data. The framework of statistical
inference gives us the tools to do so, yet, for many systems of interest,
performing inference exactly is not computationally or analytically
tractable. The contribution of this thesis, then, is twofold: first, we
uncover two sources of bias in existing variational inference methods applied
to dynamical systems in general, and state space models whose transition
function is drawn from a Gaussian process (GPSSM) in particular. We show bias
can derive from assuming posteriors in non-linear systems to be jointly
Gaussian, and from assuming that we can sever the dependence between latent
states and transition function in state space model posteriors. Second, we
propose methods to address these issues, undoing the resulting biases. We do
this without compromising on computational efficiency or on the ability to
scale to larger datasets and higher dimensions, compared to the methods we
rectify. One method, the Markov Autoregressive Flow (Markov AF) addresses the
Gaussian assumption, by providing a more flexible class of posteriors, based
on normalizing flows, which can be easily evaluated, sampled, and optimised.
The other method, Variationally Coupled Dynamics and Trajectories (VCDT),
tackles the factorisation assumption, leveraging sparse Gaussian processes
and their variational representation to reintroduce dependence between latent
states and the transition function at no extra computational cost. Since the
objective of inference is to maintain calibrated beliefs, if we employed
approximations which are significantly biased in non-linear, noisy systems,
or when there is little data available, we would have failed in our
objective, as those are precisely the regimes in which uncertainty
quantification is all the more important. Hence we think it is essential, if
we wish to act optimally on such beliefs, to uncover, and, if possible, to
correct, all sources of systematic bias in our inference methods.

Vidhi Lalchand, Wessel P. Bruinsma, David R. Burt, and Carl E. Rasmussen.
**Sparse Gaussian
process hyperparameters: Optimize or integrate?**.
In *nips36*, 2022.

** Abstract:** The kernel function and
its hyperparameters are the central model selection choice in a Gaussian
process [Rasmussen and Williams, 2006]. Typically, the hyperparameters of the
kernel are chosen by maximising the marginal likelihood, an approach known as
Type-II maximum likelihood (ML-II). However, ML-II does not account for
hyperparameter uncertainty, and it is well-known that this can lead to
severely biased estimates and an underestimation of predictive uncertainty.
While there are several works which employ a fully Bayesian characterisation
of GPs, relatively few propose such approaches for the sparse GPs paradigm.
In this work we propose an algorithm for sparse Gaussian process regression
which leverages MCMC to sample from the hyperparameter posterior within the
variational inducing point framework of [Titsias, 2009]. This work is closely
related to Hensman et al. [2015b], but side-steps the need to sample the
inducing points, thereby significantly improving sampling efficiency in the
Gaussian likelihood case. We compare this scheme against natural baselines in
literature along with stochastic variational GPs (SVGPs) along with an
extensive computational analysis.

Vidhi Lalchand, Aditya Ravuri, and Neil D. Lawrence.
**Generalised
GPLVM with Stochastic Variational Inference**.
In *25th International Conference on Artificial Intelligence and
Statistics*, volume 151 of *Proceedings of Machine Learning
Research*, pages 7841-7864. PMLR, 28-30 Mar 2022.

**
Abstract:** Gaussian process latent variable models (GPLVM) are a flexible
and non-linear approach to dimensionality reduction, extending classical
Gaussian processes to an unsupervised learning context. The Bayesian
incarnation of the GPLVM uses a variational framework, where the posterior
over latent variables is approximated by a well-behaved variational family, a
factorised Gaussian yielding a tractable lower bound. However, the
non-factorisability of the lower bound prevents truly scalable inference. In
this work, we study the doubly stochastic formulation of the Bayesian GPLVM
model amenable with minibatch training. We show how this framework is
compatible with different latent variable formulations and perform
experiments to compare a suite of models. Further, we demonstrate how we can
train in the presence of massively missing data and obtain high-fidelity
reconstructions. We demonstrate the model’s performance by benchmarking
against the canonical sparse GPLVM for high dimensional data examples.

Vidhi Lalchand, Kenza Tazi, Talay M Cheema, Richard E Turner, and Scott
Hosking.
**Kernel learning for explainable
climate science**.
In *16th Bayesian Modelling Applications Workshop at UAI, 2022*, 2022.

** Abstract:** The Upper Indus Basin, Himalayas provides water
for 270 million people and countless ecosystems. However, precipitation, a
key component to hydrological modelling, is poorly understood in this area. A
key challenge surrounding this uncertainty comes from the complex
spatial-temporal distribution of precipitation across the basin. In this work
we propose Gaussian processes with structured non-stationary kernels to model
precipitation patterns in the UIB. Previous attempts to quantify or model
precipitation in the Hindu Kush Karakoram Himalayan region have often been
qualitative or include crude assumptions and simplifications which cannot be
resolved at lower resolutions. This body of research also provides little to
no error propagation. We account for the spatial variation in precipitation
with a non-stationary Gibbs kernel parameterised with an input dependent
lengthscale. This allows the posterior function samples to adapt to the
varying precipitation patterns inherent in the distinct underlying topography
of the Indus region. The input dependent lengthscale is governed by a latent
Gaussian process with a stationary squared-exponential kernel to allow the
function level hyperparameters to vary smoothly. In ablation experiments we
motivate each component of the proposed kernel by demonstrating its ability
to model the spatial covariance, temporal structure and joint spatio-temporal
reconstruction. We benchmark our model with a stationary Gaussian process and
a Deep Gaussian processes.

Henry B. Moss, Sebastian W. Ober, and Victor Picheny.
**Information-theoretic
inducing point placement for high-throughput Bayesian optimisation**.
In *ICML Workshop on Adaptive Experimental Design and Active Learning in the
Real World (RealML)*, 2022.

** Abstract:** Sparse Gaussian
Processes are a key component of high-throughput Bayesian optimisation (BO)
loops — an increasingly common setting where evaluation budgets are large
and highly parallelised. By using representative subsets of the available
data to build approximate posteriors, sparse models dramatically reduce the
computational costs of surrogate modelling by relying on a small set of
pseudo-observations, the so-called inducing points, in lieu of the full data
set. However, current approaches to design inducing points are not
appropriate within BO loops as they seek to reduce global uncertainty in the
objective function. Thus, the high-fidelity modelling of promising and
data-dense regions required for precise optimisation is sacrificed and
computational resources are instead wasted on modelling areas of the space
already known to be sub-optimal. Inspired by entropy-based BO methods, we
propose a novel inducing point design that uses a principled
information-theoretic criterion to select inducing points. By choosing
inducing points to maximally reduce both global uncertainty and uncertainty
in the maximum value of the objective function, we build surrogate models
able to support high-precision high-throughput BO.

Alexander Terenin, David R. Burt, Artem Artemev, Seth Flaxman, Mark van der
Wilk, Carl Edward Rasmussen, and Hong Ge.
**Numerically stable sparse
Gaussian processes via minimum separation using cover trees**.
*arXiv*, 2022.

** Abstract:** As Gaussian processes
mature, they are increasingly being deployed as part of larger machine
learning and decision-making systems, for instance in geospatial modeling,
Bayesian optimization, or in latent Gaussian models. Within a system, the
Gaussian process model needs to perform in a stable and reliable manner to
ensure it interacts correctly with other parts the system. In this work, we
study the numerical stability of scalable sparse approximations based on
inducing points. We derive sufficient and in certain cases necessary
conditions on the inducing points for the computations performed to be
numerically stable. For low-dimensional tasks such as geospatial modeling, we
propose an automated method for computing inducing points satisfying these
conditions. This is done via a modification of the cover tree data structure,
which is of independent interest. We additionally propose an alternative
sparse approximation for regression with a Gaussian likelihood which trades
off a small amount of performance to further improve stability. We evaluate
the proposed techniques on a number of examples, showing that, in geospatial
settings, sparse approximations with guaranteed numerical stability often
perform comparably to those without.

Will Tebbutt.
**Advances in
Software and Spatio-Temporal Modelling with Gaussian Processes**.
PhD thesis, University of Cambridge, Department of Engineering, 2022.

** Abstract:** This thesis concerns the use of Gaussian
processes (GPs) as distributions over unknown functions in Machine Learning
and probabilistic modeling. GPs have been found to have utility in a wide
range of applications owing to their flexibility, interpretability, and
tractability. I advance their use in three directions. Firstly, the
abstractions upon which software is built for their use in practice. In
modern GP software libraries such as GPML, GPy, GPflow, and GPyTorch, the
kernel is undoubtedly the dominant abstraction. While it remains highly
successful it of course has limitations, and I propose to address some of
these through a complementary abstraction: affine transformations of GPs.
Specifically I show how a collection of GPs, and affine transformations
thereof, can themselves be treated as a single GP. This in turn leads to a
design for software, including exact and approximate inference algorithms. I
demonstrate the utility of this software through a collection of worked
examples, focussing on models which are more cleanly and easily expressed
using this new software. Secondly, I develop a new scalable approximate
inference algorithm for a class of GPs commonly utilised in spatio-temporal
problems. This is a setting in which GPs excel, for example enabling the
incorporation of important inductive biases, and observations made at
arbitrary points in time and space. However, the computation required to
perform exact inference and learning in GPs scales cubically in the number of
observations, necessitating approximation, to which end I combine two
important complementary classes of approximation: pseudo-point and Markovian.
The key contribution is the insight that a simple and useful way to combine
them turns out to be well-justified. This resolves an open question in the
literature, provides new insight into existing work, and a new family of
approximations. The efficacy of an important member of this family is
demonstrated empirically. Finally I develop a GP model and associated
approximate inference techniques for the prediction of sea surface
temperatures (SSTs) on decadal time scales, which are relevant when taking
planning decisions which consider resilience to climate change. There remains
a large degree of uncertainty as to the state of the climate on such time
scales, but it is thought to be possible to reduce this by exploiting the
predictability of natural variability in the climate. The developed GP-based
model incorporates a key assumption used by the existing statistical models
employed for decadal prediction, thus retaining a valuable inductive bias,
while offering several advantages. Amongst these is the lack of need for
spatial aggregation of data, which is especially relevant when data are
sparse, as is the case with historical ocean SST data. In summary, this
thesis contributes to the practical use of GPs through a set of abstractions
that are useful in the design of software, algorithms for approximate
inference in spatial-temporal settings, and their use in decadal climate
prediction.

Vincent Dutordoir, James Hensman, Mark van der Wilk, Carl Henrik Ek, Zoubin
Ghahramani, and Nicolas Durrande.
**Deep
neural networks as point estimates for deep Gaussian processes**.
In *Advances in Neural Information Processing Systems 34*, Online, Dec
2021.

** Abstract:** Neural networks and Gaussian processes
are complementary in their strengths and weaknesses. Having a better
understanding of their relationship comes with the promise to make each
method benefit from the strengths of the other. In this work, we establish an
equivalence between the forward passes of neural networks and (deep) sparse
Gaussian process models. The theory we develop is based on interpreting
activation functions as interdomain inducing features through a rigorous
analysis of the interplay between activation functions and kernels. This
results in models that can either be seen as neural networks with improved
uncertainty prediction or deep Gaussian processes with increased prediction
accuracy. These claims are supported by experimental results on regression
and classification datasets.

Artem Artemev, David R. Burt, and Mark van der Wilk.
**Tighter
bounds on the log marginal likelihood of gaussian process regression using
conjugate gradients**.
In *38th International Conference on Machine Learning*, 2021.

** Abstract:** We propose a lower bound on the log marginal
likelihood of Gaussian process regression models that can be computed without
matrix factorisation of the full kernel matrix. We show that approximate
maximum likelihood learning of model parameters by maximising our lower bound
retains many benefits of the sparse variational approach while reducing the
bias introduced into hyperparameter learning. The basis of our bound is a
more careful analysis of the log-determinant term appearing in the log
marginal likelihood, as well as using the method of conjugate gradients to
derive tight lower bounds on the term involving a quadratic form. Our
approach is a step forward in unifying methods relying on lower bound
maximisation (e.g. variational methods) and iterative approaches based on
conjugate gradients for training Gaussian processes. In experiments, we show
improved predictive performance with our model for a comparable amount of
training time compared to other conjugate gradient based approaches.

Laurence Aitchison, Adam X. Yang, and Sebastian W. Ober.
**Deep
kernel processes**.
In *38th International Conference on Machine Learning*, 2021.

** Abstract:** We define deep kernel processes in which positive
definite Gram matrices are progressively transformed by nonlinear kernel
functions and by sampling from (inverse) Wishart distributions. Remarkably,
we find that deep Gaussian processes (DGPs), Bayesian neural networks (BNNs),
infinite BNNs, and infinite BNNs with bottlenecks can all be written as deep
kernel processes. For DGPs the equivalence arises because the Gram matrix
formed by the inner product of features is Wishart distributed, and as we
show, standard isotropic kernels can be written entirely in terms of this
Gram matrix — we do not need knowledge of the underlying features. We
define a tractable deep kernel process, the deep inverse Wishart process, and
give a doubly-stochastic inducing-point variational inference scheme that
operates on the Gram matrices, not on the features, as in DGPs. We show that
the deep inverse Wishart process gives superior performance to DGPs and
infinite BNNs on fully-connected baselines.

Talay M Cheema.
**Understanding
local linearisation in variational Gaussian process state space
models**.
In *Time Series Workshop at the 38th International Conference on Machine
Learning*, 2021.

** Abstract:** We describe variational
inference approaches in Gaussian process state space models in terms of local
linearisations of the approximate posterior function. Most previous
approaches have either assumed independence between the posterior dynamics
and latent states (the mean-field (MF) approximation), or optimised free
parameters for both, leading to limited scalability. We use our framework to
prove that (i) there is a theoretical imperative to use non-MF approaches, to
avoid excessive bias in the process noise hyperparameter estimate, and (ii)
we can parameterise only the posterior dynamics without any less of
performance. Our approach suggests further approximations, based on the
existing rich literature on filtering and smoothing for nonlinear systems,
and unifies approaches for discrete and continuous time models.

Metod Jazbec, Matt Ashman, Vincent Fortuin, Michael Pearce, Stephan Mandt, and
Gunnar Rätsch.
**Scalable
Gaussian process variational autoencoders**.
In Arindam Banerjee and Kenji Fukumizu, editors, *Proceedings of The 24th
International Conference on Artificial Intelligence and Statistics*,
volume 130 of *Proceedings of Machine Learning Research*, pages
3511-3519. Proceedings of Machine Learning Research, 13-15 Apr 2021.

** Abstract:** Conventional variational autoencoders fail in
modeling correlations between data points due to their use of factorized
priors. Amortized Gaussian process inference through GP-VAEs has led to
significant improvements in this regard, but is still inhibited by the
intrinsic complexity of exact GP inference. We improve the scalability of
these methods through principled sparse inference approaches. We propose a
new scalable GP-VAE model that outperforms existing approaches in terms of
runtime and memory footprint, is easy to implement, and allows for joint
end-to-end optimization of all components.

Sebastian W. Ober and Laurence Aitchison.
**Global
inducing point variational posteriors for Bayesian neural networks and deep
Gaussian processes**.
In *38th International Conference on Machine Learning*, 2021.

** Abstract:** We consider the optimal approximate posterior
over the top-layer weights in a Bayesian neural network for regression, and
show that it exhibits strong dependencies on the lower-layer weights. We
adapt this result to develop a correlated approximate posterior over the
weights at all layers in a Bayesian neural network. We extend this approach
to deep Gaussian processes, unifying inference in the two model classes. Our
approximate posterior uses learned ``global'' inducing points, which are
defined only at the input layer and propagated through the network to obtain
inducing inputs at subsequent layers. By contrast, standard ``local'',
inducing point methods from the deep Gaussian process literature optimise a
separate set of inducing inputs at every layer, and thus do not model
correlations across layers. Our method gives state-of-the-art performance for
a variational Bayesian method, without data augmentation or tempering, on
CIFAR-10 of 86.7%, which is comparable to SGMCMC without tempering but with
data augmentation (88% in Wenzel et al. 2020).

Sebastian W. Ober and Laurence Aitchison.
**A
variational approximate posterior for the deep Wishart process**.
In *Advances in Neural Information Processing Systems 34*, 2021.

** Abstract:** Recent work introduced deep kernel processes as
an entirely kernel-based alternative to NNs (Aitchison et al. 2020). Deep
kernel processes flexibly learn good top-layer representations by alternately
sampling the kernel from a distribution over positive semi-definite matrices
and performing nonlinear transformations. A particular deep kernel process,
the deep Wishart process (DWP), is of particular interest because its prior
can be made equivalent to deep Gaussian process (DGP) priors for kernels that
can be expressed entirely in terms of Gram matrices. However, inference in
DWPs has not yet been possible due to the lack of sufficiently flexible
distributions over positive semi-definite matrices. Here, we give a novel
approach to obtaining flexible distributions over positive semi-definite
matrices by generalising the Bartlett decomposition of the Wishart
probability density. We use this new distribution to develop an approximate
posterior for the DWP that includes dependency across layers. We develop a
doubly-stochastic inducing-point inference scheme for the DWP and show
experimentally that inference in the DWP can improve performance over doing
inference in a DGP with the equivalent prior.

Sebastian W. Ober, Carl Edward Rasmussen, and Mark van der Wilk.
**The promises and pitfalls of
deep kernel learning**.
In *37th Conference on Uncertainty in Artificial Intelligence*, 2021.

** Abstract:** Deep kernel learning (DKL) and related techniques
aim to combine the representational power of neural networks with the
reliable uncertainty estimates of Gaussian processes. One crucial aspect of
these models is an expectation that, because they are treated as Gaussian
process models optimized using the marginal likelihood, they are protected
from overfitting. However, we identify situations where this is not the case.
We explore this behavior, explain its origins and consider how it applies to
real datasets. Through careful experimentation on the UCI, CIFAR-10, and the
UTKFace datasets, we find that the overfitting from overparameterized maximum
marginal likelihood, in which the model is "somewhat Bayesian", can in
certain scenarios be worse than that from not being Bayesian at all. We
explain how and when DKL can still be successful by investigating
optimization dynamics. We also find that failures of DKL can be rectified by
a fully Bayesian treatment, which leads to the desired performance
improvements over standard neural networks and Gaussian processes.

Fergus Simpson, Ian Davies, Vidhi Lalchand, Alessandro Vullo, Nicolas Durrande,
and Carl Edward Rasmussen.
**Kernel
identification through transformers**.
In *Advances in Neural Information Processing Systems 34*,
volume 34, pages 10483-10495, 2021.

** Abstract:**
Kernel selection plays a central role in determining the performance of
Gaussian Process (GP) models, as the chosen kernel determines both the
inductive biases and prior support of functions under the GP prior. This work
addresses the challenge of constructing custom kernel functions for
high-dimensional GP regression models. Drawing inspiration from recent
progress in deep learning, we introduce a novel approach named KITT: Kernel
Identification Through Transformers. KITT exploits a transformer-based
architecture to generate kernel recommendations in under 0.1 seconds, which
is several orders of magnitude faster than conventional kernel search
algorithms. We train our model using synthetic data generated from priors
over a vocabulary of known kernels. By exploiting the nature of the
self-attention mechanism, KITT is able to process datasets with inputs of
arbitrary dimension. We demonstrate that kernels chosen by KITT yield strong
performance over a diverse collection of regression benchmarks.

Fergus Simpson, Vidhi Lalchand, and Carl Edward Rasmussen.
**Marginalised
Gaussian Processes with Nested Sampling**.
In *Advances in Neural Information Processing Systems 34*,
volume 34, pages 13613-13625. Curran Associates, Inc., 2021.

** Abstract:** Gaussian Process models are a rich distribution
over functions with inductive biases controlled by a kernel function.
Learning occurs through optimisation of the kernel hyperparameters using the
marginal likelihood as the objective. This work proposes nested sampling as a
means of marginalising kernel hyperparameters, because it is a technique that
is well-suited to exploring complex, multi-modal distributions. We benchmark
against Hamiltonian Monte Carlo on time-series and two-dimensional regression
tasks, finding that a principled approach to quantifying hyperparameter
uncertainty substantially improves the quality of prediction intervals.

Will Tebbutt, Arno Solin, and Richard E. Turner.
**Combining
pseudo-point and state space approximations for sum-separable Gaussian
processes**.
In Cassio de Campos and Marloes H. Maathuis, editors, *Proceedings of the
Thirty-Seventh Conference on Uncertainty in Artificial Intelligence*,
Proceedings of Machine Learning Research, pages 1607-1617. PMLR, 2021.

** Abstract:** Gaussian processes (GPs) are important
probabilistic tools for inference and learning in spatio-temporal modelling
problems such as those in climate science and epidemiology. However, existing
GP approximations do not simultaneously support large numbers of off-the-grid
spatial data-points and long time-series which is a hallmark of many
applications. Pseudo-point approximations, one of the gold-standard methods
for scaling GPs to large data sets, are well suited for handling off-the-grid
spatial data. However, they cannot handle long temporal observation horizons
effectively reverting to cubic computational scaling in the time dimension.
State space GP approximations are well suited to handling temporal data, if
the temporal GP prior admits a Markov form, leading to linear complexity in
the number of temporal observations, but have a cubic spatial cost and cannot
handle off-the-grid spatial data. In this work we show that there is a simple
and elegant way to combine pseudo-point methods with the state space GP
approximation framework to get the best of both worlds. The approach hinges
on a surprising conditional independence property which applies to
space–time separable GPs. We demonstrate empirically that the combined
approach is more scalable and applicable to a greater range of
spatio-temporal problems than either method on its own.

Martin Trapp, Robert Peharz, Franz Pernkopf, and Carl Edward Rasmussen.
**Deep
structured mixtures of Gaussian processes**.
In *23rd International Conference on Artificial Intelligence and
Statistics*, Online, August 2020.

** Abstract:** Gaussian
Processes (GPs) are powerful non-parametric Bayesian regression models that
allow exact posterior inference, but exhibit high computational and memory
costs. In order to improve scalability of GPs, approximate posterior
inference is frequently employed, where a prominent class of approximation
techniques is based on local GP experts. However, local-expert techniques
proposed so far are either not well-principled, come with limited
approximation guarantees, or lead to intractable models. In this paper, we
introduce deep structured mixtures of GP experts, a stochastic process model
which i) allows exact posterior inference, ii) has attractive computational
and memory costs, and iii) when used as GP approximation, captures predictive
uncertainties consistently better than previous expert-based approximations.
In a variety of experiments, we show that deep structured mixtures have a low
approximation error and often perform competitive or outperform prior
work.

Vincent Dutordoir, Nicolas Durrande, and James Hensman.
**Sparse
Gaussian processes with spherical harmonic features**.
In *37th International Conference on Machine Learning*, Online, June
2020.

** Abstract:** We introduce a new class of inter-domain
variational Gaussian processes (GP) where data is mapped onto the unit
hypersphere in order to use spherical harmonic representations. Our inference
scheme is comparable to variational Fourier features, but it does not suffer
from the curse of dimensionality, and leads to diagonal covariance matrices
between inducing variables. This enables a speed-up in inference, because it
bypasses the need to invert large covariance matrices. Our experiments show
that our model is able to fit a regression model for a dataset with 6 million
entries two orders of magnitude faster compared to standard sparse GPs, while
retaining state of the art accuracy. We also demonstrate competitive
performance on classification with non-conjugate likelihoods.

Matthew Ashman, Jonny So, Will Tebbutt, Vincent Fortuin, Michael Pearce, and
Richard E. Turner.
**Sparse Gaussian process
variational autoencoders**.
2020.

** Abstract:** Large, multi-dimensional spatio-temporal
datasets are omnipresent in modern science and engineering. An effective
framework for handling such data are Gaussian process deep generative models
(GP-DGMs), which employ GP priors over the latent variables of DGMs. Existing
approaches for performing inference in GP-DGMs do not support sparse GP
approximations based on inducing points, which are essential for the
computational efficiency of GPs, nor do they handle missing data - a natural
occurrence in many spatio-temporal datasets - in a principled manner. We
address these shortcomings with the development of the sparse Gaussian
process variational autoencoder (SGP-VAE), characterised by the use of
partial inference networks for parameterising sparse GP approximations.
Leveraging the benefits of amortised variational inference, the SGP-VAE
enables inference in multi-output sparse GPs on previously unobserved data
with no additional training. The SGP-VAE is evaluated in a variety of
experiments where it outperforms alternative approaches including
multi-output GPs and structured VAEs.

Wessel Bruinsma, Eric Perim, Will Tebbutt, J. Scott Hosking, Arno Solin, and
Richard E. Turner.
**Scalable
exact inference in multi-output Gaussian processes**.
In *37th International Conference on Machine Learning*. Proceedings of
Machine Learning Research, 2020.

** Abstract:** Multi-output
Gaussian processes (MOGPs) leverage the flexibility and interpretability of
GPs while capturing structure across outputs, which is desirable, for
example, in spatio-temporal modelling. The key problem with MOGPs is their
computational scaling $O(n^3 p^3)$, which is cubic in the number of both
inputs $n$ (e.g., time points or locations) and outputs $p$. For this reason,
a popular class of MOGPs assumes that the data live around a low-dimensional
linear subspace, reducing the complexity to $O(n^3 m^3)$. However, this cost
is still cubic in the dimensionality of the subspace $m$, which is still
prohibitively expensive for many applications. We propose the use of a
sufficient statistic of the data to accelerate inference and learning in
MOGPs with orthogonal bases. The method achieves linear scaling in $m$ in
practice, allowing these models to scale to large $m$ without sacrificing
significant expressivity or requiring approximation. This advance opens up a
wide range of real-world tasks and can be combined with existing GP
approximations in a plug-and-play way. We demonstrate the efficacy of the
method on various synthetic and real-world data sets.

David R. Burt, Carl Edward Rasmussen, and Mark van der Wilk.
**Convergence
of sparse variational inference in Gaussian processes regression**.
*Journal of Machine Learning Research*, 21, 2020.

**
Abstract:** Gaussian processes are distributions over functions that are
versatile and mathematically convenient priors in Bayesian modelling.
However, their use is often impeded for data with large numbers of
observations, N, due to the cubic (in N) cost of matrix operations used in
exact inference. Many solutions have been proposed that rely on M << N
inducing variables to form an approximation at a cost of O(NM^{2}).
While the computational cost appears linear in N, the true complexity depends
on how M must scale with N to ensure a certain quality of the approximation.
In this work, we investigate upper and lower bounds on how M needs to grow
with N to ensure high quality approximations. We show that we can make the
KL-divergence between the approximate model and the exact posterior
arbitrarily small for a Gaussian-noise regression model with M << N.
Specifically, for the popular squared exponential kernel and D-dimensional
Gaussian distributed covariates, M = O((log N)^{D}) suffice and a
method with an overall computational cost of O(N(log N)^{2D}(log log
N)^{2}) can be used to perform inference.

Jan-Peter Calliess, Stephen J. Roberts, Carl Edward Rasmussen, and Jan
Maciejowski.
**Lazily adapted constant kinky inference for non-parametric regression and
model-reference adaptive control**.
*Automatica*, 122, 2020, doi
10.1016/j.automatica.2020.109216.

** Abstract:**
Techniques known as Nonlinear Set Membership prediction or Lipschitz
Interpolation are approaches to supervised machine learning that utilise
presupposed Lipschitz properties to perform inference over unobserved
function values. Provided a bound on the true best Lipschitz constant of the
target function is known a priori, they offer convergence guarantees, as well
as bounds around the predictions. Considering a more general setting that
builds on Lipschitz continuity, we propose an online method for estimating
the Lipschitz constant online from function value observations that are
possibly corrupted by bounded noise. Utilising this as a data-dependent
hyper-parameter gives rise to a nonparametric machine learning method, for
which we establish strong universal approximation guarantees. That is, we
show that our prediction rule can learn any continuous function on compact
support in the limit of increasingly dense data, up to a worst-case error
that can be bounded by the level of observational error. We also consider
applications of our nonparametric regression method to learning-based
control. For a class of discrete-time settings, we establish convergence
guarantees on the closed-loop tracking error of our online learning-based
controllers. To provide evidence that our method can be beneficial not only
in theory but also in practice, we apply it in the context of nonparametric
model-reference adaptive control (MRAC). Across a range of simulated aircraft
roll-dynamics and performance metrics our approach outperforms recently
proposed alternatives that were based on Gaussian processes and RBF-neural
networks.

David Janz, David Burt, and Javier Gonzalez.
**Bandit
optimisation of functions in the Matérn kernel RKHS**.
In *23rd International Conference on Artificial Intelligence and
Statistics*, 2020.

** Abstract:** We consider the problem
of optimising functions in the reproducing kernel Hilbert space (RKHS) of a
Matérn kernel with smoothness parameter $u$ over the domain $[0,1]^d$ under
noisy bandit feedback. Our contribution, the $π$-GP-UCB algorithm, is the
first practical approach with guaranteed sublinear regret for all $u>1$
and $d \geq 1$. Empirical validation suggests better performance and
drastically improved computational scalablity compared with its predecessor,
Improved GP-UCB.

Vidhi Lalchand and Carl Edward Rasmussen.
**Approximate
inference for fully Bayesian Gaussian process regression**.
In *2nd Symposium on Advances in Approximate Bayesian Inference*, pages
1-12. PMLR, 2020.

** Abstract:** Learning in Gaussian Process
models occurs through the adaptation of hyperparameters of the mean and the
covariance function. The classical approach entails maximizing the marginal
likelihood yielding fixed point estimates (an approach called Type II maximum
likelihood or ML-II). An alternative learning procedure is to infer the
posterior over hyper-parameters in a hierarchical specication of GPs we call
Fully Bayesian Gaussian Process Regression (GPR). This work considers two
approximation schemes for the intractable hyperparameter posterior: 1)
Hamiltonian Monte Carlo (HMC) yielding a sampling based approximation and 2)
Variational Inference (VI) where the posterior over hyperparameters is
approximated by a factorized Gaussian (mean-field) or a full-rank Gaussian
accounting for correlations between hyperparameters. We analyse the
predictive performance for fully Bayesian GPR on a range of benchmark data
sets.

Timothy Gebhard, Niki Kilbertus, Ian Harry, and Bernhard Schölkopf.
**Convolutional
neural networks: A magic bullet for gravitational-wave detection?**.
*Physical Review D*, 100(6):063015, September 2019, doi
https://doi.org/10.1103/PhysRevD.100.063015.

**
Abstract:** In the last few years, machine learning techniques, in
particular convolutional neural networks, have been investigated as a method
to replace or complement traditional matched filtering techniques that are
used to detect the gravitational-wave signature of merging black holes.
However, to date, these methods have not yet been successfully applied to the
analysis of long stretches of data recorded by the Advanced LIGO and Virgo
gravitational-wave observatories. In this work, we critically examine the use
of convolutional neural networks as a tool to search for merging black holes.
We identify the strengths and limitations of this approach, highlight some
common pitfalls in translating between machine learning and
gravitational-wave astronomy, and discuss the interdisciplinary challenges.
In particular, we explain in detail why convolutional neural networks alone
cannot be used to claim a statistically significant gravitational-wave
detection. However, we demonstrate how they can still be used to rapidly flag
the times of potential signals in the data for a more detailed follow-up. Our
convolutional neural network architecture as well as the proposed performance
metrics are better suited for this task than a standard binary
classifications scheme. A detailed evaluation of our approach on Advanced
LIGO data demonstrates the potential of such systems as trigger generators.
Finally, we sound a note of caution by constructing adversarial examples,
which showcase interesting "failure modes" of our model, where inputs with no
visible resemblance to real gravitational-wave signals are identified as such
by the network with high confidence.

Alessandro Davide Ialongo, Mark van der Wilk, James Hensman, and Carl Edward
Rasmussen.
**Overcoming mean-field
approximations in recurrent Gaussian process models**.
In *36th International Conference on Machine Learning*, Long Beach, June
2019.

** Abstract:** We identify a new variational inference
scheme for dynamical systems whose transition function is modelled by a
Gaussian process. Inference in this setting has either employed
computationally intensive MCMC methods, or relied on factorisations of the
variational posterior. As we demonstrate in our experiments, the
factorisation between latent system states and transition function can lead
to a miscalibrated posterior and to learning unnecessarily large noise terms.
We eliminate this factorisation by explicitly modelling the dependence
between state trajectories and the Gaussian process posterior. Samples of the
latent states can then be tractably generated by conditioning on this
representation. The method we obtain (VCDT: variationally coupled dynamics
and trajectories) gives better predictive performance and more calibrated
estimates of the transition function, yet maintains the same time and space
complexities as mean-field methods. Code is available at:
https://github.com/ialong/GPt.

David R Burt, Carl Edward Rasmussen, and Mark van der Wilk.
**Rates of convergence for sparse
variational Gaussian process regression**.
*arXiv*, 2019.

** Abstract:** Excellent variational
approximations to Gaussian process posteriors have been developed which avoid
the O(N^{3}) scaling with dataset size N. They reduce the
computational cost to O(NM^{2}), with M≪N being the number of
inducing variables, which summarise the process. While the computational cost
seems to be linear in N, the true complexity of the algorithm depends on how
M must increase to ensure a certain quality of approximation. We address this
by characterising the behavior of an upper bound on the KL divergence to the
posterior. We show that with high probability the KL divergence can be made
arbitrarily small by growing M more slowly than N. A particular case of
interest is that for regression with normally distributed inputs in
D-dimensions with the popular Squared Exponential kernel,
M=O(log^{D}N) is sufficient. Our results show that as datasets grow,
Gaussian process posteriors can truly be approximated cheaply, and provide a
concrete rule for how to increase M in continual learning scenarios.

Adrià Garriga-Alonso, Carl Edward Rasmussen, and Laurence Aitchison.
**Deep convolutional networks as
shallow Gaussian processes**.
In *International Conference on Learning Representations (ICLR)*,
2019.

** Abstract:** We show that the output of a (residual)
convolutional neural network (CNN) with an appropriate prior over the weights
and biases is a Gaussian process (GP) in the limit of infinitely many
convolutional filters, extending similar results for dense networks. For a
CNN, the equivalent kernel can be computed exactly and, unlike "deep
kernels", has very few parameters: only the hyperparameters of the original
CNN. Further, we show that this kernel has two properties that allow it to be
computed efficiently; the cost of evaluating the kernel for a pair of images
is similar to a single forward pass through the original CNN with only one
filter per layer. The kernel equivalent to a 32-layer ResNet obtains 0.84%
classification error on MNIST, a new record for GPs with a comparable number
of parameters.

James Requeima, William Tebbutt, Wessel Bruinsma, and Richard E. Turner.
**The Gaussian
process autoregressive regression model (GPAR)**.
In *22nd International Conference on Artificial Intelligence and
Statistics*. Proceedings of Machine Learning Research, 2019.

** Abstract:** Multi-output regression models must exploit
dependencies between outputs to maximise predictive performance. The
application of Gaussian processes (GPs) to this setting typically yields
models that are computationally demanding and have limited representational
power. We present the Gaussian Process Autoregressive Regression (GPAR)
model, a scalable multi-output GP model that is able to capture nonlinear,
possibly input-varying, dependencies between outputs in a simple and
tractable way: the product rule is used to decompose the joint distribution
over the outputs into a set of conditionals, each of which is modelled by a
standard GP. GPAR’s efficacy is demonstrated on a variety of synthetic and
real-world problems, outperforming existing GP models and achieving
state-of-the-art performance on established benchmarks.

Vincent Dutordoir, Hugh Salimbeni, Marc Deisenroth, and James Hensman.
**Gaussian
process conditional density estimation**.
In *Advances in Neural Information Processing Systems 32*, Montréal,
Canada, Dec 2018.

** Abstract:** Conditional Density
Estimation (CDE) models deal with estimating conditional distributions. The
conditions imposed on the distribution are the inputs of the model. CDE is a
challenging task as there is a fundamental trade-off between model
complexity, representational capacity and overfitting. In this work, we
propose to extend the model's input with latent variables and use Gaussian
processes (GP) to map this augmented input onto samples from the conditional
distribution. Our Bayesian approach allows for the modeling of small
datasets, but we also provide the machinery for it to be applied to big data
using stochastic variational inference. Our approach can be used to model
densities even in sparse data regions, and allows for sharing learned
structure between conditions. We illustrate the effectiveness and
wide-reaching applicability of our model on a variety of real- world
problems, such as spatio-temporal density estimation of taxi drop-offs,
non-Gaussian noise modeling, and few-shot learning on omniglot images.

Alessandro Davide Ialongo, Mark van der Wilk, James Hensman, and Carl Edward
Rasmussen.
**Non-factorised variational
inference in dynamical systems**.
In *First Symposium on Advances in Approximate Bayesian Inference*,
Montreal, December 2018.

** Abstract:** We focus on
variational inference in dynamical systems where the discrete time transition
function (or evolution rule) is modelled by a Gaussian process. The dominant
approach so far has been to use a factorised posterior distribution,
decoupling the transition function from the system states. This is not exact
in general and can lead to an overconfident posterior over the transition
function as well as an overestimation of the intrinsic stochasticity of the
system (process noise). We propose a new method that addresses these issues
and incurs no additional computational costs.

Matej Balog, Ilya Tolstikhin, and Bernhard Schölkopf.
**Differentially
private database release via kernel mean embeddings**.
In *35th International Conference on Machine Learning*, Stockholm,
Sweden, July 2018.

** Abstract:** We lay theoretical
foundations for new database release mechanisms that allow third-parties to
construct consistent estimators of population statistics, while ensuring that
the privacy of each individual contributing to the database is protected. The
proposed framework rests on two main ideas. First, releasing (an estimate of)
the kernel mean embedding of the data generating random variable instead of
the database itself still allows third-parties to construct consistent
estimators of a wide class of population statistics. Second, the algorithm
can satisfy the definition of differential privacy by basing the released
kernel mean embedding on entirely synthetic data points, while controlling
accuracy through the metric available in a Reproducing Kernel Hilbert Space.
We describe two instantiations of the proposed framework, suitable under
different scenarios, and prove theoretical results guaranteeing differential
privacy of the resulting algorithms and the consistency of estimators
constructed from their outputs.

** Comment:** [arXiv]

Manon Kok and Arno Solin.
**Scalable magnetic field slam in
3d using gaussian process maps**.
In *Proceedings of the 21th International Conference on Information Fusion
(accepted for publication)*, Cambridge, UK, July 2018.

**
Abstract:** We present a method for scalable and fully 3D magnetic field
simultaneous localisation and mapping (SLAM) using local anomalies in the
magnetic field as a source of position information. These anomalies are due
to the presence of ferromagnetic material in the structure of buildings and
in objects such as furniture. We represent the magnetic field map using a
Gaussian process model and take well-known physical properties of the
magnetic field into account. We build local magnetic field maps using
three-dimensional hexagonal block tiling. To make our approach
computationally tractable we use reduced-rank Gaussian process regression in
combination with a Rao-Blackwellised particle filter. We show that it is
possible to obtain accurate position and orientation estimates using
measurements from a smartphone, and that our approach provides a scalable
magnetic SLAM algorithm in terms of both computational complexity and map
storage.

Maria Lomeli, Mark Rowland, Arthur Gretton, and Zoubin Ghahramani.
**Antithetic and Monte Carlo
kernel estimators for partial rankings**.
*arXiv preprint arXiv:1807.00400*, 2018.

** Abstract:**
In the modern age, rankings data is ubiquitous and it is useful for a variety
of applications such as recommender systems, multi-object tracking and
preference learning. However, most rankings data encountered in the real
world is incomplete, which prevents the direct application of existing
modelling tools for complete rankings. Our contribution is a novel way to
extend kernel methods for complete rankings to partial rankings, via
consistent Monte Carlo estimators for Gram matrices: matrices of kernel
values between pairs of observations. We also present a novel variance
reduction scheme based on an antithetic variate construction between
permutations to obtain an improved estimator for the Mallows kernel. The
corresponding antithetic kernel estimator has lower variance and we
demonstrate empirically that it has a better performance in a variety of
Machine Learning tasks. Both kernel estimators are based on extending kernel
mean embeddings to the embedding of a set of full rankings consistent with an
observed partial ranking. They form a computationally tractable alternative
to previous approaches for partial rankings data. An overview of the existing
kernels and metrics for permutations is also provided.

Thang D. Bui, Cuong V. Nguyen, and Richard E. Turner.
**Streaming
sparse Gaussian process approximations**.
In *Advances in Neural Information Processing Systems 31*,
volume 31, Long Beach, California, USA, December 2017.

**
Abstract:** Sparse approximations for Gaussian process models provide a
suite of methods that enable these models to be deployed in large data regime
and enable analytic intractabilities to be sidestepped. However, the field
lacks a principled method to handle streaming data in which the posterior
distribution over function values and the hyperparameters are updated in an
online fashion. The small number of existing approaches either use suboptimal
hand-crafted heuristics for hyperparameter learning, or suffer from
catastrophic forgetting or slow updating when new data arrive. This paper
develops a new principled framework for deploying Gaussian process
probabilistic models in the streaming setting, providing principled methods
for learning hyperparameters and optimising pseudo-input locations. The
proposed framework is experimentally validated using synthetic and real-world
datasets.

** Comment:** The first two authors contributed equally.

Krzysztof Choromanski, Mark Rowland, and Adrian Weller.
**The
unreasonable effectiveness of structured random orthogonal
embeddings**.
In *Advances in Neural Information Processing Systems 31*, Long Beach,
California, December 2017.

** Abstract:** We examine a class
of embeddings based on structured random matrices with orthogonal rows which
can be applied in many machine learning applications including dimensionality
reduction and kernel approximation. For both the Johnson-Lindenstrauss
transform and the angular kernel, we show that we can select matrices
yielding guaranteed improved performance in accuracy and/or speed compared to
earlier methods. We introduce matrices with complex entries which give
significant further accuracy improvement. We provide geometric and Markov
chain-based perspectives to help understand the benefits, and empirical
results which suggest that the approach is helpful in a wider range of
applications.

Alessandro Davide Ialongo, Mark van der Wilk, and Carl Edward Rasmussen.
**Closed-form inference and
prediction in Gaussian process state-space models**.
In *NIPS Time Series Workshop 2017*, Long Beach, December 2017.

** Abstract:** We examine an analytic variational inference
scheme for the Gaussian Process State Space Model (GPSSM) - a probabilistic
model for system identification and time-series modelling. Our approach
performs variational inference over both the system states and the transition
function. We exploit Markov structure in the true posterior, as well as an
inducing point approximation to achieve linear time complexity in the length
of the time series. Contrary to previous approaches, no Monte Carlo sampling
is required: inference is cast as a deterministic optimisation problem. In a
number of experiments, we demonstrate the ability to model non-linear
dynamics in the presence of both process and observation noise as well as to
impute missing information (e.g. velocities from raw positions through time),
to de-noise, and to estimate the underlying dimensionality of the system.
Finally, we also introduce a closed-form method for multi-step prediction,
and a novel criterion for assessing the quality of our approximate
posterior.

Rowan McAllister and Carl Edward Rasmussen.
**Data-efficient
reinforcement learning in continuous state-action
Gaussian-POMDPs**.
In *Advances in Neural Information Processing Systems 31*, Long Beach,
California, December 2017.

** Abstract:** We present a
data-efficient reinforcement learning method for continuous state-action
systems under significant observation noise. Data-efficient solutions under
small noise exist, such as PILCO which learns the cartpole swing-up task in
30s. PILCO evaluates policies by planning state-trajectories using a dynamics
model. However, PILCO applies policies to the observed state, therefore
planning in observation space. We extend PILCO with filtering to instead plan
in belief space, consistent with partially observable Markov decisions
process (POMDP) planning. This enables data-efficient learning under
significant observation noise, outperforming more naive methods such as
post-hoc application of a filter to policies optimised by the original
(unfiltered) PILCO algorithm. We test our method on the cartpole swing-up
task, which involves nonlinear dynamics and requires nonlinear control.

Martin A. Skoglund, Zoran Sjanic, and Manon Kok.
**On
orientation estimation using iterative methods in Euclidean space**.
In *Proceedings of the 20th International Conference on Information
Fusion*, Xi'an, China, July 2017. doi
10.23919/ICIF.2017.8009830.

** Abstract:** This paper
presents three iterative methods for orientation estimation. The first two
are based on iterated Extended Kalman filter (IEKF) formulations with
different state representations. The first is using the well-known unit
quaternion as state (q-IEKF) while the other is using orientation deviation
which we call IMEKF. The third method is based on nonlinear least squares
(NLS) estimation of the angular velocity which is used to parametrise the
orientation. The results are obtained using Monte Carlo simulations and the
comparison is done with the non-iterative EKF and multiplicative EKF (MEKF)
as baseline. The result clearly shows that the IMEKF and the NLS-based method
are superior to q-IEKF and all three outperform the non-iterative
methods.

Alexandre Khae Wu Navarro, Jes Frellsen, and Richard E. Turner.
**The Multivariate Generalised
von Mises distribution: Inference and applications**.
In *31st AAAI Conference on Artificial Intelligence*, San Francisco, CA,
USA, January 2017. AAAI Press.

** Abstract:** Circular
variables arise in a multitude of data-modelling contexts ranging from
robotics to the social sciences, but they have been largely overlooked by the
machine learning community. This paper partially redresses this imbalance by
extending some standard probabilistic modelling tools to the circular domain.
First we introduce a new multivariate distribution over circular variables,
called the multivariate Generalised von Mises (mGvM) distribution. This
distribution can be constructed by restricting and renormalising a general
multivariate Gaussian distribution to the unit hyper-torus. Previously
proposed multivariate circular distributions are shown to be special cases of
this construction. Second, we introduce a new probabilistic model for
circular regression, that is inspired by Gaussian Processes, and a method for
probabilistic principal component analysis with circular hidden variables.
These models can leverage standard modelling tools (e.g. covariance functions
and methods for automatic relevance determination). Third, we show that the
posterior distribution in these models is a mGvM distribution which enables
development of an efficient variational free-energy scheme for performing
approximate inference and approximate maximum-likelihood learning.

Thang D. Bui, Josiah Yan, and Richard E. Turner.
**A unifying framework for
Gaussian process pseudo-point approximations using power expectation
propagation**.
*Journal of Machine Learning Research*, 18(104):1-72, 2017.

** Abstract:** Gaussian processes (GPs) are flexible
distributions over functions that enable high-level assumptions about unknown
functions to be encoded in a parsimonious, flexible and general way. Although
elegant, the application of GPs is limited by computational and analytical
intractabilities that arise when data are sufficiently numerous or when
employing non-Gaussian models. Consequently, a wealth of GP approximation
schemes have been developed over the last 15 years to address these key
limitations. Many of these schemes employ a small set of pseudo data points
to summarise the actual data. In this paper we develop a new pseudo-point
approximation framework using Power Expectation Propagation (Power EP) that
unifies a large number of these pseudo-point approximations. Unlike much of
the previous venerable work in this area, the new framework is built on
standard methods for approximate inference (variational free-energy, EP and
Power EP methods) rather than employing approximations to the probabilistic
generative model itself. In this way all of the approximation is performed at
`inference time' rather than at `modelling time', resolving awkward
philosophical and empirical questions that trouble previous approaches.
Crucially, we demonstrate that the new framework includes new pseudo-point
approximation methods that outperform current approaches on regression and
classification tasks.

Manon Kok, Jeroen D. Hol, and Thomas B. Schön.
**Using
inertial sensors for position and orientation estimation**.
*Foundations and Trends in Signal Processing*, 11(1-2):1-153, 2017.

** Abstract:** In recent years, MEMS inertial sensors (3D
accelerometers and 3D gyroscopes) have become widely available due to their
small size and low cost. Inertial sensor measurements are obtained at high
sampling rates and can be integrated to obtain position and orientation
information. These estimates are accurate on a short time scale, but suffer
from integration drift over longer time scales. To overcome this issue,
inertial sensors are typically combined with additional sensors and models.
In this tutorial we focus on the signal processing aspects of position and
orientation estimation using inertial sensors. We discuss different modeling
choices and a selected number of important algorithms. The algorithms include
optimization-based smoothing and filtering as well as computationally cheaper
extended Kalman filter and complementary filter implementations. The quality
of their estimates is illustrated using both experimental and simulated
data.

** Comment:** arXiv

Alexander G. D. G. Matthews, Jiri Hron, Richard E. Turner, and Zoubin
Ghahramani.
**Sample-then-optimise
posterior sampling for Bayesian linear models**.
*AABI (NeurIPS workshop)*, 2017.

** Abstract:** In modern
machine learning it is common to train models which have an extremely high
intrinsic capacity. The results obtained are often i nitialization dependent,
are different for disparate optimizers and in some cases have no explicit
regularization. This raises difficult questions about generalization. A
natural approach to questions of generalization is a Bayesian one. There is
therefore a growing literature attempting to understand how Bayesian
posterior inference could emerge from the complexity of modern practice, even
without having such a procedure as the stated goal. In this work we consider
a simple special case where exact Bayesian posterior sampling emerges from
sampling (cf initialization) and then gradient descent. Specifically, for a
Bayesian linear model, if we parameterize it in terms of a deterministic
function of an isotropic normal prior, then the action of sampling from the
prior followed by first order optimization of the squared loss will give a
posterior sample. Although the assumptions are stronger than many real
problems, it still exhibits the challenging properties of redundant model
capacity and a lack of explicit regularizers, along with initialization and
optimizer dependence. It is therefore an interesting controlled test case.
Given its simplicity, the method itself may turn out to be of independent
interest from our original goal.

Mark van der Wilk, Carl Edward Rasmussen, and James Hensman.
**Convolutional
Gaussian processes**.
In *Advances in Neural Information Processing Systems 31*, 2017.

** Abstract:** We present a practical way of introducing
convolutional structure into Gaussian processes, making them more suited to
high-dimensional inputs like images. The main contribution of our work is the
construction of an inter-domain inducing point approximation that is
well-tailored to the convolutional kernel. This allows us to gain the
generalisation benefit of a convolutional kernel, together with fast but
accurate posterior inference. We investigate several variations of the
convolutional kernel, and apply it to MNIST and CIFAR-10, which have both
been known to be challenging for Gaussian processes. We also show how the
marginal likelihood can be used to find an optimal weighting between
convolutional and RBF kernels to further improve performance. We hope that
this illustration of the usefulness of a marginal likelihood will help
automate discovering architectures in larger models.

** Comment:** arXiv

Thang D. Bui, Daniel Hernández-Lobato, José Miguel Hernández-Lobato,
Yingzhen Li, and Richard E. Turner.
**Deep Gaussian
processes for regression using approximate expectation propagation**.
In *33rd International Conference on Machine Learning*, New York, USA,
June 2016.

** Abstract:** Deep Gaussian processes (DGPs) are
multi-layer hierarchical generalisations of Gaussian processes (GPs) and are
formally equivalent to neural networks with multiple, infinitely wide hidden
layers. DGPs are nonparametric probabilistic models and as such are arguably
more flexible, have a greater capacity to generalise, and provide better
calibrated uncertainty estimates than alternative deep models. This paper
develops a new approximate Bayesian learning scheme that enables DGPs to be
applied to a range of medium to large scale regression problems for the first
time. The new method uses an approximate Expectation Propagation procedure
and a novel and efficient extension of the probabilistic backpropagation
algorithm for learning. We evaluate the new method for non-linear regression
on eleven real-world datasets, showing that it always outperforms GP
regression and is almost always better than state-of-the-art deterministic
and sampling-based approximate inference methods for Bayesian neural
networks. As a by-product, this work provides a comprehensive analysis of six
approximate Bayesian methods for training neural networks.

Matej Balog, Balaji Lakshminarayanan, Zoubin Ghahramani, Daniel M. Roy, and
Yee Whye Teh.
**The
Mondrian kernel**.
In *32nd Conference on Uncertainty in Artificial Intelligence*, pages
32-41, Jersey City, New Jersey, USA, June 2016.

**
Abstract:** We introduce the Mondrian kernel, a fast random feature
approximation to the Laplace kernel. It is suitable for both batch and online
learning, and admits a fast kernel-width-selection procedure as the random
features can be re-used efficiently for all kernel widths. The features are
constructed by sampling trees via a Mondrian process [Roy and Teh, 2009], and
we highlight the connection to Mondrian forests [Lakshminarayanan et al.,
2014], where trees are also sampled via a Mondrian process, but fit
independently. This link provides a new insight into the relationship between
kernel methods and random forests.

** Comment:** [Supplementary
Material] [arXiv] [Poster]
[Slides]
[Code]

Alexander G D G Matthews, James Hensman, Richard E. Turner, and Zoubin
Ghahramani.
**On Sparse Variational methods
and the Kullback-Leibler divergence between stochastic processes**.
In *19th International Conference on Artificial Intelligence and
Statistics*, Cadiz, Spain, May 2016.

** Abstract:** The
variational framework for learning inducing variables (Titsias, 2009a) has
had a large impact on the Gaussian process literature. The framework may be
interpreted as minimizing a rigorously defined Kullback-Leibler divergence
between the approximating and posterior processes. To our knowledge this
connection has thus far gone unremarked in the literature. In this paper we
give a substantial generalization of the literature on this topic. We give a
new proof of the result for infinite index sets which allows inducing points
that are not data points and likelihoods that depend on all function values.
We then discuss augmented index sets and show that, contrary to previous
works, marginal consistency of augmentation is not enough to guarantee
consistency of variational inference with the original model. We then
characterize an extra condition where such a guarantee is obtainable. Finally
we show how our framework sheds light on interdomain sparse approximations
and sparse approximations for Cox processes.

Matthias Stephan Bauer, Mark van der Wilk, and Carl Edward Rasmussen.
**Understanding
probabilistic sparse Gaussian process approximations**.
In *Advances in Neural Information Processing Systems 29*, 2016.

** Abstract:** Good sparse approximations are essential for
practical inference in Gaussian Processes as the computational cost of exact
methods is prohibitive for large datasets. The Fully Independent Training
Conditional (FITC) and the Variational Free Energy (VFE) approximations are
two recent popular methods. Despite superficial similarities, these
approximations have surprisingly different theoretical properties and behave
differently in practice. We thoroughly investigate the two methods for
regression both analytically and through illustrative examples, and draw
conclusions to guide practical application.

** Comment:** arXiv

Roberto Calandra, Jan Peters, Carl Edward Rasmussen, and Marc Peter Deisenroth.
**Manifold
Gaussian processes for regression**.
In *International Joint Conference on Neural Networks*, 2016.

** Abstract:** Off-the-shelf Gaussian Process (GP) covariance
functions encode smoothness assumptions on the structure of the function to
be modeled. To model complex and nondifferentiable functions, these
smoothness assumptions are often too restrictive. One way to alleviate this
limitation is to find a different representation of the data by introducing a
feature space. This feature space is often learned in an unsupervised way,
which might lead to data representations that are not useful for the overall
regression task. In this paper, we propose Manifold Gaussian Processes, a
novel supervised method that jointly learns a transformation of the data into
a feature space and a GP regression from the feature space to observed space.
The Manifold GP is a full GP and allows to learn data representations, which
are useful for the overall regression task. As a proof-of-concept, we
evaluate our approach on complex non-smooth functions where standard GPs
perform poorly, such as step functions and robotics tasks with contacts.

Carl-Johann Simon-Gabriel, Adam Ścibior, Ilya Tolstikhin, and Bernhard
Schölkopf.
**Consistent
kernel mean estimation for functions of random variables**.
In *Advances in Neural Information Processing Systems 30*, 2016.

** Abstract:** We provide a theoretical foundation for
non-parametric estimation of functions of random variables using kernel mean
embeddings. We show that for any continuous function f, consistent estimators
of the mean embedding of a random variable X lead to consistent estimators of
the mean embedding of f(X). For Matérn kernels and sufficiently smooth
functions we also provide rates of convergence. Our results extend to
functions of multiple random variables. If the variables are dependent, we
require an estimator of the mean embedding of their joint distribution as a
starting point; if they are independent, it is sufficient to have separate
estimators of the mean embeddings of their marginal distributions. In either
case, our results cover both mean embeddings based on i.i.d. samples as well
as "reduced set" expansions in terms of dependent expansion points. The
latter serves as a justification for using such expansions to limit memory
resources when applying the approach as a basis for probabilistic
programming.

James Hensman, Alexander G D G Matthews, Maurizio Filippone, and Zoubin
Ghahramani.
**MCMC
for Variationally Sparse Gaussian Processes**.
In *Advances in Neural Information Processing Systems 28*, pages 1-9,
Montreal, Canada, December 2015.

** Abstract:** Gaussian
process (GP) models form a core part of probabilistic machine learning.
Considerable research effort has been made into attacking three issues with
GP models: how to compute efficiently when the number of data is large; how
to approximate the posterior when the likelihood is not Gaussian and how to
estimate covariance function parameter posteriors. This paper simultaneously
addresses these, using a variational approximation to the posterior which is
sparse in support of the function but otherwise free-form. The result is a
Hybrid Monte-Carlo sampling scheme which allows for a non-Gaussian
approximation over the function values and covariance parameters
simultaneously, with efficient computations based on inducing-point sparse
GPs. Code to replicate each experiment in this paper will be available
shortly.

James Robert Lloyd and Zoubin Ghahramani.
**Statistical
model criticism using kernel two sample tests**.
In *Advances in Neural Information Processing Systems 29*, pages 1-9,
Montreal, Canada, December 2015.

** Abstract:** We propose an
exploratory approach to statistical model criticism using maximum mean
discrepancy (MMD) two sample tests. Typical approaches to model criticism
require a practitioner to select a statistic by which to measure
discrepancies between data and a statistical model. MMD two sample tests are
instead constructed as an analytic maximisation over a large space of
possible statistics and therefore automatically select the statistic which
most shows any discrepancy. We demonstrate on synthetic data that the
selected statistic, called the witness function, can be used to identify
where a statistical model most misrepresents the data it was trained on. We
then apply the procedure to real data where the models being assessed are
restricted Boltzmann machines, deep belief networks and Gaussian process
regression and demonstrate the ways in which these models fail to capture the
properties of the data they are trained on.

Felipe Tobar, Thang D. Bui, and Richard E. Turner.
**Learning
stationary time series using gaussian process with nonparametric
kernels**.
In *Advances in Neural Information Processing Systems 29*, Montréal
CANADA, Dec 2015.

** Abstract:** We introduce the Gaussian
Process Convolution Model (GPCM), a two-stage nonparametric generative
procedure to model stationary signals as the convolution between a
continuous-time white-noise process and a continuous-time linear filter drawn
from Gaussian process. The GPCM is a continuous-time nonparametricwindow
moving average process and, conditionally, is itself a Gaussian process with
a nonparametric kernel defined in a probabilistic fashion. The generative
model can be equivalently considered in the frequency domain, where the power
spectral density of the signal is specified using a Gaussian process. One of
the main contributions of the paper is to develop a novel variational
freeenergy approach based on inter-domain inducing variables that efficiently
learns the continuous-time linear filter and infers the driving white-noise
process. In turn, this scheme provides closed-form probabilistic estimates of
the covariance kernel and the noise-free signal both in denoising and
prediction scenarios. Additionally, the variational inference procedure
provides closed-form expressions for the approximate posterior of the
spectral density given the observed data, leading to new Bayesian
nonparametric approaches to spectrum estimation. The proposed GPCM is
validated using synthetic and real-world signals.

James Hensman, Alexander G D G Matthews, and Zoubin Ghahramani.
**Scalable
Variational Gaussian Process Classification**.
In *18th International Conference on Artificial Intelligence and
Statistics*, pages 1-9, San Diego, California, USA, May 2015.

** Abstract:** Gaussian process classification is a popular
method with a number of appealing properties. We show how to scale the model
within a variational inducing point framework, out-performing the state of
the art on benchmark datasets. Importantly, the variational formulation an be
exploited to allow classification in problems with millions of data points,
as we demonstrate in experiments.

Alex Davies.
**Effective
implementation of Gaussian process regression for machine learning**.
PhD thesis, University of Cambridge, Department of Engineering, Cambridge, UK,
2015.

** Abstract:** This thesis presents frameworks for the
effective implementation of Gaussian process regression for machine learning.
It addresses this in three parts: effective iterative methods for calculating
the predictive distribution and derivatives of a Gaussian process with fixed
hyper-parameters, defining three broad classes of kernels of controllable
complexity that allow for an order of magnitude scaling in the previous
framework and an investigation into alternative objective functions and
improved derivatives for the optimization of model hyper-parameters.

Marc Peter Deisenroth, Dieter Fox, and Carl Edward Rasmussen.
**Gaussian processes for data-efficient learning in robotics and control**.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*,
37:408-423, 2015, doi
10.1109/TPAMI.2013.218.

** Abstract:** Autonomous learning
has been a promising direction in control and robotics for more than a decade
since data-driven learning allows to reduce the amount of engineering
knowledge, which is otherwise required. However, autonomous reinforcement
learning (RL) approaches typically require many interactions with the system
to learn controllers, which is a practical limitation in real systems, such
as robots, where many interactions can be impractical and time consuming. To
address this problem, current learning approaches typically require
task-speciﬁc knowledge in form of expert demonstrations, realistic
simulators, pre-shaped policies, or speciﬁc knowledge about the underlying
dynamics. In this article, we follow a different approach and speed up
learning by extracting more information from data. In particular, we learn a
probabilistic, non-parametric Gaussian process transition model of the
system. By explicitly incorporating model uncertainty into long-term planning
and controller learning our approach reduces the effects of model errors, a
key problem in model-based learning. Compared to state-of-the art RL our
model-based policy search method achieves an unprecedented speed of learning.
We demonstrate its applicability to autonomous learning in real robot and
control tasks.

Roger Frigola.
**Bayesian time series
learning with Gaussian processes**.
PhD thesis, University of Cambridge, Department of Engineering, Cambridge, UK,
2015.

** Abstract:** The analysis of time series data is
important in fields as disparate as the social sciences, biology, engineering
or econometrics. In this dissertation, we present a number of algorithms
designed to learn Bayesian nonparametric models of time series. The goal of
these kinds of models is twofold. First, they aim at making predictions which
quantify the uncertainty due to limitations in the quantity and the quality
of the data. Second, they are flexible enough to model highly complex data
whilst preventing overfitting when the data does not warrant complex
models.

We begin with a unifying literature review on time series models
based on Gaussian processes. Then, we centre our attention on the Gaussian
Process State-Space Model (GP-SSM): a Bayesian nonparametric generalisation
of discrete-time nonlinear state-space models. We present a novel formulation
of the GP-SSM that offers new insights into its properties. We then proceed
to exploit those insights by developing new learning algorithms for the
GP-SSM based on particle Markov chain Monte Carlo and variational
inference.

Finally, we present a filtered nonlinear auto-regressive
model with a simple, robust and fast learning algorithm that makes it well
suited to its application by non-experts on large datasets. Its main
advantage is that it avoids the computationally expensive (and potentially
difficult to tune) smoothing step that is a key part of learning nonlinear
state-space models.

Yarin Gal, Yutian Chen, and Zoubin Ghahramani.
**Latent
Gaussian processes for distribution estimation of multivariate categorical
data**.
In *Proceedings of the 32nd International Conference on Machine Learning
(ICML-15)*, pages 645-654, 2015.

** Abstract:**
Multivariate categorical data occur in many applications of machine learning.
One of the main difficulties with these vectors of categorical variables is
sparsity. The number of possible observations grows exponentially with vector
length, but dataset diversity might be poor in comparison. Recent models have
gained significant improvement in supervised tasks with this data. These
models embed observations in a continuous space to capture similarities
between them. Building on these ideas we propose a Bayesian model for the
unsupervised task of distribution estimation of multivariate categorical
data. We model vectors of categorical variables as generated from a
non-linear transformation of a continuous latent space. Non-linearity
captures multi-modality in the distribution. The continuous representation
addresses sparsity. Our model ties together many existing models, linking the
linear categorical latent Gaussian model, the Gaussian process latent
variable model, and Gaussian process classification. We derive inference for
our model based on recent developments in sampling based variational
inference. We show empirically that the model outperforms its linear and
discrete counterparts in imputation tasks of sparse data.

E. Gilboa, Yunus Saatçi, and John P. Cunningham.
**Scaling multidimensional inference for structured Gaussian processes**.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*, pages
424-436, 2015, doi
10.1109/TPAMI.2013.192.

** Abstract:** Exact Gaussian
process (GP) regression has O(N^{3} runtime for data size N, making
it intractable for large N. Many algorithms for improving GP scaling
approximate the covariance with lower rank matrices. Other work has exploited
structure inherent in particular covariance functions, including GPs with
implied Markov structure, and inputs on a lattice (both enable O(N) or O(N
log N) runtime). However, these GP advances have not been well extended to
the multidimensional input setting, despite the preponderance of
multidimensional applications. This paper introduces and tests three novel
extensions of structured GPs to multidimensional inputs, for models with
additive and multiplicative kernels. First we present a new method for
inference in additive GPs, showing a novel connection between the classic
backfitting method and the Bayesian framework. We extend this model using two
advances: a variant of projection pursuit regression, and a Laplace
approximation for non-Gaussian observations. Lastly, for multiplicative
kernel structure, we present a novel method for GPs with inputs on a
multidimensional grid. We illustrate the power of these three advances on
several data sets, achieving performance equal to or very close to the naive
GP at orders of magnitude less cost.

** Comment:** arXiv

Yarin Gal and Richard Turner.
**Improving the
Gaussian process sparse spectrum approximation by representing uncertainty
in frequency inputs**.
In *Proceedings of the 32nd International Conference on Machine Learning
(ICML-15)*, pages 655-664, 2015.

** Abstract:** Standard
sparse pseudo-input approximations to the Gaussian process (GP) cannot handle
complex functions well. Sparse spectrum alternatives attempt to answer this
but are known to over-fit. We suggest the use of variational inference for
the sparse spectrum approximation to avoid both issues. We model the
covariance function with a finite Fourier series approximation and treat it
as a random variable. The random covariance function has a posterior, on
which a variational distribution is placed. The variational distribution
transforms the random covariance function to fit the data. We study the
properties of our approximate inference, compare it to alternative ones, and
extend it to the distributed and stochastic domains. Our approximation
captures complex functions better than standard approaches and avoids
over-fitting.

James Rovert Lloyd.
**Representation,
learning, description and criticism of probabilistic models with applications
to networks, functions and relational data**.
PhD thesis, University of Cambridge, Department of Engineering, Cambridge, UK,
2015.

** Abstract:** This thesis makes contributions to a
variety of aspects of probabilistic inference. When performing probabilistic
inference, one must first represent one’s beliefs with a probability
distribution. Specifying the details of a probability distribution can be a
difficult task in many situations, but when expressing beliefs about complex
data structures it may not even be apparent what form such a distribution
should take. This thesis starts by demonstrating how representation theorems
due to Aldous, Hoover and Kallenberg can be used to specify appropriate
models for data in the form of networks. These theorems are then extended in
order to reveal appropriate probability distributions for arbitrary
relational data or databases. A simpler data structure to specify probability
distributions for is that of functions; many probability distributions for
functions have been used for centuries. We demonstrate that many of these
distributions can be expressed in a common language of Gaussian process
kernels constructed from a few base elements and operators. The structure of
this language allows for the effective automatic construction of
probabilistic models for functions. Furthermore, the formal mathematical
language of kernels can be mapped neatly onto natural language allowing for
automatic descriptions of the automatically constructed models. By further
automating the construction of statistical models, the need to be able to
effectively check or criticise these models becomes greater. This thesis
demonstrates how kernel two sample tests can be used to demonstrate where a
probabilistic model most disagrees with data allowing for targeted
improvements to the model. In proposing a new method of model criticism this
thesis also briefly discusses the philosophy of model criticism within the
context of probabilistic inference.

Felipe Tobar, Petar M. Djurić, and Danilo P. Mandic.
**Unsupervised
state-space modeling using reproducing kernels**.
*IEEE Transactions on Signal Processing*, 63:5210 - 5221, 2015.

** Abstract:** A novel framework for the design of state-space
models (SSMs) is proposed whereby the state-transition function of the model
is parametrized using reproducing kernels. The nature of SSMs requires
learning a latent function that resides in the state space and for which
input-output sample pairs are not available, thus prohibiting the use of
gradient-based supervised kernel learning. To this end, we then propose to
learn the mixing weights of the kernel estimate by sampling from their
posterior density using Monte Carlo methods. We first introduce an offline
version of the proposed algorithm, followed by an online version which
performs inference on both the parameters and the hidden state through
particle filtering. The accuracy of the estimation of the state-transition
function is first validated on synthetic data. Next, we show that the
proposed algorithm outperforms kernel adaptive filters in the prediction of
real-world time series, while also providing probabilistic estimates, a key
advantage over standard methods.

Felipe Tobar and Danilo P. Mandic.
**Design
of positive-definite quaternion kernels**.
*IEEE Signal Processing Letters*, 22:2117 - 2121, 2015.

**
Abstract:** Quaternion reproducing kernel Hilbert spaces (QRKHS) have been
proposed recently and provide a high-dimensional feature space (alternative
to the real-valued multikernel approach) for general kernel-learning
applications. The current challenge within quaternion-kernel learning is the
lack of general quaternion-valued kernels, which are necessary to exploit the
full advantages of the QRKHS theory in real-world problems. This letter
proposes a novel way to design quaternion-valued kernels, this is achieved by
transforming three complex kernels into quaternion ones and then combining
their real and imaginary parts. Building on this general construction, our
emphasis is on a new quaternion kernel of polynomial features, which is
assessed in the prediction of bodysensor networks applications.

Felipe Tobar and Danilo P. Mandic.
**High-dimensional kernel regression: A guide for practitioners**.
In W.-C. Siu Y. C. Lim, H. K. Kwan, editor, *Trends in Digital Signal
Processing: A Festschrift in Honour of A.G. Constantinides*,
chapter 9, pages 287-310. CRC Press, 2015.

Felipe Tobar and Richard E. Turner.
**Modelling
of complex signals using Gaussian processes**.
In *Proceedings of the IEEE International Conference on Acoustics, Speech,
and Signal Processing (ICASSP)*, pages 2209 - 2213, 2015.

** Abstract:** In complex-valued signal processing, estimation
algorithms require complete knowledge (or accurate estimation) of the second
order statistics, this makes Gaussian processes (GP) well suited for
modelling complex signals, as they are designed in terms of covariance
functions. Dealing with bivariate signals using GPs require four covariance
matrices, or equivalently, two complex matrices. We propose a GP-based
approach for modelling complex signals, whereby the second-order statistics
are learnt through maximum likelihood; in particular, the complex GP approach
allows for circularity coefficient estimation in a robust manner when the
observed signal is corrupted by (circular) white noise. The proposed model is
validated using climate signals, for both circular and noncircular cases. The
results obtained open new possibilities for collaboration between the complex
signal processing and Gaussian processes communities towards an appealing
representation and statistical description of bivariate signals.

James Robert Lloyd, David Duvenaud, Roger Grosse, Joshua B. Tenenbaum, and
Zoubin Ghahramani.
**Automatic construction and
natural-language description of nonparametric regression models**.
In *Association for the Advancement of Artificial Intelligence (AAAI)*,
July 2014.

** Abstract:** This paper presents the beginnings
of an automatic statistician, focusing on regression problems. Our system
explores an open-ended space of statistical models to discover a good
explanation of a data set, and then produces a detailed report with figures
and natural-language text. Our approach treats unknown regression functions
nonparametrically using Gaussian processes, which has two important
consequences. First, Gaussian processes can model functions in terms of
high-level properties (e.g. smoothness, trends, periodicity, changepoints).
Taken together with the compositional structure of our language of models
this allows us to automatically describe functions in simple terms. Second,
the use of flexible nonparametric models and a rich language for composing
them in an open-ended manner also results in state-of-the-art extrapolation
performance evaluated over 13 real time series data sets from various
domains.

Michael Schober, David Duvenaud, and Philipp Hennig.
**Probabilistic ODE solvers with
Runge-Kutta means**.
*arXiv preprint arXiv:1406.2582*, June 2014.

**
Abstract:** Runge-Kutta methods are the classic family of solvers for
ordinary differential equations (ODEs), and the basis for the
state-of-the-art. Like most numerical methods, they return point estimates.
We construct a family of probabilistic numerical methods that instead return
a Gauss-Markov process defining a probability distribution over the ODE
solution. In contrast to prior work, we construct this family such that
posterior means match the outputs of the Runge-Kutta family exactly, thus
inheriting their proven good properties. Remaining degrees of freedom not
identified by the match to Runge-Kutta are chosen such that the posterior
probability measure fits the observed structure of the ODE. Our results shed
light on the structure of Runge-Kutta solvers from a new direction, provide a
richer, probabilistic output, have low computational cost, and raise new
research questions.

David Duvenaud, Oren Rippel, Ryan P. Adams, and Zoubin Ghahramani.
**Avoiding pathologies in very
deep networks**.
In *17th International Conference on Artificial Intelligence and
Statistics*, Reykjavik, Iceland, April 2014.

**
Abstract:** Choosing appropriate architectures and regularization
strategies for deep networks is crucial to good predictive performance. To
shed light on this problem, we analyze the analogous problem of constructing
useful priors on compositions of functions. Specifically, we study the deep
Gaussian process, a type of infinitely-wide, deep neural network. We show
that in standard architectures, the representational capacity of the network
tends to capture fewer degrees of freedom as the number of layers increases,
retaining only a single degree of freedom in the limit. We propose an
alternate network architecture which does not suffer from this pathology. We
also examine deep covariance functions, obtained by composing infinitely many
feature transforms. Lastly, we characterize the class of models obtained by
performing dropout on Gaussian processes.

B. Bischoff, D. Nguyen-Tuong, D. van Hoof, A. McHutchon, Carl Edward Rasmussen,
A. Knoll, and M. P. Deisenroth.
**Policy search
for learning robot control using sparse data**.
In *IEEE International Conference on Robotics and Automation*, pages
3882-3887, Hong Kong, China, 2014. IEEE, doi
10.1109/ICRA.2014.6907422.

** Abstract:** In many complex
robot applications, such as grasping and manipulation, it is difficult to
program desired task solutions beforehand, as robots are within an uncertain
and dynamic environment. In such cases, learning tasks from experience can be
a useful alternative. To obtain a sound learning and generalization
performance, machine learning, especially, reinforcement learning, usually
requires sufficient data. However, in cases where only little data is
available for learning, due to system constraints and practical issues,
reinforcement learning can act suboptimally. In this paper, we investigate
how model-based reinforcement learning, in particular the probabilistic
inference for learning control method (PILCO), can be tailored to cope with
the case of sparse data to speed up learning. The basic idea is to include
further prior knowledge into the learning process. As PILCO is built on the
probabilistic Gaussian processes framework, additional system knowledge can
be incorporated by defining appropriate prior distributions, e.g. a linear
mean Gaussian prior. The resulting PILCO formulation remains in closed form
and analytically tractable. The proposed approach is evaluated in simulation
as well as on a physical robot, the Festo Robotino XT. For the robot
evaluation, we employ the approach for learning an object pick-up task. The
results show that by including prior knowledge, policy learning can be sped
up in presence of sparse data.

Sébastien Bratières, Novi Quadrianto, Sebastian Nowozin, and Zoubin
Ghahramani.
**Scalable
Gaussian Process structured prediction for grid factor graph
applications**.
In *31st International Conference on Machine Learning*, 2014.

** Abstract:** Structured prediction is an important and well
studied problem with many applications across machine learning. GPstruct is a
recently proposed structured prediction model that offers appealing
properties such as being kernelised, non-parametric, and supporting Bayesian
inference (Bratières et al. 2013). The model places a Gaussian process prior
over energy functions which describe relationships between input variables
and structured output variables. However, the memory demand of GPstruct is
quadratic in the number of latent variables and training runtime scales
cubically. This prevents GPstruct from being applied to problems involving
grid factor graphs, which are prevalent in computer vision and spatial
statistics applications. Here we explore a scalable approach to learning
GPstruct models based on ensemble learning, with weak learners (predictors)
trained on subsets of the latent variables and bootstrap data, which can
easily be distributed. We show experiments with 4M latent variables on image
segmentation. Our method outperforms widely-used conditional random field
models trained with pseudo-likelihood. Moreover, in image segmentation
problems it improves over recent state-of-the-art marginal optimisation
methods in terms of predictive performance and uncertainty calibration.
Finally, it generalises well on all training set sizes.

Thang D. Bui and Richard E. Turner.
**Tree-structured Gaussian process approximations**.
In Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger,
editors, *Advances in Neural Information Processing Systems 28*,
volume 28, pages 2213-2221. Curran Associates, Inc., 2014.

** Abstract:** Gaussian process regression can be accelerated by
constructing a small pseudo-dataset to summarize the observed data. This idea
sits at the heart of many approximation schemes, but such an approach
requires the number of pseudo-datapoints to be scaled with the range of the
input space if the accuracy of the approximation is to be maintained. This
presents problems in time-series settings or in spatial datasets where large
numbers of pseudo-datapoints are required since computation typically scales
quadratically with the pseudo-dataset size. In this paper we devise an
approximation whose complexity grows linearly with the number of
pseudo-datapoints. This is achieved by imposing a tree or chain structure on
the pseudo-datapoints and calibrating the approximation using a
Kullback-Leibler (KL) minimization. Inference and learning can then be
performed efficiently using the Gaussian belief propagation algorithm. We
demonstrate the validity of our approach on a set of challenging regression
tasks including missing data imputation for audio and spatial datasets. We
trace out the speed-accuracy trade-off for the new method and show that the
frontier dominates those obtained from a large number of existing
approximation techniques.

Alex Davies and Zoubin Ghahramani.
**The random forest
kernel and other kernels for big data from random partitions**.
*arXiv*, abs/1402.4293, 2014.

** Abstract:** We present
Random Partition Kernels, a new class of kernels derived by demonstrating a
natural connection between random partitions of objects and kernels between
those objects. We show how the construction can be used to create kernels
from methods that would not normally be viewed as random partitions, such as
Random Forest. To demonstrate the potential of this method, we propose two
new kernels, the Random Forest Kernel and the Fast Cluster Kernel, and show
that these kernels consistently outperform standard kernels on problems
involving real-world datasets. Finally, we show how the form of these kernels
lend themselves to a natural approximation that is appropriate for certain
big data problems, allowing O(N) inference in methods such as Gaussian
Processes, Support Vector Machines and Kernel PCA.

Roger Frigola, Yutian Chen, and Carl Edward Rasmussen.
**Variational
Gaussian process state-space models**.
In Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger,
editors, *Advances in Neural Information Processing Systems 27*,
2014.

** Abstract:** State-space models have been successfully
used for more than fifty years in different areas of science and engineering.
We present a procedure for efficient variational Bayesian learning of
nonlinear state-space models based on sparse Gaussian processes. The result
of learning is a tractable posterior over nonlinear dynamical systems. In
comparison to conventional parametric models, we offer the possibility to
straightforwardly trade off model capacity and computational cost whilst
avoiding overfitting. Our main algorithm uses a hybrid inference approach
combining variational Bayes and sequential Monte Carlo. We also present
stochastic variational inference and online learning approaches for fast
learning with long time series.

Roger Frigola, Fredrik Lindsten, Thomas B. Schön, and Carl Edward
Rasmussen.
**Identification of Gaussian
process state-space models with particle stochastic approximation
EM**.
In *Proceedings of the 19th World Congress of the International Federation
of Automatic Control (IFAC)*, 2014.

** Abstract:**
Gaussian process state-space models (GP-SSMs) are a very flexible family of
models of nonlinear dynamical systems. They comprise a Bayesian nonparametric
representation of the dynamics of the system and additional
(hyper-)parameters governing the properties of this nonparametric
representation. The Bayesian formalism enables systematic reasoning about the
uncertainty in the system dynamics. We present an approach to maximum
likelihood identification of the parameters in GP-SSMs, while retaining the
full nonparametric description of the dynamics. The method is based on a
stochastic approximation version of the EM algorithm that employs recent
developments in particle Markov chain Monte Carlo for efficient
identification.

Yarin Gal, Mark van der Wilk, and Carl Rasmussen.
**Distributed
variational inference in sparse Gaussian process regression and latent
variable models**.
In Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger,
editors, *Advances in Neural Information Processing Systems 27*, pages
3257-3265. Curran Associates, Inc., 2014.

** Abstract:**
Gaussian processes (GPs) are a powerful tool for probabilistic inference over
functions. They have been applied to both regression and non-linear
dimensionality reduction, and offer desirable properties such as uncertainty
estimates, robustness to over-fitting, and principled ways for tuning
hyper-parameters. However the scalability of these models to big datasets
remains an active topic of research. We introduce a novel re-parametrisation
of variational inference for sparse GP regression and latent variable models
that allows for an efficient distributed algorithm. This is done by
exploiting the decoupling of the data given the inducing points to
re-formulate the evidence lower bound in a Map-Reduce setting. We show that
the inference scales well with data and computational resources, while
preserving a balanced distribution of the load among the nodes. We further
demonstrate the utility in scaling Gaussian processes to big data. We show
that GP performance improves with increasing amounts of data in regression
(on flight data with 2 million records) and latent variable modelling (on
MNIST). The results show that GPs perform better than many common models
often used for big data.

Andrew McHutchon.
**Nonlinear modelling and
control using Gaussian processes**.
PhD thesis, University of Cambridge, Department of Engineering, Cambridge, UK,
2014.

** Abstract:** In many scientific disciplines it is
often required to make predictions about how a system will behave or to
deduce the correct control values to elicit a particular desired response.
Efficiently solving both of these tasks relies on the construction of a model
capturing the system's operation. In the most interesting situations, the
model needs to capture strongly nonlinear effects and deal with the presence
of uncertainty and noise. Building models for such systems purely based on a
theoretical understanding of underlying physical principles can be infeasibly
complex and require a large number of simplifying assumptions. An alternative
is to use a data-driven approach, which builds a model directly from
observations. A powerful and principled approach to doing this is to use a
Gaussian Process (GP).

In this thesis we start by discussing how GPs can
be applied to data sets which have noise affecting their inputs. We present
the "Noisy Input GP", which uses a simple local-linearisation to refer the
input noise into heteroscedastic output noise, and compare it to other
methods both theoretically and empirically. We show that this technique leads
to a effective model for nonlinear functions with input and output noise. We
then consider the broad topic of GP state space models for application to
dynamical systems. We discuss a very wide variety of approaches for using GPs
in state space models, including introducing a new method based on
moment-matching, which consistently gave the best performance. We analyse the
methods in some detail including providing a systematic comparison between
approximate-analytic and particle methods. To our knowledge such a comparison
has not been provided before in this area. Finally, we investigate an
automatic control learning framework, which uses Gaussian Processes to model
a system for which we wish to design a controller. Controller design for
complex systems is a difficult task and thus a framework which allows an
automatic design directly from data promises to be extremely useful. We
demonstrate that the previously published framework cannot cope with the
presence of observation noise but that the introduction of a state space
model dramatically improves its performance. This contribution, along with
some other suggested improvements opens the door for this framework to be
used in real-world applications.

Amar Shah, Andrew Gordon Wilson, and Zoubin Ghahramani.
**Student-t
processes as alternatives to Gaussian processes**.
In *AISTATS*, JMLR Proceedings. JMLR.org, 2014.

**
Abstract:** We investigate the Student-t process as an alternative to the
Gaussian process as a nonparametric prior over functions. We derive closed
form expressions for the marginal likelihood and predictive distribution of a
Student-t process, by integrating away an inverse Wishart process prior over
the covariance kernel of a Gaussian process model. We show surprising
equivalences between different hierarchical Gaussian process models leading
to Student-t processes, and derive a new sampling scheme for the inverse
Wishart process, which helps elucidate these equivalences. Overall, we show
that a Student-t process can retain the attractive properties of a Gaussian
process - a nonparametric representation, analytic marginal and predictive
distributions, and easy model selection through covariance kernels - but has
enhanced flexibility, and predictive covariances that, unlike a Gaussian
process, explicitly depend on the values of training observations. We verify
empirically that a Student-t process is especially useful in situations where
there are changes in covariance structure, or in applications like Bayesian
optimization, where accurate predictive covariances are critical for good
performance. These advantages come at no additional computational cost over
Gaussian processes.

Andrew Gordon Wilson.
**Covariance
Kernels for Fast Automatic Pattern Discovery and Extrapolation with
Gaussian Processes**.
PhD thesis, University of Cambridge, Cambridge, UK, 2014.

**
Abstract:** Truly intelligent systems are capable of pattern discovery and
extrapolation without human intervention. Bayesian nonparametric models,
which can uniquely represent expressive prior information and detailed
inductive biases, provide a distinct opportunity to develop intelligent
systems, with applications in essentially any learning and prediction
task.

Gaussian processes are rich distributions over functions, which
provide a Bayesian nonparametric approach to smoothing and interpolation. A
covariance kernel determines the support and inductive biases of a Gaussian
process. In this thesis, we introduce new covariance kernels to enable fast
automatic pattern discovery and extrapolation with Gaussian processes.

In
the introductory chapter, we discuss the high level principles behind all of
the models in this thesis: 1) we can typically improve the predictive
performance of a model by accounting for additional structure in data; 2) to
automatically discover rich structure in data, a model must have large
support and the appropriate inductive biases; 3) we most need expressive
models for large datasets, which typically provide more information for
learning structure, and 4) we can often exploit the existing inductive biases
(assumptions) or structure of a model for scalable inference, without the
need for simplifying assumptions.

In the context of this introduction, we
then discuss, in chapter 2, Gaussian processes as kernel machines, and my
views on the future of Gaussian process research.

In chapter 3 we
introduce the Gaussian process regression network (GPRN) framework, a
multi-output Gaussian process method which scales to many output variables,
and accounts for input-dependent correlations between the outputs. Underlying
the GPRN is a highly expressive kernel, formed using an adaptive mixture of
latent basis functions in a neural network like architecture. The GPRN is
capable of discovering expressive structure in data. We use the GPRN to model
the time-varying expression levels of 1000 genes, the spatially varying
concentrations of several distinct heavy metals, and multivariate volatility
(input dependent noise covariances) between returns on equity indices and
currency exchanges, which is particularly valuable for portfolio allocation.
We generalise the GPRN to an adaptive network framework, which does not
depend on Gaussian processes or Bayesian nonparametrics; and we outline
applications for the adaptive network in nuclear magnetic resonance (NMR)
spectroscopy, ensemble learning, and change-point modelling.

In chapter 4
we introduce simple closed form kernel for automatic pattern discovery and
extrapolation. These spectral mixture (SM) kernels are derived by modelling
the spectral densiy of a kernel (its Fourier transform) using a
scale-location Gaussian mixture. SM kernels form a basis for all stationary
covariances, and can be used as a drop-in replacement for standard kernels,
as they retain simple and exact learning and inference procedures. We use the
SM kernel to discover patterns and perform long range extrapolation on
atmospheric CO2 trends and airline passenger data, as well as on synthetic
examples. We also show that the SM kernel can be used to automatically
reconstruct several standard covariances. The SM kernel and the GPRN are
highly complementary; we show that using the SM kernel with adaptive basis
functions in a GPRN induces an expressive prior over non-stationary
kernels.

In chapter 5 we introduce GPatt, a method for fast
multidimensional pattern extrapolation, particularly suited to imge and movie
data. Without human intervention - no hand crafting of kernel features, and
no sophisticated initialisation procedures - we show that GPatt can solve
large scale pattern extrapolation, inpainting and kernel discovery problems,
including a problem with 383,400 training points. GPatt exploits the
structure of a spectral mixture product (SMP) kernel, for fast yet exact
inference procedures. We find that GPatt significantly outperforms popular
alternative scalable gaussian process methods in speed and accuracy.
Moreover, we discover profound differences between each of these methods,
suggesting expressive kernels, nonparametric representations, and scalable
inference which exploits existing model structure are useful in combination
for modelling large scale multidimensional patterns.

The models in this
dissertation have proven to be scalable and with greatly enhanced predictive
performance over the alternatives: the extra structure being modelled is an
important part of a wide variety of real data - including problems in
econometrics, gene expression, geostatistics, nuclear magnetic resonance
spectroscopy, ensemble learning, multi-output regression, change point
modelling, time series, multivariate volatility, image inpainting, texture
extrapolation, video extrapolation, acoustic modelling, and kernel
discovery.

Andrew Gordon Wilson, Yuting Wu, Daniel J. Holland, Sebastian Nowozin,
Mick D. Mantle, Lynn F. Gladden, and Andrew Blake.
**Bayesian inference for NMR
spectroscopy with applications to chemical quantification**.
*arXiv preprint arXiv 1402.3580*, 2014.

** Abstract:**
Nuclear magnetic resonance (NMR) spectroscopy exploits the magnetic
properties of atomic nuclei to discover the structure, reaction state and
chemical environment of molecules. We propose a probabilistic generative
model and inference procedures for NMR spectroscopy. Specifically, we use a
weighted sum of trigonometric functions undergoing exponential decay to model
free induction decay (FID) signals. We discuss the challenges in estimating
the components of this general model - amplitudes, phase shifts,
frequencies, decay rates, and noise variances - and offer practical
solutions. We compare with conventional Fourier transform spectroscopy for
estimating the relative concentrations of chemicals in a mixture, using
synthetic and experimentally acquired FID signals. We find the proposed model
is particularly robust to low signal to noise ratios (SNR), and overlapping
peaks in the Fourier transform of the FID, enabling accurate predictions
(e.g., 1% error at low SNR) which are not possible with conventional
spectroscopy (5% error).

José Miguel Hernández-Lobato, James Robert Lloyds, and Daniel
Hernández-Lobato.
**Gaussian process
conditional copulas with applications to financial time series**.
In *Advances in Neural Information Processing Systems 27*, pages 1-9,
Lake Tahoe, California, USA, December 2013.

** Abstract:** The
estimation of dependencies between multiple variables is a central problem in
the analysis of financial time series. A common approach is to express these
dependencies in terms of a copula function. Typically the copula function is
assumed to be constant but this may be inaccurate when there are covariates
that could have a large influence on the dependence structure of the data. To
account for this, a Bayesian framework for the estimation of conditional
copulas is proposed. In this framework the parameters of a copula are
non-linearly related to some arbitrary conditioning variables. We evaluate
the ability of our method to predict time-varying dependencies on several
equities and currencies and observe consistent performance gains compared to
static copula models and other time-varying copula methods.

David Lopez-Paz, Philipp Hennig, and Bernhard Scholköpf.
**The randomized dependence
coefficient**.
In *Advances in Neural Information Processing Systems 27*, pages 1-9,
Lake Tahoe, California, USA, December 2013.

** Abstract:** We
introduce the Randomized Dependence Coefficient (RDC), a measure of
non-linear dependence between random variables of arbitrary dimension based
on the Hirschfeld-Gebelein-Rényi Maximum Correlation Coefficient. RDC
is defined in terms of correlation of random non-linear copula projections;
it is invariant with respect to marginal distribution transformations, has
low computational cost and is easy to implement: just five lines of R code,
included at the end of the paper.

Tomoharu Iwata, David Duvenaud, and Zoubin Ghahramani.
**Warped mixtures for nonparametric
cluster shapes**.
In *29th Conference on Uncertainty in Artificial Intelligence*,
Bellevue, Washington, July 2013.

** Abstract:** A mixture of
Gaussians fit to a single curved or heavy-tailed cluster will report that the
data contains many clusters. To produce more appropriate clusterings, we
introduce a model which warps a latent mixture of Gaussians to produce
nonparametric cluster shapes. The possibly low-dimensional latent mixture
model allows us to summarize the properties of the high-dimensional clusters
(or density manifolds) describing the data. The number of manifolds, as well
as the shape and dimension of each manifold is automatically inferred. We
derive a simple inference scheme for this model which analytically integrates
out both the mixture parameters and the warping function. We show that our
model is effective for density estimation, performs better than infinite
Gaussian mixture models at recovering the true number of clusters, and
produces interpretable summaries of high-dimensional datasets.

David Duvenaud, James Robert Lloyd, Roger Grosse, Joshua B. Tenenbaum, and
Zoubin Ghahramani.
**Structure
discovery in nonparametric regression through compositional kernel
search**.
In *30th International Conference on Machine Learning*, Atlanta,
Georgia, USA, June 2013.

** Abstract:** Despite its
importance, choosing the structural form of the kernel in nonparametric
regression remains a black art. We define a space of kernel structures which
are built compositionally by adding and multiplying a small number of base
kernels. We present a method for searching over this space of structures
which mirrors the scientific discovery process. The learned structures can
often decompose functions into interpretable components and enable long-range
extrapolation on time-series datasets. Our structure search method
outperforms many widely used kernels and kernel combination methods on a
variety of prediction tasks.

David Lopez-Paz, José Miguel Hernández-Lobato, and Zoubin Ghahramani.
**Gaussian
process vine copulas for multivariate dependence**.
In *30th International Conference on Machine Learning*, Atlanta,
Georgia, USA, June 2013.

** Abstract:** Copulas allow to learn
marginal distributions separately from the multivariate dependence structure
(copula) that links them together into a density function. Vine
factorizations ease the learning of high-dimensional copulas by constructing
a hierarchy of conditional bivariate copulas. However, to simplify inference,
it is common to assume that each of these conditional bivariate copulas is
independent from its conditioning variables. In this paper, we relax this
assumption by discovering the latent functions that specify the shape of a
conditional copula given its conditioning variables We learn these functions
by following a Bayesian approach based on sparse Gaussian processes with
expectation propagation for scalable, approximate inference. Experiments on
real-world datasets show that, when modeling all conditional dependencies, we
obtain better estimates of the underlying copula of the data.

Andrew Gordon Wilson and Ryan Prescott Adams.
**Gaussian
process kernels for pattern discovery and extrapolation**.
In *30th International Conference on Machine Learning*, February 18
2013.

** Abstract:** Gaussian processes are rich distributions
over functions, which provide a Bayesian nonparametric approach to smoothing
and interpolation. We introduce simple closed form kernels that can be used
with Gaussian processes to discover patterns and enable extrapolation. These
kernels are derived by modelling a spectral density - the Fourier transform
of a kernel - with a Gaussian mixture. The proposed kernels support a broad
class of stationary covariances, but Gaussian process inference remains
simple and analytic. We demonstrate the proposed kernels by discovering
patterns and performing long range extrapolation on synthetic examples, as
well as atmospheric CO2 trends and airline passenger data. We also show that
we can reconstruct standard covariances within our framework.

** Comment:** arXiv:1302.4245

Sébastien Bratières, Novi Quadrianto, and Zoubin Ghahramani.
**Bayesian
structured prediction using Gaussian processes**.
*arXiv*, abs/1307.3846, 2013.

** Abstract:** We introduce
a conceptually novel structured prediction model, GPstruct, which is
kernelized, non-parametric and Bayesian, by design. We motivate the model
with respect to existing approaches, among others, conditional random fields
(CRFs), maximum margin Markov networks (M3N), and structured support vector
machines (SVMstruct), which embody only a subset of its properties. We
present an inference procedure based on Markov Chain Monte Carlo. The
framework can be instantiated for a wide range of structured objects such as
linear chains, trees, grids, and other general graphs. As a proof of concept,
the model is benchmarked on several natural language processing tasks and a
video gesture segmentation task involving a linear chain structure. We show
prediction accuracies for GPstruct which are comparable to or exceeding those
of CRFs and SVMstruct.

Roger Frigola, Fredrik Lindsten, Thomas B. Schön, and Carl Edward
Rasmussen.
**Bayesian inference and learning in
Gaussian process state-space models with particle MCMC**.
In L. Bottou, C.J.C. Burges, Z. Ghahramani, M. Welling, and K.Q. Weinberger,
editors, *Advances in Neural Information Processing Systems 26*.
Curran Associates, Inc., 2013.

** Abstract:** State-space
models are successfully used in many areas of science, engineering and
economics to model time series and dynamical systems. We present a fully
Bayesian approach to inference and learning in nonlinear nonparametric
state-space models. We place a Gaussian process prior over the transition
dynamics, resulting in a flexible model able to capture complex dynamical
phenomena. However, to enable efficient inference, we marginalize over the
dynamics of the model and instead infer directly the joint smoothing
distribution through the use of specially tailored Particle Markov Chain
Monte Carlo samplers. Once a sample from the smoothing distribution is
computed, the state transition predictive distribution can be formulated
analytically. We make use of sparse Gaussian process models to greatly reduce
the computational complexity of the approach.

Roger Frigola and Carl Edward Rasmussen.
**Integrated pre-processing for
Bayesian nonlinear system identification with Gaussian processes**.
In *Decision and Control (CDC), 2013 IEEE 52nd Annual Conference on*,
2013.

** Abstract:** We introduce GP-FNARX: a new model for
nonlinear system identification based on a nonlinear autoregressive exogenous
model (NARX) with filtered regressors (F) where the nonlinear regression
problem is tackled using sparse Gaussian processes (GP). We integrate data
pre-processing with system identification into a fully automated procedure
that goes from raw data to an identified model. Both pre-processing
parameters and GP hyper-parameters are tuned by maximizing the marginal
likelihood of the probabilistic model. We obtain a Bayesian model of the
system's dynamics which is able to report its uncertainty in regions where
the data is scarce. The automated approach, the modeling of uncertainty and
its relatively low computational cost make of GP-FNARX a good candidate for
applications in robotics and adaptive control.

E. Gilboa, Yunus Saatçi, and John P. Cunningham.
**Scaling
multidimensional Gaussian processes using projected additive
approximations**.
In *30th International Conference on Machine Learning*, 2013.

** Abstract:** Exact Gaussian Process (GP) regression has
O(N^{3}) runtime for data size N, making it intractable for large N.
Many algorithms for improving GP scaling approximate the covariance with
lower rank matrices. Other work has exploited structure inherent in
particular covariance functions, including GPs with implied Markov structure,
and equispaced inputs (both enable O(N) runtime). However, these GP advances
have not been extended to the multidimensional input setting, despite the
preponderance of multidimensional applications. This paper introduces and
tests novel extensions of structured GPs to multidimensional inputs. We
present new methods for additive GPs, showing a novel connection between the
classic backﬁtting method and the Bayesian framework. To achieve optimal
accuracy-complexity tradeoff, we extend this model with a novel variant of
projection pursuit regression. Our primary result – projection pursuit
Gaussian Process Regression – shows orders of magnitude speedup while
preserving high accuracy. The natural second and third steps include
non-Gaussian observations and higher dimensional equispaced grid methods. We
introduce novel techniques to address both of these necessary directions. We
thoroughly illustrate the power of these three advances on several datasets,
achieving close performance to the naive Full GP at orders of magnitude less
cost.

James Robert Lloyd.
**Gefcom2012
hierarchical load forecasting: Gradient boosting machines and gaussian
processes**.
*International Journal of Forecasting*, 2013.

**
Abstract:** This report discusses methods for forecasting hourly loads of a
US utility as part of the load forecasting track of the Global Energy
Forecasting Competition 2012 hosted on Kaggle. The methods described
(gradient boosting machines and Gaussian processes) are generic machine
learning / regression algorithms and few domain specific adjustments were
made. Despite this, the algorithms were able to produce highly competitive
predictions and hopefully they can inspire more reﬁned techniques to
compete with state-of-the-art load forecasting methodologies.

Andrew Gordon Wilson, Elad Gilboa, Arye Nehorai, and John P Cunningham.
**Gpatt: Fast multidimensional
pattern extrapolation with Gaussian processes**.
*arXiv preprint arXiv:1310.5288*, 2013.

** Abstract:**
Gaussian processes are typically used for smoothing and interpolation on
small datasets. We introduce a new Bayesian nonparametric framework - GPatt
- enabling automatic pattern extrapolation with Gaussian processes on large
multidimensional datasets. GPatt unifies and extends highly expressive
kernels and fast exact inference techniques. Without human intervention - no
hand crafting of kernel features, and no sophisticated initialisation
procedures - we show that GPatt can solve large scale pattern extrapolation,
inpainting, and kernel discovery problems, including a problem with 383,400
training points. We find that GPatt significantly outperforms popular
alternative scalable Gaussian process methods in speed and accuracy.
Moreover, we discover profound differences between each of these methods,
suggesting expressive kernels, nonparametric representations, and scalable
inference which exploits model structure are useful in combination for
modelling large scale multidimensional patterns.

James Robert Lloyd, Peter Orbanz, Zoubin Ghahramani, and Daniel M. Roy.
**Random
function priors for exchangeable arrays with applications to graphs and
relational data**.
In *Advances in Neural Information Processing Systems 26*, pages 1-9,
Lake Tahoe, California, USA, December 2012.

** Abstract:** A
fundamental problem in the analysis of structured relational data like
graphs, networks, databases, and matrices is to extract a summary of the
common structure underlying relations between individual entities. Relational
data are typically encoded in the form of arrays; invariance to the ordering
of rows and columns corresponds to exchangeable arrays. Results in
probability theory due to Aldous, Hoover and Kallenberg show that
exchangeable arrays can be represented in terms of a random measurable
function which constitutes the natural model parameter in a Bayesian model.
We obtain a flexible yet simple Bayesian nonparametric model by placing a
Gaussian process prior on the parameter function. Efficient inference
utilises elliptical slice sampling combined with a random sparse
approximation to the Gaussian process. We demonstrate applications of the
model to network data and clarify its relation to models in the literature,
several of which emerge as special cases.

Michael A. Osborne, David Duvenaud, Roman Garnett, Carl Edward Rasmussen,
Stephen J. Roberts, and Zoubin Ghahramani.
**Active
learning of model evidence using Bayesian quadrature**.
In *Advances in Neural Information Processing Systems 25*, pages 46-54,
Lake Tahoe, California, USA, December 2012.

** Abstract:**
Numerical integration is a key component of many problems in scientiﬁc
computing, statistical modelling, and machine learning. Bayesian Quadrature
is a model-based method for numerical integration which, relative to standard
Monte Carlo methods, offers increased sample efficiency and a more robust
estimate of the uncertainty in the estimated integral. We propose a novel
Bayesian Quadrature approach for numerical integration when the integrand is
non-negative, such as the case of computing the marginal likelihood,
predictive distribution, or normalising constant of a probabilistic model.
Our approach approximately marginalises the quadrature model’s
hyperparameters in closed form, and introduces an active learning scheme to
optimally select function evaluations, as opposed to using Monte Carlo
samples. We demonstrate our method on both a number of synthetic benchmarks
and a real scientiﬁc problem from astronomy.

Ferenc Huszár and David Duvenaud.
**Optimally-weighted herding is
Bayesian quadrature**.
In *28th Conference on Uncertainty in Artificial Intelligence*, pages
377-385, Catalina Island, California, July 2012.

**
Abstract:** Herding and kernel herding are deterministic methods of
choosing samples which summarise a probability distribution. A related task
is choosing samples for estimating integrals using Bayesian quadrature. We
show that the criterion minimised when selecting samples in kernel herding is
equivalent to the posterior variance in Bayesian quadrature. We then show
that sequential Bayesian quadrature can be viewed as a weighted version of
kernel herding which achieves performance superior to any other weighted
herding method. We demonstrate empirically a rate of convergence faster than
O(1/N). Our results also imply an upper bound on the empirical error of the
Bayesian quadrature estimate.

Dino Sejdinovic, Heiko Strathmann, Maria Lomeli, Christophe Andrieu, and Arthur
Gretton.
**Kernel adaptive
Metropolis-Hastings**.
In *31st International Conference on Machine Learning*, pages 1-9,
Beijing, China, June 2012.

** Abstract:** A Kernel Adaptive
Metropolis-Hastings algo- rithm is introduced, for the purpose of sampling
from a target distribution with strongly nonlin- ear support. The algorithm
embeds the trajec- tory of the Markov chain into a reproducing ker- nel
Hilbert space (RKHS), such that the fea- ture space covariance of the samples
informs the choice of proposal. The procedure is com- putationally efficient
and straightforward to im- plement, since the RKHS moves can be inte- grated
out analytically: our proposal distribu- tion in the original space is a
normal distribution whose mean and covariance depend on where the current
sample lies in the support of the tar- get distribution, and adapts to its
local covari- ance structure. Furthermore, the procedure re- quires neither
gradients nor any other higher or- der information about the target, making
it par- ticularly attractive for contexts such as Pseudo- Marginal MCMC.
Kernel Adaptive Metropolis- Hastings outperforms competing fixed and adap-
tive samplers on multivariate, highly nonlinear target distributions, arising
in both real-world and synthetic examples.

Andrew Gordon Wilson, David A. Knowles, and Zoubin Ghahramani.
**Gaussian process regression
networks**.
In *29th International Conference on Machine Learning*, Edinburgh,
Scotland, June 2012.

** Abstract:** We introduce a new
regression framework, Gaussian process regression networks (GPRN), which
combines the structural properties of Bayesian neural networks with the
nonparametric flexibility of Gaussian processes. GPRN accommodates input
(predictor) dependent signal and noise correlations between multiple output
(response) variables, input dependent length-scales and amplitudes, and
heavy-tailed predictive distributions. We derive both elliptical slice
sampling and variational Bayes inference procedures for GPRN. We apply GPRN
as a multiple output regression and multivariate volatility model,
demonstrating substantially improved performance over eight popular multiple
output (multi-task) Gaussian process models and three multivariate volatility
models on real datasets, including a 1000 dimensional gene expression
dataset.

Richard E. Turner and Maneesh Sahani.
**Decomposing
signals into a sum of amplitude and frequency modulated sinusoids using
probabilistic inference**.
In *Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE
International Conference on*, pages 2173-2176, march 2012, doi
10.1109/ICASSP.2012.6288343.

** Abstract:** There are many
methods for decomposing signals into a sum of amplitude and frequency
modulated sinusoids. In this paper we take a new estimation based approach.
Identifying the problem as ill-posed, we show how to regularize the solution
by imposing soft constraints on the amplitude and phase variables of the
sinusoids. Estimation proceeds using a version of Kalman smoothing. We
evaluate the method on synthetic and natural, clean and noisy signals,
showing that it outperforms previous decompositions, but at a higher
computational cost.

John P. Cunningham, Zoubin Ghahramani, and Carl Edward Rasmussen.
**Gaussian
Processes for time-marked time-series data**.
In *15th International Conference on Artificial Intelligence and
Statistics*, pages 255-263, 2012.

** Abstract:** In many
settings, data is collected as multiple time series, where each recorded time
series is an observation of some underlying dynamical process of interest.
These observations are often time-marked with known event times, and one
desires to do a range of standard analyses. When there is only one time
marker, one simply aligns the observations temporally on that marker. When
multiple time-markers are present and are at different times on different
time series observations, these analyses are more difficult. We describe a
Gaussian Process model for analyzing multiple time series with multiple time
markings, and we test it on a variety of data.

Marc Peter Deisenroth, Ryan D. Turner, Marco F. Huber, Uwe D. Hanebeck, and
Carl Edward Rasmussen.
**Robust
filtering and smoothing with Gaussian processes**.
*IEEE Transactions on Automatic Control*, 57(7):1865-1871, 2012, doi
10.1109/TAC.2011.2179426.

** Abstract:** We propose a
principled algorithm for robust Bayesian filtering and smoothing in nonlinear
stochastic dynamic systems when both the transition function and the
measurement function are described by nonparametric Gaussian process (GP)
models. GPs are gaining increasing importance in signal processing, machine
learning, robotics, and control for representing unknown system functions by
posterior probability distributions. This modern way of "system
identification" is more robust than finding point estimates of a parametric
function representation. Our principled filtering/smoothing approach for GP
dynamic systems is based on analytic moment matching in the context of the
forward-backward algorithm. Our numerical evaluations demonstrate the
robustness of the proposed approach in situations where other
state-of-the-art Gaussian filters and smoothers can fail.

Neil Houlsby, Jose Miguel Hernández-Lobato, Ferenc Huszár, and Zoubin
Ghahramani.
**Collaborative
Gaussian processes for preference learning**.
In *Advances in Neural Information Processing Systems 26*, pages
2096-2104. Curran Associates, Inc., 2012.

** Abstract:** We
present a new model based on Gaussian processes (GPs) for learning pairwise
preferences expressed by multiple users. Inference is simplified by using a
*preference kernel* for GPs which allows us to combine supervised GP
learning of user preferences with unsupervised dimensionality reduction for
multi-user systems. The model not only exploits collaborative information
from the shared structure in user behavior, but may also incorporate user
features if they are available. Approximate inference is implemented using a
combination of expectation propagation and variational Bayes. Finally, we
present an efficient active learning strategy for querying preferences. The
proposed technique performs favorably on real-world data against
state-of-the-art multi-user preference learning algorithms.

Joseph Hall, Carl Edward Rasmussen, and Jan Maciejowski.
**Modelling and
control of nonlinear systems using Gaussian processes with partial model
information**.
In *51st IEEE Conference on Decision and Control*, 2012.

**
Abstract:** Gaussian processes are gaining increasing popularity among the
control community, in particular for the modelling of discrete time state
space systems. However, it has not been clear how to incorporate model
information, in the form of known state relationships, when using a Gaussian
process as a predictive model. An obvious example of known prior information
is position and velocity related states. Incorporation of such information
would be beneficial both computationally and for faster dynamics learning.
This paper introduces a method of achieving this, yielding faster dynamics
learning and a reduction in computational effort from O(Dn^{2}) to
O((D-F)n^{2}) in the prediction stage for a system with D states, F
known state relationships and n observations. The effectiveness of the method
is demonstrated through its inclusion in the PILCO learning algorithm with
application to the swing-up and balance of a torque-limited pendulum and the
balancing of a robotic unicycle in simulation.

Ryan D. Turner and Carl Edward Rasmussen.
**Model based learning
of sigma points in unscented Kalman filtering**.
*Neurocomputing*, 80:47-53, 2012, doi
10.1016/j.neucom.2011.07.029.

** Abstract:** The unscented
Kalman filter (UKF) is a widely used method in control and time series
applications. The UKF suffers from arbitrary parameters necessary for sigma
point placement, potentially causing it to perform poorly in nonlinear
problems. We show how to treat sigma point placement in a UKF as a learning
problem in a model based view. We demonstrate that learning to place the
sigma points correctly from data can make sigma point collapse much less
likely. Learning can result in a significant increase in predictive
performance over default settings of the parameters in the UKF and other
filters designed to avoid the problems of the UKF, such as the GP-ADF. At the
same time, we maintain a lower computational complexity than the other
methods. We call our method UKF-L.

Andrew Gordon Wilson and Zoubin Ghahramani.
**Modelling input
varying correlations between multiple responses**.
In Peter A. Flach, Tijl De Bie, and Nello Cristianini, editors,
*ECML/PKDD*, volume 7524 of *Lecture Notes in Computer
Science*, pages 858-861. Springer, 2012.

** Abstract:**
We introduced a generalised Wishart process (GWP) for modelling input
dependent covariance matrices Σ(x), allowing one to model input varying
correlations and uncertainties between multiple response variables. The GWP
can naturally scale to thousands of response variables, as opposed to
competing multivariate volatility models which are typically intractable for
greater than 5 response variables. The GWP can also naturally capture a rich
class of covariance dynamics - periodicity, Brownian motion, smoothness,
- through a covariance kernel.

Andrew Gordon Wilson, David A Knowles, and Zoubin Ghahramani.
**Gaussian process
regression networks**.
Technical Report arXiv:1110.4411 [stat.ML], Department of Engineering,
University of Cambridge, Cambridge, UK, October 19 2011.

**
Abstract:** We introduce a new regression framework, Gaussian process
regression networks (GPRN), which combines the structural properties of
Bayesian neural networks with the non-parametric flexibility of Gaussian
processes. This model accommodates input dependent signal and noise
correlations between multiple response variables, input dependent
length-scales and amplitudes, and heavy-tailed predictive distributions. We
derive both efficient Markov chain Monte Carlo and variational Bayes
inference procedures for this model. We apply GPRN as a multiple output
regression and multivariate volatility model, demonstrating substantially
improved performance over eight popular multiple output (multi-task) Gaussian
process models and three multivariate volatility models on benchmark
datasets, including a 1000 dimensional gene expression dataset.

** Comment:** arXiv:1110.4411

Simon Lacoste-Julien, Ferenc Huszár, and Zoubin Ghahramani.
**Approximate
inference for the loss-calibrated Bayesian**.
In Geoff Gordon and David Dunson, editors, *14th International Conference on
Artificial Intelligence and Statistics*, volume 15, pages 416-424,
Fort Lauderdale, FL, USA, April 2011. Journal of Machine Learning Research.

** Abstract:** We consider the problem of approximate inference
in the context of Bayesian decision theory. Traditional approaches focus on
approximating general properties of the posterior, ignoring the decision task
- and associated losses - for which the posterior could be used. We argue
that this can be suboptimal and propose instead to *loss-calibrate* the
approximate inference methods with respect to the decision task at hand. We
present a general framework rooted in Bayesian decision theory to analyze
approximate inference from the perspective of losses, opening up several
research directions. As a first loss-calibrated approximate inference
attempt, we propose an EM-like algorithm on the Bayesian posterior risk and
show how it can improve a standard approach to Gaussian process
classification when losses are asymmetric.

David Duvenaud, Hannes Nickisch, and Carl Edward Rasmussen.
**Additive
Gaussian processes**.
In *Advances in Neural Information Processing Systems 24*, pages
226-234, Granada, Spain, 2011.

** Abstract:** We introduce a
Gaussian process model of functions which are additive. An additive function
is one which decomposes into a sum of low-dimensional functions, each
depending on only a subset of the input variables. Additive GPs generalize
both Generalized Additive Models, and the standard GP models which use
squared-exponential kernels. Hyperparameter learning in this model can be
seen as Bayesian Hierarchical Kernel Learning (HKL). We introduce an
expressive but tractable parameterization of the kernel function, which
allows efficient evaluation of all input interaction terms, whose number is
exponential in the input dimension. The additional structure discoverable by
this model results in increased interpretability, as well as state-of-the-art
predictive power in regression tasks.

Marc Peter Deisenroth and Carl Edward Rasmussen.
**PILCO: A
model-based and data-efficient approach to policy search**.
In *28th International Conference on Machine Learning*, 2011.

** Abstract:** In this paper, we introduce PILCO, a practical,
data-efficient model-based policy search method. PILCO reduces model bias,
one of the key problems of model-based reinforcement learning, in a
principled way. By learning a probabilistic dynamics model and explicitly
incorporating model uncertainty into long-term planning, PILCO can cope with
very little data and facilitates learning from scratch in only a few trials.
Policy evaluation is performed in closed form using state-of-the-art
approximate inference. Furthermore, policy gradients are computed
analytically for policy improvement. We report unprecedented learning
efficiency on challenging and high-dimensional control tasks.

** Comment:** web
site

Andrew McHutchon and Carl Edward Rasmussen.
**Gaussian process
training with input noise**.
In J. Shawe-Taylor, R.S. Zemel, P.L. Bartlett, F. Pereira, and K.Q. Weinberger,
editors, *Advances in Neural Information Processing Systems 24*, pages
1341-1349, Granada, Spain, 2011. Curran Associates, Inc.

**
Abstract:** In standard Gaussian Process regression input locations are
assumed to be noise free. We present a simple yet effective GP model for
training on input points corrupted by i.i.d. Gaussian noise. To make
computations tractable we use a local linear expansion about each input
point. This allows the input noise to be recast as output noise proportional
to the squared gradient of the GP posterior mean. The input noise
hyperparameters are trained alongside other hyperparameters by the usual
method of maximisation of the marginal likelihood, and allow estimation of
the noise levels on each input dimension. Training uses an iterative scheme,
which alternates between optimising the hyperparameters and calculating the
posterior gradient. Analytic predictive moments can then be found for
Gaussian distributed test points. We compare our model to others over a range
of different regression problems and show that it improves over current
methods.

Yunus Saatçi.
**Scalable inference for
structured Gaussian process models**.
PhD thesis, University of Cambridge, Department of Engineering, Cambridge, UK,
2011.

** Abstract:** The generic inference and learning
algorithm for Gaussian Process (GP) regression has O(N^{3}) runtime
and O(N^{2}) memory complexity, where N is the number of observations
in the dataset. Given the computational resources available to a present-day
workstation, this implies that GP regression simply *cannot be run* on
large datasets. The need to use non- Gaussian likelihood functions for tasks
such as classification adds even more to the computational burden
involved.

The majority of algorithms designed to improve the scaling of
GPs are founded on the idea of approximating the true covariance matrix,
which is usually of rank N, with a matrix of rank P, where P<<N.
Typically, the true training set is replaced with a smaller, representative
(pseudo-) training set such that a specific measure of information loss is
minimized. These algorithms typically attain O(P^{2}N) runtime and
O(PN) space complexity. They are also general in the sense that they are
designed to work with *any* covariance function. In essence, they
trade off accuracy with computational complexity. The central contribution of
this thesis is to improve scaling instead by exploiting any structure that is
present in the covariance matrices generated by *particular*
covariance functions. Instead of settling for a kernel-independent
accuracy/complexity trade off, as is done in much the literature, we often
obtain accuracies close to, or exactly equal to the full GP model at a
fraction of the computational cost.

We define a *structured* GP as
any GP model that is endowed with a kernel which produces structured
covariance matrices. A trivial example of a structured GP is one with the
linear regression kernel. In this case, given inputs living in R^{D},
the covariance matrices generated have rank D - this results in significant
computational gains in the usual case where D<<N. Another case arises
when a stationary kernel is evaluated on equispaced, scalar inputs. This
results in *Toeplitz* covariance matrices and all necessary
computations can be carried out exactly in O(N log N).

This thesis
studies four more types of structured GP. First, we comprehensively review
the case of kernels corresponding to *Gauss-Markov* processes
evaluated on scalar inputs. Using state-space models we show how
(generalised) regression (including hyperparameter learning) can be performed
in O(N log N) runtime and O(N) space. Secondly, we study the case where we
introduce block structure into the covariance matrix of a GP time-series
model by assuming a particular form of nonstationarity a priori. Third, we
extend the efficiency of scalar Gauss-Markov processes to higher-dimensional
input spaces by assuming *additivity*. We illustrate the connections
between the classical backfitting algorithm and approximate Bayesian
inference techniques including Gibbs sampling and variational Bayes. We also
show that it is possible to relax the rather strong assumption of additivity
without sacrificing O(N log N) complexity, by means of a projection-pursuit
style GP regression model. Finally, we study the properties of a GP model
with a tensor product kernel evaluated on a multivariate grid of inputs
locations. We show that for an *arbitrary* (regular or irregular) grid
the resulting covariance matrices are Kronecker and full GP regression can be
implemented in O(N) time and memory usage.

We illustrate the power of
these methods on several real-world regression datasets which satisfy the
assumptions inherent in the structured GP employed. In many cases we obtain
performance comparable to the generic GP algorithm. We also analyse the
performance degradation when these assumptions are not met, and in several
cases show that it is comparable to that observed for sparse GP methods. We
provide similar results for regression tasks with non-Gaussian likelihoods,
an extension rarely addressed by sparse GP techniques.

Ryan Darby Turner.
**Gaussian processes for
state space models and change point detection**.
PhD thesis, University of Cambridge, Department of Engineering, Cambridge, UK,
2011.

** Abstract:** This thesis details several applications
of Gaussian processes (GPs) for enhanced time series modeling. We first cover
different approaches for using Gaussian processes in time series problems.
These are extended to the state space approach to time series in two
different problems. We also combine Gaussian processes and Bayesian online
change point detection (BOCPD) to increase the generality of the Gaussian
process time series methods. These methodologies are evaluated on predictive
performance on six real world data sets, which include three environmental
data sets, one financial, one biological, and one from industrial well
drilling.

Gaussian processes are capable of generalizing standard linear
time series models. We cover two approaches: the Gaussian process time series
model (GPTS) and the autoregressive Gaussian process (ARGP). We cover a
variety of methods that greatly reduce the computational and memory
complexity of Gaussian process approaches, which are generally cubic in
computational complexity.

Two different improvements to state space based
approaches are covered. First, Gaussian process inference and learning (GPIL)
generalizes linear dynamical systems (LDS), for which the Kalman filter is
based, to general nonlinear systems for nonparametric system identification.
Second, we address pathologies in the unscented Kalman filter (UKF). We use
Gaussian process optimization (GPO) to learn UKF settings that minimize the
potential for sigma point collapse.

We show how to embed mentioned
Gaussian process approaches to time series into a change point framework. Old
data, from an old regime, that hinders predictive performance is
automatically and elegantly phased out. The computational improvements for
Gaussian process time series approaches are of even greater use in the change
point framework. We also present a supervised framework learning a change
point model when change point labels are available in training.

These
mentioned methodologies significantly improve predictive performance on the
diverse set of data sets selected.

Richard E. Turner and Maneesh Sahani.
**Demodulation
as probabilistic inference**.
*Transactions on Audio, Speech and Language Processing*, 19:2398-2411,
2011.

** Abstract:** Demodulation is an ill-posed problem
whenever both carrier and envelope signals are broadband and unknown. Here,
we approach this problem using the methods of probabilistic inference. The
new approach, called Probabilistic Amplitude Demodulation (PAD), is
computationally challenging but improves on existing methods in a number of
ways. By contrast to previous approaches to demodulation, it satisfies five
key desiderata: PAD has soft constraints because it is probabilistic; PAD is
able to automatically adjust to the signal because it learns parameters; PAD
is user-steerable because the solution can be shaped by user-specific prior
information; PAD is robust to broad-band noise because this is modelled
explicitly; and PAD’s solution is self-consistent, empirically satisfying a
Carrier Identity property. Furthermore, the probabilistic view naturally
encompasses noise and uncertainty, allowing PAD to cope with missing data and
return error bars on carrier and envelope estimates. Finally, we show that
when PAD is applied to a bandpass-filtered signal, the stop-band energy of
the inferred carrier is minimal, making PAD well-suited to sub-band
demodulation.

Richard E. Turner and Maneesh Sahani.
**Probabilistic
amplitude and frequency demodulation**.
In *Advances in Neural Information Processing Systems 24*, pages
981-989. The MIT Press, 2011.

** Abstract:** A number of
recent scientific and engineering problems require signals to be decomposed
into a product of a slowly varying positive envelope and a quickly varying
carrier whose instantaneous frequency also varies slowly over time. Although
signal processing provides algorithms for so-called amplitude- and
frequency-demodulation (AFD), there are well known problems with all of the
existing methods. Motivated by the fact that AFD is ill-posed, we approach
the problem using probabilistic inference. The new approach, called
probabilistic amplitude and frequency demodulation (PAFD), models
instantaneous frequency using an auto-regressive generalization of the von
Mises distribution, and the envelopes using Gaussian auto-regressive dynamics
with a positivity constraint. A novel form of expectation propagation is used
for inference. We demonstrate that although PAFD is computationally
demanding, it outperforms previous approaches on synthetic and real signals
in clean, noisy and missing data settings.

Andrew Gordon Wilson and Zoubin Ghahramani.
**Generalised
Wishart processes**.
In *27th Conference on Uncertainty in Artificial Intelligence*, 2011.

** Abstract:** We introduce a new stochastic process called the
generalised Wishart process (GWP). It is a collection of positive
semi-definite random matrices indexed by any arbitrary input variable. We use
this process as a prior over dynamic (e.g. time varying) covariance matrices.
The GWP captures a diverse class of covariance dynamics, naturally hanles
missing data, scales nicely with dimension, has easily interpretable
parameters, and can use input variables that include covariates other than
time. We describe how to construct the GWP, introduce general procedures for
inference and prediction, and show that it outperforms its main competitor,
multivariate GARCH, even on financial data that especially suits GARCH.

** Comment:** Supplementary
Material, Best Student Paper Award

Carl Edward Rasmussen and Hannes Nickisch.
**Gaussian
Processes for Machine Learning (GPML) Toolbox**.
*Journal of Machine Learning Research*, 11:3011-3015, December 2010.

** Abstract:** The GPML toolbox provides a wide range of
functionality for Gaussian process (GP) inference and prediction. GPs are
specified by mean and covariance functions; we offer a library of simple mean
and covariance functions and mechanisms to compose more complex ones. Several
likelihood functions are supported including Gaussian and heavy-tailed for
regression as well as others suitable for classification. Finally, a range of
inference methods is provided, including exact and variational inference,
Expectation Propagation, and Laplace's method dealing with non-Gaussian
likelihoods and FITC for dealing with large regression tasks.

** Comment:** Toolbox avaiable from here. Implements algorithms
from Rasmussen and Williams, 2006.

Hannes Nickisch and Carl Edward Rasmussen.
**Gaussian mixture
modeling with Gaussian process latent variable models**.
In *Proceedings of the 32nd DAGM Symposium on Pattern Recognition*,
Lecture Notes in Computer Science (LNCS), Darmstadt, Germany, September 2010.
Springer, doi
10.1007/978-3-642-15986-2_28.

** Abstract:** Density
modeling is notoriously difficult for high dimensional data. One approach to
the problem is to search for a lower dimensional manifold which captures the
main characteristics of the data. Recently, the Gaussian Process Latent
Variable Model (GPLVM) has successfully been used to find low dimensional
manifolds in a variety of complex data. The GPLVM consists of a set of points
in a low dimensional latent space, and a stochastic map to the observed
space. We show how it can be interpreted as a density model in the observed
space. However, the GPLVM is not trained as a density model and therefore
yields bad density estimates. We propose a new training strategy and obtain
improved generalisation performance and better density estimates in
comparative evaluations on several benchmark data sets.

Ryan Turner and Carl Edward Rasmussen.
**Model based learning
of sigma points in unscented Kalman filtering**.
In Samuel Kaski, David J. Miller, Erkki Oja, and Antti Honkela, editors,
*Machine Learning for Signal Processing (MLSP 2010)*, pages 178-183,
Kittilä, Finland, August 2010.

** Abstract:** The
unscented Kalman filter (UKF) is a widely used method in control and time
series applications. The UKF suffers from arbitrary parameters necessary for
a step known as sigma point placement, causing it to perform poorly in
nonlinear problems. We show how to treat sigma point placement in a UKF as a
learning problem in a model based view. We demonstrate that learning to place
the sigma points correctly from data can make sigma point collapse much less
likely. Learning can result in a significant increase in predictive
performance over default settings of the parameters in the UKF and other
filters designed to avoid the problems of the UKF, such as the GP-ADF. At the
same time, we maintain a lower computational complexity than the other
methods. We call our method UKF-L.

Miguel Lázaro-Gredilla, Joaquin Quiñonero-Candela, Carl Edward Rasmussen,
and Aníbal Figueiras-Vidal.
**Sparse
spectrum Gaussian process regression**.
*Journal of Machine Learning Research*, 11:1865-1881, June 2010.

** Abstract:** We present a new sparse Gaussian Process (GP)
model for regression. The key novel idea is to sparsify the *spectral
representation* of the GP. This leads to a simple, practical algorithm for
regression tasks. We compare the achievable trade-offs between predictive
accuracy and computational requirements, and show that these are typically
superior to existing state-of-the-art sparse approximations. We discuss both
the weight space and function space representations, and note that the new
construction implies priors over functions which are always stationary, and
can approximate any covariance function in this class.

Yunus Saatçi, Ryan Turner, and Carl Edward Rasmussen.
**Gaussian process
change point models**.
In *27th International Conference on Machine Learning*, pages 927-934,
Haifa, Israel, June 2010.

** Abstract:** We combine Bayesian
online change point detection with Gaussian processes to create a
nonparametric time series model which can handle change points. The model can
be used to locate change points in an online manner; and, unlike other
Bayesian online change point detection algorithms, is applicable when
temporal correlations in a regime are expected. We show three variations on
how to apply Gaussian processes in the change point context, each with their
own advantages. We present methods to reduce the computational burden of
these models and demonstrate it on several real world data sets.

Ryan Turner, Marc Peter Deisenroth, and Carl Edward Rasmussen.
**State-space
inference and learning with Gaussian processes**.
In Yee Whye Teh and Mike Titterington, editors, *13th International
Conference on Artificial Intelligence and Statistics*, volume 9 of
*W & CP*, pages 868-875, Chia Laguna, Sardinia, Italy, May 13-15
2010. Journal of Machine Learning Research.

** Abstract:**
State-space inference and learning with Gaussian processes (GPs) is an
unsolved problem. We propose a new, general methodology for inference and
learning in nonlinear state-space models that are described probabilistically
by non-parametric GP models. We apply the expectation maximization algorithm
to iterate between inference in the latent state-space and learning the
parameters of the underlying GP dynamics model.

** Comment:** poster.

M. M. Churchland, B. M. Yu, J. P. Cunningham, L. P. Sugrue, M. R. Cohen, G. S.
Corrado, W. T. Newsome, A. M. Clark, P. Hosseini, B. B. Scott, D. C. Bradley,
M. A. Smith, A. Kohn, J. A. Movshon, K. M. Armstrong, T. Moore, S. W. Chang,
L. H. Snyder, S. G. Lisberger, N. J. Priebe, I. M. Finn, D. Ferster, S. I.
Ryu, G. Santhanam, M. Sahani, and K. V. Shenoy.
**Stimulus
onset quashes neural variability: a widespread cortical phenomenon**.
*Nature Neuroscience*, 13:369-378, 2010.

** Abstract:**
Neural responses are typically characterized by computing the mean firing
rate, but response variability can exist across trials. Many studies have
examined the effect of a stimulus on the mean response, but few have examined
the effect on response variability. We measured neural variability in 13
extracellularly recorded datasets and one intracellularly recorded dataset
from seven areas spanning the four cortical lobes in monkeys and cats. In
every case, stimulus onset caused a decline in neural variability. This
occurred even when the stimulus produced little change in mean firing rate.
The variability decline was observed in membrane potential recordings, in the
spiking of individual neurons and in correlated spiking variability measured
with implanted 96-electrode arrays. The variability decline was observed for
all stimuli tested, regardless of whether the animal was awake, behaving or
anaesthetized. This widespread variability decline suggests a rather general
property of cortex, that its state is stabilized by an input.

Marc Peter Deisenroth.
**Efficient reinforcement
learning using Gaussian processes**.
PhD thesis, Karlsruhe Institute of Technology, Karlsruhe, Germany, 2010.

** Abstract:** In many research areas, including control and
medical applications, we face decision-making problems where data are limited
and/or the underlying generative process is complicated and partially
unknown. In these scenarios, we can profit from algorithms that learn from
data and aid decision making.

Reinforcement learning (RL) is a general
computational approach to experience-based goal-directed learning for
sequential decision making under uncertainty. However, RL often lacks
efficiency in terms of the number of required trials when no task-specific
knowledge is available. This lack of efficiency makes RL often inapplicable
to (optimal) control problems. Thus, a central issue in RL is to speed up
learning by extracting more information from available experience.

The
contributions of this dissertation are threefold:

1. We propose PILCO, a
fully Bayesian approach for efficient RL in continuous-valued state and
action spaces when no expert knowledge is available. PILCO is based on
well-established ideas from statistics and machine learning. PILCO's key
ingredient is a probabilistic dynamics model learned from data, which is
implemented by a Gaussian process (GP). The GP carefully quantifies knowledge
by a probability distribution over plausible dynamics models. By averaging
over all these models during long-term planning and decision making, PILCO
takes uncertainties into account in a principled way and, therefore, reduces
model bias, a central problem in model-based RL.

2. Due to its generality
and efficiency, PILCO can be considered a conceptual and practical approach
to jointly learning models and controllers when expert knowledge is difficult
to obtain or simply not available. For this scenario, we investigate PILCO's
properties its applicability to challenging real and simulated nonlinear
control problems. For example, we consider the tasks of learning to swing up
a double pendulum attached to a cart or to balance a unicycle with five
degrees of freedom. Across all tasks we report unprecedented automation and
an unprecedented learning efficiency for solving these tasks.

3. As a
step toward pilco's extension to partially observable Markov decision
processes, we propose a principled algorithm for robust filtering and
smoothing in GP dynamic systems. Unlike commonly used Gaussian filters for
nonlinear systems, it does neither rely on function linearization nor on
finite-sample representations of densities. Our algorithm profits from exact
moment matching for predictions while keeping all computations analytically
tractable. We present experimental evidence that demonstrates the robustness
and the advantages of our method over unscented Kalman filters, the cubature
Kalman filter, and the extended Kalman filter.

Richard E. Turner and Maneesh Sahani.
**Statistical
inference for single- and multi-band probabilistic amplitude
demodulation.**.
In *Proceedings of the IEEE International Conference on Acoustics, Speech,
and Signal Processing (ICASSP)*, pages 5466-5469, 2010.

**
Abstract:** Amplitude demodulation is an ill-posed problem and so it is
natural to treat it from a Bayesian viewpoint, inferring the most likely
carrier and envelope under probabilistic constraints. One such treatment is
Probabilistic Amplitude Demodulation (PAD), which, whilst computationally
more intensive than traditional approaches, offers several advantages. Here
we provide methods for estimating the uncertainty in the PAD-derived
envelopes and carriers, and for learning free-parameters like the time-scale
of the envelope. We show how the probabilistic approach can naturally handle
noisy and missing data. Finally, we indicate how to extend the model to
signals which contain multiple modulators and carriers.

Richard E. Turner.
**Statistical
Models for Natural Sounds**.
PhD thesis, Gatsby Computational Neuroscience Unit, UCL, 2010.

**
Abstract:** It is important to understand the rich structure of natural
sounds in order to solve important tasks, like automatic speech recognition,
and to understand auditory processing in the brain. This thesis takes a step
in this direction by characterising the statistics of simple natural sounds.
We focus on the statistics because perception often appears to depend on
them, rather than on the raw waveform. For example the perception of auditory
textures, like running water, wind, fire and rain, depends on
summary-statistics, like the rate of falling rain droplets, rather than on
the exact details of the physical source. In order to analyse the statistics
of sounds accurately it is necessary to improve a number of traditional
signal processing methods, including those for amplitude demodulation,
time-frequency analysis, and sub-band demodulation. These estimation tasks
are ill-posed and therefore it is natural to treat them as Bayesian inference
problems. The new probabilistic versions of these methods have several
advantages. For example, they perform more accurately on natural signals and
are more robust to noise, they can also fill-in missing sections of data, and
provide error-bars. Furthermore, free-parameters can be learned from the
signal. Using these new algorithms we demonstrate that the energy, sparsity,
modulation depth and modulation time-scale in each sub-band of a signal are
critical statistics, together with the dependencies between the sub-band
modulators. In order to validate this claim, a model containing co-modulated
coloured noise carriers is shown to be capable of generating a range of
realistic sounding auditory textures. Finally, we explored the connection
between the statistics of natural sounds and perception. We demonstrate that
inference in the model for auditory textures qualitatively replicates the
primitive grouping rules that listeners use to understand simple acoustic
scenes. This suggests that the auditory system is optimised for the
statistics of natural sounds.

Andrew Gordon Wilson and Zoubin Ghahramani.
**Copula
processes**.
In *Advances in Neural Information Processing Systems 23*, 2010.
Spotlight.

** Abstract:** We define a copula process which
describes the dependencies between arbitrarily many random variables
independently of their marginal distributions. As an example, we develop a
stochastic volatility model, Gaussian Copula Process Volatility (GCPV), to
predict the latent standard deviations of a sequence of random variables. To
make predictions we use Bayesian inference, with the Laplace approximation,
and with Markov chain Monte Carlo as an alternative. We find our model can
outperform GARCH on simulated and financial data. And unlike GARCH, GCPV can
easily handle missing data, incorporate covariates other than time, and model
a rich class of covariance structures.

** Comment:** Supplementary
Material, slides.

Ryan Turner, Marc Peter Deisenroth, and Carl Edward Rasmussen.
**System
identification in Gaussian process dynamical systems**.
In Dilan Görür, editor, *NIPS Workshop on Nonparametric Bayes*,
Whistler, BC, Canada, December 2009.

** Comment:** poster.

B. M. Yu, J. P. Cunningham, G. Santhanam, S. I. Ryu, K. V. Shenoy, and
M. Sahani.
**Gaussian-process
factor analysis for low-dimensional single-trial analysis of neural
population activity**.
In *Advances in Neural Information Processing Systems 21*, pages 1-8,
Vancouver, BC, December 2009.

** Abstract:** We consider the
problem of extracting smooth, low-dimensional neural trajectories that
summarize the activity recorded simultaneously from many neurons on
individual experimental trials. Beyond the benefit of visualizing the
high-dimensional, noisy spiking activity in a compact form, such trajectories
can offer insight into the dynamics of the neural circuitry underlying the
recorded activity. Current methods for extracting neural trajectories involve
a two-stage process: the spike trains are first smoothed over time, then a
static dimensionality- reduction technique is applied. We first describe
extensions of the two-stage methods that allow the degree of smoothing to be
chosen in a principled way and that account for spiking variability, which
may vary both across neurons and across time. We then present a novel method
for extracting neural trajectories - Gaussian-process factor analysis (GPFA)
- which unifies the smoothing and dimensionality- reduction operations in a
common probabilistic framework. We applied these methods to the activity of
61 neurons recorded simultaneously in macaque premotor and motor cortices
during reach planning and execution. By adopting a goodness-of-fit metric
that measures how well the activity of each neuron can be predicted by all
other recorded neurons, we found that the proposed extensions improved the
predictive ability of the two-stage methods. The predictive ability was
further improved by going to GPFA. From the extracted trajectories, we
directly observed a convergence in neural state during motor planning, an
effect that was shown indirectly by previous studies. We then show how such
methods can be a powerful tool for relating the spiking activity across a
neural population to the subject's behavior on a single-trial basis. Finally,
to assess how well the proposed methods characterize neural population
activity when the underlying time course is known, we performed simulations
that revealed that GPFA performed tens of percent better than the best
two-stage method.

R. Adams and Zoubin Ghahramani.
**Archipelago:
nonparametric Bayesian semi-supervised learning**.
In Léon Bottou and Michael Littman, editors, *26th International
Conference on Machine Learning*, pages 1-8, Montréal, QC, Canada,
June 2009. Omnipress.

** Abstract:** Semi-supervised learning
(SSL), is classification where additional unlabeled data can be used to
improve accuracy. Generative approaches are appealing in this situation, as a
model of the data's probability density can assist in identifying clusters.
Nonparametric Bayesian methods, while ideal in theory due to their principled
motivations, have been difficult to apply to SSL in practice. We present a
nonparametric Bayesian method that uses Gaussian processes for the generative
model, avoiding many of the problems associated with Dirichlet process
mixture models. Our model is fully generative and we take advantage of recent
advances in Markov chain Monte Carlo algorithms to provide a practical
inference method. Our method compares favorably to competing approaches on
synthetic and real-world multi-class data.

** Comment:** This paper was awarded Honourable Mention for
Best Paper at ICML 2009.

Marc Peter Deisenroth, Marco F. Huber, and Uwe D. Hanebeck.
**Analytic
moment-based Gaussian process filtering**.
In Léon Bottou and Michael Littman, editors, *26th International
Conference on Machine Learning*, pages 225-232, Montréal, QC,
Canada, June 2009. Omnipress.

** Abstract:** We propose an
analytic moment-based filter for nonlinear stochastic dynamic systems modeled
by Gaussian processes. Exact expressions for the expected value and the
covariance matrix are provided for both the prediction step and the filter
step, where an additional Gaussian assumption is exploited in the latter
case. Our filter does not require further approximations. In particular, it
avoids finite-sample approximations. We compare the filter to a variety of
Gaussian filters, that is, the EKF, the UKF, and the recent GP-UKF proposed
by Ko
et al. (2007).

** Comment:** With corrections. code.

Marc Peter Deisenroth, Carl Edward Rasmussen, and Jan Peters.
**Gaussian process
dynamic programming**.
*Neurocomputing*, 72(7-9):1508-1524, March 2009, doi
10.1016/j.neucom.2008.12.019.

** Abstract:** Reinforcement
learning (RL) and optimal control of systems with continuous states and
actions require approximation techniques in most interesting cases. In this
article, we introduce Gaussian process dynamic programming (GPDP), an
approximate value function-based RL algorithm. We consider both a classic
optimal control problem, where problem-specific prior knowledge is available,
and a classic RL problem, where only very general priors can be used. For the
classic optimal control problem, GPDP models the unknown value functions with
Gaussian processes and generalizes dynamic programming to continuous-valued
states and actions. For the RL problem, GPDP starts from a given initial
state and explores the state space using Bayesian active learning. To design
a fast learner, available data have to be used efficiently. Hence, we propose
to learn probabilistic models of the a priori unknown transition dynamics and
the value functions on the fly. In both cases, we successfully apply the
resulting continuous-valued controllers to the under-actuated pendulum swing
up and analyze the performances of the suggested algorithms. It turns out
that GPDP uses data very efficiently and can be applied to problems, where
classic dynamic programming would be cumbersome.

** Comment:** code.

C. Chang, J. P. Cunningham, and G. Glover.
**Influence
of heart rate on the bold signal: the cardiac response function**.
*NeuroImage*, 44:857-869, 2009.

** Abstract:** It has
previously been shown that low-frequency fluctuations in both respiratory
volume and cardiac rate can induce changes in the blood-oxygen level
dependent (BOLD) signal. Such physiological noise can obscure the detection
of neural activation using fMRI, and it is therefore important to model and
remove the effects of this noise. While a hemodynamic response function
relating respiratory variation (RV) and the BOLD signal has been described,
no such mapping for heart rate (HR) has been proposed. In the current study,
the effects of RV and HR are simultaneously deconvolved from resting state
fMRI. It is demonstrated that a convolution model including RV and HR can
explain significantly more variance in gray matter BOLD signal than a model
that includes RV alone, and an average HR response function is proposed that
well characterizes our subject population. It is observed that the voxel-wise
morphology of the deconvolved RV responses is preserved when HR is included
in the model, and that its form is adequately modeled by Birn et al.'s
previously described respiration response function. Furthermore, it is shown
that modeling out RV and HR can significantly alter functional connectivity
maps of the default-mode network.

B. M. Yu, J. P. Cunningham, G. Santhanam, S. I. Ryu, K. V. Shenoy, and
M. Sahani.
**Gaussian-process
factor analysis for low-dimensional single-trial analysis of neural
population activity**.
*Journal of Neurophysiology*, 102:614-635, 2009.

**
Abstract:** We consider the problem of extracting smooth, low-dimensional
neural trajectories that summarize the activity recorded simultaneously from
many neurons on individual experimental trials. Beyond the benefit of
visualizing the high-dimensional, noisy spiking activity in a compact form,
such trajectories can offer insight into the dynamics of the neural circuitry
underlying the recorded activity. Current methods for extracting neural
trajectories involve a two-stage process: the spike trains are first smoothed
over time, then a static dimensionality- reduction technique is applied. We
first describe extensions of the two-stage methods that allow the degree of
smoothing to be chosen in a principled way and that account for spiking
variability, which may vary both across neurons and across time. We then
present a novel method for extracting neural trajectories - Gaussian-process
factor analysis (GPFA) - which unifies the smoothing and dimensionality-
reduction operations in a common probabilistic framework. We applied these
methods to the activity of 61 neurons recorded simultaneously in macaque
premotor and motor cortices during reach planning and execution. By adopting
a goodness-of-fit metric that measures how well the activity of each neuron
can be predicted by all other recorded neurons, we found that the proposed
extensions improved the predictive ability of the two-stage methods. The
predictive ability was further improved by going to GPFA. From the extracted
trajectories, we directly observed a convergence in neural state during motor
planning, an effect that was shown indirectly by previous studies. We then
show how such methods can be a powerful tool for relating the spiking
activity across a neural population to the subject's behavior on a
single-trial basis. Finally, to assess how well the proposed methods
characterize neural population activity when the underlying time course is
known, we performed simulations that revealed that GPFA performed tens of
percent better than the best two-stage method.

J. P. Cunningham, B. M. Yu, K. V. Shenoy, and M. Sahani.
**Inferring
neural firing rates from spike trains using Gaussian processes**.
In *Advances in Neural Information Processing Systems 20*, pages 1-8,
Vancouver, BC, December 2008.

** Abstract:** Neural spike
trains present challenges to analytical efforts due to their noisy, spiking
nature. Many studies of neuroscientific and neural prosthetic importance rely
on a smoothed, denoised estimate of the spike train's underlying firing rate.
Current techniques to find time-varying firing rates require ad hoc choices
of parameters, offer no confidence intervals on their estimates, and can
obscure potentially important single trial variability. We present a new
method, based on a Gaussian Process prior, for inferring probabilistically
optimal estimates of firing rate functions underlying single or multiple
neural spike trains. We test the performance of the method on simulated data
and experimentally gathered neural spike trains, and we demonstrate
improvements over conventional estimators.

** Comment:** Spotlight Presentation

H. Kim and Zoubin Ghahramani.
**Outlier robust
Gaussian process classification**.
In L. Niels da Vitoria, editor, *Structural, Syntactic and Statistical
Pattern Recognition*, volume 5342 of *Lecture Notes in Computer
Science (LNCS)*, pages 896-905, Berlin, Germany, December 2008. Springer
Berlin / Heidelberg.

** Abstract:** Gaussian process
classifiers (GPCs) are a fully statistical model for kernel classification.
We present a form of GPC which is robust to labeling errors in the data set.
This model allows label noise not only near the class boundaries, but also
far from the class boundaries which can result from mistakes in labelling or
gross errors in measuring the input features. We derive an outlier robust
algorithm for training this model which alternates iterations based on the EP
approximation and hyperparameter updates until convergence. We show the
usefulness of the proposed algorithm with model selection method through
simulation results.

Hannes Nickisch and Carl Edward Rasmussen.
**Approximations
for binary Gaussian process classification**.
*Journal of Machine Learning Research*, 9:2035-2078, October 2008.

** Abstract:** We provide a comprehensive overview of many
recent algorithms for approximate inference in Gaussian process models for
probabilistic binary classification. The relationships between several
approaches are elucidated theoretically, and the properties of the different
algorithms are corroborated by experimental results. We examine both 1) the
quality of the predictive distributions and 2) the suitability of the
different marginal likelihood approximations for model selection (selecting
hyperparameters) and compare to a gold standard based on MCMC. Interestingly,
some methods produce good predictive distributions although their marginal
likelihood approximations are poor. Strong conclusions are drawn about the
methods: The Expectation Propagation algorithm is almost always the method of
choice unless the computational budget is very tight. We also extend existing
methods in various ways, and provide unifying code implementing all
approaches.

J. P. Cunningham, K. V. Shenoy, and M. Sahani.
**Fast
Gaussian process methods for point process intensity estimation**.
In *25th International Conference on Machine Learning*, pages 1-8,
Helsinki, Finland, June 2008.

** Abstract:** Point processes
are difficult to analyze because they provide only a sparse and noisy
observation of the intensity function driving the process. Gaussian Processes
offer an attractive framework within which to infer underlying intensity
functions. The result of this inference is a continuous function defined
across time that is typically more amenable to analytical efforts. However, a
naive implementation will become computationally infeasible in any problem of
reasonable size, both in memory and run time requirements. We demonstrate
problem specific methods for a class of renewal processes that eliminate the
memory burden and reduce the solve time by orders of magnitude.

Marc Peter Deisenroth, Jan Peters, and Carl Edward Rasmussen.
**Approximate
dynamic programming with Gaussian processes**.
In *2008 American Control Conference (ACC 2008)*, pages 4480-4485,
Seattle, WA, USA, June 2008.

** Abstract:** In general, it is
difficult to determine an optimal closed-loop policy in nonlinear control
problems with continuous-valued state and control domains. Hence,
approximations are often inevitable. The standard method of discretizing
states and controls suffers from the curse of dimensionality and strongly
depends on the chosen temporal sampling rate. The paper introduces Gaussian
Process Dynamic Programming (GPDP). In GPDP, value functions in the Bellman
recursion of the dynamic programming algorithm are modeled using Gaussian
processes. GPDP returns an optimal state-feedback for a finite set of states.
Based on these outcomes, we learn a possibly discontinuous closed-loop policy
on the entire state space by switching between two independently trained
Gaussian processes.

** Comment:** code.

Marc Peter Deisenroth, Carl Edward Rasmussen, and Jan Peters.
**Model-based
reinforcement learning with continuous states and actions**.
In *Proceedings of the 16th European Symposium on Artificial Neural Networks
(ESANN 2008)*, pages 19-24, Bruges, Belgium, April 2008.

**
Abstract:** Finding an optimal policy in a reinforcement learning (RL)
framework with continuous state and action spaces is challenging. Approximate
solutions are often inevitable. GPDP is an approximate dynamic programming
algorithm based on Gaussian process (GP) models for the value functions. In
this paper, we extend GPDP to the case of unknown transition dynamics. After
building a GP model for the transition dynamics, we apply GPDP to this model
and determine a continuous-valued policy in the entire state space. We apply
the resulting controller to the underpowered pendulum swing up. Moreover, we
compare our results on this RL task to a nearly optimal discrete DP solution
in a fully known environment.

J. P. Cunningham.
**Derivation
of Expectation Propagation for "fast Gaussian process methods for point
process intensity estimation"**.
Technical report, Stanford University, 2008.

** Abstract:** We
derive the Expectation Propagation algorithm updates for approximating the
posterior distribution on intensity in a conditionally inhomogeneous gamma
interval process with a Gaussian Process prior (GP IGIP), a model which
appeared in Cunningham, Shenoy, Sahani (2008) ICML.

Hyun-Chul Kim and Zoubin Ghahramani.
**Outlier robust
Gaussian process classification**.
In Niels da Vitoria Lobo, Takis Kasparis, Fabio Roli, James Tin-Yau Kwok,
Michael Georgiopoulos, Georgios C. Anagnostopoulos, and Marco Loog, editors,
*SSPR/SPR*, volume 5342 of *Lecture Notes in Computer Science*,
pages 896-905. Springer, 2008.

** Abstract:** Gaussian
process classifiers (GPCs) are a fully statistical model for kernel
classification. We present a form of GPC which is robust to labeling errors
in the data set. This model allows label noise not only near the class
boundaries, but also far from the class boundaries which can result from
mistakes in labelling or gross errors in measuring the input features. We
derive an outlier robust algorithm for training this model which alternates
iterations based on the EP approximation and hyperparameter updates until
convergence. We show the usefulness of the proposed algorithm with model
selection method through simulation results.

Richard E. Turner and Maneesh Sahani.
**Modeling
natural sounds with modulation cascade processes**.
In J. C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, *nips20*,
volume 20. mit, 2008.

** Abstract:** Natural sounds are
structured on many time-scales. A typical segment of speech, for example,
contains features that span four orders of magnitude: Sentences ($\sim1$s);
phonemes ($\sim10$−$1$ s); glottal pulses ($\sim 10$−$2$s); and formants
($\sim 10$−$3$s). The auditory system uses information from each of these
time-scales to solve complicated tasks such as auditory scene analysis [1].
One route toward understanding how auditory processing accomplishes this
analysis is to build neuroscience-inspired algorithms which solve similar
tasks and to compare the properties of these algorithms with properties of
auditory processing. There is however a discord: Current machine-audition
algorithms largely concentrate on the shorter time-scale structures in
sounds, and the longer structures are ignored. The reason for this is
two-fold. Firstly, it is a difficult technical problem to construct an
algorithm that utilises both sorts of information. Secondly, it is
computationally demanding to simultaneously process data both at high
resolution (to extract short temporal information) and for long duration (to
extract long temporal information). The contribution of this work is to
develop a new statistical model for natural sounds that captures structure
across a wide range of time-scales, and to provide efficient learning and
inference algorithms. We demonstrate the success of this approach on a
missing data task.

W. Chu, V. Sindhwani, Z. Ghahramani, and S. Keerthi.
**Relational
learning with Gaussian processes**.
In B. Schölkopf, J. Platt, and T. Hofmann, editors, *Advances in Neural
Information Processing Systems 19*, volume 19 of *Bradford
Books*, pages 289-296, Cambridge, MA, USA, September 2007. The MIT
Press.
Online contents gives pages 314-321, and 289-296 on pdf of contents.

** Abstract:** Correlation between instances is often modelled
via a kernel function using input attributes of the instances. Relational
knowledge can further reveal additional pairwise correlations between
variables of interest. In this paper, we develop a class of models which
incorporates both reciprocal relational information and input attributes
using Gaussian process techniques. This approach provides a novel
non-parametric Bayesian framework with a data-dependent prior for supervised
learning tasks. We also apply this framework to semi-supervised learning.
Experimental results on several real world data sets verify the usefulness of
this algorithm.

Joaquin Quiñonero-Candela, Carl Edward Rasmussen, and Christopher K. I.
Williams.
**Approximation
methods for Gaussian process regression**.
In L. Bottou, O. Chapelle, D. DeCoste, and J. Weston, editors, *Large-Scale
Kernel Machines*, Neural Information Processing, pages 203-223. The
MIT Press, Cambridge, MA, USA, September 2007.

**
Abstract:** A wealth of computationally efficient approximation methods for
Gaussian process regression have been recently proposed. We give a unifying
overview of sparse approximations, following Quiñonero-Candela and Rasmussen (2005), and a
brief review of approximate matrix-vector multiplication methods.

** Comment:** book

Edward Snelson and Zoubin Ghahramani.
**Local and global
sparse Gaussian process approximations**.
In M. Meila and X. Shen, editors, *11th International Conference on
Artificial Intelligence and Statistics*. Omnipress, 2007.

**
Abstract:** Gaussian process (GP) models are flexible probabilistic
nonparametric models for regression, classification and other tasks.
Unfortunately they suffer from computational intractability for large data
sets. Over the past decade there have been many different approximations
developed to reduce this cost. Most of these can be termed global
approximations, in that they try to summarize all the training data via a
small set of support points. A different approach is that of local
regression, where many local experts account for their own part of space. In
this paper we start by investigating the regimes in which these different
approaches work well or fail. We then proceed to develop a new sparse GP
approximation which is a combination of both the global and local approaches.
Theoretically we show that it is derived as a natural extension of the
framework developed by Quiñonero-Candela and
Rasmussen for sparse GP approximations. We demonstrate the benefits of
the combined approximation on some 1D examples for illustration, and on some
large real-world data sets.

Richard E. Turner and M Sahani.
**Probabilistic
amplitude demodulation**.
In *7th International Conference on Independent Component Analysis and
Signal Separation*, pages 544-551, 2007.

** Abstract:**
Auditory scene analysis is extremely challenging. One approach, perhaps that
adopted by the brain, is to shape useful representations of sounds on prior
knowledge about their statistical structure. For example, sounds with
harmonic sections are common and so time-frequency representations are
efficient. Most current representations concentrate on the shorter
components. Here, we propose representations for structures on longer
time-scales, like the phonemes and sentences of speech. We decompose a sound
into a product of processes, each with its own characteristic time-scale.
This demodulation cascade relates to classical amplitude demodulation, but
traditional algorithms fail to realise the representation fully. A new
approach, probabilistic amplitude demodulation, is shown to out-perform the
established methods, and to easily extend to representation of a full
demodulation cascade.

Malte Kuß and Carl Edward Rasmussen.
**Assessing
approximations for Gaussian process classification**.
In Y. Weiss, B. Schölkopf, and J. Platt, editors, *Advances in Neural
Information Processing Systems 18*, pages 699-706, Cambridge, MA, USA,
April 2006. The MIT Press.

** Abstract:** Gaussian processes
are attractive models for probabilistic classification but unfortunately
exact inference is analytically intractable. We compare Laplace's method and
Expectation Propagation (EP) focusing on marginal likelihood estimates and
predictive performance. We explain theoretically and corroborate empirically
that EP is superior to Laplace. We also compare to a sophisticated MCMC
scheme and show that EP is surprisingly accurate.

Hyun-Chul Kim and Zoubin Ghahramani.
**Bayesian Gaussian
process classification with the EM-EP algorithm**.
*IEEE Trans. Pattern Anal. Mach. Intell.*, 28(12):1948-1959, 2006.

** Abstract:** Gaussian process classifiers (GPCs) are Bayesian
probabilistic kernel classifiers. In GPCs, the probability of belonging to a
certain class at an input location is monotonically related to the value of
some latent function at that location. Starting from a Gaussian process prior
over this latent function, data are used to infer both the posterior over the
latent function and the values of hyperparameters to determine various
aspects of the function. Recently, the expectation propagation (EP) approach
has been proposed to infer the posterior over the latent function. Based on
this work, we present an approximate EM algorithm, the EM-EP algorithm, to
learn both the latent function and the hyperparameters. This algorithm is
found to converge in practice and provides an efficient Bayesian framework
for learning hyperparameters of the kernel. A multiclass extension of the
EM-EP algorithm for GPCs is also derived. In the experimental results, the
EM-EP algorithms are as good or better than other methods for GPCs or Support
Vector Machines (SVMs) with cross-validation.

Hyun-Chul Kim, Daijin Kim, Zoubin Ghahramani, and Sung Yang Bang.
**Appearance-based
gender classification with Gaussian processes**.
*Pattern Recognition Letters*, 27(6):618-626, 2006.

**
Abstract:** This paper concerns the gender classification task of
discriminating between images of faces of men and women from face images. In
appearance-based approaches, the initial images are preprocessed (e.g.
normalized) and input into classifiers. Recently, support vector machines
(SVMs) which are popular kernel classifiers have been applied to gender
classification and have shown excellent performance. SVMs have difficulty in
determining the hyperparameters in kernels (using cross-validation). We
propose to use Gaussian process classifiers (GPCs) which are Bayesian kernel
classifiers. The main advantage of GPCs over SVMs is that they determine the
hyperparameters of the kernel based on Bayesian model selection criterion.
The experimental results show that our methods outperformed SVMs with
cross-validation in most of data sets. Moreover, the kernel hyperparameters
found by GPCs using Bayesian methods can be used to improve SVM
performance.

Carl Edward Rasmussen and Christopher K. I. Williams.
**Gaussian processes
for machine learning**.
The MIT Press, 2006.

** Abstract:** Gaussian processes (GPs)
provide a principled, practical, probabilistic approach to learning in kernel
machines. GPs have received increased attention in the machine-learning
community over the past decade, and this book provides a long-needed
systematic and unified treatment of theoretical and practical aspects of GPs
in machine learning. The treatment is comprehensive and self-contained,
targeted at researchers and students in machine learning and applied
statistics.

** Comment:** Winner of the 2009 DeGroot Prize. Book web page, chapters and entire book
pdf. GPML Toolbox.

Edward Snelson and Zoubin Ghahramani.
**Sparse Gaussian
processes using pseudo-inputs**.
In Y. Weiss, B. Schölkopf, and J. Platt, editors, *Advances in Neural
Information Processing Systems 18*, pages 1257-1264. The MIT Press,
Cambridge, MA, 2006.

** Abstract:** We present a new Gaussian
process (GP) regression model whose covariance is parameterized by the the
locations of M pseudo-input points, which we learn by a gradient based
optimization. We take M<<N, where N is the number of real data points,
and hence obtain a sparse regression method which has O(NM^{2})
training cost and O(M^{2}) prediction cost per test case. We also
find hyperparameters of the covariance function in the same joint
optimization. The method can be viewed as a Bayesian regression model with
particular input dependent noise. The method turns out to be closely related
to several other sparse GP approaches, and we discuss the relation in detail.
We finally demonstrate its performance on some large data sets, and make a
direct comparison to other sparse GP methods. We show that our method can
match full GP performance with small M, i.e. very sparse solutions, and it
significantly outperforms other approaches in this regime.

Edward Snelson and Zoubin Ghahramani.
**Variable
noise and dimensionality reduction for sparse Gaussian processes**.
In R. Dechter and T. S. Richardson, editors, *22nd Conference on Uncertainty
in Artificial Intelligence*. AUAI Press, 2006.

**
Abstract:** The sparse pseudo-input Gaussian process (SPGP) is a new
approximation method for speeding up GP regression in the case of a large
number of data points N. The approximation is controlled by the gradient
optimization of a small set of M pseudo-inputs, thereby reducing complexity
from O(N^{3}) to O(NM^{2}). One limitation of the SPGP is
that this optimization space becomes impractically big for high dimensional
data sets. This paper addresses this limitation by performing automatic
dimensionality reduction. A projection of the input space to a low
dimensional space is learned in a supervised manner, alongside the
pseudo-inputs, which now live in this reduced space. The paper also
investigates the suitability of the SPGP for modeling data with
input-dependent noise. A further extension of the model is made to make it
even more powerful in this regard - we learn an uncertainty parameter for
each pseudo-input. The combination of sparsity, reduced dimension, and
input-dependent noise makes it possible to apply GPs to much larger and more
complex data sets than was previously practical. We demonstrate the benefits
of these methods on several synthetic and real world problems.

Wei Chu and Zoubin Ghahramani.
**Gaussian processes
for ordinal regression**.
*Journal of Machine Learning Research*, 6:1019-1041, 2005.

** Abstract:** We present a probabilistic kernel approach to
ordinal regression based on Gaussian processes. A threshold model that
generalizes the probit function is used as the likelihood function for
ordinal variables. Two inference techniques, based on the Laplace
approximation and the expectation propagation algorithm respectively, are
derived for hyperparameter learning and model selection. We compare these two
Gaussian process approaches with a previous ordinal regression method based
on support vector machines on some benchmark and real-world data sets,
including applications of ordinal regression to collaborative filtering and
gene expression analysis. Experimental results on these data sets verify the
usefulness of our approach.

Wei Chu and Zoubin Ghahramani.
**Preference learning
with Gaussian processes**.
In Luc De Raedt and Stefan Wrobel, editors, *ICML*, volume 119 of
*ACM International Conference Proceeding Series*, pages 137-144. acm,
2005.

** Abstract:** In this paper, we propose a probabilistic
kernel approach to preference learning based on Gaussian processes. A new
likelihood function is proposed to capture the preference relations in the
Bayesian framework. The generalized formulation is also applicable to tackle
many multiclass problems. The overall approach has the advantages of Bayesian
methods for model selection and probabilistic prediction. Experimental
results compared against the constraint classification approach on several
benchmark datasets verify the usefulness of this algorithm.

Malte Kuß, Tobias Pfingsten, Lehel Csatò, and Carl Edward Rasmussen.
**Approximate
inference for robust Gaussian process regression**.
Technical Report 136, Max Planck Institute for Biological Cybernetics,
Tübingen, Germany, 2005.

** Abstract:** Gaussian process
(GP) priors have been successfully used in non-parametric Bayesian regression
and classification models. Inference can be performed analytically only for
the regression model with Gaussian noise. For all other likelihood models
inference is intractable and various approximation techniques have been
proposed. In recent years expectation-propagation (EP) has been developed as
a general method for approximate inference. This article provides a general
summary of how expectation-propagation can be used for approximate inference
in Gaussian process models. Furthermore we present a case study describing
its implementation for a new robust variant of Gaussian process regression.
To gain further insights into the quality of the EP approximation we present
experiments in which we compare to results obtained by Markov chain Monte
Carlo (MCMC) sampling.

Malte Kuß and Carl Edward Rasmussen.
**Assessing
approximate inference for binary Gaussian process classification**.
*Journal of Machine Learning Research*, 6:1679-1704, 2005.

** Abstract:** Gaussian process priors can be used to define
flexible, probabilistic classification models. Unfortunately exact Bayesian
inference is analytically intractable and various approximation techniques
have been proposed. In this work we review and compare Laplace's method and
Expectation Propagation for approximate Bayesian inference in the binary
Gaussian process classification model. We present a comprehensive comparison
of the approximations, their predictive performance and marginal likelihood
estimates to results obtained by MCMC sampling. We explain theoretically and
corroborate empirically the advantages of Expectation Propagation compared to
Laplace's method.

Joaquin Quiñonero-Candela and Carl Edward Rasmussen.
**Analysis of some
methods for reduced rank Gaussian process regression**.
In R. Murray-Smith and R. Shorten, editors, *Switching and Learning in
Feedback Systems*, pages 98-127. Springer, Berlin, Heidelberg, 2005.

** Abstract:** While there is strong motivation for using
Gaussian Processes (GPs) due to their excellent performance in regression and
classification problems, their computational complexity makes them
impractical when the size of the training set exceeds a few thousand cases.
This has motivated the recent proliferation of a number of cost-effective
approximations to GPs, both for classification and for regression. In this
paper we analyze one popular approximation to GPs for regression: the reduced
rank approximation. While generally GPs are equivalent to infinite linear
models, we show that Reduced Rank Gaussian Processes (RRGPs) are equivalent
to finite sparse linear models. We also introduce the concept of degenerate
GPs and show that they correspond to inappropriate priors. We show how to
modify the RRGP to prevent it from being degenerate at test time. Training
RRGPs consists both in learning the covariance function hyperparameters and
the support set. We propose a method for learning hyperparameters for a given
support set. We also review the Sparse Greedy GP (SGGP) approximation (Somla
and Bartlett, 2001), which is a way of learning the support set for given
hyperparameters based on approximating the posterior. We propose an
alternative method to the SGGP that has better generalization capabilities.
Finally we make experiments to compare the different ways of training a RRGP.
We provide some Matlab code for learning RRGPs.

Joaquin Quiñonero-Candela and Carl Edward Rasmussen.
**A
unifying view of sparse approximate Gaussian process regression**.
*Journal of Machine Learning Research*, 6:1939-1959, 2005.

** Abstract:** We provide a new unifying view, including all
existing proper probabilistic sparse approximations for Gaussian process
regression. Our approach relies on expressing the effective prior which the
methods are using. This allows new insights to be gained, and highlights the
relationship between existing methods. It also allows for a clear
theoretically justified ranking of the closeness of the known approximations
to the corresponding full GPs. Finally we point directly to designs of new
better sparse approximations, combining the best of the existing strategies,
within attractive computational constraints.

Carl Edward Rasmussen and Joaquin Quiñonero-Candela.
**Healing the
Relevance Vector Machine through augmentation**.
In L. De Raedt and S. Wrobel, editors, *22nd International Conference on
Machine Learning*, pages 689-696, 2005.

** Abstract:**
The Relevance Vector Machine (RVM) is a sparse approximate Bayesian kernel
method. It provides full predictive distributions for test cases. However,
the predictive uncertainties have the unintuitive property, that *they
get smaller the further you move away from the training cases*. We give a
thorough analysis. Inspired by the analogy to non-degenerate Gaussian
Processes, we suggest augmentation to solve the problem. The purpose of the
resulting model, RVM*, is primarily to corroborate the theoretical and
experimental analysis. Although RVM* could be used in practical applications,
it is no longer a truly sparse model. Experiments show that sparsity comes at
the expense of worse predictive distributions.

Edward Snelson, Carl Edward Rasmussen, and Zoubin Ghahramani.
**Warped Gaussian
processes**.
In S. Thrun, L. Saul, and B. Schölkopf, editors, *Advances in Neural
Information Processing Systems 16*, pages 337-344, Cambridge, MA, USA,
December 2004. The MIT Press.

** Abstract:** We generalise
the Gaussian process (GP) framework for regression by learning a nonlinear
transformation of the GP outputs. This allows for non-Gaussian processes and
non-Gaussian noise. The learning algorithm chooses a nonlinear transformation
such that transformed data is well-modelled by a GP. This can be seen as
including a preprocessing transformation as an integral part of the
probabilistic modelling problem, rather than as an ad-hoc step. We
demonstrate on several real regression problems that learning the
transformation can lead to significantly better performance than using a
regular GP, or a GP with a fixed transformation.

Juš Kocijan, Roderick Murray-Smith, Carl Edward Rasmussen, and Agathe
Girard.
**Gaussian
process model based predictive control**.
In *American Control Conference*, pages 2214-2219, 2004.

** Abstract:** Gaussian process models provide a probabilistic
non-parametric modelling approach for black-box identi cation of non-linear
dynamic systems. The Gaussian processes can highlight areas of the input
space where prediction quality is poor, due to the lack of data or its
complexity, by indicating the higher variance around the predicted mean.
Gaussian process models contain noticeably less coef cients to be optimised.
This paper illustrates possible application of Gaussian process models within
model-based predictive control. The extra information provided within
Gaussian process model is used in predictive control, where optimisation of
control signal takes the variance information into account. The predictive
control principle is demonstrated on control of pH process benchmark.

Carl Edward Rasmussen.
**Gaussian processes in
machine learning**.
In Olivier Bousquet, Ulrike von Luxburg, and Gunnar Rätsch, editors,
*Advanced Lectures on Machine Learning: ML Summer Schools 2003, Canberra,
Australia, February 2 - 14, 2003, Tübingen, Germany, August 4 - 16, 2003,
Revised Lectures*, volume 3176 of *Lecture Notes in Computer Science
(LNCS)*, pages 63-71. Springer-Verlag, Heidelberg, 2004.

**
Abstract:** We give a basic introduction to Gaussian Process regression
models. We focus on understanding the role of the stochastic process and how
it is used to define a distribution over functions. We present the simple
equations for incorporating training data and examine how to learn the
hyperparameters using the marginal likelihood. We explain the practical
advantages of Gaussian Process and end with conclusions and a look at the
current trends in GP work.

** Comment:** Copyright by Springer, springerlink

Fabian Sinz, Joaquin Quiñonero-Candela, Gökhan H. Bakir, Carl Edward
Rasmussen, and Matthias O. Franz.
**Learning
depth from stereo**.
In C. E. Rasmussen, H. H. Bülthoff, B. Schölkopf, and M. A. Giese,
editors, *26th DAGM Symposium*, volume 3175 of *Lecture Notes in
Computer Science (LNCS)*, pages 245-252, Berlin, Germany, 09 2004.
Springer.

** Abstract:** We compare two approaches to the
problem of estimating the depth of a point in space from observing its image
position in two different cameras: 1. The classical photogrammetric approach
explicitly models the two cameras and estimates their intrinsic and extrinsic
parameters using a tedious calibration procedure; 2. A generic machine
learning approach where the mapping from image to spatial coordinates is
directly approximated by a Gaussian Process regression. Our results show that
the generic learning approach, in addition to simplifying the procedure of
calibration, can lead to higher depth accuracies than classical calibration
although no specific domain knowledge is used.

Agathe Girard, Carl Edward Rasmussen, Joaquin Quiñonero-Candela, and
Roderick Murray-Smith.
**Gaussian
process priors with uncertain inputs - application to multiple-step ahead
time series forecasting**.
In S. Becker, S. Thrun, and K. Obermayer, editors, *Advances in Neural
Information Processing Systems 15*, pages 529-536, Cambridge, MA, USA,
December 2003. The MIT Press.

** Abstract:** We consider the
problem of multi-step ahead prediction in time series analysis using the
non-parametric Gaussian process model. k-step ahead forecasting of a
discrete-time non-linear dynamic system can be performed by doing repeated
one-step ahead predictions. For a state-space model of the form y_{t}
= f(y_{t-1},...,y_{t-L}), the prediction of y at time t + k
is based on the point estimates of the previous outputs. In this paper, we
show how, using an analytical Gaussian approximation, we can formally
incorporate the uncertainty about intermediate regressor values, thus
updating the uncertainty on the current prediction.

Carl Edward Rasmussen and Zoubin Ghahramani.
**Bayesian Monte
Carlo**.
In S. Becker, S. Thrun, and K. Obermayer, editors, *Advances in Neural
Information Processing Systems 15*, pages 489-496, Cambridge, MA, USA,
December 2003. The MIT Press.

** Abstract:** We investigate
Bayesian alternatives to classical Monte Carlo methods for evaluating
integrals. Bayesian Monte Carlo (BMC) allows the incorporation of prior
knowledge, such as smoothness of the integrand, into the estimation. In a
simple problem we show that this outperforms any classical importance
sampling method. We also attempt more challenging multidimensional integrals
involved in computing marginal likelihoods of statistical models (a.k.a.
partition functions and model evidences). We find that Bayesian Monte Carlo
outperformed Annealed Importance Sampling, although for very high dimensional
problems or problems with massive multimodality BMC may be less adequate. One
advantage of the Bayesian approach to Monte Carlo is that samples can be
drawn from any distribution. This allows for the possibility of active design
of sample points so as to maximise information gain.

Ercan Solak, Roderick Murray-Smith, William E. Leithead, Douglas Leith, and
Carl Edward Rasmussen.
**Derivative
observations in Gaussian process models of dynamic systems**.
In S. Becker, S. Thrun, and K. Obermayer, editors, *Advances in Neural
Information Processing Systems 15*, pages 1033-1040, Cambridge, MA, USA,
December 2003. The MIT Press.

** Abstract:** Gaussian
processes provide an approach to nonparametric modelling which allows a
straightforward combination of function and derivative observations in an
empirical model. This is of particular importance in identification of
nonlinear dynamic systems from experimental data. 1) It allows us to combine
derivative information, and associated uncertainty with normal function
observations into the learning and inference process. This derivative
information can be in the form of priors specified by an expert or identified
from perturbation data close to equilibrium. 2) It allows a seamless fusion
of multiple local linear models in a consistent manner, inferring consistent
models and ensuring that integrability constraints are met. 3) It improves
dramatically the computational efficiency of Gaussian process models for
dynamic system identification, by summarising large quantities of
near-equilibrium data by a handful of linearisations, reducing the training
set size - traditionally a problem for Gaussian process models.

Roderick Murray-Smith, Daniel Sbarbaro, Carl Edward Rasmussen, and Agathe
Girard.
**Adaptive,
cautious, predictive control with Gaussian process priors**.
In P. Van den Hof, B. Wahlberg, and S. Weiland, editors, *IFAC SYSID
2003*, pages 1195-1200, Oxford, UK, August 2003. Elsevier Science Ltd.

** Abstract:** Nonparametric Gaussian Process models, a Bayesian
statistics approach, are used to implement a nonlinear adaptive control law.
Predictions, including propagation of the state uncertainty are made over a
k-step horizon. The expected value of a quadratic cost function is minimised,
over this prediction horizon, without ignoring the variance of the model
predictions. The general method and its main features are illustrated on a
simulation example.

Joaquin Quiñonero-Candela, Agathe Girard, Jan Larsen, and Carl Edward
Rasmussen.
**Propagation of
uncertainty in Bayesian kernel models - application to multiple-step ahead
forecasting**.
In *ICASSP 2003*, volume 2, pages 701-704, April 2003.

** Abstract:** The object of Bayesian modelling is the
predictive distribution, which in a forecasting scenario enables improved
estimates of forecasted values and their uncertainties. In this paper we
focus on reliably estimating the predictive mean and variance of forecasted
values using Bayesian kernel based models such as the Gaussian Process and
the Relevance Vector Machine. We derive novel analytic expressions for the
predictive mean and variance for Gaussian kernel shapes under the assumption
of a Gaussian input distribution in the static case, and of a recursive
Gaussian predictive density in iterative forecasting. The capability of the
method is demonstrated for forecasting of time-series and compared to
approximate methods.

Juš Kocijan, Blaž Banko, Bojan Likar, Agathe Girard, Roderick
Murray-Smith, and Carl Edward Rasmussen.
**A case based
comparison of identification with neural network and Gaussian process
models**.
In *IFAC Internaltional Conference on Intelligent Control Systems and Signal
Processing*, volume 1, pages 137-142, 2003.

**
Abstract:** In this paper an alternative approach to black-box
identification of non-linear dynamic systems is compared with the more
established approach of using artificial neural networks. The Gaussian
process prior approach is a representative of non-parametric modelling
approaches. It was compared on a pH process modelling case study. The purpose
of modelling was to use the model for control design. The comparison revealed
that even though Gaussian process models can be effectively used for
modelling dynamic systems caution has to be axercised when signals are
selected.

Juš Kocijan, Roderick Murray-Smith, Carl Edward Rasmussen, and Bojan
Likar.
**Predictive
control with Gaussian process models**.
In B. Zajc and M. Tkal, editors, *IEEE Region 8 Eurocon 2003: Computer as a
Tool*, pages 352-356, 2003.

** Abstract:** This paper
describes model-based predictive control based on Gaussian processes.
Gaussian process models provide a probabilistic non-parametric modelling
approach for black-box identification of non-linear dynamic systems. It
offers more insight in variance of obtained model response, as well as fewer
parameters to determine than other models. The Gaussian processes can
highlight areas of the input space where prediction quality is poor, due to
the lack of data or its complexity, by indicating the higher variance around
the predicted mean. This property is used in predictive control, where
optimisation of control signal takes the variance information into account.
The predictive control principle is demonstrated on a simulated example of
nonlinear system.

Joaquin Quiñonero-Candela, Agathe Girard, and Carl Edward Rasmussen.
**Prediction at an
uncertain input for Gaussian processes and Relevance Vector Machines
application to multiple-step ahead time-series prediction**.
Technical Report IMM-2003-18, Instititute for Mathemetical Modelling, DTU,
2003.

** Comment:** techreport

Carl Edward Rasmussen.
**Gaussian processes to
speed up Hybrid Monte Carlo for expensive Bayesian integrals**.
In *Bayesian Statistics 7*, pages 651-659. Oxford University Press,
2003.

** Abstract:** Hybrid Monte Carlo (HMC) is often the
method of choice for computing Bayesian integrals that are not analytically
tractable. However the success of this method may require a very large number
of evaluations of the (un-normalized) posterior and its partial derivatives.
In situations where the posterior is computationally costly to evaluate, this
may lead to an unacceptable computational load for HMC. I propose to use a
Gaussian Process model of the (log of the) posterior for most of the
computations required by HMC. Within this scheme only occasional evaluation
of the actual posterior is required to guarantee that the samples generated
have exactly the desired distribution, even if the GP model is somewhat
inaccurate. The method is demonstrated on a 10 dimensional problem, where 200
evaluations suffice for the generation of 100 roughly independent points from
the posterior. Thus, the proposed scheme allows Bayesian treatment of models
with posteriors that are computationally demanding, such as models involving
computer simulation.

Carl Edward Rasmussen and Zoubin Ghahramani.
**Infinite mixtures of
Gaussian process experts**.
In *Advances in Neural Information Processing Systems 14*, pages
881-888, Cambridge, MA, USA, December 2002. The MIT Press.

**
Abstract:** We present an extension to the Mixture of Experts (ME) model,
where the individual experts are Gaussian Process (GP) regression models.
Using an input-dependent adaptation of the Dirichlet Process, we implement a
gating network for an infinite number of Experts. Inference in this model may
be done efficiently using a Markov Chain relying on Gibbs sampling. The model
allows the effective covariance function to vary with the inputs, and may
handle large datasets - thus potentially overcoming two of the biggest
hurdles with GP models. Simulations show the viability of this approach.

Irene K. Andersen, Anna Szymkowiak, Carl Edward Rasmussen, L. G. Hanson, J. R.
Marstrand, H. B. W. Larsson, and Lars Kai Hansen.
**Perfusion
quantification using Gaussian process deconvolution**.
*Magnetic Resonance in Medicine*, 48(2):351-361, 2002, doi 10.1002/mrm.10213.

** Abstract:** The quantification of perfusion using dynamic
susceptibility contrast MR imaging requires deconvolution to obtain the
residual impulse-response function (IRF). Here, a method using a Gaussian
process for deconvolution, GPD, is proposed. The fact that the IRF is smooth
is incorporated as a constraint in the method. The GPD method, which
automatically estimates the noise level in each voxel, has the advantage that
model parameters are optimized automatically. The GPD is compared to singular
value decomposition (SVD) using a common threshold for the singular values
and to SVD using a threshold optimized according to the noise level in each
voxel. The comparison is carried out using artificial data as well as using
data from healthy volunteers. It is shown that GPD is comparable to SVD
variable optimized threshold when determining the maximum of the IRF, which
is directly related to the perfusion. GPD provides a better estimate of the
entire IRF. As the signal to noise ratio increases or the time resolution of
the measurements increases, GPD is shown to be superior to SVD. This is also
found for large distribution volumes.

Christopher K. I. Williams, Carl Edward Rasmussen, Anton Schwaighofer, and
Volker Tresp.
**Observations
on the Nyström method for Gaussian process prediction**.
Technical report, University of Edinburgh, 2002.

** Abstract:**
A number of methods for speeding up Gaussian Process (GP) prediction have
been proposed, including the Nyström method of Williams and Seeger
(2001). In this paper we focus on two issues (1) the relationship of the
Nyström method to the Subset of Regressors method (Poggio and Girosi
1990; Luo and Wahba, 1997) and (2) understanding in what circumstances the
Nyström approximation would be expected to provide a good approximation
to exact GP regression.

Carl Edward Rasmussen.
**Evaluation of
Gaussian processes and other methods for non-linear regression**.
PhD thesis, University of Toronto, Department of Computer Science, Toronto,
CANADA, 1996.

** Abstract:** This thesis develops two Bayesian
learning methods relying on Gaussian processes and a rigorous statistical
approach for evaluating such methods. In these experimental designs the
sources of uncertainty in the estimated generalisation performances due to
both variation in training and test sets are accounted for. The framework
allows for estimation of generalisation performance as well as statistical
tests of significance for pairwise comparisons. Two experimental designs are
recommended and supported by the DELVE software environment.

Two new
non-parametric Bayesian learning methods relying on Gaussian process priors
over functions are developed. These priors are controlled by hyperparameters
which set the characteristic length scale for each input dimension. In the
simplest method, these parameters are fit from the data using optimization.
In the second, fully Bayesian method, a Markov chain Monte Carlo technique is
used to integrate over the hyperparameters. One advantage of these Gaussian
process methods is that the priors and hyperparameters of the trained models
are easy to interpret.

The Gaussian process methods are benchmarked
against several other methods, on regression tasks using both real data and
data generated from realistic simulations. The experiments show that small
datasets are unsuitable for benchmarking purposes because the uncertainties
in performance measurements are large. A second set of experiments provide
strong evidence that the bagging procedure is advantageous for the
Multivariate Adaptive Regression Splines (MARS) method.

The simulated
datasets have controlled characteristics which make them useful for
understanding the relationship between properties of the dataset and the
performance of different methods. The dependency of the performance on
available computation time is also investigated. It is shown that a Bayesian
approach to learning in multi-layer perceptron neural networks achieves
better performance than the commonly used early stopping procedure, even for
reasonably short amounts of computation time. The Gaussian process methods
are shown to consistently outperform the more conventional methods.

Chris K. I. Williams and Carl Edward Rasmussen.
**Gaussian processes
for regression**.
In *Advances in Neural Information Processing Systems 8*, pages
514-520, Cambridge, MA., USA, 1996. The MIT Press.

**
Abstract:** The Bayesian analysis of neural networks is difficult because a
simple prior over weights implies a complex prior over functions. We
investigate the use of a Gaussian process prior over functions, which permits
the predictive Bayesian analysis for fixed values of hyperparameters to be
carried out exactly using matrix operations. Two methods, using optimization
and averaging (via Hybrid Monte Carlo) over hyperparameters have been tested
on a number of challenging problems and have produced excellent results.

## ClusteringClustering algorithms are unsupervised methods for finding groups of similar points in data. They are closely related to statistical mixture models. |

Vidhi Lalchand, Aditya Ravuri, and Neil D. Lawrence.
**Generalised
GPLVM with Stochastic Variational Inference**.
In *25th International Conference on Artificial Intelligence and
Statistics*, volume 151 of *Proceedings of Machine Learning
Research*, pages 7841-7864. PMLR, 28-30 Mar 2022.

**
Abstract:** Gaussian process latent variable models (GPLVM) are a flexible
and non-linear approach to dimensionality reduction, extending classical
Gaussian processes to an unsupervised learning context. The Bayesian
incarnation of the GPLVM uses a variational framework, where the posterior
over latent variables is approximated by a well-behaved variational family, a
factorised Gaussian yielding a tractable lower bound. However, the
non-factorisability of the lower bound prevents truly scalable inference. In
this work, we study the doubly stochastic formulation of the Bayesian GPLVM
model amenable with minibatch training. We show how this framework is
compatible with different latent variable formulations and perform
experiments to compare a suite of models. Further, we demonstrate how we can
train in the presence of massively missing data and obtain high-fidelity
reconstructions. We demonstrate the model’s performance by benchmarking
against the canonical sparse GPLVM for high dimensional data examples.

Antonio Vergari, Robert Peharz, Nicola Di Mauro, Alejandro Molina, Kristian
Kersting, and Floriana Esposito.
**Sum-product
autoencoding: Encoding and decoding representations using sum-product
networks**.
In *32nd AAAI Conference on Artificial Intelligence*, New Orleans, USA,
February 2018.

** Abstract:** Abstract Sum-Product Networks
(SPNs) are a deep probabilistic architecture that up to now has been
successfully employed for tractable inference. Here, we extend their scope
towards unsupervised representation learning: we encode samples into
continuous and categorical embeddings and show that they can also be decoded
back into the original input space by leveraging MPE inference. We
characterize when this Sum-Product Autoencoding (SPAE) leads to equivalent
reconstructions and extend it towards dealing with missing embedding
information. Our experimental results on several multilabel classification
problems demonstrate that SPAE is competitive with state-of-the-art
autoencoder architectures, even if the SPNs were never trained to reconstruct
their inputs.

Maria Lomeli, Stefano Favaro, and Yee Whye Teh.
**A
marginal sampler for sigma-Stable Poisson-Kingman mixture models**.
*Journal of Computational and Graphical Statistics*, 26:44-53, 2017.

** Abstract:** We investigate the class of sigma-stable
Poisson-Kingman random probability measures (RPMs) in the context of Bayesian
nonparametric mixture modeling. This is a large class of discrete RPMs, which
encompasses most of the popular discrete RPMs used in Bayesian
nonparametrics, such as the Dirichlet process, Pitman-Yor process, the
normalized inverse Gaussian process, and the normalized generalized Gamma
process. We show how certain sampling properties and marginal
characterizations of sigma-stable Poisson-Kingman RPMs can be usefully
exploited for devising a Markov chain Monte Carlo (MCMC) algorithm for
performing posterior inference with a Bayesian nonparametric mixture model.
Specifically, we introduce a novel and efficient MCMC sampling scheme in an
augmented space that has a small number of auxiliary variables per iteration.
We apply our sampling scheme to a density estimation and clustering tasks
with unidimensional and multidimensional datasets, and compare it against
competing MCMC sampling schemes. Supplementary materials for this article are
available online.

Maria Lomeli.
**General Bayesian inference
schemes in infinite mixture models**.
PhD thesis, University College London,Gatsby Unit, London, UK, 2017.

** Abstract:** Bayesian statistical models allow us to formalise
our knowledge about the world and reason about our uncertainty, but there is
a need for better procedures to accurately encode its complexity. One way to
do so is through compositional models, which are formed by combining blocks
consisting of simpler models. One can increase the complexity of the
compositional model by either stacking more blocks or by using a
not-so-simple model as a building block. This thesis is an example of the
latter. One first aim is to expand the choice of Bayesian nonparametric (BNP)
blocks for constructing tractable compositional models. So far, most of the
models that have a Bayesian nonparametric component use a Dirichlet Process
or a Pitman-Yor process because of the availability of tractable and compact
representations. This thesis shows how to overcome certain intractabilities
in order to obtain analogous compact representations for the class of
Poisson-Kingman priors which includes the Dirichlet and Pitman-Yor processes.
A major impediment to the widespread use of Bayesian nonparametric building
blocks is that inference is often costly, intractable or difficult to carry
out. This is an active research area since dealing with the model's infinite
dimensional component forbids the direct use of standard simulation-based
methods. The main contribution of this thesis is a variety of inference
schemes that tackle this problem: Markov chain Monte Carlo and Sequential
Monte Carlo methods, which are exact inference schemes since they target the
true posterior. The contributions of this thesis, in a larger context,
provide general purpose exact inference schemes in the flavour or
probabilistic programming: the user is able to choose from a variety of
models, focusing only on the modelling part. Indeed, if the wide enough class
of Poisson-Kingman priors is used as one of our blocks, this objective is
achieved.

Maria Lomeli, Stefano Favaro, and Yee Whye Teh.
**A
hybrid sampler for Poisson-Kingman mixture models**.
In *Advances in Neural Information Processing Systems 28*, pages 1-9,
Montreal, Canada, December 2015.

** Abstract:** This paper
concerns the introduction of a new Markov Chain Monte Carlo scheme for
posterior sampling in Bayesian nonparametric mixture models with priors that
belong to the general Poisson-Kingman class. We present a novel and compact
way of representing the infinite dimensional component of the model such that
while explicitly representing this infinite component it has less memory and
storage requirements than previous MCMC schemes. We describe comparative
simulation results demonstrating the efficacy of the proposed MCMC algorithm
against existing marginal and conditional MCMC samplers.

Stefano Favaro, Maria Lomeli, and Yee Whye Teh.
**On a
class of sigma-Stable Poisson-Kingman models and an effective marginalised
sampler**.
*Statistics and Computing*, 25:67-78, 2015.

**
Abstract:** We investigate the use of a large class of discrete random
probability measures, which is referred to as the class Q, , in the context
of Bayesian nonparametric mixture modeling. The class Q encompasses both the
the two-parameter Poisson?Dirichlet process and the normalized generalized
Gamma process, thus allowing us to comparatively study the inferential
advantages of these two well-known nonparametric priors. Apart from ahighly
flexible parameterization, the distinguishing feature of the class Q is the
availability of a tractable posterior distribution. This feature, in turn,
leads to derive an efficient marginal MCMC algorithm for posterior sampling
within the framework of mixture models. We demonstrate the efficacy of our
modeling framework on both one-dimensional and multi-dimensional
datasets.

Hong Ge, Yutian Chen, Moquan Wan, and Zoubin Ghahramani.
**Distributed
Inference for Dirichlet Process Mixture Models**.
37:2276-2284, 07-09 Jul 2015.

** Abstract:** Bayesian
nonparametric mixture models based on the Dirichlet process (DP) have been
widely used for solving problems like clustering, density estimation and
topic modelling. These models make weak assumptions about the underlying
process that generated the observed data. Thus, when more data are collected,
the complexity of these models can change accordingly. These theoretical
properties often lead to superior predictive performance when compared to
traditional finite mixture models. However, despite the increasing amount of
data available, the application of Bayesian nonparametric mixture models is
so far limited to relatively small data sets. In this paper, we propose an
efficient distributed inference algorithm for the DP and the HDP mixture
model. The proposed method is based on a variant of the slice sampler for
DPs. Since this sampler does not involve a pre-determined truncation, the
stationary distribution of the sampling algorithm is unbiased. We provide
both local thread-level and distributed machine-level parallel
implementations and study the performance of this sampler through an
extensive set of experiments on image and text data. When compared to
existing inference algorithms, the proposed method exhibits state-of-the-art
accuracy and strong scalability with up to 512 cores.

Tomoharu Iwata, James Robert Lloyd, and Zoubin Ghahramani.
**Unsupervised
many-to-many object matching for relational data**.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*,
2015.

** Abstract:** We propose a method for unsupervised
many-to-many object matching from multiple networks, which is the task of
finding correspondences between groups of nodes in different networks. For
example, the proposed method can discover shared word groups from
multi-lingual document-word networks without cross-language alignment
information. We assume that multiple networks share groups, and each group
has its own interaction pattern with other groups. Using infinite relational
models with this assumption, objects in different networks are clustered into
common groups depending on their interaction patterns, discovering a
matching. The effectiveness of the proposed method is experimentally
demonstrated by using synthetic and real relational data sets, which include
applications to cross-domain recommendation without shared user/item
identifiers and multi-lingual word clustering.

Creighton Heaukulani, David A. Knowles, and Zoubin Ghahramani.
**Beta diffusion trees and
hierarchical feature allocations**.
Technical report, Dept. of Engineering, University of Cambridge, August 2014.

** Abstract:** We define the beta diffusion tree, a random tree
structure with a set of leaves that defines a collection of overlapping
subsets of objects, known as a feature allocation. A generative process for
the tree structure is defined in terms of particles (representing the
objects) diffusing in some continuous space, analogously to the Dirichlet
diffusion tree (Neal, 2003b), which defines a tree structure over partitions
(i.e., non-overlapping subsets) of the objects. Unlike in the Dirichlet
diffusion tree, multiple copies of a particle may exist and diffuse along
multiple branches in the beta diffusion tree, and an object may therefore
belong to multiple subsets of particles. We demonstrate how to build a
hierarchically-clustered factor analysis model with the beta diffusion tree
and how to perform inference over the random tree structures with a Markov
chain Monte Carlo algorithm. We conclude with several numerical experiments
on missing data problems with data sets of gene expression microarrays,
international development statistics, and intranational socioeconomic
measurements.

Creighton Heaukulani, David A. Knowles, and Zoubin Ghahramani.
**Beta
diffusion trees**.
In *31st International Conference on Machine Learning*, Beijing, China,
June 2014.

** Abstract:** We define the beta diffusion tree, a
random tree structure with a set of leaves that defines a collection of
overlapping subsets of objects, known as a feature allocation. The generative
process for the tree is defined in terms of particles (representing the
objects) diffusing in some continuous space, analogously to the Dirichlet and
Pitman–Yor diffusion trees (Neal, 2003b; Knowles & Ghahramani, 2011), both
of which define tree structures over clusters of the particles. With the beta
diffusion tree, however, multiple copies of a particle may exist and diffuse
to multiple locations in the continuous space, resulting in (a random number
of) possibly overlapping clusters of the objects. We demonstrate how to build
a hierarchically-clustered factor analysis model with the beta diffusion tree
and how to perform inference over the random tree structures with a Markov
chain Monte Carlo algorithm. We conclude with several numerical experiments
on missing data problems with data sets of gene expression arrays,
international development statistics, and intranational socioeconomic
measurements.

Creighton Heaukulani and Daniel M. Roy.
**The combinatorial structure of beta
negative binomial processes**.
Technical report, Dept. of Engineering, University of Cambridge, March 2014.

** Abstract:** We characterize the combinatorial structure of
conditionally-i.i.d. sequences of negative binomial processes with a common
beta process base measure. In Bayesian nonparametric applications, such
processes have served as models for unknown multisets of a measurable space.
Previous work has characterized random subsets arising from
conditionally-i.i.d. sequences of Bernoulli processes with a common beta
process base measure. In this case, the combinatorial structure is described
by the Indian buffet process. Our results give a count analogue of the Indian
buffet process, which we call a negative binomial Indian buffet process. As
an intermediate step toward this goal, we provide constructions for the beta
negative binomial process that avoid a representation of the underlying beta
process base measure.

Tomoharu Iwata, David Duvenaud, and Zoubin Ghahramani.
**Warped mixtures for nonparametric
cluster shapes**.
In *29th Conference on Uncertainty in Artificial Intelligence*,
Bellevue, Washington, July 2013.

** Abstract:** A mixture of
Gaussians fit to a single curved or heavy-tailed cluster will report that the
data contains many clusters. To produce more appropriate clusterings, we
introduce a model which warps a latent mixture of Gaussians to produce
nonparametric cluster shapes. The possibly low-dimensional latent mixture
model allows us to summarize the properties of the high-dimensional clusters
(or density manifolds) describing the data. The number of manifolds, as well
as the shape and dimension of each manifold is automatically inferred. We
derive a simple inference scheme for this model which analytically integrates
out both the mixture parameters and the warping function. We show that our
model is effective for density estimation, performs better than infinite
Gaussian mixture models at recovering the true number of clusters, and
produces interpretable summaries of high-dimensional datasets.

Amar Shah and Zoubin Ghahramani.
**Determinantal
clustering processes - A nonparametric Bayesian approach to kernel based
semi-supervised clustering**.
*UAI*, 2013.

** Abstract:** Semi-supervised clustering is
the task of clustering data points into clusters where only a fraction of the
points are labelled. The true number of clusters in the data is often unknown
and most models require this parameter as an input. Dirichlet process mixture
models are appealing as they can infer the number of clusters from the data.
However, these models do not deal with high dimensional data well and can
encounter difficulties in inference. We present a novel nonparameteric
Bayesian kernel based method to cluster data points without the need to
prespecify the number of clusters or to model complicated densities from
which data points are assumed to be generated from. The key insight is to use
determinants of submatrices of a kernel matrix as a measure of how close
together a set of points are. We explore some theoretical properties of the
model and derive a natural Gibbs based algorithm with MCMC hyperparameter
learning. The model is implemented on a variety of synthetic and real world
data sets.

P. Kirk, J. E. Griffin, R. S. Savage, Z. Ghahramani, and D. L. Wild.
**Bayesian
correlated clustering to integrate multiple datasets**.
*Bioinformatics*, 2012.

** Abstract:** Motivation: The
integration of multiple datasets remains a key challenge in systems biology
and genomic medicine. Modern high-throughput technologies generate a broad
array of different data types, providing distinct – but often complementary
– information. We present a Bayesian method for the unsupervised
integrative modelling of multiple datasets, which we refer to as MDI
(Multiple Dataset Integration). MDI can integrate information from a wide
range of different datasets and data types simultaneously (including the
ability to model time series data explicitly using Gaussian processes). Each
dataset is modelled using a Dirichlet-multinomial allocation (DMA) mixture
model, with dependencies between these models captured via parameters that
describe the agreement among the datasets.

Results: Using a set of 6
artificially constructed time series datasets, we show that MDI is able to
integrate a significant number of datasets simultaneously, and that it
successfully captures the underlying structural similarity between the
datasets. We also analyse a variety of real S. cerevisiae datasets. In the
2-dataset case, we show that MDI’s performance is comparable to the present
state of the art. We then move beyond the capabilities of current approaches
and integrate gene expression, ChIP-chip and protein-protein interaction
data, to identify a set of protein complexes for which genes are co-regulated
during the cell cycle. Comparisons to other unsupervised data integration
techniques – as well as to non-integrative approaches – demonstrate that
MDI is very competitive, while also providing information that would be
difficult or impossible to extract using other methods.

** Comment:** This paper is available from the Bioinformatics
site and a Matlab implementation of MDI is available fromthis site.

Donglin Niu, Jennifer G. Dy, and Z. Ghahramani.
**A nonparametric
Bayesian model for multiple clustering with overlapping feature
views**.
In *15th International Conference on Artificial Intelligence and
Statistics*, 2012.

** Abstract:** Most clustering
algorithms produce a single clustering solution. This is inadequate for many
data sets that are multi-faceted and can be grouped and interpreted in many
different ways. Moreover, for high-dimensional data, different features may
be relevant or irrelevant to each clustering solution, suggesting the need
for feature selection in clustering. Features relevant to one clustering
interpretation may be different from the ones relevant for an alternative
interpretation or view of the data. In this paper, we introduce a
probabilistic nonparametric Bayesian model that can discover multiple
clustering solutions from data and the feature subsets that are relevant for
the clusters in each view. In our model, the features in different views may
be shared and therefore the sets of relevant features are allowed to overlap.
We model feature relevance to each view using an Indian Buffet Process and
the cluster membership in each view using a Chinese Restaurant Process. We
provide an inference approach to learn the latent parameters corresponding to
this multiple partitioning problem. Our model not only learns the features
and clusters in each view but also automatically learns the number of
clusters, number of views and number of features in each view.

Joshua Abbott, Katherine A. Heller, Zoubin Ghahramani, and Thomas L. Griffiths.
**Testing a
Bayesian measure of representativeness using a large image
database**.
In *Advances in Neural Information Processing Systems 24*, Cambridge,
MA, USA, 2011. The MIT Press.

** Abstract:** How do people
determine which elements of a set are most representative of that set? We
extend an existing Bayesian measure of representativeness, which indicates
the representativeness of a sample from a distribution, to deﬁne a measure
of the representativeness of an item to a set. We show that this measure is
formally related to a machine learning method known as Bayesian Sets.
Building on this connection, we derive an analytic expression for the
representativeness of objects described by a sparse vector of binary
features. We then apply this measure to a large database of images, using it
to determine which images are the most representative members of different
sets. Comparing the resulting predictions to human judgments of
representativeness provides a test of this measure with naturalistic stimuli,
and illustrates how databases that are more commonly used in computer vision
and machine learning can be used to evaluate psychological theories.

Daniel Hernández-Lobato, José Miguel Hernández-Lobato, and Pierre Dupont.
**Robust
multi-class Gaussian process classification**.
In *Advances in Neural Information Processing Systems 25*, 2011.

** Abstract:** Multi-class Gaussian Processs Classifiers (MGPCs)
are often affected by overfitting problems when labeling errors occur far
from the decision boundaries. To prevent this, we investigate a robust MGPC
(RMGPC) which considers labeling errors independently of their distance to
the decision boundaries. Expectation propagation is used for approximate
inference. Experiments with several datasets in which noise is injected in
the labels illustrate the benefits of RMGPC. This method performs better than
other Gaussian process alternatives based on considering latent Gaussian
noise or heavy-tailed processes. When no noise is injected in the labels,
RMGPC still performs equal or better than the other methods. Finally, we show
how RMGPC can be used for successfully indentifying data instances which are
difficult to classify correctly in practice.

David A. Knowles and Zoubin Ghahramani.
**Pitman-Yor
diffusion trees**.
In *27th Conference on Uncertainty in Artificial Intelligence*, 2011.

** Abstract:** We introduce the Pitman Yor Diffusion Tree (PYDT)
for hierarchical clustering, a generalization of the Dirichlet Diffusion Tree
(Neal, 2001) which removes the restriction to binary branching structure. The
generative process is described and shown to result in an exchangeable
distribution over data points. We prove some theoretical properties of the
model and then present two inference methods: a collapsed MCMC sampler which
allows us to model uncertainty over tree structures, and a computationally
efficient greedy Bayesian EM search algorithm. Both algorithms use message
passing on the tree structure. The utility of the model and algorithms is
demonstrated on synthetic and real world data, both continuous and
binary.

** Comment:** web site

David A. Knowles and Thomas P. Minka.
**Non-conjugate
variational message passing for multinomial and binary regression**.
In *Advances in Neural Information Processing Systems 25*, 2011.

** Abstract:** Variational Message Passing (VMP) is an
algorithmic implementation of the Variational Bayes (VB) method which applies
only in the special case of conjugate exponential family models. We propose
an extension to VMP, which we refer to as Non-conjugate Variational Message
Passing (NCVMP) which aims to alleviate this restriction while maintaining
modularity, allowing choice in how expectations are calculated, and
integrating into an existing message-passing framework: Infer.NET. We
demonstrate NCVMP on logistic binary and multinomial regression. In the
multinomial case we introduce a novel variational bound for the softmax
factor which is tighter than other commonly used bounds whilst maintaining
computational tractability.

** Comment:** web site supplementary

Y. Guan, J. G. Dy, D. Niu, and Z. Ghahramani.
**Variational
inference for nonparametric multiple clustering**.
In *KDD10 Workshop on Discovering, Summarizing, and Using Multiple
Clusterings*, Washington, DC, USA, July 2010.

**
Abstract:** Most clustering algorithms produce a single clustering
solution. Similarly, feature selection for clustering tries to find one
feature subset where one interesting clustering solution resides. However, a
single data set may be multi-faceted and can be grouped and interpreted in
many different ways, especially for high dimensional data, where feature
selection is typically needed. Moreover, different clustering solutions are
interesting for different purposes. Instead of committing to one clustering
solution, in this paper we introduce a probabilistic nonparametric Bayesian
model that can discover several possible clustering solutions and the feature
subset views that generated each cluster partitioning simultaneously. We
provide a variational inference approach to learn the features and clustering
partitions in each view. Our model allows us not only to learn the multiple
clusterings and views but also allows us to automatically learn the number of
views and the number of clusters in each view.

R. P. Adams, Zoubin Ghahramani, and Michael I. Jordan.
**Tree-structured
stick breaking for hierarchical data**.
In *Advances in Neural Information Processing Systems 23*. The MIT
Press, 2010.

** Abstract:** Many data are naturally modeled by
an unobserved hierarchical structure. In this paper we propose a flexible
nonparametric prior over unknown data hierarchies. The approach uses nested
stick-breaking processes to allow for trees of unbounded width and depth,
where data can live at any node and are infinitely exchangeable. One can view
our model as providing infinite mixtures where the components have a
dependency structure corresponding to an evolutionary diffusion down a tree.
By using a stick-breaking approach, we can apply Markov chain Monte Carlo
methods based on slice sampling to perform Bayesian inference and simulate
from the posterior distribution on trees. We apply our method to hierarchical
clustering of images and topic modeling of text data.

Andreas Vlachos, Zoubin Ghahramani, and Ted Briscoe.
**Active learning
for constrained Dirichlet process mixture models**.
In *Proceedings of the 2010 Workshop on Geometrical Models of Natural
Language Semantics*, pages 57-61, Uppsala, Sweden, 2010.

**
Abstract:** Recent work applied Dirichlet Process Mixture Models to the
task of verb clustering, incorporating supervision in the form of must-links
and cannot-links constraints between instances. In this work, we introduce an
active learning approach for constraint selection employing uncertainty-based
sampling. We achieve substantial improvements over random selection on two
datasets.

R. Savage, K. A. Heller, Y. Xu, Zoubin Ghahramani, W. Truman, M. Grant,
K. Denby, and D. L. Wild.
**R/BHC: fast
Bayesian hierarchical clustering for microarray data**.
*BMC Bioinformatics 2009*, 10(242):1-9, August 2009, doi
10.1186/1471-2105-10-242.

** Abstract:** Background:
Although the use of clustering methods has rapidly become one of the standard
computational approaches in the literature of microarray gene expression data
analysis, little attention has been paid to uncertainty in the results
obtained.

Results: We present an R/Bioconductor port of a fast novel
algorithm for Bayesian agglomerative hierarchical clustering and demonstrate
its use in clustering gene expression microarray data. The method performs
bottom-up hierarchical clustering, using a Dirichlet Process (infinite
mixture) to model uncertainty in the data and Bayesian model selection to
decide at each step which clusters to merge.

Conclusion: Biologically
plausible results are presented from a well studied data set: expression
profiles of *A. thaliana* subjected to a variety of biotic and abiotic
stresses. Our method avoids several limitations of traditional methods, for
example how many clusters there should be and how to choose a principled
distance metric.

A. Vlachos, A Korhonen, and Z. Ghahramani.
**Unsupervised and
constrained Dirichlet process mixture models for verb clustering**.
In *4th Workshop on Statistical Machine Translation, EACL '09*, Athens,
Greece, March 2009.

** Abstract:** In this work, we apply
Dirichlet Process Mixture Models (DPMMs) to a learning task in natural
language processing (NLP): lexical-semantic verb clustering. We thoroughly
evaluate a method of guiding DPMMs towards a particular clustering solution
using pairwise constraints. The quantitative and qualitative evaluation
performed highlights the benefits of both standard and constrained DPMMs
compared to previously used approaches. In addition, it sheds light on the
use of evaluation measures and their practical application.

Carl Edward Rasmussen, Bernhard J. de la Cruz, Zoubin Ghahramani, and David L.
Wild.
**Modeling and visualizing
uncertainty in gene expression clusters using Dirichlet process
mixtures**.
*IEEE/ACM Transactions on Computational Biology and Bioinformatics*,
6(4):615-628, 2009, doi
10.1109/TCBB.2007.70269.

** Abstract:** Although the use
of clustering methods has rapidly become one of the standard computational
approaches in the literature of microarray gene expression data, little
attention has been paid to uncertainty in the results obtained. Dirichlet
process mixture (DPM) models provide a nonparametric Bayesian alternative to
the bootstrap approach to modeling uncertainty in gene expression clustering.
Most previously published applications of Bayesian model-based clustering
methods have been to short time series data. In this paper, we present a case
study of the application of nonparametric Bayesian clustering methods to the
clustering of high-dimensional nontime series gene expression data using full
Gaussian covariances. We use the probability that two genes belong to the
same cluster in a DPM model as a measure of the similarity of these gene
expression profiles. Conversely, this probability can be used to define a
dissimilarity measure, which, for the purposes of visualization, can be input
to one of the standard linkage algorithms used for hierarchical clustering.
Biologically plausible results are obtained from the Rosetta compendium of
expression profiles which extend previously published cluster analyses of
this data.

Katherine A. Heller, Sinead Williamson, and Zoubin Ghahramani.
**Statistical
models for partial membership**.
In Andrew McCallum and Sam Roweis, editors, *25th International Conference
on Machine Learning*, pages 392-399, Helsinki, Finland, July 2008.
Omnipress.

** Abstract:** We present a principled Bayesian
framework for modeling partial memberships of data points to clusters. Unlike
a standard mixture model which assumes that each data point belongs to one
and only one mixture component, or cluster, a partial membership model allows
data points to have fractional membership in multiple clusters. Algorithms
which assign data points partial memberships to clusters can be useful for
tasks such as clustering genes based on microarray data (Gasch & Eisen,
2002). Our Bayesian Partial Membership Model (BPM) uses exponential family
distributions to model each cluster, and a product of these distibtutions,
with weighted parameters, to model each datapoint. Here the weights
correspond to the degree to which the datapoint belongs to each cluster. All
parameters in the BPM are continuous, so we can use Hybrid Monte Carlo to
perform inference and learning. We discuss relationships between the BPM and
Latent Dirichlet Allocation, Mixed Membership models, Exponential Family PCA,
and fuzzy clustering. Lastly, we show some experimental results and discuss
nonparametric extensions to our model.

A. Vlachos, Z. Ghahramani, and A Korhonen.
**Dirichlet
process mixture models for verb clustering**.
In Guillaume Bouchard, Hal Daumé III, Marc Dymetman, and Yee Whye Teh,
editors, *ICML Workshop on Prior Knowledge for Text and Language
Processing*, pages 43-48, Helsinki, Finland, July 2008.

**
Abstract:** In this work we apply Dirichlet Process Mixture Models to a
learning task in natural language processing (NLP): lexical-semantic verb
clustering. We assess the performance on a dataset based on Levin's (1993)
verb classes using the recently introduced V-measure metric. In, we present a
method to add human supervision to the model in order to to influence the
solution with respect to some prior knowledge. The quantitative evaluation
performed highlights the benefits of the chosen method compared to previously
used clustering approaches.

Katherine A. Heller and Zoubin Ghahramani.
**A nonparametric
Bayesian approach to modeling overlapping clusters**.
In Marina Meila and Xiaotong Shen, editors, *AISTATS*, volume 2 of
*JMLR Proceedings*, pages 187-194. JMLR.org, 2007.

**
Abstract:** Although clustering data into mutually exclusive partitions has
been an extremely successful approach to unsupervised learning, there are
many situations in which a richer model is needed to fully represent the
data. This is the case in problems where data points actually simultaneously
belong to multiple, overlapping clusters. For example a particular gene may
have several functions, therefore belonging to several distinct clusters of
genes, and a biologist may want to discover these through unsupervised
modeling of gene expression data. We present a new nonparametric Bayesian
method, the Infinite Overlapping Mixture Model (IOMM), for modeling
overlapping clusters. The IOMM uses exponential family distributions to model
each cluster and forms an overlapping mixture by taking products of such
distributions, much like products of experts (Hinton, 2002). The IOMM allows
an unbounded number of clusters, and assignments of points to (multiple)
clusters is modeled using an Indian Buffet Process (IBP), (Griffiths and
Ghahramani, 2006). The IOMM has the desirable properties of being able to
focus in on overlapping regions while maintaining the ability to model a
potentially infinite number of clusters which may overlap. We derive MCMC
inference algorithms for the IOMM and show that these can be used to cluster
movies into multiple genres.

Zoubin Ghahramani and Katherine A. Heller.
**Bayesian
sets**.
In Y. Weiss, B. Schölkopf, and J. Platt, editors, *Advances in Neural
Information Processing Systems 18*, pages 435-442, Cambridge, MA, USA,
December 2006. The MIT Press.

** Abstract:** Inspired by
"Google™ Sets", we consider the problem of retrieving items from a
concept or cluster, given a query consisting of a few items from that
cluster. We formulate this as a Bayesian inference problem and describe a
very simple algorithm for solving it. Our algorithm uses a model-based
concept of a cluster and ranks items using a score which evaluates the
marginal probability that each item belongs to a cluster containing the query
items. For exponential family models with conjugate priors this marginal
probability is a simple function of sufficient statistics. We focus on sparse
binary data and show that our score can be evaluated exactly using a single
sparse matrix multiplication, making it possible to apply our algorithm to
very large datasets. We evaluate our algorithm on three datasets: retrieving
movies from EachMovie, finding completions of author sets from the NIPS
dataset, and finding completions of sets of words appearing in the Grolier
encyclopedia. We compare to Google™ Sets and show that Bayesian Sets
gives very reasonable set completions.

Katherine A. Heller and Zoubin Ghahramani.
**Bayesian
hierarchical clustering**.
In Luc De Raedt and Stefan Wrobel, editors, *ICML*, volume 119 of
*ACM International Conference Proceeding Series*, pages 297-304.
Association for Computing Machinery, 2005.

** Abstract:** We
present a novel algorithm for agglomerative hierarchical clustering based on
evaluating marginal likelihoods of a probabilistic model. This algorithm has
several advantages over traditional distance-based agglomerative clustering
algorithms. (1) It defines a probabilistic model of the data which can be
used to compute the predictive distribution of a test point and the
probability of it belonging to any of the existing clusters in the tree. (2)
It uses a model-based criterion to decide on merging clusters rather than an
ad-hoc distance metric. (3) Bayesian hypothesis testing is used to decide
which merges are advantageous and to output the recommended depth of the
tree. (4) The algorithm can be interpreted as a novel fast bottom-up
approximate inference method for a Dirichlet process (i.e. countably
infinite) mixture model (DPM). It provides a new lower bound on the marginal
likelihood of a DPM by summing over exponentially many clusterings of the
data in polynomial time. We describe procedures for learning the model
hyperpa-rameters, computing the predictive distribution, and extensions to
the algorithm. Experimental results on synthetic and real-world data sets
demonstrate useful properties of the algorithm.

A. Dubey, S. Hwang, C. Rangel, Carl Edward Rasmussen, Zoubin Ghahramani, and
David L. Wild.
**Clustering
protein sequence and structure space with infinite Gaussian mixture
models**.
In *Pacific Symposium on Biocomputing 2004*, pages 399-410, Singapore,
2004. World Scientific Publishing.

** Abstract:** We describe
a novel approach to the problem of automatically clustering protein sequences
and discovering protein families, subfamilies etc., based on the thoery of
infinite Gaussian mixture models. This method allows the data itself to
dictate how many mixture components are required to model it, and provides a
measure of the probability that two proteins belong to the same cluster. We
illustrate our methods with application to three data sets: globin sequences,
globin sequences with known tree-dimensional structures and G-pretein coupled
receptor sequences. The consistency of the clusters indicate that that our
methods is producing biologically meaningful results, which provide a very
good indication of the underlying families and subfamilies. With the
inclusion of secondary structure and residue solvent accessibility
information, we obtain a classification of sequences of known structure which
reflects and extends their SCOP classifications.

Naonori Ueda, Ryohei Nakano, Zoubin Ghahramani, and Geoffrey E. Hinton.
**SMEM algorithm
for mixture models**.
*Neural Computation*, 12(9):2109-2128, 2000.

**
Abstract:** We present a split-and-merge expectation-maximization (SMEM)
algorithm to overcome the local maxima problem in parameter estimation of
finite mixture models. In the case of mixture models, local maxima often
involve having too many components of a mixture model in one part of the
space and too few in another, widely separated part of the space. To escape
from such configurations, we repeatedly perform simultaneous split-and-merge
operations using a new criterion for efficiently selecting the
split-and-merge candidates. We apply the proposed algorithm to the training
of gaussian mixtures and mixtures of factor analyzers using synthetic and
real data and show the effectiveness of using the split-and-merge operations
to improve the likelihood of both the training data and of held-out test
data. We also show the practical usefulness of the proposed algorithm by
applying it to image compression and pattern recognition problems.

Naonori Ueda, Ryohei Nakano, Zoubin Ghahramani, and Geoffrey E. Hinton.
**Split and merge
EM algorithm for improving Gaussian mixture density estimates**.
*VLSI Signal Processing*, 26(1-2):133-140, 2000.

**
Abstract:** We present a split and merge EM algorithm to overcome the local
maximum problem in Gaussian mixture density estimation. Nonglobal maxims
often involve having too many Gaussians in one part of the space and too few
in another, widely separated part of the space. To escape from such
configurations we repeatedly perform split and merge operations using a new
criterion for efficiently selecting the split and merge candidates.
Experimental results on synthetic and real data show the effectiveness of
using the split and merge operations to improve the likelihood of both the
training data and of held-out test data

Naonori Ueda, Ryohei Nakano, Zoubin Ghahramani, and Geoffrey E. Hinton.
**SMEM algorithm
for mixture models**.
In Michael J. Kearns, Sara A. Solla, and David A. Cohn, editors, *NIPS*,
pages 599-605. The MIT Press, 1998.

** Abstract:** We present
a split-and-merge expectation-maximization (SMEM) algorithm to overcome the
local maxima problem in parameter estimation of finite mixture models. In the
case of mixture models, local maxima often involve having too many components
of a mixture model in one part of the space and too few in another, widely
separated part of the space. To escape from such configurations, we
repeatedly perform simultaneous split-and-merge operations using a new
criterion for efficiently selecting the split-and-merge candidates. We apply
the proposed algorithm to the training of gaussian mixtures and mixtures of
factor analyzers using synthetic and real data and show the effectiveness of
using the split- and-merge operations to improve the likelihood of both the
training data and of held-out test data. We also show the practical
usefulness of the proposed algorithm by applying it to image compression and
pattern recognition problems.

Zoubin Ghahramani.
**Factorial learning and
the EM algorithm**.
In Gerald Tesauro, David S. Touretzky, and Todd K. Leen, editors, *Advances
in Neural Information Processing Systems 7*, pages 617-624. MIT Press,
1994.

** Abstract:** Many real world learning problems are
best characterized by an interaction of multiple independent causes or
factors. Discovering such causal structure from the data is the focus of this
paper. Based on Zemel and Hinton's cooperative vector quantizer (CVQ)
architecture, an unsupervised learning algorithm is derived from the
Expectation-Maximization (EM) framework. Due to the combinatorial nature of
the data generation process, the exact E-step is computationally intractable.
Two alternative methods for computing the E-step are proposed: Gibbs sampling
and mean-field approximation, and some promising empirical results are
presented.

Zoubin Ghahramani and Michael I. Jordan.
**Supervised learning
from incomplete data via an EM approach**.
In Jack D. Cowan, Gerald Tesauro, and Joshua Alspector, editors, *NIPS*,
pages 120-127. Morgan Kaufmann, 1993.

** Abstract:**
Real-world learning tasks may involve high-dimensional data sets with
arbitrary patterns of missing data. In this paper we present a framework
based on maximum likelihood density estimation for learning from such data
sets. We use mixture models for the density estimates and make two distinct
appeals to the ExpectationMaximization (EM) principle (Dempster et al., 1977)
in deriving a learning algorithm-EM is used both for the estimation of
mixture components and for coping with missing data. The resulting algorithm
is applicable to a wide range of supervised as well as unsupervised learning
problems. Results from a classification benchmark-the iris data set-are
presented.

## Graphical ModelsGraphical models are a graphical representation of the conditional independence relations among a set of variables. The graph is useful both as an intuitive representation of how the variables are related, and as a tool for defining efficient message passing algorithms for probabilistic inference. |

Martin Trapp, Robert Peharz, Franz Pernkopf, and Carl Edward Rasmussen.
**Deep
structured mixtures of Gaussian processes**.
In *23rd International Conference on Artificial Intelligence and
Statistics*, Online, August 2020.

** Abstract:** Gaussian
Processes (GPs) are powerful non-parametric Bayesian regression models that
allow exact posterior inference, but exhibit high computational and memory
costs. In order to improve scalability of GPs, approximate posterior
inference is frequently employed, where a prominent class of approximation
techniques is based on local GP experts. However, local-expert techniques
proposed so far are either not well-principled, come with limited
approximation guarantees, or lead to intractable models. In this paper, we
introduce deep structured mixtures of GP experts, a stochastic process model
which i) allows exact posterior inference, ii) has attractive computational
and memory costs, and iii) when used as GP approximation, captures predictive
uncertainties consistently better than previous expert-based approximations.
In a variety of experiments, we show that deep structured mixtures have a low
approximation error and often perform competitive or outperform prior
work.

Robert Peharz, Steven Lang, Antonio Vergari, Karl Stelzner, Alejandro Molina,
Martin Trapp, Guy Van den Broeck, Kristian Kersting, and Zoubin Ghahramani.
**Einsum networks: Fast and
scalable learning of tractable probabilistic circuits**.
In *37th International Conference on Machine Learning*, Online, July
2020.

** Abstract:** Probabilistic circuits (PCs) are a
promising avenue for probabilistic modeling, as they permit a wide range of
exact and efficient inference routines. Recent ``deep-learning-style''
implementations of PCs strive for a better scalability, but are still
difficult to train on real-world data, due to their sparsely connected
computational graphs. In this paper, we propose Einsum Networks (EiNets), a
novel implementation design for PCs, improving prior art in several regards.
At their core, EiNets combine a large number of arithmetic operations in a
single monolithic einsum-operation, leading to speedups and memory savings of
up to two orders of magnitude, in comparison to previous implementations. As
an algorithmic contribution, we show that the implementation of
Expectation-Maximization (EM) can be simplified for PCs, by leveraging
automatic differentiation. Furthermore, we demonstrate that EiNets scale well
to datasets which were previously out of reach, such as SVHN and CelebA, and
that they can be used as faithful generative image models.

Martin Trapp, Robert Peharz, Hong Ge, Franz Pernkopf, and Zoubin Ghahramani.
**Bayesian
learning of sum-product networks**.
In *Advances in Neural Information Processing Systems 33*, Vancouver,
December 2019.

** Abstract:** Sum-product networks (SPNs) are
flexible density estimators and have received significant attention due to
their attractive inference properties. While parameter learning in SPNs is
well developed, structure learning leaves something to be desired: Even
though there is a plethora of SPN structure learners, most of them are
somewhat ad-hoc and based on intuition rather than a clear learning
principle. In this paper, we introduce a well-principled Bayesian framework
for SPN structure learning. First, we decompose the problem into i) laying
out a computational graph, and ii) learning the so-called scope function over
the graph. The first is rather unproblematic and akin to neural network
architecture validation. The second represents the effective structure of the
SPN and needs to respect the usual structural constraints in SPN, i.e.
completeness and decomposability. While representing and learning the scope
function is somewhat involved in general, in this paper, we propose a natural
parametrisation for an important and widely used special case of SPNs. These
structural parameters are incorporated into a Bayesian model, such that
simultaneous structure and parameter learning is cast into monolithic
Bayesian posterior inference. In various experiments, our Bayesian SPNs often
improve test likelihoods over greedy SPN learners. Further, since the
Bayesian framework protects against overfitting, we can evaluate
hyper-parameters directly on the Bayesian model score, waiving the need for a
separate validation set, which is especially beneficial in low data regimes.
Bayesian SPNs can be applied to heterogeneous domains and can easily be
extended to nonparametric formulations. Moreover, our Bayesian approach is
the first, which consistently and robustly learns SPN structures under
missing data.

Tameem Adel and Adrian Weller.
**TibGM: A
transferable and information-based graphical model approach for reinforcement
learning**.
In *36th International Conference on Machine Learning*, Long Beach, June
2019.

** Abstract:** One of the challenges to reinforcement
learning (RL) is scalable transferability among complex tasks. Incorporating
a graphical model (GM), along with the rich family of related methods, as a
basis for RL frameworks provides potential to address issues such as
transferability, generalisation and exploration. Here we propose a flexible
GM-based RL framework which leverages efficient inference procedures to
enhance generalisation and transfer power. In our proposed transferable and
information-based graphical model framework ‘TibGM’, we show the
equivalence between our mutual information-based objective in the GM, and an
RL consolidated objective consisting of a standard reward maximisation target
and a generalisation/transfer objective. In settings where there is a sparse
or deceptive reward signal, our TibGM framework is flexible enough to
incorporate exploration bonuses depicting intrinsic rewards. We empirically
verify improved performance and exploration power.

Ofer Meshi, Ben London, Adrian Weller, and David Sontag.
**Train and test tightness of
LP relaxations in structured prediction**.
*Journal of Machine Learning Research*, 20(13):1-34, 2019.

** Abstract:** Structured prediction is used in areas including
computer vision and natural language processing to predict structured outputs
such as segmentations or parse trees. In these settings, prediction is
performed by MAP inference or, equivalently, by solving an integer linear
program. Because of the complex scoring functions required to obtain accurate
predictions, both learning and inference typically require the use of
approximate solvers. We propose a theoretical explanation for the striking
observation that approximations based on linear programming (LP) relaxations
are often tight (exact) on real-world instances. In particular, we show that
learning with LP relaxed inference encourages integrality of training
instances, and that this training tightness generalizes to test data.

Yingzhen Li and Stephan Mandt.
**Disentangled Sequential
Autoencoder**.
In *35th International Conference on Machine Learning*, Stockholm
Sweden, July 2018.

** Abstract:** We present a VAE
architecture for encoding and generating high dimensional sequential data,
such as video or audio. Our deep generative model learns a latent
representation of the data which is split into a static and dynamic part,
allowing us to approximately disentangle latent time-dependent features
(dynamics) from features which are preserved over time (content). This
architecture gives us partial control over generating content and dynamics by
conditioning on either one of these sets of features. In our experiments on
artificially generated cartoon video clips and voice recordings, we show that
we can convert the content of a given sequence into another one by such
content swapping. For audio, this allows us to convert a male speaker into a
female speaker and vice versa, while for video we can separately manipulate
shapes and dynamics. Furthermore, we give empirical evidence for the
hypothesis that stochastic RNNs as latent state models are more efficient at
compressing and generating long sequences than deterministic ones, which may
be relevant for applications in video compression.

Sungsoo Ahn, Michael Chertkov, Jinwoo Shin, and Adrian Weller.
**Gauged
mini-bucket elimination for approximate inference**.
In *21st International Conference on Artificial Intelligence and
Statistics*, Playa Blanca, Lanzarote, Canary Islands, April 2018.

** Abstract:** Computing the partition function Z of a discrete
graphical model is a fundamental inference challenge. Since this is
computationally intractable, variational approximations are often used in
practice. Recently, so-called gauge transformations were used to improve
variational lower bounds on Z. In this paper, we propose a new
gauge-variational approach, termed WMBE-G, which combines gauge
transformations with the weighted mini-bucket elimination (WMBE) method.
WMBE-G can provide both upper and lower bounds on Z, and is easier to
optimize than the prior gauge-variational algorithm. We show that WMBE-G
strictly improves the earlier WMBE approximation for symmetric models
including Ising models with no magnetic field. Our experimental results
demonstrate the effectiveness of WMBE-G even for generic, nonsymmetric
models.

Antonio Vergari, Robert Peharz, Nicola Di Mauro, Alejandro Molina, Kristian
Kersting, and Floriana Esposito.
**Sum-product
autoencoding: Encoding and decoding representations using sum-product
networks**.
In *32nd AAAI Conference on Artificial Intelligence*, New Orleans, USA,
February 2018.

** Abstract:** Abstract Sum-Product Networks
(SPNs) are a deep probabilistic architecture that up to now has been
successfully employed for tractable inference. Here, we extend their scope
towards unsupervised representation learning: we encode samples into
continuous and categorical embeddings and show that they can also be decoded
back into the original input space by leveraging MPE inference. We
characterize when this Sum-Product Autoencoding (SPAE) leads to equivalent
reconstructions and extend it towards dealing with missing embedding
information. Our experimental results on several multilabel classification
problems demonstrate that SPAE is competitive with state-of-the-art
autoencoder architectures, even if the SPNs were never trained to reconstruct
their inputs.

Sungsoo Ahn, Michael Chertkov, Adrian Weller, and Jinwoo Shin.
**Bucket
renormalization for approximate inference**.
In *35th International Conference on Machine Learning*, 2018.

** Abstract:** Probabilistic graphical models are a key tool in
machine learning applications. Computing the partition function, i.e.,
normalizing constant, is a fundamental task of statistical inference but it
is generally computationally intractable, leading to extensive study of
approximation methods. Iterative variational methods are a popular and
successful family of approaches. However, even state of the art variational
methods can return poor results or fail to converge on difficult instances.
In this paper, we instead consider computing the partition function via
sequential summation over variables. We develop robust approximate algorithms
by combining ideas from mini-bucket elimination with tensor network and
renormalization group methods from statistical physics. The resulting
“convergence-free” methods show good empirical performance on both
synthetic and real-world benchmark models, even for difficult instances.

Niki Kilbertus, Mateo Rojas-Carulla, Giambattista Parascandolo, Moritz Hardt,
Dominik Janzing, and Bernhard Schölkopf.
**Avoiding discrimination
through causal reasoning**.
In *Advances in Neural Information Processing Systems 30*, Long Beach,
California, December 2017.

** Abstract:** Recent work on
fairness in machine learning has focused on various statistical
discrimination criteria and how they trade off. Most of these criteria are
observational: They depend only on the joint distribution of predictor,
protected attribute, features, and outcome. While convenient to work with,
observational criteria have severe inherent limitations that prevent them
from resolving matters of fairness conclusively. Going beyond observational
criteria, we frame the problem of discrimination based on protected
attributes in the language of causal reasoning. This viewpoint shifts
attention from ``What is the right fairness criterion?'' to ``What do we want
to assume about our model of the causal data generating process?'' Through
the lens of causality, we make several contributions. First, we crisply
articulate why and when observational criteria fail, thus formalizing what
was before a matter of opinion. Second, our approach exposes previously
ignored subtleties and why they are fundamental to the problem. Finally, we
put forward natural causal non-discrimination criteria and develop algorithms
that satisfy them.

Mark Rowland and Adrian Weller.
**Uprooting
and rerooting higher-order graphical models**.
In *Advances in Neural Information Processing Systems 31*, Long Beach,
California, December 2017.

** Abstract:** The idea of
uprooting and rerooting graphical models was introduced specifically for
binary pairwise models by Weller [18] as a way to transform a model to any of
a whole equivalence class of related models, such that inference on any one
model yields inference results for all others. This is very helpful since
inference, or relevant bounds, may be much easier to obtain or more accurate
for some model in the class. Here we introduce methods to extend the approach
to models with higher-order potentials and develop theoretical insights. For
example, we demonstrate that the triplet-consistent polytope TRI is unique in
being 'universally rooted'. We demonstrate empirically that rerooting can
significantly improve accuracy of methods of inference for higher-order
models at negligible computational cost.

Matej Balog, Nilesh Tripuraneni, Zoubin Ghahramani, and Adrian Weller.
**Lost
relatives of the Gumbel trick**.
In *34th International Conference on Machine Learning*, Sydney,
Australia, August 2017.

** Abstract:** The Gumbel trick is a
method to sample from a discrete probability distribution, or to estimate its
normalizing partition function. The method relies on repeatedly applying a
random perturbation to the distribution in a particular way, each time
solving for the most likely configuration. We derive an entire family of
related methods, of which the Gumbel trick is one member, and show that the
new methods have superior properties in several settings with minimal
additional computational cost. In particular, for the Gumbel trick to yield
computational benefits for discrete graphical models, Gumbel perturbations on
all configurations are typically replaced with so-called low-rank
perturbations. We show how a subfamily of our new methods adapts to this
setting, proving new upper and lower bounds on the log partition function and
deriving a family of sequential samplers for the Gibbs distribution. Finally,
we balance the discussion by showing how the simpler analytical form of the
Gumbel trick enables additional theoretical results.

** Comment:** [arXiv] [Poster]
[Slides]
[Code]

Martin Trapp, Tamas Madl, Robert Peharz, Franz Pernkopf, and Robert Trappl.
**Safe
semi-supervised learning of sum-product networks**.
In *33st Conference on Uncertainty in Artificial Intelligence*, Sidney,
Australia, August 2017.

** Abstract:** In several domains
obtaining class annotations is expensive while at the same time unlabelled
data are abundant. While most semi-supervised approaches enforce restrictive
assumptions on the data distribution, recent work has managed to learn
semi-supervised models in a non-restrictive regime. However, so far such
approaches have only been proposed for linear models. In this work, we
introduce semi-supervised parameter learning for Sum-Product Networks (SPNs).
SPNs are deep probabilistic models admitting inference in linear time in
number of network edges. Our approach has several advantages, as it (1)
allows generative and discriminative semi-supervised learning, (2) guarantees
that adding unlabelled data can increase, but not degrade, the performance
(safe), and (3) is computationally efficient and does not enforce restrictive
assumptions on the data distribution. We show on a variety of data sets that
safe semi-supervised learning with SPNs is competitive compared to
state-of-the-art and can lead to a better generative and discriminative
objective value than a purely supervised approach.

Eric Jang, Shixiang Gu, and Ben Poole.
**Categorical reparametrization
with gumble-softmax**.
In *5th International Conference on Learning Representations*, Toulon
FRANCE, April 2017.

** Abstract:** Categorical variables are a
natural choice for representing discrete structure in the world. However,
stochastic neural networks rarely use categorical latent variables due to the
inability to backpropagate through samples. In this work, we present an
efficient gradient estimator that replaces the non-differentiable sample from
a categorical distribution with a differentiable sample from a novel
Gumbel-Softmax distribution. This distribution has the essential property
that it can be smoothly annealed into a categorical distribution. We show
that our Gumbel-Softmax estimator outperforms state-of-the-art gradient
estimators on structured output prediction and unsupervised generative
modeling tasks with categorical latent variables, and enables large speedups
on semi-supervised classification.

Mark Rowland, Aldo Pacchiano, and Adrian Weller.
**Conditions beyond
treewidth for tightness of higher-order LP relaxations**.
In *20th International Conference on Artificial Intelligence and
Statistics*, Fort Lauderdale, Florida, April 2017.

**
Abstract:** Linear programming (LP) relaxations are a popular method to
attempt to find a most likely configuration of a discrete graphical model. If
a solution to the relaxed problem is obtained at an integral vertex then the
solution is guaranteed to be exact and we say that the relaxation is tight.
We consider binary pairwise models and introduce new methods which allow us
to demonstrate refined conditions for tightness of LP relaxations in the
Sherali-Adams hierarchy. Our results include showing that for higher order LP
relaxations, treewidth is not precisely the right way to characterize
tightness. This work is primarily theoretical, with insights that can improve
efficiency in practice.

Ofer Meshi, Mehrdad Mahdavi, Adrian Weller, and David Sontag.
**Train and test
tightness of LP relaxations in structured prediction**.
In *33rd International Conference on Machine Learning*, New York, NY,
June 2016.

** Abstract:** Structured prediction is used in
areas such as computer vision and natural language processing to predict
structured outputs such as segmentations or parse trees. In these settings,
prediction is performed by MAP inference or, equivalently, by solving an
integer linear program. Because of the complex scoring functions required to
obtain accurate predictions, both learning and inference typically require
the use of approximate solvers. We propose a theoretical explanation to the
striking observation that approximations based on linear programming (LP)
relaxations are often tight on real-world instances. In particular, we show
that learning with LP relaxed inference encourages integrality of training
instances, and that tightness generalizes from train to test data.

Adrian Weller.
**Characterizing tightness
of LP relaxations by forbidding signed minors**.
In *32nd Conference on Uncertainty in Artificial Intelligence*, Jersey
City, NJ, June 2016.

** Abstract:** We consider binary
pairwise graphical models and provide an exact characterization (necessary
and sufficient conditions observing signs of potentials) of tightness for the
LP relaxation on the triplet-consistent polytope of the MAP inference
problem, by forbidding an odd-K5 (complete graph on 5 variables with all
edges repulsive) as a signed minor in the signed suspension graph. This
captures signs of both singleton and edge potentials in a compact and
efficiently testable condition, and improves significantly on earlier
results. We provide other results on tightness of LP relaxations by
forbidding minors, draw connections and suggest paths for future
research.

Adrian Weller.
**Uprooting and
rerooting graphical models**.
In *33rd International Conference on Machine Learning*, New York, NY,
June 2016.

** Abstract:** We show how any binary pairwise
model may be ‘uprooted’ to a fully symmetric model, wherein original
singleton potentials are transformed to potentials on edges to an added
variable, and then ‘rerooted’ to a new model on the original number of
variables. The new model is essentially equivalent to the original model,
with the same partition function and allowing recovery of the original
marginals or a MAP configuration, yet may have very different computational
properties that allow much more efficient inference. This meta-approach
deepens our understanding, may be applied to any existing algorithm to yield
improved methods in practice, generalizes earlier theoretical results, and
reveals a remarkable interpretation of the triplet-consistent polytope.

Shixiang Gu, Sergey Levine, Ilya Sutskever, and Andriy Mnih.
**Muprop: Unbiased backpropagation
for stochastic neural networks**.
In *4th International Conference on Learning Representations*, San Juan
PUERTO RICO, May 2016.

** Abstract:** Deep neural networks are
powerful parametric models that can be trained efficiently using the
backpropagation algorithm. Stochastic neural networks combine the power of
large parametric functions with that of graphical models, which makes it
possible to learn very complex distributions. However, as backpropagation is
not directly applicable to stochastic networks that include discrete sampling
operations within their computational graph, training such networks remains
difficult. We present MuProp, an unbiased gradient estimator for stochastic
networks, designed to make this task easier. MuProp improves on the
likelihood-ratio estimator by reducing its variance using a control variate
based on the first-order Taylor expansion of a mean-field network. Crucially,
unlike prior attempts at using backpropagation for training stochastic
networks, the resulting estimator is unbiased and well behaved. Our
experiments on structured output prediction and discrete latent variable
modeling demonstrate that MuProp yields consistently good performance across
a range of difficult tasks.

Adrian Weller and Justin Domke.
**Clamping
improves TRW and mean field approximations**.
In *19th International Conference on Artificial Intelligence and
Statistics*, Cadiz, Spain, May 2016.

** Abstract:** We
examine the effect of clamping variables for approximate inference in
undirected graphical models with pairwise relationships and discrete
variables. For any number of variable labels, we demonstrate that clamping
and summing approximate sub-partition functions can lead only to a decrease
in the partition function estimate for TRW, and an increase for the naive
mean field method, in each case guaranteeing an improvement in the
approximation and bound. We next focus on binary variables, add the Bethe
approximation to consideration and examine ways to choose good variables to
clamp, introducing new methods. We show the importance of identifying highly
frustrated cycles, and of checking the singleton entropy of a variable. We
explore the value of our methods by empirical analysis and draw lessons to
guide practitioners.

Adrian Weller, Mark Rowland, and David Sontag.
**Tightness of LP
relaxations for almost balanced models**.
In *19th International Conference on Artificial Intelligence and
Statistics*, Cadiz, Spain, May 2016.

** Abstract:** Linear
programming (LP) relaxations are widely used to attempt to identify a most
likely configuration of a discrete graphical model. In some cases, the LP
relaxation attains an optimum vertex at an integral location and thus
guarantees an exact solution to the original optimization problem. When this
occurs, we say that the LP relaxation is tight. Here we consider binary
pairwise models and derive sufﬁcient conditions for guaranteed tightness of
(i) the standard LP relaxation on the local polytope LP+LOC, and (ii) the LP
relaxation on the triplet-consistent polytope LP+TRI (the next level in the
Sherali-Adams hierarchy). We provide simple new proofs of earlier results and
derive signiﬁcant novel results including that LP+TRI is tight for any
model where each block is balanced or almost balanced, and a decomposition
theorem that may be used to break apart complex models into smaller pieces.
An almost balanced (sub-)model is one that contains no frustrated cycles
except through one privileged variable.

Shixiang Gu, Zoubin Ghahramani, and Richard E. Turner.
**Neural adaptive sequential Monte
Carlo**.
In *Advances in Neural Information Processing Systems 29*, Montréal
CANADA, Dec 2015.

** Abstract:** Sequential Monte Carlo (SMC),
or particle filtering, is a popular class of methods for sampling from an
intractable target distribution using a sequence of simpler intermediate
distributions. Like other importance sampling-based methods, performance is
critically dependent on the proposal distribution: a bad proposal can lead to
arbitrarily inaccurate estimates of the target distribution. This paper
presents a new method for automatically adapting the proposal using an
approximation of the Kullback-Leibler divergence between the true posterior
and the proposal distribution. The method is very flexible, applicable to any
parameterised proposal distribution and it supports online and batch
variants. We use the new framework to adapt powerful proposal distributions
with rich parameterisations based upon neural networks leading to Neural
Adaptive Sequential Monte Carlo (NASMC). Experiments indicate that NASMC
significantly improves inference in a non-linear state space model
outperforming adaptive proposal methods including the Extended Kalman and
Unscented Particle Filters. Experiments also indicate that improved inference
translates into improved parameter learning when NASMC is used as a
subroutine of Particle Marginal Metropolis Hastings. Finally we show that
NASMC is able to train a neural network-based deep recurrent generative model
achieving results that compete with the state-of-the-art for polymorphic
music modelling. NASMC can be seen as bridging the gap between adaptive SMC
methods and the recent work in scalable, black-box variational inference.

Adrian Weller.
**Bethe and
related pairwise entropy approximations**.
In *31st Conference on Uncertainty in Artificial Intelligence*, pages
942-951, Amsterdam, July 2015.

** Abstract:** For undirected
graphical models, belief propagation often performs remarkably well for
approximate marginal inference, and may be viewed as a heuristic to minimize
the Bethe free energy. Focusing on binary pairwise models, we demonstrate
that several recent results on the Bethe approximation may be generalized to
a broad family of related pairwise free energy approximations with arbitrary
counting numbers. We explore approximation error and shed light on the
empirical success of the Bethe approximation.

** Comment:** Supplementary
Material

Nilesh Tripuraneni, Shixiang Gu, Hong Ge, and Zoubin Ghahramani.
**Particle
gibbs for infinite hidden markov models**.
In *Advances in Neural Information Processing Systems 29*, Montreal
CANADA, May 2015.

** Abstract:** Infinite Hidden Markov Models
(iHMM’s) are an attractive, nonparametric gener- alization of the classical
Hidden Markov Model which can automatically infer the number of hidden states
in the system. However, due to the infinite-dimensional nature of the
transition dynamics, performing inference in the iHMM is difficult. In this
paper, we present an infinite-state Particle Gibbs (PG) algorithm to re-
sample state trajectories for the iHMM. The proposed algorithm uses an
efficient proposal optimized for iHMMs and leverages ancestor sampling to
improve the mixing of the standard PG algorithm. Our algorithm demonstrates
significant con- vergence improvements on synthetic and real world data
sets.

Adrian Weller.
**Revisiting the
limits of MAP inference by MWSS on perfect graphs**.
In *18th International Conference on Artificial Intelligence and
Statistics*, pages 1061-1069, San Diego, California, May 2015.

** Abstract:** A recent, promising approach to identifying a
configuration of a discrete graphical model with highest probability (termed
MAP inference) is to reduce the problem to finding a maximum weight stable
set (MWSS) in a derived weighted graph, which, if perfect, allows a solution
to be found in polynomial time. Weller and Jebara (2013) investigated the
class of binary pairwise models where this method may be applied. However,
their analysis made a seemingly innocuous assumption which simplifies
analysis but led to only a subset of possible reparameterizations being
considered. Here we introduce novel techniques and consider all cases,
demonstrating that this greatly expands the set of tractable models. We
provide a simple, exact characterization of the new, enlarged set and show
how such models may be efficiently identified, thus settling the power of the
approach on this class.

** Comment:** Also accepted for presentation at the 21st
International Conference on Principles and Practice of Constraint Programming
(CP 2015)

José Miguel Hernández-Lobato, Matthew W. Hoffman, and Zoubin Ghahramani.
**Predictive
entropy search for efficient global optimization of black-box
functions**.
In *Advances in Neural Information Processing Systems 28*, pages 1-9,
Montreal, Canada, December 2014.

** Abstract:** We propose a
novel information-theoretic approach for Bayesian optimization called
Predictive Entropy Search (PES). At each iteration, PES selects the next
evaluation point that maximizes the expected information gained with respect
to the global maximum. PES codifies this intractable acquisition function in
terms of the expected reduction in the differential entropy of the predictive
distribution. This reformulation allows PES to obtain approximations that are
both more accurate and efficient than other alternatives such as Entropy
Search (ES). Furthermore, PES can easily perform a fully Bayesian treatment
of the model hyperparameters while ES cannot. We evaluate PES in both
synthetic and realworld applications, including optimization problems in
machine learning, finance, biotechnology, and robotics. We show that the
increased accuracy of PES leads to significant gains in optimization
performance.

Yue Wu, José Miguel Hernández-Lobato, and Zoubin Ghahramani.
**Gaussian
process volatility model**.
In *Advances in Neural Information Processing Systems 28*, pages 1-9,
Montreal, Canada, December 2014.

** Abstract:** The prediction
of time-changing variances is an important task in the modeling of financial
data. Standard econometric models are often limited as they assume rigid
functional relationships for the evolution of the variance. Moreover,
functional parameters are usually learned by maximum likelihood, which can
lead to overfitting. To address these problems we introduce GP-Vol, a novel
non-parametric model for time-changing variances based on Gaussian Processes.
This new model can capture highly flexible functional relationships for the
variances. Furthermore, we introduce a new online algorithm for fast
inference in GP-Vol. This method is much faster than current offline
inference procedures and it avoids overfitting problems by following a fully
Bayesian approach. Experiments with financial data show that GP-Vol performs
significantly better than current standard alternatives.

Neil Houlsby, José Miguel Hernández-Lobato, and Zoubin Ghahramani.
**Cold-start
active learning with robust ordinal matrix factorization**.
In *31st International Conference on Machine Learning*, Beijing, China,
June 2014.

** Abstract:** We present a new matrix
factorization model for rating data and a corresponding active learning
strategy to address the cold-start problem. Cold-start is one of the most
challenging tasks for recommender systems: what to recommend with new users
or items for which one has little or no data. An approach is to use active
learning to collect the most useful initial ratings. However, the performance
of active learning depends strongly upon having accurate estimates of i) the
uncertainty in model parameters and ii) the intrinsic noisiness of the data.
To achieve these estimates we propose a heteroskedastic Bayesian model for
ordinal matrix factorization. We also present a computationally efficient
framework for Bayesian active learning with this type of complex
probabilistic model. This algorithm successfully distinguishes between
informative and noisy data points. Our model yields state-of-the-art
predictive performance and, coupled with our active learning strategy,
enables us to gain useful information in the cold-start setting from the very
first active sample.

José Miguel Hernández-Lobato, Neil Houlsby, and Zoubin Ghahramani.
**Probabilistic
matrix factorization with non-random missing data**.
In *31st International Conference on Machine Learning*, Beijing, China,
June 2014.

** Abstract:** We propose a probabilistic matrix
factorization model for collaborative filtering that learns from data that is
missing not at random (MNAR). Matrix factorization models exhibit
state-of-the-art predictive performance in collaborative filtering. However,
these models usually assume that the data is missing at random (MAR), and
this is rarely the case. For example, the data is not MAR if users rate items
they like more than ones they dislike. When the MAR assumption is incorrect,
inferences are biased and predictive performance can suffer. Therefore, we
model both the generative process for the data and the missing data
mechanism. By learning these two models jointly we obtain improved
performance over state-of-the-art methods when predicting the ratings and
when modeling the data observation process. We present the first viable MF
model for MNAR data. Our results are promising and we expect that further
research on NMAR models will yield large gains in collaborative
filtering.

José Miguel Hernández-Lobato, Neil Houlsby, and Zoubin Ghahramani.
**Stochastic
inference for scalable probabilistic modeling of binary matrices**.
In *31st International Conference on Machine Learning*, Beijing, China,
June 2014.

** Abstract:** Fully observed large binary matrices
appear in a wide variety of contexts. To model them, probabilistic matrix
factorization (PMF) methods are an attractive solution. However, current
batch algorithms for PMF can be inefficient because they need to analyze the
entire data matrix before producing any parameter updates. We derive an
efficient stochastic inference algorithm for PMF models of fully observed
binary matrices. Our method exhibits faster convergence rates than more
expensive batch approaches and has better predictive performance than
scalable alternatives. The proposed method includes new data subsampling
strategies which produce large gains over standard uniform subsampling. We
also address the task of automatically selecting the size of the minibatches
of data used by our method. For this, we derive an algorithm that adjusts
this hyper-parameter online.

Wouter Boomsma, Pengfei Tian, Jes Frellsen, Jesper Ferkinghoff-Borg, Thomas
Hamelryck, Kresten Lindorff-Larsen, and Michele Vendruscolo.
**Equilibrium simulations of proteins using molecular fragment replacement and
NMR chemical shifts**.
*Proceedings of the National Academy of Sciences*, 111(38):13852-13857,
2014, doi
10.1073/pnas.1404948111.

** Abstract:** Methods of protein
structure determination based on NMR chemical shifts are becoming
increasingly common. The most widely used approaches adopt the molecular
fragment replacement strategy, in which structural fragments are repeatedly
reassembled into different complete conformations in molecular simulations.
Although these approaches are effective in generating individual structures
consistent with the chemical shift data, they do not enable the sampling of
the conformational space of proteins with correct statistical weights. Here,
we present a method of molecular fragment replacement that makes it possible
to perform equilibrium simulations of proteins, and hence to determine their
free energy landscapes. This strategy is based on the encoding of the
chemical shift information in a probabilistic model in Markov chain Monte
Carlo simulations. First, we demonstrate that with this approach it is
possible to fold proteins to their native states starting from extended
structures. Second, we show that the method satisfies the detailed balance
condition and hence it can be used to carry out an equilibrium sampling from
the Boltzmann distribution corresponding to the force field used in the
simulations. Third, by comparing the results of simulations carried out with
and without chemical shift restraints we describe quantitatively the effects
that these restraints have on the free energy landscapes of proteins. Taken
together, these results demonstrate that the molecular fragment replacement
strategy can be used in combination with chemical shift information to
characterize not only the native structures of proteins but also their
conformational fluctuations.

Jes Frellsen, Thomas Hamelryck, and Jesper Ferkinghoff-Borg.
**Combining the
multicanonical ensemble with generative probabilistic models of local
biomolecular structure**.
In *Proceedings of the 59th World Statistics Congress of the
International Statistical Institute*, pages 139-144, Hong Kong,
2014.

** Abstract:** Markov chain Monte Carlo is a powerful
tool for sampling complex systems such as large biomolecular structures.
However, the standard Metropolis-Hastings algorithm suffers from a number of
deficiencies when applied to systems with rugged free-energy landscapes. Some
of these deficiencies can be addressed with the multicanonical ensemble. In
this paper we will present two strategies for applying the multicanonical
ensemble to distributions constructed from generative probabilistic models of
local biomolecular structure. In particular, we will describe how to use the
multicanonical ensemble efficiently in conjunction with the reference ratio
method.

Adrian Weller and Tony Jebara.
**Clamping
variables and approximate inference**.
In Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger,
editors, *Advances in Neural Information Processing Systems 28*, pages
909-917. Curran Associates, Inc., 2014.

** Abstract:** It was
recently proved using graph covers (Ruozzi, 2012) that the Bethe partition
function is upper bounded by the true partition function for a binary
pairwise model that is attractive. Here we provide a new, arguably simpler
proof from first principles. We make use of the idea of clamping a variable
to a particular value. For an attractive model, we show that summing over the
Bethe partition functions for each sub-model obtained after clamping any
variable can only raise (and hence improve) the approximation. In fact, we
derive a stronger result that may have other useful implications. Repeatedly
clamping until we obtain a model with no cycles, where the Bethe
approximation is exact, yields the result. We also provide a related lower
bound on a broad class of approximate partition functions of general pairwise
multi-label models that depends only on the topology. We demonstrate that
clamping a few wisely chosen variables can be of practical value by
dramatically reducing approximation error.

** Comment:** Supplementary
Material

Daniel Hernández-Lobato and José Miguel Hernández-Lobato.
**Learning
feature selection dependencies in multi-task learning**.
In *Advances in Neural Information Processing Systems 27*, pages 1-9,
Lake Tahoe, California, USA, December 2013.

** Abstract:** A
probabilistic model based on the horseshoe prior is proposed for learning de-
pendencies in the process of identifying relevant features for prediction.
Exact inference is intractable in this model. However, expectation
propagation offers an approximate alternative. Because the process of
estimating feature selection dependencies may suffer from over-fitting in the
model proposed, additional data from a multi-task learning scenario are
considered for induction. The same model can be used in this setting with few
modifications. Furthermore, the assumptions made are less restrictive than in
other multi-task methods: The different tasks must share feature selection
dependencies, but can have different relevant features and model
coefficients. Experiments with real and synthetic data show that this model
performs better than other multi-task alternatives from the literature. The
experiments also show that the model is able to induce suitable feature
selection dependencies for the problems considered, only from the training
data.

Daniel Hernández-Lobato, José Miguel Hernández-Lobato, and Pierre Dupont.
**Generalized
spike-and-slab priors for Bayesian group feature selection using
expectation propagation**.
*Journal of Machine Learning Research*, 14:1891-1945, July 2013.

** Abstract:** We describe a Bayesian method for group feature
selection in linear regression problems. The method is based on a generalized
version of the standard spike-and-slab prior distribution which is often used
for individual feature selection. Exact Bayesian inference under the prior
considered is infeasible for typical regression problems. However,
approximate inference can be carried out efficiently using Expectation
Propagation (EP). A detailed analysis of the generalized spike-and-slab
prior shows that it is well suited for regression problems that are sparse at
the group level. Furthermore, this prior can be used to introduce prior
knowledge about specific groups of features that are a priori believed to be
more relevant. An experimental evaluation compares the performance of the
proposed method with those of group LASSO, Bayesian group LASSO,
automatic relevance determination and additional variants used for group
feature selection. The results of these experiments show that a model based
on the generalized spike-and-slab prior and the EP algorithm has
state-of-the-art prediction performance in the problems analyzed.
Furthermore, this model is also very useful to carry out sequential
experimental design (also known as active learning), where the data instances
that are most informative are iteratively included in the training set,
reducing the number of instances needed to obtain a particular level of
prediction accuracy.

Sébastien Bratières, Novi Quadrianto, and Zoubin Ghahramani.
**Bayesian
structured prediction using Gaussian processes**.
*arXiv*, abs/1307.3846, 2013.

** Abstract:** We introduce
a conceptually novel structured prediction model, GPstruct, which is
kernelized, non-parametric and Bayesian, by design. We motivate the model
with respect to existing approaches, among others, conditional random fields
(CRFs), maximum margin Markov networks (M3N), and structured support vector
machines (SVMstruct), which embody only a subset of its properties. We
present an inference procedure based on Markov Chain Monte Carlo. The
framework can be instantiated for a wide range of structured objects such as
linear chains, trees, grids, and other general graphs. As a proof of concept,
the model is benchmarked on several natural language processing tasks and a
video gesture segmentation task involving a linear chain structure. We show
prediction accuracies for GPstruct which are comparable to or exceeding those
of CRFs and SVMstruct.

Frederik Eaton and Zoubin Ghahramani.
**Model reductions
for inference: Generality of pairwise, binary, and planar factor
graphs**.
*Neural Computation*, 25(5):1213-1260, 2013.

**
Abstract:** We offer a solution to the problem of efficiently translating
algorithms between different types of discrete statistical model. We
investigate the expressive power of three classes of model-those with binary
variables, with pairwise factors, and with planar topology-as well as their
four intersections. We formalize a notion of "simple reduction" for the
problem of inferring marginal probabilities and consider whether it is
possible to "simply reduce" marginal inference from general discrete factor
graphs to factor graphs in each of these seven subclasses. We characterize
the reducibility of each class, showing in particular that the class of
binary pairwise factor graphs is able to simply reduce only positive models.
We also exhibit a continuous "spectral reduction" based on polynomial
interpolation, which overcomes this limitation. Experiments assess the
performance of standard approximate inference algorithms on the outputs of
our reductions.

Yarin Gal and Phil Blunsom.
**A systematic
Bayesian treatment of the IBM alignment models**.
In *Proceedings of the 2013 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies*.
Association for Computational Linguistics, 2013.

**
Abstract:** The dominant yet ageing IBM and HMM word alignment models
underpin most popular Statistical Machine Translation implementations in use
today. Though beset by the limitations of implausible independence
assumptions, intractable optimisation problems, and an excess of tunable
parameters, these models provide a scalable and reliable starting point for
inducing translation systems. In this paper we build upon this venerable base
by recasting these models in the non-parametric Bayesian framework. By
replacing the categorical distributions at their core with hierarchical
Pitman-Yor processes, and through the use of collapsed Gibbs sampling, we
provide a more flexible formulation and sidestep the original heuristic
optimisation techniques. The resulting models are highly extendible,
naturally permitting the introduction of phrasal dependencies. We present
extensive experimental results showing improvements in both AER and BLEU when
benchmarked against Giza++, including significant improvements over IBM model
4.

José Miguel Hernández-Lobato, Neil Houlsby, and Zoubin Ghahramani.
**Stochastic
inference for scalable probabilistic modeling of binary matrices**.
In *NIPS Workshop on Randomized Methods for Machine Learning*, 2013.

** Abstract:** Fully observed large binary matrices appear in a
wide variety of contexts. To model them, probabilistic matrix factorization
(PMF) methods are an attractive solution. However, current batch algorithms
for PMF can be inefficient since they need to analyze the entire data matrix
before producing any parameter updates. We derive an efficient stochastic
inference algorithm for PMF models of fully observed binary matrices. Our
method exhibits faster convergence rates than more expensive batch approaches
and has better predictive performance than scalable alternatives. The
proposed method includes new data subsampling strategies which produce large
gains over standard uniform subsampling. We also address the task of
automatically selecting the size of the minibatches of data and we propose an
algorithm that adjusts this hyper-parameter in an online manner.

Daniel M. Roy.
**On the computability
and complexity of Bayesian reasoning**.
In *NIPS Workshop on Philosophy and Machine Learning*, 2011.

** Abstract:** If we consider the claim made by some cognitive
scientists that the mind performs Bayesian reasoning, and if we
simultaneously accept the Physical Church-Turing thesis and thus believe that
the computational power of the mind is no more than that of a Turing machine,
then what limitations are there to the reasoning abilities of the mind? I
give an overview of joint work with Nathanael Ackerman (Harvard, Mathematics)
and Cameron Freer (MIT, CSAIL) that bears on the computability and complexity
of Bayesian reasoning. In particular, we prove that conditional probability
is in general not computable in the presence of continuous random variables.
However, in light of additional structure in the prior distribution, such as
the presence of certain types of noise, or of exchangeability, conditioning
is possible. These results cover most of statistical practice. At the
workshop on Logic and Computational Complexity, we presented results on the
computational complexity of conditioning, embedding sharp-P-complete problems
in the task of computing conditional probabilities for diffuse continuous
random variables. This work complements older work. For example, under
cryptographic assumptions, the computational complexity of producing samples
and computing probabilities was separated by Ben-David, Chor, Goldreich and
Luby. In recent work, we also make use of cryptographic assumptions to show
that different representations of exchangeable sequences may have vastly
different complexity. However, when faced with an adversary that is
computational bounded, these different representations have the same
complexity, highlighting the fact that knowledge representation and
approximation play a fundamental role in the possibility and plausibility of
Bayesian reasoning.

R. P. Adams, H. Wallach, and Zoubin Ghahramani.
**Learning the
structure of deep sparse graphical models**.
In Yee Whye Teh and Mike Titterington, editors, *13th International
Conference on Artificial Intelligence and Statistics*, pages 1-8, Chia
Laguna, Sardinia, Italy, May 2010.

** Abstract:** Deep belief
networks are a powerful way to model complex probability distributions.
However, it is difficult to learn the structure of a belief network,
particularly one with hidden units. The Indian buffet process has been used
as a nonparametric Bayesian prior on the structure of a directed belief
network with a single infinitely wide hidden layer. Here, we introduce the
cascading Indian buffet process (CIBP), which provides a prior on the
structure of a layered, directed belief network that is unbounded in both
depth and width, yet allows tractable inference. We use the CIBP prior with
the nonlinear Gaussian belief network framework to allow each unit to vary
its behavior between discrete and continuous representations. We use Markov
chain Monte Carlo for inference in this model and explore the structures
learned on image data.

** Comment:** Winner of the Best Paper Award

Charalampos Rotsos, Jurgen Van Gael, Andrew W. Moore, and Zoubin Ghahramani.
**Probabilistic
graphical models for semi-supervised traffic classification**.
In *The 6th International Wireless Communications and Mobile Computing
Conference*, pages 752-757, Caen, France, 2010.

**
Abstract:** Traffic classification using machine learning continues to be
an active research area. The majority of work in this area uses off-the-shelf
machine learning tools and treats them as black-box classifiers. This
approach turns all the modelling complexity into a feature selection problem.
In this paper, we build a problem-specific solution to the traffic
classification problem by designing a custom probabilistic graphical model.
Graphical models are a modular framework to design classifiers which
incorporate domain-specific knowledge. More specifically, our solution
introduces semi-supervised learning which means we learn from both labelled
and unlabelled traffic flows. We show that our solution performs
competitively compared to previous approaches while using less data and
simpler features.

R. Silva and Z. Ghahramani.
**The hidden life of
latent variables: Bayesian learning with mixed graph models**.
*Journal of Machine Learning Research*, 10:1187-1238, June 2009.

** Abstract:** Directed acyclic graphs (DAGs) have been widely
used as a representation of conditional independence in machine learning and
statistics. Moreover, hidden or latent variables are often an important
component of graphical models. However, DAG models suffer from an important
limitation: the family of DAGs is not closed under marginalization of hidden
variables. This means that in general we cannot use a DAG to represent the
independencies over a subset of variables in a larger DAG. Directed mixed
graphs (DMGs) are a representation that includes DAGs as a special case, and
overcomes this limitation. This paper introduces algorithms for performing
Bayesian inference in Gaussian and probit DMG models. An important
requirement for inference is the specification of the distribution over
parameters of the models. We introduce a new distribution for covariance
matrices of Gaussian DMGs. We discuss and illustrate how several Bayesian
machine learning tasks can benefit from the principle presented here: the
power to model dependencies that are generated from hidden variables, but
without necessarily modeling such variables explicitly.

Frederik Eaton and Zoubin Ghahramani.
**Choosing a variable
to clamp: Approximate inference using conditioned belief
propagation**.
In D. van Dyk and M. Welling, editors, *12th International Conference on
Artificial Intelligence and Statistics*, volume 5, pages 145-152,
Clearwater Beach, FL, USA, April 2009. Journal of Machine Learning
Research.

** Abstract:** In this paper we propose an algorithm
for approximate inference on graphical models based on belief propagation
(BP). Our algorithm is an approximate version of Cutset Conditioning, in
which a subset of variables is instantiated to make the rest of the graph
singly connected. We relax the constraint of single-connectedness, and select
variables one at a time for conditioning, running belief propagation after
each selection. We consider the problem of determining the best variable to
clamp at each level of recursion, and propose a fast heuristic which applies
back-propagation to the BP updates. We demonstrate that the heuristic
performs better than selecting variables at random, and give experimental
results which show that it performs competitively with existing approximate
inference algorithms.

** Comment:** Code (in C++
based on libDAI).

R. Silva and Z. Ghahramani.
**Factorial mixture
of Gaussians and the marginal independence model**.
In *12th International Conference on Artificial Intelligence and
Statistics*, volume 5, pages 520-527, Clearwater Beach, FL, USA,
April 2009. Journal of Machine Learning Research.
ISSN: 1938-7228.

** Abstract:** Marginal independence
constraints play an important role in learning with graphical models. One way
of parameterizing a model of marginal independencies is by building a latent
variable model where two independent observed variables have no common latent
source. In sparse domains, however, it might be advantageous to model the
marginal observed distribution directly, without explicitly including latent
variables in the model. There have been recent advances in Gaussian and
binary models of marginal independence, but no models with non-linear
dependencies between continuous variables has been proposed so far. In this
paper, we describe how to generalize the Gaussian model of marginal
independencies based on mixtures, and how to learn parameters. This requires
a non-standard parameterization and raises difficult non-linear optimization
issues.

** Comment:** Code at http://www.homepages.ucl.ac.uk/~ucgtrbd/code/fmog-version0.zip

R. Silva, W. Chu, and Zoubin Ghahramani.
**Hidden common
cause relations in relational learning**.
In J. C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, *Advances in
Neural Information Processing Systems 20*, pages 1345-1352, Cambridge,
MA, USA, December 2008. The MIT Press.

** Abstract:** When
predicting class labels for objects within a relational database, it is often
helpful to consider a model for relationships: this allows for information
between class labels to be shared and to improve prediction performance.
However, there are different ways by which objects can be related within a
relational database. One traditional way corresponds to a Markov network
structure: each existing relation is represented by an undirected edge. This
encodes that, conditioned on input features, each object label is independent
of other object labels given its neighbors in the graph. However, there is no
reason why Markov networks should be the only representation of choice for
symmetric dependence structures. Here we discuss the case when relationships
are postulated to exist due to *hidden common causes*. We discuss how
the resulting graphical model differs from Markov networks, and how it
describes different types of real-world relational processes. A Bayesian
nonparametric classification model is built upon this graphical
representation and evaluated with several empirical studies.

** Comment:** Code at http://www.homepages.ucl.ac.uk/~ucgtrbd/code/xgp

J. Zhang, Z. Ghahramani, and Y. Yang.
**Flexible latent
variable models for multi-task learning**.
*Machine Learning*, 73(3):221-242, December 2008.

**
Abstract:** Given multiple prediction problems such as regression and
classification, we are interested in a joint inference framework which can
effectively borrow information among tasks to improve the prediction
accuracy, especially when the number of training examples per problem is
small. In this paper we propose a probabilistic framework which can support a
set of latent variable models for different multi-task learning scenarios. We
show that the framework is a generalization of standard learning methods for
single prediction problems and it can effectively model the shared structure
among different prediction tasks. Furthermore, we present efficient
algorithms for the empirical Bayes method as well as point estimation. Our
experiments on both simulated datasets and real world classification datasets
show the effectiveness of the proposed models in two evaluation settings:
standard multi-task learning setting and transfer learning setting.

F. Pérez-Cruz, Zoubin Ghahramani, and M. Pontil.
**Conditional
graphical models**.
In G. H. Bakir, T. Hofmann, B. Schölkopf, A. J. Smola, B. Taskar, and
S. V. N. Vishwanathan, editors, *Predicting Structured Data*, pages
265-282. The MIT Press, Cambridge, MA, USA, September 2007.
Chapter 12.

** Abstract:** In this chapter we propose a
modification of CRF-like algorithms that allows for solving large-scale
structured classification problems. Our approach consists in upper bounding
the CRF functional in order to decompose its training into independent
optimisation problems per clique. Furthermore we show that each sub-problem
corresponds to solving a multiclass learning task in each clique, which
enlarges the applicability of these tools for large-scale structural learning
problems. Before presenting the Conditional Graphical Model (CGM), as we
refer to this procedure, we review the family of CRF algorithms. We
concentrate on the best known procedures and standard generalisations of
CRFs. The ob jective of this introduction is analysing from the same
viewpoint the proposed solutions in the literature to tackle this problem,
which allows comparing their different features. We complete the chapter with
a case study, in which we show the possibility to work with large-scale
problems using CGM and that the obtained performance is comparable to the
result with CRF-like algorithms.

Ricardo Silva and Zoubin Ghahramani.
**Bayesian inference
for Gaussian mixed graph models**.
In *UAI*. AUAI Press, 2006.

** Abstract:** We introduce
priors and algorithms to perform Bayesian inference in Gaussian models
defined by acyclic directed mixed graphs. Such a class of graphs, composed of
directed and bi-directed edges, is a representation of conditional
independencies that is closed under marginalization and arises naturally from
causal models which allow for unmeasured confounding. Monte Carlo methods and
a variational approximation for such models are presented. Our algorithms for
Bayesian inference allow the evaluation of posterior distributions for
several quantities of interest, including causal effects that are not
identifiable from data alone but could otherwise be inferred where
informative prior knowledge about confounding is available.

Iain Murray and Zoubin Ghahramani.
**Bayesian learning
in undirected graphical models: Approximate MCMC algorithms**.
In David Maxwell Chickering and Joseph Y. Halpern, editors, *UAI*, pages
392-399. AUAI Press, 2004.

** Abstract:** Bayesian learning
in undirected graphical models|computing posterior distributions over
parameters and predictive quantities is exceptionally difficult. We
conjecture that for general undirected models, there are no tractable MCMC
(Markov Chain Monte Carlo) schemes giving the correct equilibrium
distribution over parameters. While this intractability, due to the partition
function, is familiar to those performing parameter optimisation, Bayesian
learning of posterior distributions over undirected model parameters has been
unexplored and poses novel challenges. we propose several approximate MCMC
schemes and test on fully observed binary models (Boltzmann machines) for a
small coronary heart disease data set and larger artificial systems. While
approximations must perform well on the model, their interaction with the
sampling scheme is also important. Samplers based on variational mean- field
approximations generally performed poorly, more advanced methods using loopy
propagation, brief sampling and stochastic dynamics lead to acceptable
parameter posteriors. Finally, we demonstrate these techniques on a Markov
random field with hidden variables.

Zoubin Ghahramani.
**An introduction to
hidden Markov models and Bayesian networks**.
*IJPRAI*, 15(1):9-42, 2001.

** Abstract:** We provide a
tutorial on learning and inference in hidden Markov models in the context of
the recent literature on Bayesian networks. This perspective make sit
possible to consider novel generalizations to hidden Markov models with
multiple hidden state variables, multiscale representations, and mixed
discrete and continuous variables. Although exact inference in these
generalizations is usually intractable, one can use approximate inference in
these generalizations is usually intractable, one can use approximate
inference algorithms such as Markov chain sampling and variational methods.
We describe how such methods are applied to these generalized hidden Markov
models. We conclude this review with a discussion of Bayesian methods for
model selection in generalized HMMs.

Nicholas J. Adams, Amos J. Storkey, Christopher K. I. Williams, and Zoubin
Ghahramani.
**MFDTs: Mean
field dynamic trees**.
In *International Conference on Pattern Recognition*, volume 3,
pages 151-154, 2000.

** Abstract:** Tree structured belief
networks are attractive for image segmentation tasks. However, networks with
fixed architectures are not very suitable as they lead to blocky artefacts,
and led to the introduction of dynamic trees (DTs). The Dynamic trees
architecture provide a prior distribution over tree structures, and simulated
annealing (SA) was used to search for structures with high posterior
probability. In this paper we introduce a mean field approach to inference in
DTs. We find that the mean field method captures the posterior better than
just using the maximum a posteriori solution found by SA

Zoubin Ghahramani.
**Learning dynamic
Bayesian networks**.
In C. Lee Giles and Marco Gori, editors, *Adaptive Processing of Sequences
and Data Structures*, volume 1387 of *Lecture Notes in Computer
Science*, pages 168-197. Springer, 1997.

** Abstract:**
Bayesian networks are a concise graphical formalism for describing
probabilistic models. We have provided a brief tutorial of methods for
learning and inference in dynamic Bayesian networks. In many of the
interesting models, beyond the simple linear dynamical system or hidden
Markov model, the calculations required for inference are intractable. Two
different approaches for handling this intractability are Monte Carlo methods
such as Gibbs sampling, and variational methods. An especially promising
variational approach is based on exploiting tractable substructures in the
Bayesian network.

## Monte Carlo MethodsMarkov chain Monte Carlo (MCMC) methods use sampling to approximate high dimensional integrals and intractable sums. MCMC methods are widely used in many areas of science, applied mathematics and engineering. They are an indispensable approximate inference tool for Bayesian statistics and machine learning. |

George Nicholson, Marta Blangiardo, Mark Briers, Peter J Diggle, Tor Erlend
Fjelde, Hong Ge, Robert J B Goudie, Radka Jersakova, Ruairidh E King, Brieuc
C L Lehmann, Ann-Marie Mallon, Tullia Padellini, Yee Whye Teh, Chris Holmes,
and Sylvia Richardson.
**Interoperability of
statistical models in pandemic preparedness: principles and reality**.
*Stat. Sci.*, 37(2):183-206, May 2022.

** Abstract:** We
present interoperability as a guiding framework for statistical modelling to
assist policy makers asking multiple questions using diverse datasets in the
face of an evolving pandemic response. Interoperability provides an important
set of principles for future pandemic preparedness, through the joint design
and deployment of adaptable systems of statistical models for disease
surveillance using probabilistic reasoning. We illustrate this through case
studies for inferring and characterising spatial-temporal prevalence and
reproduction numbers of SARS-CoV-2 infections in England.

Gergely Flamich, Stratis Markou, and José Miguel Hernández-Lobato.
**Fast relative entropy coding with
A* coding**.
In *39th International Conference on Machine Learning*, 2022.

** Abstract:** Relative entropy coding (REC) algorithms encode a
sample from a target distribution $Q$ using a proposal distribution $P$, such
that the expected codelength is $\mathcalO(D_KL[Q||P])$. REC can be
seamlessly integrated with existing learned compression models since, unlike
entropy coding, it does not assume discrete $Q$ or $P$, and does not require
quantisation. However, general REC algorithms require an intractable
$Ømega(e^D_KL[Q||P])$ runtime. We introduce AS* and AD* coding, two REC
algorithms based on A* sampling. We prove that, for continuous distributions
over $\mathbbR$, if the density ratio is unimodal, AS* has
$\mathcalO(D_[Q||P]QP)$ expected runtime, where
$D_[Q||P]QP$ is the Rényi $\infty$-divergence. We provide
experimental evidence that AD* also has $\mathcalO(D_[Q||P]QP)$
expected runtime. We prove that AS* and AD* achieve an expected codelength of
$\mathcalO(D_KL[Q||P])$. Further, we introduce DAD*, an approximate
algorithm based on AD* which retains its favourable runtime and has bias
similar to that of alternative methods. Focusing on VAEs, we propose the
IsoKL VAE (IKVAE), which can be used with DAD* to further improve compression
efficiency. We evaluate A* coding with (IK)VAEs on MNIST, showing that it can
losslessly compress images near the theoretically optimal limit.

Vidhi Lalchand, Wessel P. Bruinsma, David R. Burt, and Carl E. Rasmussen.
**Sparse Gaussian
process hyperparameters: Optimize or integrate?**.
In *nips36*, 2022.

** Abstract:** The kernel function and
its hyperparameters are the central model selection choice in a Gaussian
process [Rasmussen and Williams, 2006]. Typically, the hyperparameters of the
kernel are chosen by maximising the marginal likelihood, an approach known as
Type-II maximum likelihood (ML-II). However, ML-II does not account for
hyperparameter uncertainty, and it is well-known that this can lead to
severely biased estimates and an underestimation of predictive uncertainty.
While there are several works which employ a fully Bayesian characterisation
of GPs, relatively few propose such approaches for the sparse GPs paradigm.
In this work we propose an algorithm for sparse Gaussian process regression
which leverages MCMC to sample from the hyperparameter posterior within the
variational inducing point framework of [Titsias, 2009]. This work is closely
related to Hensman et al. [2015b], but side-steps the need to sample the
inducing points, thereby significantly improving sampling efficiency in the
Gaussian likelihood case. We compare this scheme against natural baselines in
literature along with stochastic variational GPs (SVGPs) along with an
extensive computational analysis.

Valerii Likhosherstov, Xingyou Song, Krzysztof Choromanski, Jared Q Davis, and
Adrian Weller.
**Debiasing
a first-order heuristic for approximate bi-level optimization**.
In Marina Meila and Tong Zhang, editors, *Proceedings of the 38th
International Conference on Machine Learning*, volume 139 of
*Proceedings of Machine Learning Research*, pages 6621-6630. PMLR,
18-24 Jul 2021.

** Abstract:** Approximate bi-level
optimization (ABLO) consists of (outer-level) optimization problems,
involving numerical (inner-level) optimization loops. While ABLO has many
applications across deep learning, it suffers from time and memory complexity
proportional to the length $r$ of its inner optimization loop. To address
this complexity, an earlier first-order method (FOM) was proposed as a
heuristic which omits second derivative terms, yielding significant speed
gains and requiring only constant memory. Despite FOM’s popularity, there
is a lack of theoretical understanding of its convergence properties. We
contribute by theoretically characterizing FOM’s gradient bias under mild
assumptions. We further demonstrate a rich family of examples where FOM-based
SGD does not converge to a stationary point of the ABLO objective. We address
this concern by proposing an unbiased FOM (UFOM) enjoying constant memory
complexity as a function of $r$. We characterize the introduced time-variance
tradeoff, demonstrate convergence bounds, and find an optimal UFOM for a
given ABLO problem. Finally, we propose an efficient adaptive UFOM
scheme.

Fergus Simpson, Vidhi Lalchand, and Carl Edward Rasmussen.
**Marginalised
Gaussian Processes with Nested Sampling**.
In *Advances in Neural Information Processing Systems 34*,
volume 34, pages 13613-13625. Curran Associates, Inc., 2021.

** Abstract:** Gaussian Process models are a rich distribution
over functions with inductive biases controlled by a kernel function.
Learning occurs through optimisation of the kernel hyperparameters using the
marginal likelihood as the objective. This work proposes nested sampling as a
means of marginalising kernel hyperparameters, because it is a technique that
is well-suited to exploring complex, multi-modal distributions. We benchmark
against Hamiltonian Monte Carlo on time-series and two-dimensional regression
tasks, finding that a principled approach to quantifying hyperparameter
uncertainty substantially improves the quality of prediction intervals.

Kai Xu, Tor Erlend Fjelde, Charles Sutton, and Hong Ge.
**Couplings for
multinomial Hamiltonian Monte Carlo**.
130:3646-3654, 13-15 Apr 2021.

** Abstract:** Hamiltonian
Monte Carlo (HMC) is a popular sampling method in Bayesian inference.
Recently, Heng & Jacob (2019) studied Metropolis HMC with couplings for
unbiased Monte Carlo estimation, establishing a generic parallelizable scheme
for HMC. However, in practice a different HMC method, multinomial HMC, is
considered as the go-to method, e.g. as part of the no-U-turn sampler. In
multinomial HMC, proposed states are not limited to end-points as in
Metropolis HMC; instead points along the entire trajectory can be proposed.
In this paper, we establish couplings for multinomial HMC, based on optimal
transport for multinomial sampling in its transition. We prove an upper bound
for the meeting time – the time it takes for the coupled chains to meet –
based on the notion of local contractivity. We evaluate our methods using
three targets: 1,000 dimensional Gaussians, logistic regression and
log-Gaussian Cox point processes. Compared to Heng & Jacob (2019),
coupled multinomial HMC generally attains a smaller meeting time, and is more
robust to choices of step sizes and trajectory lengths, which allows re-use
of existing adaptation methods for HMC. These improvements together paves the
way for a wider and more practical use of coupled HMC methods.

Krzysztof Choromanski, Mark Rowland, Wenyu Chen, and Adrian Weller.
**Unifying
orthogonal Monte Carlo methods**.
In *36th International Conference on Machine Learning*, Long Beach, June
2019.

** Abstract:** Many machine learning methods making use
of Monte Carlo sampling in vector spaces have been shown to be improved by
conditioning samples to be mutually orthogonal. Exact orthogonal coupling of
samples is computationally intensive, hence approximate methods have been of
great interest. In this paper, we present a unifying perspective of many
approximate methods by considering Givens transformations, propose new
approximate methods based on this framework, and demonstrate the first
statistical guarantees for families of approximate methods in kernel
approximation. We provide extensive empirical evaluations with guidance for
practitioners.

Krzysztof Choromanski, Mark Rowland, Tamas Sarlos, Vikas Sindhwani, Richard E.
Turner, and Adrian Weller.
**The geometry of
random features**.
In *21st International Conference on Artificial Intelligence and
Statistics*, Playa Blanca, Lanzarote, Canary Islands, April 2018.

** Abstract:** We present an in-depth examination of the
effectiveness of radial basis function kernel (beyond Gaussian) estimators
based on orthogonal random feature maps. We show that orthogonal estimators
outperform state-of-the-art mechanisms that use iid sampling under weak
conditions for tails of the associated Fourier distributions. We prove that
for the case of many dimensions, the superiority of the orthogonal transform
can be accurately measured by a property we define called the charm of the
kernel, and that orthogonal random features provide optimal (in terms of mean
squared error) kernel estimators. We provide the first theoretical results
which explain why orthogonal random features outperform unstructured on
downstream tasks such as kernel ridge regression by showing that orthogonal
random features provide kernel algorithms with better spectral properties
than the previous state-of-the-art. Our results enable practitioners more
generally to estimate the benefits from applying orthogonal transforms.

Hong Ge, Kai Xu, and Zoubin Ghahramani.
**Turing: A language
for flexible probabilistic inference**.
84:1682-1690, 09-11 Apr 2018.

** Abstract:** Probabilistic
programming promises to simplify and democratize probabilistic machine
learning, but successful probabilistic programming systems require flexible,
generic and efficient inference engines. In this work, we present a system
called Turing for building MCMC algorithms for probabilistic programming
inference. Turing has a very simple syntax and makes full use of the
numerical capabilities in the Julia programming language, including all
implemented probability distributions, and automatic differentiation. Turing
supports a wide range of popular Monte Carlo algorithms, including
Hamiltonian Monte Carlo (HMC), HMC with No-U-Turns (NUTS), Gibbs sampling,
sequential Monte Carlo (SMC), and several particle MCMC (PMCMC) samplers.
Most importantly, Turing inference is composable: it combines MCMC operations
on subsets of variables, for example using a combination of an HMC engine and
a particle Gibbs (PG) engine. We explore several combinations of inference
methods with the aim of finding approaches that are both efficient and
universal, i.e. applicable to arbitrary probabilistic models. NUTS—a
popular variant of HMC that adapts Hamiltonian simulation path length
automatically, although quite powerful for exploring differentiable target
distributions, is however not universal. We identify some failure modes for
the NUTS engine, and demonstrate that composition of PG (for discrete
variables) and NUTS (for continuous variables) can be useful when the NUTS
engine is either not applicable, or simply does not work well. Our aim is to
present Turing and its composable inference engines to the world and
encourage other researchers to build on this system to help advance the field
of probabilistic machine learning.

Paavo Parmas, Carl Edward Rasmussen, Jan Peters, and Kenji Doya.
**PIPPS:
Flexible model-based policy search robust to the curse of chaos**.
In *35th International Conference on Machine Learning*, 2018.

** Abstract:** Previously, the exploding gradient problem has
been explained to be central in deep learning and model-based reinforcement
learning, because it causes numerical issues and instability in optimization.
Our experiments in model-based reinforcement learning imply that the problem
is not just a numerical issue, but it may be caused by a fundamental
chaos-like nature of long chains of nonlinear computations. Not only do the
magnitudes of the gradients become large, the direction of the gradients
becomes essentially random. We show that reparameterization gradients suffer
from the problem, while likelihood ratio gradients are robust. Using our
insights, we develop a model-based policy search framework, Probabilistic
Inference for Particle-Based Policy Search (PIPPS), which is easily
extensible, and allows for almost arbitrary models and policies, while
simultaneously matching the performance of previous data-efficient learning
algorithms. Finally, we invent the total propagation algorithm, which
efficiently computes a union over all pathwise derivative depths during a
single backwards pass, automatically giving greater weight to estimators with
lower variance, sometimes improving over reparameterization gradients by
10^{6} times.

Krzysztof Choromanski, Mark Rowland, and Adrian Weller.
**The
unreasonable effectiveness of structured random orthogonal
embeddings**.
In *Advances in Neural Information Processing Systems 31*, Long Beach,
California, December 2017.

** Abstract:** We examine a class
of embeddings based on structured random matrices with orthogonal rows which
can be applied in many machine learning applications including dimensionality
reduction and kernel approximation. For both the Johnson-Lindenstrauss
transform and the angular kernel, we show that we can select matrices
yielding guaranteed improved performance in accuracy and/or speed compared to
earlier methods. We introduce matrices with complex entries which give
significant further accuracy improvement. We provide geometric and Markov
chain-based perspectives to help understand the benefits, and empirical
results which suggest that the approach is helpful in a wider range of
applications.

Nilesh Tripuraneni, Mark Rowland, Zoubin Ghahramani, and Richard E. Turner.
**Magnetic
Hamiltonian Monte Carlo**.
In *34th International Conference on Machine Learning*, 2017.

** Abstract:** Hamiltonian Monte Carlo (HMC) exploits
Hamiltonian dynamics to construct efficient proposals for Markov chain Monte
Carlo (MCMC). In this paper, we present a generalization of HMC which
exploits non-canonical Hamiltonian dynamics. We refer to this algorithm as
magnetic HMC, since in 3 dimensions a subset of the dynamics map onto the
mechanics of a charged particle coupled to a magnetic field. We establish a
theoretical basis for the use of non-canonical Hamiltonian dynamics in MCMC,
and construct a symplectic, leapfrog-like integrator allowing for the
implementation of magnetic HMC. Finally, we exhibit several examples where
these non-canonical dynamics can lead to improved mixing of magnetic HMC
relative to ordinary HMC.

Jes Frellsen, Ole Winther, Zoubin Ghahramani, and Jesper Ferkinghoff-Borg.
**Bayesian generalised ensemble Markov chain Monte Carlo**.
In *19th International Conference on Artificial Intelligence and
Statistics*, Cadiz, Spain, May 2016.

** Abstract:**
Bayesian generalised ensemble (BayesGE) is a new method that addresses two
major drawbacks of standard Markov chain Monte Carlo algorithms for inference
in high-dimensional probability models: inapplicability to estimate the
partition function, and poor mixing properties. BayesGE uses a Bayesian
approach to iteratively update the belief about the density of states
(distribution of the log likelihood under the prior) for the model, with the
dual purpose of enhancing the sampling efficiency and make the estimation of
the partition function tractable. We benchmark BayesGE on Ising and Potts
systems and show that it compares favourably to existing state-of-the-art
methods.

Hong Ge, Yutian Chen, Moquan Wan, and Zoubin Ghahramani.
**Distributed
Inference for Dirichlet Process Mixture Models**.
37:2276-2284, 07-09 Jul 2015.

** Abstract:** Bayesian
nonparametric mixture models based on the Dirichlet process (DP) have been
widely used for solving problems like clustering, density estimation and
topic modelling. These models make weak assumptions about the underlying
process that generated the observed data. Thus, when more data are collected,
the complexity of these models can change accordingly. These theoretical
properties often lead to superior predictive performance when compared to
traditional finite mixture models. However, despite the increasing amount of
data available, the application of Bayesian nonparametric mixture models is
so far limited to relatively small data sets. In this paper, we propose an
efficient distributed inference algorithm for the DP and the HDP mixture
model. The proposed method is based on a variant of the slice sampler for
DPs. Since this sampler does not involve a pre-determined truncation, the
stationary distribution of the sampling algorithm is unbiased. We provide
both local thread-level and distributed machine-level parallel
implementations and study the performance of this sampler through an
extensive set of experiments on image and text data. When compared to
existing inference algorithms, the proposed method exhibits state-of-the-art
accuracy and strong scalability with up to 512 cores.

Anoop Korattikara, Yutian Chen, and Max Welling.
**Austerity
in MCMC land: Cutting the Metropolis-Hastings budget**.
In *31st International Conference on Machine Learning*, pages 181-189,
Beijing, China, June 2014.

** Abstract:** Can we make Bayesian
posterior MCMC sampling more efficient when faced with very large datasets?
We argue that computing the likelihood for N datapoints in the
Metropolis-Hastings (MH) test to reach a single binary decision is
computationally inefficient. We introduce an approximate MH rule based on a
sequential hypothesis test that allows us to accept or reject samples with
high confidence using only a fraction of the data required for the exact MH
rule. While this method introduces an asymptotic bias, we show that this bias
can be controlled and is more than offset by a decrease in variance due to
our ability to draw more samples per unit of time.

** Comment:** supplementary

Jes Frellsen, Thomas Hamelryck, and Jesper Ferkinghoff-Borg.
**Combining the
multicanonical ensemble with generative probabilistic models of local
biomolecular structure**.
In *Proceedings of the 59th World Statistics Congress of the
International Statistical Institute*, pages 139-144, Hong Kong,
2014.

** Abstract:** Markov chain Monte Carlo is a powerful
tool for sampling complex systems such as large biomolecular structures.
However, the standard Metropolis-Hastings algorithm suffers from a number of
deficiencies when applied to systems with rugged free-energy landscapes. Some
of these deficiencies can be addressed with the multicanonical ensemble. In
this paper we will present two strategies for applying the multicanonical
ensemble to distributions constructed from generative probabilistic models of
local biomolecular structure. In particular, we will describe how to use the
multicanonical ensemble efficiently in conjunction with the reference ratio
method.

Yue Wu, José Miguel Hernández-Lobato, and Zoubin Ghahramani.
**Dynamic covariance
models for multivariate financial time series**.
In *30th International Conference on Machine Learning*, Atlanta,
Georgia, USA, June 2013.

** Abstract:** The accurate
prediction of time-changing covariances is an important problem in the
modeling of multivariate financial data. However, some of the most popular
models suffer from a) overfitting problems and multiple local optima, b)
failure to capture shifts in market conditions and c) large computational
costs. To address these problems we introduce a novel dynamic model for
time-changing covariances. Over-fitting and local optima are avoided by
following a Bayesian approach instead of computing point estimates. Changes
in market conditions are captured by assuming a diffusion process in
parameter values, and finally computationally efficient and scalable
inference is performed using particle filters. Experiments with financial
data show excellent performance of the proposed method with respect to
current standard models.

Ferenc Huszár and David Duvenaud.
**Optimally-weighted herding is
Bayesian quadrature**.
In *28th Conference on Uncertainty in Artificial Intelligence*, pages
377-385, Catalina Island, California, July 2012.

**
Abstract:** Herding and kernel herding are deterministic methods of
choosing samples which summarise a probability distribution. A related task
is choosing samples for estimating integrals using Bayesian quadrature. We
show that the criterion minimised when selecting samples in kernel herding is
equivalent to the posterior variance in Bayesian quadrature. We then show
that sequential Bayesian quadrature can be viewed as a weighted version of
kernel herding which achieves performance superior to any other weighted
herding method. We demonstrate empirically a rate of convergence faster than
O(1/N). Our results also imply an upper bound on the empirical error of the
Bayesian quadrature estimate.

Yichuan Zhang, Charles A. Sutton, Amos J. Storkey, and Zoubin Ghahramani.
**Continuous
relaxations for discrete Hamiltonian Monte Carlo**.
In Peter L. Bartlett, Fernando C. N. Pereira, Christopher J. C. Burges,
Léon Bottou, and Kilian Q. Weinberger, editors, *NIPS*, pages
3203-3211, 2012.

** Abstract:** Continuous relaxations play
an important role in discrete optimization, but have not seen much use in
approximate probabilistic inference. Here we show that a general form of the
Gaussian Integral Trick makes it possible to transform a wide class of
discrete variable undirected models into fully continuous systems. The
continuous representation allows the use of gradient-based Hamiltonian Monte
Carlo for inference, results in new ways of estimating normalization
constants (partition functions), and in general opens up a number of new
avenues for inference in difficult discrete systems. We demonstrate some of
these continuous relaxation inference algorithms on a number of illustrative
problems.

Sébastien Bratières, Jurgen Van Gael, Andreas Vlachos, and Zoubin
Ghahramani.
**Scaling the
iHMM: Parallelization versus Hadoop**.
In *Proceedings of the 2010 10th IEEE International Conference on Computer
and Information Technology*, pages 1235-1240, Bradford, UK, 2010. IEEE
Computer Society, doi
10.1109/CIT.2010.223.

** Abstract:** This paper compares
parallel and distributed implementations of an iterative, Gibbs sampling,
machine learning algorithm. Distributed implementations run under Hadoop on
facility computing clouds. The probabilistic model under study is the
infinite HMM Beal, Ghahramani and Rasmussen,
2002, in which parameters are learnt using an instance blocked Gibbs
sampling, with a step consisting of a dynamic program. We apply this model to
learn part-of-speech tags from newswire text in an unsupervised fashion.
However our focus here is on runtime performance, as opposed to NLP-relevant
scores, embodied by iteration duration, ease of development, deployment and
debugging.

Finale Doshi-Velez and Zoubin Ghahramani.
**Accelerated Gibbs
sampling for the Indian buffet process**.
In Léon Bottou and Michael Littman, editors, *26th International
Conference on Machine Learning*, pages 273-280, Montréal, QC,
Canada, June 2009. Omnipress.

** Abstract:** We often seek to
identify co-occurring hidden features in a set of observations. The Indian
Buffet Process (IBP) provides a non-parametric prior on the features present
in each observation, but current inference techniques for the IBP often scale
poorly. The collapsed Gibbs sampler for the IBP has a running time cubic in
the number of observations, and the uncollapsed Gibbs sampler, while linear,
is often slow to mix. We present a new linear-time collapsed Gibbs sampler
for conjugate likelihood models and demonstrate its efficacy on large
real-world datasets.

Finale Doshi-Velez and Zoubin Ghahramani.
**Accelerated
sampling for the Indian buffet process**.
In Andrea Pohoreckyj Danyluk, Léon Bottou, and Michael L. Littman, editors,
*ICML*, volume 382 of *ACM International Conference Proceeding
Series*, page 35. acm, 2009.

** Abstract:** We often
seek to identify co-occurring hidden features in a set of observations. The
Indian Buffet Process (IBP) provides a nonparametric prior on the features
present in each observation, but current inference techniques for the IBP
often scale poorly. The collapsed Gibbs sampler for the IBP has a running
time cubic in the number of observations, and the uncollapsed Gibbs sampler,
while linear, is often slow to mix. We present a new linear-time collapsed
Gibbs sampler for conjugate likelihood models and demonstrate its efficacy on
large real-world datasets.

C. Hübler, K. Borgwardt, H.-P. Kriegel, and Z. Ghahramani.
**Metropolis
algorithms for representative subgraph sampling**.
In *Proceedings of 8th IEEE International Conference on Data Mining (ICDM
2008)*, pages 283-292, Pisa, Italy, December 2008. IEEE.
ISSN: 1550-4786.

** Abstract:** While data mining in
chemoinformatics studied graph data with dozens of nodes, systems biology and
the Internet are now generating graph data with thousands and millions of
nodes. Hence data mining faces the algorithmic challenge of coping with this
significant increase in graph size: Classic algorithms for data analysis are
often too expensive and too slow on large graphs.

While one strategy to
overcome this problem is to design novel efficient algorithms, the other is
to 'reduce' the size of the large graph by sampling. This is the scope of
this paper: We will present novel Metropolis algorithms for sampling a
'representative' small subgraph from the original large graph, with
'representative' describing the requirement that the sample shall preserve
crucial graph properties of the original graph. In our experiments, we
improve over the pioneering work of Leskovec and Faloutsos (KDD 2006), by
producing representative subgraph samples that are both smaller and of higher
quality than those produced by other methods from the literature.

Jurgen Van Gael, Yunus Saatçi, Yee-Whye Teh, and Zoubin Ghahramani.
**Beam sampling
for the infinite hidden Markov model**.
In *25th International Conference on Machine Learning*, volume 25,
pages 1088-1095, Helsinki, Finland, 2008. Association for Computing
Machinery.

** Abstract:** The infinite hidden Markov model is
a non-parametric extension of the widely used hidden Markov model. Our paper
introduces a new inference algorithm for the infinite Hidden Markov model
called beam sampling. Beam sampling combines slice sampling, which limits the
number of states considered at each time step to a finite number, with
dynamic programming, which samples whole state trajectories efficiently. Our
algorithm typically outperforms the Gibbs sampler and is more robust. We
present applications of iHMM inference using the beam sampler on changepoint
detection and text prediction problems.

Iain Murray, Zoubin Ghahramani, and David J. C. MacKay.
**MCMC for
doubly-intractable distributions**.
In *UAI*. AUAI Press, 2006.

** Abstract:** Markov Chain
Monte Carlo (MCMC) algorithms are routinely used to draw samples from
distributions with intractable normalization constants. However, standard
MCMC algorithms do not apply to doubly-intractable distributions in which
there are additional parameter-dependent normalization terms; for example,
the posterior over parameters of an undirected graphical model. An ingenious
auxiliary-variable scheme (Møller et al., 2004) offers a solution: exact
sampling (Propp and Wilson, 1996) is used to sample from a
Metropolis-Hastings proposal for which the acceptance probability is
tractable. Unfortunately the acceptance probability of these expensive
updates can be low. This paper provides a generalization of Møller et al.
(2004) and a new MCMC algorithm, which obtains better acceptance
probabilities for the same amount of exact sampling, and removes the need to
estimate model parameters before sampling begins.

Iain Murray, David J. C. MacKay, Zoubin Ghahramani, and John Skilling.
**Nested sampling
for Potts models**.
In *NIPS*, 2005.

** Abstract:** Nested sampling is a new
Monte Carlo method by Skilling intended for general Bayesian computation.
Nested sampling provides a robust alternative to annealing-based methods for
computing normalizing constants. It can also generate estimates of other
quantities such as posterior expectations. The key technical requirement is
an ability to draw samples uniformly from the prior subject to a constraint
on the likelihood. We provide a demonstration with the Potts model, an
undirected graphical model.

Iain Murray and Zoubin Ghahramani.
**Bayesian learning
in undirected graphical models: Approximate MCMC algorithms**.
In David Maxwell Chickering and Joseph Y. Halpern, editors, *UAI*, pages
392-399. AUAI Press, 2004.

** Abstract:** Bayesian learning
in undirected graphical models|computing posterior distributions over
parameters and predictive quantities is exceptionally difficult. We
conjecture that for general undirected models, there are no tractable MCMC
(Markov Chain Monte Carlo) schemes giving the correct equilibrium
distribution over parameters. While this intractability, due to the partition
function, is familiar to those performing parameter optimisation, Bayesian
learning of posterior distributions over undirected model parameters has been
unexplored and poses novel challenges. we propose several approximate MCMC
schemes and test on fully observed binary models (Boltzmann machines) for a
small coronary heart disease data set and larger artificial systems. While
approximations must perform well on the model, their interaction with the
sampling scheme is also important. Samplers based on variational mean- field
approximations generally performed poorly, more advanced methods using loopy
propagation, brief sampling and stochastic dynamics lead to acceptable
parameter posteriors. Finally, we demonstrate these techniques on a Markov
random field with hidden variables.

Carl Edward Rasmussen and Zoubin Ghahramani.
**Bayesian Monte
Carlo**.
In S. Becker, S. Thrun, and K. Obermayer, editors, *Advances in Neural
Information Processing Systems 15*, pages 489-496, Cambridge, MA, USA,
December 2003. The MIT Press.

** Abstract:** We investigate
Bayesian alternatives to classical Monte Carlo methods for evaluating
integrals. Bayesian Monte Carlo (BMC) allows the incorporation of prior
knowledge, such as smoothness of the integrand, into the estimation. In a
simple problem we show that this outperforms any classical importance
sampling method. We also attempt more challenging multidimensional integrals
involved in computing marginal likelihoods of statistical models (a.k.a.
partition functions and model evidences). We find that Bayesian Monte Carlo
outperformed Annealed Importance Sampling, although for very high dimensional
problems or problems with massive multimodality BMC may be less adequate. One
advantage of the Bayesian approach to Monte Carlo is that samples can be
drawn from any distribution. This allows for the possibility of active design
of sample points so as to maximise information gain.

Carl Edward Rasmussen.
**Gaussian processes to
speed up Hybrid Monte Carlo for expensive Bayesian integrals**.
In *Bayesian Statistics 7*, pages 651-659. Oxford University Press,
2003.

** Abstract:** Hybrid Monte Carlo (HMC) is often the
method of choice for computing Bayesian integrals that are not analytically
tractable. However the success of this method may require a very large number
of evaluations of the (un-normalized) posterior and its partial derivatives.
In situations where the posterior is computationally costly to evaluate, this
may lead to an unacceptable computational load for HMC. I propose to use a
Gaussian Process model of the (log of the) posterior for most of the
computations required by HMC. Within this scheme only occasional evaluation
of the actual posterior is required to guarantee that the samples generated
have exactly the desired distribution, even if the GP model is somewhat
inaccurate. The method is demonstrated on a 10 dimensional problem, where 200
evaluations suffice for the generation of 100 roughly independent points from
the posterior. Thus, the proposed scheme allows Bayesian treatment of models
with posteriors that are computationally demanding, such as models involving
computer simulation.

Carl Edward Rasmussen.
**A practical Monte
Carlo implementation of Bayesian learning**.
In *Advances in Neural Information Processing Systems 8*, pages
598-604, Cambridge, MA., USA, 1996. The MIT Press.

**
Abstract:** A practical method for Bayesian training of feed-forward neural
networks using sophisticated Monte Carlo methods is presented and evaluated.
In reasonably small amounts of computer time this approach outperforms other
state-of-the-art methods on 5 datalimited tasks from real world domains.

## Semi-Supervised LearningOften, it is easy and cheap to obtain large amounts of unlabelled data (e.g. images, text documents), while it is hard or expensive to obtain labelled data. Semi-supervised learning methods attempt to use the unlabelled data to improve the performance on supervised learning tasks, such as classification. |

J. von Kügelgen, A. Mey, M. Loog, and B. Schölkopf.
**Semi-supervised
learning, causality, and the conditional cluster assumption**.
In *Proceedings of the 36th International Conference on Uncertainty in
Artificial Intelligence (UAI)*, volume 124 of *Proceedings of Machine
Learning Research*, pages 1-10. PMLR, 2020.
*also at NeurIPS 2019 Workshop Do the right thing: machine learning and causal
inference for improved decision making.

** Abstract:** While
the success of semi-supervised learning (SSL) is still not fully understood,
Schölkopf et al. (2012) have established a link to the principle of
independent causal mechanisms. They conclude that SSL should be impossible
when predicting a target variable from its causes, but possible when
predicting it from its effects. Since both these cases are restrictive, we
extend their work by considering classification using cause and effect
features at the same time, such as predicting a disease from both risk
factors and symptoms. While standard SSL exploits information contained in
the marginal distribution of all inputs (to improve the estimate of the
conditional distribution of the target given in-puts), we argue that in our
more general setting we should use information in the conditional
distribution of effect features given causal features. We explore how this
insight generalises the previous understanding, and how it relates to and can
be exploited algorithmically for SSL.

Martin Trapp, Tamas Madl, Robert Peharz, Franz Pernkopf, and Robert Trappl.
**Safe
semi-supervised learning of sum-product networks**.
In *33st Conference on Uncertainty in Artificial Intelligence*, Sidney,
Australia, August 2017.

** Abstract:** In several domains
obtaining class annotations is expensive while at the same time unlabelled
data are abundant. While most semi-supervised approaches enforce restrictive
assumptions on the data distribution, recent work has managed to learn
semi-supervised models in a non-restrictive regime. However, so far such
approaches have only been proposed for linear models. In this work, we
introduce semi-supervised parameter learning for Sum-Product Networks (SPNs).
SPNs are deep probabilistic models admitting inference in linear time in
number of network edges. Our approach has several advantages, as it (1)
allows generative and discriminative semi-supervised learning, (2) guarantees
that adding unlabelled data can increase, but not degrade, the performance
(safe), and (3) is computationally efficient and does not enforce restrictive
assumptions on the data distribution. We show on a variety of data sets that
safe semi-supervised learning with SPNs is competitive compared to
state-of-the-art and can lead to a better generative and discriminative
objective value than a purely supervised approach.

Tomoharu Iwata, James Robert Lloyd, and Zoubin Ghahramani.
**Unsupervised
many-to-many object matching for relational data**.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*,
2015.

** Abstract:** We propose a method for unsupervised
many-to-many object matching from multiple networks, which is the task of
finding correspondences between groups of nodes in different networks. For
example, the proposed method can discover shared word groups from
multi-lingual document-word networks without cross-language alignment
information. We assume that multiple networks share groups, and each group
has its own interaction pattern with other groups. Using infinite relational
models with this assumption, objects in different networks are clustered into
common groups depending on their interaction patterns, discovering a
matching. The effectiveness of the proposed method is experimentally
demonstrated by using synthetic and real relational data sets, which include
applications to cross-domain recommendation without shared user/item
identifiers and multi-lingual word clustering.

Alexander G. D. G. Matthews and Zoubin Ghahramani.
**Classification using log
Gaussian Cox processes**.
*arXiv preprint arXiv:1405.4141*, 2014.

** Abstract:**
McCullagh and Yang (2006) suggest a family of classification algorithms
based on Cox processes. We further investigate the log Gaussian variant which
has a number of appealing properties. Conditioned on the covariates, the
distribution over labels is given by a type of conditional Markov random
field. In the supervised case, computation of the predictive probability of a
single test point scales linearly with the number of training points and the
multiclass generalization is straightforward. We show new links between the
supervised method and classical nonparametric methods. We give a detailed
analysis of the pairwise graph representable Markov random field, which we
use to extend the model to semi-supervised learning problems, and propose an
inference method based on graph min-cuts. We give the first experimental
analysis on supervised and semi-supervised datasets and show good empirical
performance.

David Lopez-Paz, José Miguel Hernández-Lobato, and Bernhard Scholköpf.
**Semi-supervised
domain adaptation with non-parametric copulas**.
In *Advances in Neural Information Processing Systems 26*, pages 1-9,
Lake Tahoe, California, USA, December 2012.

** Abstract:** A
new framework based on the theory of copulas is proposed to address
semisupervised domain adaptation problems. The presented method factorizes
any multivariate density into a product of marginal distributions and
bivariate copula functions. Therefore, changes in each of these factors can
be detected and corrected to adapt a density model accross different learning
domains. Importantly, we introduce a novel vine copula model, which allows
for this factorization in a non-parametric manner. Experimental results on
regression problems with real-world data illustrate the efficacy of the
proposed approach when compared to state-of-the-art techniques.

C. Rotsos, J. Van Gael, A.W. Moore, and Z. Ghahramani.
**Traffic
classification in information poor environments**.
In *1st International Workshop on Traffic Analysis and Classification (IWCMC
'10)*, Caen, France, July 2010.

** Abstract:** Traffic
classification using machine learning continues to be an active research
area. The majority of work in this area uses *off-the-shelf* machine
learning tools and treats them as *black-box* classifiers. This approach
turns all the modelling complexity into a feature selection problem. In this
paper, we build a problem-specific solution to the traffic classification
problem by designing a custom probabilistic graphical model. Graphical models
are a modular framework to design classifiers which incorporate
domain-specific knowledge. More specifically, our solution introduces
semi-supervised learning which means we learn from both labelled and
unlabelled traffic flows. We show that our solution performs competitively
compared to previous approaches while using less data and simpler
features.

R. Adams and Zoubin Ghahramani.
**Archipelago:
nonparametric Bayesian semi-supervised learning**.
In Léon Bottou and Michael Littman, editors, *26th International
Conference on Machine Learning*, pages 1-8, Montréal, QC, Canada,
June 2009. Omnipress.

** Abstract:** Semi-supervised learning
(SSL), is classification where additional unlabeled data can be used to
improve accuracy. Generative approaches are appealing in this situation, as a
model of the data's probability density can assist in identifying clusters.
Nonparametric Bayesian methods, while ideal in theory due to their principled
motivations, have been difficult to apply to SSL in practice. We present a
nonparametric Bayesian method that uses Gaussian processes for the generative
model, avoiding many of the problems associated with Dirichlet process
mixture models. Our model is fully generative and we take advantage of recent
advances in Markov chain Monte Carlo algorithms to provide a practical
inference method. Our method compares favorably to competing approaches on
synthetic and real-world multi-class data.

** Comment:** This paper was awarded Honourable Mention for
Best Paper at ICML 2009.

Arik Azran.
**The Rendezvous
algorithm: multiclass semi-supervised learning with Markov random
walks**.
In Zoubin Ghahramani, editor, *24th International Conference on Machine
Learning*, pages 49-56, Corvallis, OR, USA, June 2007. Omnipress.

** Abstract:** We consider the problem of multiclass
classification where both labeled and unlabeled data points are given. We
introduce and demonstrate a new approach for estimating a distribution over
the missing labels where data points are viewed as nodes of a graph, and
pairwise similarities are used to derive a transition probability matrix P
for a Markov random walk between them. The algorithm associates each point
with a particle which moves between points according to P. Labeled points are
set to be absorbing states of the Markov random walk, and the probability of
each particle to be absorbed by the different labeled points, as the number
of steps increases, is then used to derive a distribution over the associated
missing label. A computationally efficient algorithm to implement this is
derived and demonstrated on both real and artificial data sets, including a
numerical comparison with other methods.

Matthias O. Franz, Younghee Kwon, Carl Edward Rasmussen, and Bernhard
Schölkopf.
**Semi-supervised
kernel regression using whitened function classes**.
In C. E. Rasmussen, H. H. Bülthoff, M. A. Giese, and B. Schölkopf,
editors, *Lecture Notes in Computer Science (LNCS)*, volume 3175,
pages 18-26, Berlin, Germany, 2004. Springer.

** Abstract:**
The use of non-orthonormal basis functions in ridge regression leads to an
often undesired non-isotropic prior in function space. In this study, we
investigate an alternative regularization technique that results in an
implicit whitening of the basis functions by penalizing directions in
function space with a large prior variance. The regularization term is
computed from unlabelled input data that characterizes the input
distribution. Tests on two datasets using polynomial basis functions showed
an improved average performance compared to standard ridge regression.

Xiaojin Zhu, Jaz S. Kandola, Zoubin Ghahramani, and John D. Lafferty.
**Nonparametric
transforms of graph kernels for semi-supervised learning**.
In Sebastian Thrun, Lawrence K. Saul, and Bernhard Schölkopf, editors,
*NIPS*. MIT Press, 2004.

** Abstract:** We present an
algorithm based on convex optimization for constructing kernels for
semi-supervised learning. The kernel matrices are derived from the spectral
decomposition of graph Laplacians, and combine labeled and unlabeled data in
a systematic fashion. Unlike previous work using diffusion kernels and
Gaussian random field kernels, a nonparametric kernel approach is presented
that incorporates order constraints during optimization. This results in
flexible kernels and avoids the need to choose among different parametric
forms. Our approach relies on a quadratically constrained quadratic program
(QCQP), and is computationally feasible for large datasets. We evaluate the
kernels on real datasets using support vector machines, with encouraging
results.

Xiaojin Zhu, Zoubin Ghahramani, and John D. Lafferty.
**Semi-supervised
learning using Gaussian fields and harmonic functions**.
In Tom Fawcett and Nina Mishra, editors, *ICML*, pages 912-919. AAAI
Press, 2003.

** Abstract:** An approach to semi-supervised
learning is proposed that is based on a Gaussian random field model. Labeled
and unlabeled data are represented as vertices in a weighted graph, with edge
weights encoding the similarity between instances. The learning problem is
then formulated in terms of a Gaussian random field on this graph, where the
mean of the field is characterized in terms of harmonic functions, and is
efficiently obtained using matrix methods or belief propagation. The
resulting learning algorithms have intimate connections with random walks,
electric networks, and spectral graph theory. We discuss methods to
incorporate class priors and the predictions of classifiers obtained by
supervised learning. We also propose a method of parameter learning by
entropy minimization, and show the algorithm's ability to perform feature
selection. Promising experimental results are presented for synthetic data,
digit classification, and text classification tasks.

## Non-parametric Bayesian LearningNon-parametric models are very flexible statistical models in which the complexity of the model grows with the amount of observed data. While traditional parametric models make strong assumptions about how the data was generated, non-parametric models try to make weaker assumptions and let the data "speak for itself". Many non-parametric models can be seen as infinite limits of finite parametric models, and an important family of non-parametric models are derived from Dirichlet processes. See also Gaussian Processes. |

Talay M Cheema.
**Contrasting
discrete and continuous methods for Bayesian system identification**.
In *Workshop on Continuous Time Machine Learning at the 39th International
Conference on Machine Learning*, 2022.

** Abstract:** In
recent years, there has been considerable interest in embedding continuous
time methods in machine learning algorithms. In system identification, the
task is to learn a dynamical model from incomplete observation data, and when
prior knowledge is in continuous time – for example, mechanistic
differential equation models – it seems natural to use continuous time
models for learning. Yet when learning flexible, nonlinear, probabilistic
dynamics models, most previous work has focused on discrete time models to
avoid computational, numerical, and mathematical difficulties. In this work
we show, with the aid of small-scale examples, that this mismatch between
model and data generating process can be consequential under certain
circumstances, and we discuss possible modifications to discrete time models
which may better suit them to handling data generated by continuous time
processes.

Vidhi Lalchand, Kenza Tazi, Talay M Cheema, Richard E Turner, and Scott
Hosking.
**Kernel learning for explainable
climate science**.
In *16th Bayesian Modelling Applications Workshop at UAI, 2022*, 2022.

** Abstract:** The Upper Indus Basin, Himalayas provides water
for 270 million people and countless ecosystems. However, precipitation, a
key component to hydrological modelling, is poorly understood in this area. A
key challenge surrounding this uncertainty comes from the complex
spatial-temporal distribution of precipitation across the basin. In this work
we propose Gaussian processes with structured non-stationary kernels to model
precipitation patterns in the UIB. Previous attempts to quantify or model
precipitation in the Hindu Kush Karakoram Himalayan region have often been
qualitative or include crude assumptions and simplifications which cannot be
resolved at lower resolutions. This body of research also provides little to
no error propagation. We account for the spatial variation in precipitation
with a non-stationary Gibbs kernel parameterised with an input dependent
lengthscale. This allows the posterior function samples to adapt to the
varying precipitation patterns inherent in the distinct underlying topography
of the Indus region. The input dependent lengthscale is governed by a latent
Gaussian process with a stationary squared-exponential kernel to allow the
function level hyperparameters to vary smoothly. In ablation experiments we
motivate each component of the proposed kernel by demonstrating its ability
to model the spatial covariance, temporal structure and joint spatio-temporal
reconstruction. We benchmark our model with a stationary Gaussian process and
a Deep Gaussian processes.

Angus Phillips, Thomas Seror, Michael Hutchinson, Valentin De Bortoli, Arnaud
Doucet, and Emile Mathieu.
**Spectral diffusion
processes**.
In *NeurIPS workshop on Score-Based Methods*, 2022.

** Abstract:** Score-based generative modelling (SGM) has proven
to be a very effective method for modelling densities on finite-dimensional
spaces. In this work we propose to extend this methodology to learn
generative models over functional spaces. To do so, we represent functional
data in spectral space to dissociate the stochastic part of the processes
from their space-time part. Using dimensionality reduction techniques we then
sample from their stochastic component using finite dimensional SGM. We
demonstrate our method's effectiveness for modelling various multimodal
datasets.

Talay M Cheema.
**Understanding
local linearisation in variational Gaussian process state space
models**.
In *Time Series Workshop at the 38th International Conference on Machine
Learning*, 2021.

** Abstract:** We describe variational
inference approaches in Gaussian process state space models in terms of local
linearisations of the approximate posterior function. Most previous
approaches have either assumed independence between the posterior dynamics
and latent states (the mean-field (MF) approximation), or optimised free
parameters for both, leading to limited scalability. We use our framework to
prove that (i) there is a theoretical imperative to use non-MF approaches, to
avoid excessive bias in the process noise hyperparameter estimate, and (ii)
we can parameterise only the posterior dynamics without any less of
performance. Our approach suggests further approximations, based on the
existing rich literature on filtering and smoothing for nonlinear systems,
and unifies approaches for discrete and continuous time models.

Fergus Simpson, Vidhi Lalchand, and Carl Edward Rasmussen.
**Marginalised
Gaussian Processes with Nested Sampling**.
In *Advances in Neural Information Processing Systems 34*,
volume 34, pages 13613-13625. Curran Associates, Inc., 2021.

** Abstract:** Gaussian Process models are a rich distribution
over functions with inductive biases controlled by a kernel function.
Learning occurs through optimisation of the kernel hyperparameters using the
marginal likelihood as the objective. This work proposes nested sampling as a
means of marginalising kernel hyperparameters, because it is a technique that
is well-suited to exploring complex, multi-modal distributions. We benchmark
against Hamiltonian Monte Carlo on time-series and two-dimensional regression
tasks, finding that a principled approach to quantifying hyperparameter
uncertainty substantially improves the quality of prediction intervals.

Jan-Peter Calliess, Stephen J. Roberts, Carl Edward Rasmussen, and Jan
Maciejowski.
**Lazily adapted constant kinky inference for non-parametric regression and
model-reference adaptive control**.
*Automatica*, 122, 2020, doi
10.1016/j.automatica.2020.109216.

** Abstract:**
Techniques known as Nonlinear Set Membership prediction or Lipschitz
Interpolation are approaches to supervised machine learning that utilise
presupposed Lipschitz properties to perform inference over unobserved
function values. Provided a bound on the true best Lipschitz constant of the
target function is known a priori, they offer convergence guarantees, as well
as bounds around the predictions. Considering a more general setting that
builds on Lipschitz continuity, we propose an online method for estimating
the Lipschitz constant online from function value observations that are
possibly corrupted by bounded noise. Utilising this as a data-dependent
hyper-parameter gives rise to a nonparametric machine learning method, for
which we establish strong universal approximation guarantees. That is, we
show that our prediction rule can learn any continuous function on compact
support in the limit of increasingly dense data, up to a worst-case error
that can be bounded by the level of observational error. We also consider
applications of our nonparametric regression method to learning-based
control. For a class of discrete-time settings, we establish convergence
guarantees on the closed-loop tracking error of our online learning-based
controllers. To provide evidence that our method can be beneficial not only
in theory but also in practice, we apply it in the context of nonparametric
model-reference adaptive control (MRAC). Across a range of simulated aircraft
roll-dynamics and performance metrics our approach outperforms recently
proposed alternatives that were based on Gaussian processes and RBF-neural
networks.

Vidhi Lalchand and Carl Edward Rasmussen.
**Approximate
inference for fully Bayesian Gaussian process regression**.
In *2nd Symposium on Advances in Approximate Bayesian Inference*, pages
1-12. PMLR, 2020.

** Abstract:** Learning in Gaussian Process
models occurs through the adaptation of hyperparameters of the mean and the
covariance function. The classical approach entails maximizing the marginal
likelihood yielding fixed point estimates (an approach called Type II maximum
likelihood or ML-II). An alternative learning procedure is to infer the
posterior over hyper-parameters in a hierarchical specication of GPs we call
Fully Bayesian Gaussian Process Regression (GPR). This work considers two
approximation schemes for the intractable hyperparameter posterior: 1)
Hamiltonian Monte Carlo (HMC) yielding a sampling based approximation and 2)
Variational Inference (VI) where the posterior over hyperparameters is
approximated by a factorized Gaussian (mean-field) or a full-rank Gaussian
accounting for correlations between hyperparameters. We analyse the
predictive performance for fully Bayesian GPR on a range of benchmark data
sets.

Martin Trapp, Robert Peharz, Hong Ge, Franz Pernkopf, and Zoubin Ghahramani.
**Bayesian
learning of sum-product networks**.
In *Advances in Neural Information Processing Systems 33*, Vancouver,
December 2019.

** Abstract:** Sum-product networks (SPNs) are
flexible density estimators and have received significant attention due to
their attractive inference properties. While parameter learning in SPNs is
well developed, structure learning leaves something to be desired: Even
though there is a plethora of SPN structure learners, most of them are
somewhat ad-hoc and based on intuition rather than a clear learning
principle. In this paper, we introduce a well-principled Bayesian framework
for SPN structure learning. First, we decompose the problem into i) laying
out a computational graph, and ii) learning the so-called scope function over
the graph. The first is rather unproblematic and akin to neural network
architecture validation. The second represents the effective structure of the
SPN and needs to respect the usual structural constraints in SPN, i.e.
completeness and decomposability. While representing and learning the scope
function is somewhat involved in general, in this paper, we propose a natural
parametrisation for an important and widely used special case of SPNs. These
structural parameters are incorporated into a Bayesian model, such that
simultaneous structure and parameter learning is cast into monolithic
Bayesian posterior inference. In various experiments, our Bayesian SPNs often
improve test likelihoods over greedy SPN learners. Further, since the
Bayesian framework protects against overfitting, we can evaluate
hyper-parameters directly on the Bayesian model score, waiving the need for a
separate validation set, which is especially beneficial in low data regimes.
Bayesian SPNs can be applied to heterogeneous domains and can easily be
extended to nonparametric formulations. Moreover, our Bayesian approach is
the first, which consistently and robustly learns SPN structures under
missing data.

Maria Lomeli, Mark Rowland, Arthur Gretton, and Zoubin Ghahramani.
**Antithetic and Monte Carlo
kernel estimators for partial rankings**.
*arXiv preprint arXiv:1807.00400*, 2018.

** Abstract:**
In the modern age, rankings data is ubiquitous and it is useful for a variety
of applications such as recommender systems, multi-object tracking and
preference learning. However, most rankings data encountered in the real
world is incomplete, which prevents the direct application of existing
modelling tools for complete rankings. Our contribution is a novel way to
extend kernel methods for complete rankings to partial rankings, via
consistent Monte Carlo estimators for Gram matrices: matrices of kernel
values between pairs of observations. We also present a novel variance
reduction scheme based on an antithetic variate construction between
permutations to obtain an improved estimator for the Mallows kernel. The
corresponding antithetic kernel estimator has lower variance and we
demonstrate empirically that it has a better performance in a variety of
Machine Learning tasks. Both kernel estimators are based on extending kernel
mean embeddings to the embedding of a set of full rankings consistent with an
observed partial ranking. They form a computationally tractable alternative
to previous approaches for partial rankings data. An overview of the existing
kernels and metrics for permutations is also provided.

Ben Bloem-Reddy, Emile Mathieu, Adam Foster, Tom Rainforth, Hong Ge, Maria
Lomeli, and Zoubin Ghahramani.
**Sampling and inference
for discrete random probability measures in probabilistic programs**.
In *NIPS workshop on Advances in Approximate Inference*,
California, United States, December 2017.

** Abstract:** We
consider the problem of sampling a sequence from a discrete random prob-
ability measure (RPM) with countable support, under (probabilistic)
constraints of finite memory and computation. A canonical example is sampling
from the Dirichlet Process, which can be accomplished using its well-known
stick-breaking representation and lazy initialization of its atoms. We show
that efficiently lazy initialization is possible if and only if a size-biased
representation of the discrete RPM is known. For models constructed from such
discrete RPMs, we consider the implications for generic particle-based
inference methods in probabilistic program- ming systems. To demonstrate, we
implement posterior inference for Normalized Inverse Gaussian Process mixture
models in Turing.

Daniel Limon, Jan-Peter Calliess, and Jan Maciejowski.
**Learning-based nonlinear model predictive control**.
In *IFAC 2017 World Congress*, Toulouse, France, July 2017. doi
10.1016/j.ifacol.2017.08.1050.

** Abstract:** This paper
presents stabilizing Model Predictive Controllers (MPC) in which prediction
models are inferred from experimental data of the inputs and outputs of the
plant. Using a nonparametric machine learning technique called LACKI, the
estimated (possibly nonlinear) model function together with an estimation of
Hoelder constant is provided. Based on these, a number of predictive
controllers with stability guaranteed by design are proposed. Firstly, the
case when the prediction model is estimated off- line is considered and
robust stability and recursive feasibility is ensured by using tightened
constraints in the optimisation problem. This controller has been extended to
the more interesting and complex case: the online learning of the model,
where the new data collected from feedback is added to enhance the prediction
model. A on-line learning MPC based on a double sequence of predictions is
proposed. Stability of the online learning MPC is proved. These controllers
are illustrated by simulation.

Jan-Peter Calliess.
**Lipschitz
optimisation for Lipschitz interpolation**.
In *2017 American Control Conference (ACC 2017)*, Seattle, WA, USA, May
2017.

** Abstract:** Techniques known as Nonlinear Set
Membership prediction, Kinky Inference or Lipschitz Interpolation are fast
and numerically robust approaches to nonparametric machine learning that have
been proposed to be utilised in the context of system identification and
learning-based control. They utilise presupposed Lipschitz properties in
order to compute inferences over unobserved function values. Unfortunately,
most of these approaches rely on exact knowledge about the input space metric
as well as about the Lipschitz constant. Furthermore, existing techniques to
estimate the Lipschitz constants from the data are not robust to noise or
seem to be ad-hoc and typically are decoupled from the ultimate learning and
prediction task. To overcome these limitations, we propose an approach for
optimising parameters of the presupposed metrics by minimising validation set
prediction errors. To avoid poor performance due to local minima, we propose
to utilise Lipschitz properties of the optimisation objective to ensure
global optimisation success. The resulting approach is a new flexible method
for nonparametric black-box learning. We illustrate its competitiveness on a
set of benchmark problems.

Maria Lomeli, Stefano Favaro, and Yee Whye Teh.
**A
marginal sampler for sigma-Stable Poisson-Kingman mixture models**.
*Journal of Computational and Graphical Statistics*, 26:44-53, 2017.

** Abstract:** We investigate the class of sigma-stable
Poisson-Kingman random probability measures (RPMs) in the context of Bayesian
nonparametric mixture modeling. This is a large class of discrete RPMs, which
encompasses most of the popular discrete RPMs used in Bayesian
nonparametrics, such as the Dirichlet process, Pitman-Yor process, the
normalized inverse Gaussian process, and the normalized generalized Gamma
process. We show how certain sampling properties and marginal
characterizations of sigma-stable Poisson-Kingman RPMs can be usefully
exploited for devising a Markov chain Monte Carlo (MCMC) algorithm for
performing posterior inference with a Bayesian nonparametric mixture model.
Specifically, we introduce a novel and efficient MCMC sampling scheme in an
augmented space that has a small number of auxiliary variables per iteration.
We apply our sampling scheme to a density estimation and clustering tasks
with unidimensional and multidimensional datasets, and compare it against
competing MCMC sampling schemes. Supplementary materials for this article are
available online.

Maria Lomeli.
**General Bayesian inference
schemes in infinite mixture models**.
PhD thesis, University College London,Gatsby Unit, London, UK, 2017.

** Abstract:** Bayesian statistical models allow us to formalise
our knowledge about the world and reason about our uncertainty, but there is
a need for better procedures to accurately encode its complexity. One way to
do so is through compositional models, which are formed by combining blocks
consisting of simpler models. One can increase the complexity of the
compositional model by either stacking more blocks or by using a
not-so-simple model as a building block. This thesis is an example of the
latter. One first aim is to expand the choice of Bayesian nonparametric (BNP)
blocks for constructing tractable compositional models. So far, most of the
models that have a Bayesian nonparametric component use a Dirichlet Process
or a Pitman-Yor process because of the availability of tractable and compact
representations. This thesis shows how to overcome certain intractabilities
in order to obtain analogous compact representations for the class of
Poisson-Kingman priors which includes the Dirichlet and Pitman-Yor processes.
A major impediment to the widespread use of Bayesian nonparametric building
blocks is that inference is often costly, intractable or difficult to carry
out. This is an active research area since dealing with the model's infinite
dimensional component forbids the direct use of standard simulation-based
methods. The main contribution of this thesis is a variety of inference
schemes that tackle this problem: Markov chain Monte Carlo and Sequential
Monte Carlo methods, which are exact inference schemes since they target the
true posterior. The contributions of this thesis, in a larger context,
provide general purpose exact inference schemes in the flavour or
probabilistic programming: the user is able to choose from a variety of
models, focusing only on the modelling part. Indeed, if the wide enough class
of Poisson-Kingman priors is used as one of our blocks, this objective is
achieved.

Matej Balog, Balaji Lakshminarayanan, Zoubin Ghahramani, Daniel M. Roy, and
Yee Whye Teh.
**The
Mondrian kernel**.
In *32nd Conference on Uncertainty in Artificial Intelligence*, pages
32-41, Jersey City, New Jersey, USA, June 2016.

**
Abstract:** We introduce the Mondrian kernel, a fast random feature
approximation to the Laplace kernel. It is suitable for both batch and online
learning, and admits a fast kernel-width-selection procedure as the random
features can be re-used efficiently for all kernel widths. The features are
constructed by sampling trees via a Mondrian process [Roy and Teh, 2009], and
we highlight the connection to Mondrian forests [Lakshminarayanan et al.,
2014], where trees are also sampled via a Mondrian process, but fit
independently. This link provides a new insight into the relationship between
kernel methods and random forests.

** Comment:** [Supplementary
Material] [arXiv] [Poster]
[Slides]
[Code]

Alexander G D G Matthews, James Hensman, Richard E. Turner, and Zoubin
Ghahramani.
**On Sparse Variational methods
and the Kullback-Leibler divergence between stochastic processes**.
In *19th International Conference on Artificial Intelligence and
Statistics*, Cadiz, Spain, May 2016.

** Abstract:** The
variational framework for learning inducing variables (Titsias, 2009a) has
had a large impact on the Gaussian process literature. The framework may be
interpreted as minimizing a rigorously defined Kullback-Leibler divergence
between the approximating and posterior processes. To our knowledge this
connection has thus far gone unremarked in the literature. In this paper we
give a substantial generalization of the literature on this topic. We give a
new proof of the result for infinite index sets which allows inducing points
that are not data points and likelihoods that depend on all function values.
We then discuss augmented index sets and show that, contrary to previous
works, marginal consistency of augmentation is not enough to guarantee
consistency of variational inference with the original model. We then
characterize an extra condition where such a guarantee is obtainable. Finally
we show how our framework sheds light on interdomain sparse approximations
and sparse approximations for Cox processes.

Jan-Peter Calliess.
**Lazily adapted constant kinky
inference for nonparametric regression and model-reference adaptive
control**.
*arXiv*, arXiv:1701.00178, 2016.

** Abstract:**
Techniques known as Nonlinear Set Membership prediction, Lipschitz
Interpolation or Kinky Inference are approaches to machine learning that
utilise presupposed Lipschitz properties to compute inferences over
unobserved function values. Provided a bound on the true best Lipschitz
constant of the target function is known a priori they offer convergence
guarantees as well as bounds around the predictions. Considering a more
general setting that builds on Hölder continuity relative to
pseudo-metrics, we propose an online method for estimating the Hoelder
constant online from function value observations that possibly are corrupted
by bounded observational errors. Utilising this to compute adaptive
parameters within a kinky inference rule gives rise to a nonparametric
machine learning method, for which we establish strong universal
approximation guarantees. That is, we show that our prediction rule can learn
any continuous function in the limit of increasingly dense data to within a
worst-case error bound that depends on the level of observational
uncertainty. We apply our method in the context of nonparametric
model-reference adaptive control (MRAC). Across a range of simulated aircraft
roll-dynamics and performance metrics our approach outperforms recently
proposed alternatives that were based on Gaussian processes and RBF-neural
networks. For discrete-time systems, we provide stability guarantees for our
learning-based controllers both for the batch and the online learning
setting.

Jan-Peter Calliess.
**Bayesian
Lipschitz constant estimation and quadrature**.
In *Workshop on Probabilistic Integration, NIPS*, Montreal, Canada,
December 2015.

** Abstract:** Lipschitz quadrature methods
provide an approach to one-dimensional numerical integration on bounded
domains. On the basis of the assumption that the integrand is Lipschitz
continuous with a known Lipschitz constant, these quadrature rules can
provide a tight error bound around their integral estimates and utilise the
Lipschitz constant to guide exploration in the context of adaptive
quadrature. In this paper, we outline our ongoing work on extending this
approach to settings where the Lipschitz constant is probabilistically
uncertain. As the key component, we introduce a Bayesian approach for
updating a subjectively probabilistic belief of the Lipschitz constant.
Combined with any Lipschitz quadrature rule, we obtain an approach for
translating a sample into an integral estimate with probabilistic uncertainty
intervals. The paper concludes with an illustration of the approach followed
by a discussion of open issues and future work.

James Hensman, Alexander G D G Matthews, Maurizio Filippone, and Zoubin
Ghahramani.
**MCMC
for Variationally Sparse Gaussian Processes**.
In *Advances in Neural Information Processing Systems 28*, pages 1-9,
Montreal, Canada, December 2015.

** Abstract:** Gaussian
process (GP) models form a core part of probabilistic machine learning.
Considerable research effort has been made into attacking three issues with
GP models: how to compute efficiently when the number of data is large; how
to approximate the posterior when the likelihood is not Gaussian and how to
estimate covariance function parameter posteriors. This paper simultaneously
addresses these, using a variational approximation to the posterior which is
sparse in support of the function but otherwise free-form. The result is a
Hybrid Monte-Carlo sampling scheme which allows for a non-Gaussian
approximation over the function values and covariance parameters
simultaneously, with efficient computations based on inducing-point sparse
GPs. Code to replicate each experiment in this paper will be available
shortly.

Maria Lomeli, Stefano Favaro, and Yee Whye Teh.
**A
hybrid sampler for Poisson-Kingman mixture models**.
In *Advances in Neural Information Processing Systems 28*, pages 1-9,
Montreal, Canada, December 2015.

** Abstract:** This paper
concerns the introduction of a new Markov Chain Monte Carlo scheme for
posterior sampling in Bayesian nonparametric mixture models with priors that
belong to the general Poisson-Kingman class. We present a novel and compact
way of representing the infinite dimensional component of the model such that
while explicitly representing this infinite component it has less memory and
storage requirements than previous MCMC schemes. We describe comparative
simulation results demonstrating the efficacy of the proposed MCMC algorithm
against existing marginal and conditional MCMC samplers.

James Robert Lloyd and Zoubin Ghahramani.
**Statistical
model criticism using kernel two sample tests**.
In *Advances in Neural Information Processing Systems 29*, pages 1-9,
Montreal, Canada, December 2015.

** Abstract:** We propose an
exploratory approach to statistical model criticism using maximum mean
discrepancy (MMD) two sample tests. Typical approaches to model criticism
require a practitioner to select a statistic by which to measure
discrepancies between data and a statistical model. MMD two sample tests are
instead constructed as an analytic maximisation over a large space of
possible statistics and therefore automatically select the statistic which
most shows any discrepancy. We demonstrate on synthetic data that the
selected statistic, called the witness function, can be used to identify
where a statistical model most misrepresents the data it was trained on. We
then apply the procedure to real data where the models being assessed are
restricted Boltzmann machines, deep belief networks and Gaussian process
regression and demonstrate the ways in which these models fail to capture the
properties of the data they are trained on.

James Hensman, Alexander G D G Matthews, and Zoubin Ghahramani.
**Scalable
Variational Gaussian Process Classification**.
In *18th International Conference on Artificial Intelligence and
Statistics*, pages 1-9, San Diego, California, USA, May 2015.

** Abstract:** Gaussian process classification is a popular
method with a number of appealing properties. We show how to scale the model
within a variational inducing point framework, out-performing the state of
the art on benchmark datasets. Importantly, the variational formulation an be
exploited to allow classification in problems with millions of data points,
as we demonstrate in experiments.

Nilesh Tripuraneni, Shixiang Gu, Hong Ge, and Zoubin Ghahramani.
**Particle
gibbs for infinite hidden markov models**.
In *Advances in Neural Information Processing Systems 29*, Montreal
CANADA, May 2015.

** Abstract:** Infinite Hidden Markov Models
(iHMM’s) are an attractive, nonparametric gener- alization of the classical
Hidden Markov Model which can automatically infer the number of hidden states
in the system. However, due to the infinite-dimensional nature of the
transition dynamics, performing inference in the iHMM is difficult. In this
paper, we present an infinite-state Particle Gibbs (PG) algorithm to re-
sample state trajectories for the iHMM. The proposed algorithm uses an
efficient proposal optimized for iHMMs and leverages ancestor sampling to
improve the mixing of the standard PG algorithm. Our algorithm demonstrates
significant con- vergence improvements on synthetic and real world data
sets.

Stefano Favaro, Maria Lomeli, and Yee Whye Teh.
**On a
class of sigma-Stable Poisson-Kingman models and an effective marginalised
sampler**.
*Statistics and Computing*, 25:67-78, 2015.

**
Abstract:** We investigate the use of a large class of discrete random
probability measures, which is referred to as the class Q, , in the context
of Bayesian nonparametric mixture modeling. The class Q encompasses both the
the two-parameter Poisson?Dirichlet process and the normalized generalized
Gamma process, thus allowing us to comparatively study the inferential
advantages of these two well-known nonparametric priors. Apart from ahighly
flexible parameterization, the distinguishing feature of the class Q is the
availability of a tractable posterior distribution. This feature, in turn,
leads to derive an efficient marginal MCMC algorithm for posterior sampling
within the framework of mixture models. We demonstrate the efficacy of our
modeling framework on both one-dimensional and multi-dimensional
datasets.

Yarin Gal, Yutian Chen, and Zoubin Ghahramani.
**Latent
Gaussian processes for distribution estimation of multivariate categorical
data**.
In *Proceedings of the 32nd International Conference on Machine Learning
(ICML-15)*, pages 645-654, 2015.

** Abstract:**
Multivariate categorical data occur in many applications of machine learning.
One of the main difficulties with these vectors of categorical variables is
sparsity. The number of possible observations grows exponentially with vector
length, but dataset diversity might be poor in comparison. Recent models have
gained significant improvement in supervised tasks with this data. These
models embed observations in a continuous space to capture similarities
between them. Building on these ideas we propose a Bayesian model for the
unsupervised task of distribution estimation of multivariate categorical
data. We model vectors of categorical variables as generated from a
non-linear transformation of a continuous latent space. Non-linearity
captures multi-modality in the distribution. The continuous representation
addresses sparsity. Our model ties together many existing models, linking the
linear categorical latent Gaussian model, the Gaussian process latent
variable model, and Gaussian process classification. We derive inference for
our model based on recent developments in sampling based variational
inference. We show empirically that the model outperforms its linear and
discrete counterparts in imputation tasks of sparse data.

Zoubin Ghahramani.
**Probabilistic
machine learning and artificial intelligence**.
*Nature*, 521:452–459, 2015, doi
doi:10.1038/nature14541.

** Abstract:** How can a machine
learn from experience? Probabilistic modelling provides a framework for
understanding what learning is, and has therefore emerged as one of the
principal theoretical and practical approaches for designing machines that
learn from data acquired through experience. The probabilistic framework,
which describes how to represent and manipulate uncertainty about models and
predictions, has a central role in scientific data analysis, machine
learning, robotics, cognitive science and artificial intelligence. This
Review provides an introduction to this framework, and discusses some of the
state-of-the-art advances in the field, namely, probabilistic programming,
Bayesian optimization, data compression and automatic model discovery.

Yarin Gal and Richard Turner.
**Improving the
Gaussian process sparse spectrum approximation by representing uncertainty
in frequency inputs**.
In *Proceedings of the 32nd International Conference on Machine Learning
(ICML-15)*, pages 655-664, 2015.

** Abstract:** Standard
sparse pseudo-input approximations to the Gaussian process (GP) cannot handle
complex functions well. Sparse spectrum alternatives attempt to answer this
but are known to over-fit. We suggest the use of variational inference for
the sparse spectrum approximation to avoid both issues. We model the
covariance function with a finite Fourier series approximation and treat it
as a random variable. The random covariance function has a posterior, on
which a variational distribution is placed. The variational distribution
transforms the random covariance function to fit the data. We study the
properties of our approximate inference, compare it to alternative ones, and
extend it to the distributed and stochastic domains. Our approximation
captures complex functions better than standard approaches and avoids
over-fitting.

Tomoharu Iwata, James Robert Lloyd, and Zoubin Ghahramani.
**Unsupervised
many-to-many object matching for relational data**.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*,
2015.

** Abstract:** We propose a method for unsupervised
many-to-many object matching from multiple networks, which is the task of
finding correspondences between groups of nodes in different networks. For
example, the proposed method can discover shared word groups from
multi-lingual document-word networks without cross-language alignment
information. We assume that multiple networks share groups, and each group
has its own interaction pattern with other groups. Using infinite relational
models with this assumption, objects in different networks are clustered into
common groups depending on their interaction patterns, discovering a
matching. The effectiveness of the proposed method is experimentally
demonstrated by using synthetic and real relational data sets, which include
applications to cross-domain recommendation without shared user/item
identifiers and multi-lingual word clustering.

James Rovert Lloyd.
**Representation,
learning, description and criticism of probabilistic models with applications
to networks, functions and relational data**.
PhD thesis, University of Cambridge, Department of Engineering, Cambridge, UK,
2015.

** Abstract:** This thesis makes contributions to a
variety of aspects of probabilistic inference. When performing probabilistic
inference, one must first represent one’s beliefs with a probability
distribution. Specifying the details of a probability distribution can be a
difficult task in many situations, but when expressing beliefs about complex
data structures it may not even be apparent what form such a distribution
should take. This thesis starts by demonstrating how representation theorems
due to Aldous, Hoover and Kallenberg can be used to specify appropriate
models for data in the form of networks. These theorems are then extended in
order to reveal appropriate probability distributions for arbitrary
relational data or databases. A simpler data structure to specify probability
distributions for is that of functions; many probability distributions for
functions have been used for centuries. We demonstrate that many of these
distributions can be expressed in a common language of Gaussian process
kernels constructed from a few base elements and operators. The structure of
this language allows for the effective automatic construction of
probabilistic models for functions. Furthermore, the formal mathematical
language of kernels can be mapped neatly onto natural language allowing for
automatic descriptions of the automatically constructed models. By further
automating the construction of statistical models, the need to be able to
effectively check or criticise these models becomes greater. This thesis
demonstrates how kernel two sample tests can be used to demonstrate where a
probabilistic model most disagrees with data allowing for targeted
improvements to the model. In proposing a new method of model criticism this
thesis also briefly discusses the philosophy of model criticism within the
context of probabilistic inference.

Creighton Heaukulani, David A. Knowles, and Zoubin Ghahramani.
**Beta diffusion trees and
hierarchical feature allocations**.
Technical report, Dept. of Engineering, University of Cambridge, August 2014.

** Abstract:** We define the beta diffusion tree, a random tree
structure with a set of leaves that defines a collection of overlapping
subsets of objects, known as a feature allocation. A generative process for
the tree structure is defined in terms of particles (representing the
objects) diffusing in some continuous space, analogously to the Dirichlet
diffusion tree (Neal, 2003b), which defines a tree structure over partitions
(i.e., non-overlapping subsets) of the objects. Unlike in the Dirichlet
diffusion tree, multiple copies of a particle may exist and diffuse along
multiple branches in the beta diffusion tree, and an object may therefore
belong to multiple subsets of particles. We demonstrate how to build a
hierarchically-clustered factor analysis model with the beta diffusion tree
and how to perform inference over the random tree structures with a Markov
chain Monte Carlo algorithm. We conclude with several numerical experiments
on missing data problems with data sets of gene expression microarrays,
international development statistics, and intranational socioeconomic
measurements.

James Robert Lloyd, David Duvenaud, Roger Grosse, Joshua B. Tenenbaum, and
Zoubin Ghahramani.
**Automatic construction and
natural-language description of nonparametric regression models**.
In *Association for the Advancement of Artificial Intelligence (AAAI)*,
July 2014.

** Abstract:** This paper presents the beginnings
of an automatic statistician, focusing on regression problems. Our system
explores an open-ended space of statistical models to discover a good
explanation of a data set, and then produces a detailed report with figures
and natural-language text. Our approach treats unknown regression functions
nonparametrically using Gaussian processes, which has two important
consequences. First, Gaussian processes can model functions in terms of
high-level properties (e.g. smoothness, trends, periodicity, changepoints).
Taken together with the compositional structure of our language of models
this allows us to automatically describe functions in simple terms. Second,
the use of flexible nonparametric models and a rich language for composing
them in an open-ended manner also results in state-of-the-art extrapolation
performance evaluated over 13 real time series data sets from various
domains.

Creighton Heaukulani, David A. Knowles, and Zoubin Ghahramani.
**Beta
diffusion trees**.
In *31st International Conference on Machine Learning*, Beijing, China,
June 2014.

** Abstract:** We define the beta diffusion tree, a
random tree structure with a set of leaves that defines a collection of
overlapping subsets of objects, known as a feature allocation. The generative
process for the tree is defined in terms of particles (representing the
objects) diffusing in some continuous space, analogously to the Dirichlet and
Pitman–Yor diffusion trees (Neal, 2003b; Knowles & Ghahramani, 2011), both
of which define tree structures over clusters of the particles. With the beta
diffusion tree, however, multiple copies of a particle may exist and diffuse
to multiple locations in the continuous space, resulting in (a random number
of) possibly overlapping clusters of the objects. We demonstrate how to build
a hierarchically-clustered factor analysis model with the beta diffusion tree
and how to perform inference over the random tree structures with a Markov
chain Monte Carlo algorithm. We conclude with several numerical experiments
on missing data problems with data sets of gene expression arrays,
international development statistics, and intranational socioeconomic
measurements.

Konstantina Palla, David A. Knowles, and Zoubin Ghahramani.
**A reversible
infinite hmm using normalised random measures**.
In *31st International Conference on Machine Learning*, Beijing, China,
June 2014.

** Abstract:** We present a nonparametric prior
over reversible Markov chains. We use completely random measures,
specifically gamma processes, to construct a countably infinite graph with
weighted edges. By enforcing symmetry to make the edges undirected we define
a prior over random walks on graphs that results in a reversible Markov
chain. The resulting prior over infinite transition matrices is closely
related to the hierarchical Dirichlet process but enforces reversibility. A
reinforcement scheme has recently been proposed with similar properties, but
the de Finetti measure is not well characterised. We take the alternative
approach of explicitly constructing the mixing measure, which allows more
straightforward and efficient inference at the cost of no longer having a
closed form predictive distribution. We use our process to construct a
reversible infinite HMM which we apply to two real datasets, one from
epigenomics and one ion channel recording.

David Duvenaud, Oren Rippel, Ryan P. Adams, and Zoubin Ghahramani.
**Avoiding pathologies in very
deep networks**.
In *17th International Conference on Artificial Intelligence and
Statistics*, Reykjavik, Iceland, April 2014.

**
Abstract:** Choosing appropriate architectures and regularization
strategies for deep networks is crucial to good predictive performance. To
shed light on this problem, we analyze the analogous problem of constructing
useful priors on compositions of functions. Specifically, we study the deep
Gaussian process, a type of infinitely-wide, deep neural network. We show
that in standard architectures, the representational capacity of the network
tends to capture fewer degrees of freedom as the number of layers increases,
retaining only a single degree of freedom in the limit. We propose an
alternate network architecture which does not suffer from this pathology. We
also examine deep covariance functions, obtained by composing infinitely many
feature transforms. Lastly, we characterize the class of models obtained by
performing dropout on Gaussian processes.

Creighton Heaukulani and Daniel M. Roy.
**The combinatorial structure of beta
negative binomial processes**.
Technical report, Dept. of Engineering, University of Cambridge, March 2014.

** Abstract:** We characterize the combinatorial structure of
conditionally-i.i.d. sequences of negative binomial processes with a common
beta process base measure. In Bayesian nonparametric applications, such
processes have served as models for unknown multisets of a measurable space.
Previous work has characterized random subsets arising from
conditionally-i.i.d. sequences of Bernoulli processes with a common beta
process base measure. In this case, the combinatorial structure is described
by the Indian buffet process. Our results give a count analogue of the Indian
buffet process, which we call a negative binomial Indian buffet process. As
an intermediate step toward this goal, we provide constructions for the beta
negative binomial process that avoid a representation of the underlying beta
process base measure.

Sébastien Bratières, Novi Quadrianto, Sebastian Nowozin, and Zoubin
Ghahramani.
**Scalable
Gaussian Process structured prediction for grid factor graph
applications**.
In *31st International Conference on Machine Learning*, 2014.

** Abstract:** Structured prediction is an important and well
studied problem with many applications across machine learning. GPstruct is a
recently proposed structured prediction model that offers appealing
properties such as being kernelised, non-parametric, and supporting Bayesian
inference (Bratières et al. 2013). The model places a Gaussian process prior
over energy functions which describe relationships between input variables
and structured output variables. However, the memory demand of GPstruct is
quadratic in the number of latent variables and training runtime scales
cubically. This prevents GPstruct from being applied to problems involving
grid factor graphs, which are prevalent in computer vision and spatial
statistics applications. Here we explore a scalable approach to learning
GPstruct models based on ensemble learning, with weak learners (predictors)
trained on subsets of the latent variables and bootstrap data, which can
easily be distributed. We show experiments with 4M latent variables on image
segmentation. Our method outperforms widely-used conditional random field
models trained with pseudo-likelihood. Moreover, in image segmentation
problems it improves over recent state-of-the-art marginal optimisation
methods in terms of predictive performance and uncertainty calibration.
Finally, it generalises well on all training set sizes.

Roger Frigola, Yutian Chen, and Carl Edward Rasmussen.
**Variational
Gaussian process state-space models**.
In Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger,
editors, *Advances in Neural Information Processing Systems 27*,
2014.

** Abstract:** State-space models have been successfully
used for more than fifty years in different areas of science and engineering.
We present a procedure for efficient variational Bayesian learning of
nonlinear state-space models based on sparse Gaussian processes. The result
of learning is a tractable posterior over nonlinear dynamical systems. In
comparison to conventional parametric models, we offer the possibility to
straightforwardly trade off model capacity and computational cost whilst
avoiding overfitting. Our main algorithm uses a hybrid inference approach
combining variational Bayes and sequential Monte Carlo. We also present
stochastic variational inference and online learning approaches for fast
learning with long time series.

Stefano Favaro, Maria Lomeli, Bernardo Nipoti, and Yee Whye Teh.
**Stick-breaking
representations of sigma-Stable Poisson-Kingman models**.
*Electronic Journal of Statistics*, 8:1063-1085, 2014.

**
Abstract:** In this paper we investigate the stick-breaking representation
for the class of sigma-Stable Poisson-Kingman models, also known as
Gibbs-type random probability measures. This class includes as special cases
most of the discrete priors commonly used in Bayesian nonparametrics, such as
the two parameter Poisson-Dirichlet process and the normalized generalized
Gamma process. Under the assumption sigma=u/v, for any coprime integers
1<=u

Roger Frigola, Fredrik Lindsten, Thomas B. Schön, and Carl Edward
Rasmussen.
**Identification of Gaussian
process state-space models with particle stochastic approximation
EM**.
In *Proceedings of the 19th World Congress of the International Federation
of Automatic Control (IFAC)*, 2014.

** Abstract:**
Gaussian process state-space models (GP-SSMs) are a very flexible family of
models of nonlinear dynamical systems. They comprise a Bayesian nonparametric
representation of the dynamics of the system and additional
(hyper-)parameters governing the properties of this nonparametric
representation. The Bayesian formalism enables systematic reasoning about the
uncertainty in the system dynamics. We present an approach to maximum
likelihood identification of the parameters in GP-SSMs, while retaining the
full nonparametric description of the dynamics. The method is based on a
stochastic approximation version of the EM algorithm that employs recent
developments in particle Markov chain Monte Carlo for efficient
identification.

Yarin Gal and Zoubin Ghahramani.
**Pitfalls in the
use of parallel inference for the Dirichlet process**.
In *Proceedings of the 31th International Conference on Machine Learning
(ICML-14)*, 2014.

** Abstract:** Recent work done by
Lovell, Adams, and Mansingka (2012) and Williamson, Dubey, and Xing (2013)
has suggested an alternative parametrisation for the Dirichlet process in
order to derive non-approximate parallel MCMC inference for it – work which
has been picked-up and implemented in several different fields. In this paper
we show that the approach suggested is impractical due to an extremely
unbalanced distribution of the data. We characterise the requirements of
efficient parallel inference for the Dirichlet process and show that the
proposed inference fails most of these requirements (while approximate
approaches often satisfy most of them). We present both theoretical and
experimental evidence, analysing the load balance for the inference and
showing that it is independent of the size of the dataset and the number of
nodes available in the parallel implementation. We end with suggestions of
alternative paths of research for efficient non-approximate parallel
inference for the Dirichlet process.

Yarin Gal, Mark van der Wilk, and Carl Rasmussen.
**Distributed
variational inference in sparse Gaussian process regression and latent
variable models**.
In Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger,
editors, *Advances in Neural Information Processing Systems 27*, pages
3257-3265. Curran Associates, Inc., 2014.

** Abstract:**
Gaussian processes (GPs) are a powerful tool for probabilistic inference over
functions. They have been applied to both regression and non-linear
dimensionality reduction, and offer desirable properties such as uncertainty
estimates, robustness to over-fitting, and principled ways for tuning
hyper-parameters. However the scalability of these models to big datasets
remains an active topic of research. We introduce a novel re-parametrisation
of variational inference for sparse GP regression and latent variable models
that allows for an efficient distributed algorithm. This is done by
exploiting the decoupling of the data given the inducing points to
re-formulate the evidence lower bound in a Map-Reduce setting. We show that
the inference scales well with data and computational resources, while
preserving a balanced distribution of the load among the nodes. We further
demonstrate the utility in scaling Gaussian processes to big data. We show
that GP performance improves with increasing amounts of data in regression
(on flight data with 2 million records) and latent variable modelling (on
MNIST). The results show that GPs perform better than many common models
often used for big data.

Alexander G. D. G. Matthews and Zoubin Ghahramani.
**Classification using log
Gaussian Cox processes**.
*arXiv preprint arXiv:1405.4141*, 2014.

** Abstract:**
McCullagh and Yang (2006) suggest a family of classification algorithms
based on Cox processes. We further investigate the log Gaussian variant which
has a number of appealing properties. Conditioned on the covariates, the
distribution over labels is given by a type of conditional Markov random
field. In the supervised case, computation of the predictive probability of a
single test point scales linearly with the number of training points and the
multiclass generalization is straightforward. We show new links between the
supervised method and classical nonparametric methods. We give a detailed
analysis of the pairwise graph representable Markov random field, which we
use to extend the model to semi-supervised learning problems, and propose an
inference method based on graph min-cuts. We give the first experimental
analysis on supervised and semi-supervised datasets and show good empirical
performance.

Andrew Gordon Wilson.
**Covariance
Kernels for Fast Automatic Pattern Discovery and Extrapolation with
Gaussian Processes**.
PhD thesis, University of Cambridge, Cambridge, UK, 2014.

**
Abstract:** Truly intelligent systems are capable of pattern discovery and
extrapolation without human intervention. Bayesian nonparametric models,
which can uniquely represent expressive prior information and detailed
inductive biases, provide a distinct opportunity to develop intelligent
systems, with applications in essentially any learning and prediction
task.

Gaussian processes are rich distributions over functions, which
provide a Bayesian nonparametric approach to smoothing and interpolation. A
covariance kernel determines the support and inductive biases of a Gaussian
process. In this thesis, we introduce new covariance kernels to enable fast
automatic pattern discovery and extrapolation with Gaussian processes.

In
the introductory chapter, we discuss the high level principles behind all of
the models in this thesis: 1) we can typically improve the predictive
performance of a model by accounting for additional structure in data; 2) to
automatically discover rich structure in data, a model must have large
support and the appropriate inductive biases; 3) we most need expressive
models for large datasets, which typically provide more information for
learning structure, and 4) we can often exploit the existing inductive biases
(assumptions) or structure of a model for scalable inference, without the
need for simplifying assumptions.

In the context of this introduction, we
then discuss, in chapter 2, Gaussian processes as kernel machines, and my
views on the future of Gaussian process research.

In chapter 3 we
introduce the Gaussian process regression network (GPRN) framework, a
multi-output Gaussian process method which scales to many output variables,
and accounts for input-dependent correlations between the outputs. Underlying
the GPRN is a highly expressive kernel, formed using an adaptive mixture of
latent basis functions in a neural network like architecture. The GPRN is
capable of discovering expressive structure in data. We use the GPRN to model
the time-varying expression levels of 1000 genes, the spatially varying
concentrations of several distinct heavy metals, and multivariate volatility
(input dependent noise covariances) between returns on equity indices and
currency exchanges, which is particularly valuable for portfolio allocation.
We generalise the GPRN to an adaptive network framework, which does not
depend on Gaussian processes or Bayesian nonparametrics; and we outline
applications for the adaptive network in nuclear magnetic resonance (NMR)
spectroscopy, ensemble learning, and change-point modelling.

In chapter 4
we introduce simple closed form kernel for automatic pattern discovery and
extrapolation. These spectral mixture (SM) kernels are derived by modelling
the spectral densiy of a kernel (its Fourier transform) using a
scale-location Gaussian mixture. SM kernels form a basis for all stationary
covariances, and can be used as a drop-in replacement for standard kernels,
as they retain simple and exact learning and inference procedures. We use the
SM kernel to discover patterns and perform long range extrapolation on
atmospheric CO2 trends and airline passenger data, as well as on synthetic
examples. We also show that the SM kernel can be used to automatically
reconstruct several standard covariances. The SM kernel and the GPRN are
highly complementary; we show that using the SM kernel with adaptive basis
functions in a GPRN induces an expressive prior over non-stationary
kernels.

In chapter 5 we introduce GPatt, a method for fast
multidimensional pattern extrapolation, particularly suited to imge and movie
data. Without human intervention - no hand crafting of kernel features, and
no sophisticated initialisation procedures - we show that GPatt can solve
large scale pattern extrapolation, inpainting and kernel discovery problems,
including a problem with 383,400 training points. GPatt exploits the
structure of a spectral mixture product (SMP) kernel, for fast yet exact
inference procedures. We find that GPatt significantly outperforms popular
alternative scalable gaussian process methods in speed and accuracy.
Moreover, we discover profound differences between each of these methods,
suggesting expressive kernels, nonparametric representations, and scalable
inference which exploits existing model structure are useful in combination
for modelling large scale multidimensional patterns.

The models in this
dissertation have proven to be scalable and with greatly enhanced predictive
performance over the alternatives: the extra structure being modelled is an
important part of a wide variety of real data - including problems in
econometrics, gene expression, geostatistics, nuclear magnetic resonance
spectroscopy, ensemble learning, multi-output regression, change point
modelling, time series, multivariate volatility, image inpainting, texture
extrapolation, video extrapolation, acoustic modelling, and kernel
discovery.

José Miguel Hernández-Lobato, James Robert Lloyds, and Daniel
Hernández-Lobato.
**Gaussian process
conditional copulas with applications to financial time series**.
In *Advances in Neural Information Processing Systems 27*, pages 1-9,
Lake Tahoe, California, USA, December 2013.

** Abstract:** The
estimation of dependencies between multiple variables is a central problem in
the analysis of financial time series. A common approach is to express these
dependencies in terms of a copula function. Typically the copula function is
assumed to be constant but this may be inaccurate when there are covariates
that could have a large influence on the dependence structure of the data. To
account for this, a Bayesian framework for the estimation of conditional
copulas is proposed. In this framework the parameters of a copula are
non-linearly related to some arbitrary conditioning variables. We evaluate
the ability of our method to predict time-varying dependencies on several
equities and currencies and observe consistent performance gains compared to
static copula models and other time-varying copula methods.

Tomoharu Iwata, David Duvenaud, and Zoubin Ghahramani.
**Warped mixtures for nonparametric
cluster shapes**.
In *29th Conference on Uncertainty in Artificial Intelligence*,
Bellevue, Washington, July 2013.

** Abstract:** A mixture of
Gaussians fit to a single curved or heavy-tailed cluster will report that the
data contains many clusters. To produce more appropriate clusterings, we
introduce a model which warps a latent mixture of Gaussians to produce
nonparametric cluster shapes. The possibly low-dimensional latent mixture
model allows us to summarize the properties of the high-dimensional clusters
(or density manifolds) describing the data. The number of manifolds, as well
as the shape and dimension of each manifold is automatically inferred. We
derive a simple inference scheme for this model which analytically integrates
out both the mixture parameters and the warping function. We show that our
model is effective for density estimation, performs better than infinite
Gaussian mixture models at recovering the true number of clusters, and
produces interpretable summaries of high-dimensional datasets.

Novi Quadrianto, Viktoriia Sharmanska, David A. Knowles, and Zoubin Ghahramani.
**The supervised
ibp: Neighbourhood preserving infinite latent feature models**.
In *29th Conference on Uncertainty in Artificial Intelligence*,
Bellevue, USA, July 2013.

** Abstract:** We propose a
probabilistic model to infer supervised latent variables in the Hamming space
from observed data. Our model allows simultaneous inference of the number of
binary latent variables, and their values. The latent variables preserve
neighbourhood structure of the data in a sense that objects in the same
semantic concept have similar latent values, and objects in different
concepts have dissimilar latent values. We formulate the supervised infinite
latent variable problem based on an intuitive principle of pulling objects
together if they are of the same type, and pushing them apart if they are
not. We then combine this principle with a flexible Indian Buffet Process
prior on the latent variables. We show that the inferred supervised latent
variables can be directly used to perform a nearest neighbour search for the
purpose of retrieval. We introduce a new application of dynamically extending
hash codes, and show how to effectively couple the structure of the hash
codes with continuously growing structure of the neighbourhood preserving
infinite latent feature space.

David Duvenaud, James Robert Lloyd, Roger Grosse, Joshua B. Tenenbaum, and
Zoubin Ghahramani.
**Structure
discovery in nonparametric regression through compositional kernel
search**.
In *30th International Conference on Machine Learning*, Atlanta,
Georgia, USA, June 2013.

** Abstract:** Despite its
importance, choosing the structural form of the kernel in nonparametric
regression remains a black art. We define a space of kernel structures which
are built compositionally by adding and multiplying a small number of base
kernels. We present a method for searching over this space of structures
which mirrors the scientific discovery process. The learned structures can
often decompose functions into interpretable components and enable long-range
extrapolation on time-series datasets. Our structure search method
outperforms many widely used kernels and kernel combination methods on a
variety of prediction tasks.

David Lopez-Paz, José Miguel Hernández-Lobato, and Zoubin Ghahramani.
**Gaussian
process vine copulas for multivariate dependence**.
In *30th International Conference on Machine Learning*, Atlanta,
Georgia, USA, June 2013.

** Abstract:** Copulas allow to learn
marginal distributions separately from the multivariate dependence structure
(copula) that links them together into a density function. Vine
factorizations ease the learning of high-dimensional copulas by constructing
a hierarchy of conditional bivariate copulas. However, to simplify inference,
it is common to assume that each of these conditional bivariate copulas is
independent from its conditioning variables. In this paper, we relax this
assumption by discovering the latent functions that specify the shape of a
conditional copula given its conditioning variables We learn these functions
by following a Bayesian approach based on sparse Gaussian processes with
expectation propagation for scalable, approximate inference. Experiments on
real-world datasets show that, when modeling all conditional dependencies, we
obtain better estimates of the underlying copula of the data.

Andrew Gordon Wilson and Ryan Prescott Adams.
**Gaussian
process kernels for pattern discovery and extrapolation**.
In *30th International Conference on Machine Learning*, February 18
2013.

** Abstract:** Gaussian processes are rich distributions
over functions, which provide a Bayesian nonparametric approach to smoothing
and interpolation. We introduce simple closed form kernels that can be used
with Gaussian processes to discover patterns and enable extrapolation. These
kernels are derived by modelling a spectral density - the Fourier transform
of a kernel - with a Gaussian mixture. The proposed kernels support a broad
class of stationary covariances, but Gaussian process inference remains
simple and analytic. We demonstrate the proposed kernels by discovering
patterns and performing long range extrapolation on synthetic examples, as
well as atmospheric CO2 trends and airline passenger data. We also show that
we can reconstruct standard covariances within our framework.

** Comment:** arXiv:1302.4245

Roger Frigola, Fredrik Lindsten, Thomas B. Schön, and Carl Edward
Rasmussen.
**Bayesian inference and learning in
Gaussian process state-space models with particle MCMC**.
In L. Bottou, C.J.C. Burges, Z. Ghahramani, M. Welling, and K.Q. Weinberger,
editors, *Advances in Neural Information Processing Systems 26*.
Curran Associates, Inc., 2013.

** Abstract:** State-space
models are successfully used in many areas of science, engineering and
economics to model time series and dynamical systems. We present a fully
Bayesian approach to inference and learning in nonlinear nonparametric
state-space models. We place a Gaussian process prior over the transition
dynamics, resulting in a flexible model able to capture complex dynamical
phenomena. However, to enable efficient inference, we marginalize over the
dynamics of the model and instead infer directly the joint smoothing
distribution through the use of specially tailored Particle Markov Chain
Monte Carlo samplers. Once a sample from the smoothing distribution is
computed, the state transition predictive distribution can be formulated
analytically. We make use of sparse Gaussian process models to greatly reduce
the computational complexity of the approach.

Roger Frigola and Carl Edward Rasmussen.
**Integrated pre-processing for
Bayesian nonlinear system identification with Gaussian processes**.
In *Decision and Control (CDC), 2013 IEEE 52nd Annual Conference on*,
2013.

** Abstract:** We introduce GP-FNARX: a new model for
nonlinear system identification based on a nonlinear autoregressive exogenous
model (NARX) with filtered regressors (F) where the nonlinear regression
problem is tackled using sparse Gaussian processes (GP). We integrate data
pre-processing with system identification into a fully automated procedure
that goes from raw data to an identified model. Both pre-processing
parameters and GP hyper-parameters are tuned by maximizing the marginal
likelihood of the probabilistic model. We obtain a Bayesian model of the
system's dynamics which is able to report its uncertainty in regions where
the data is scarce. The automated approach, the modeling of uncertainty and
its relatively low computational cost make of GP-FNARX a good candidate for
applications in robotics and adaptive control.

Yarin Gal and Phil Blunsom.
**A systematic
Bayesian treatment of the IBM alignment models**.
In *Proceedings of the 2013 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies*.
Association for Computational Linguistics, 2013.

**
Abstract:** The dominant yet ageing IBM and HMM word alignment models
underpin most popular Statistical Machine Translation implementations in use
today. Though beset by the limitations of implausible independence
assumptions, intractable optimisation problems, and an excess of tunable
parameters, these models provide a scalable and reliable starting point for
inducing translation systems. In this paper we build upon this venerable base
by recasting these models in the non-parametric Bayesian framework. By
replacing the categorical distributions at their core with hierarchical
Pitman-Yor processes, and through the use of collapsed Gibbs sampling, we
provide a more flexible formulation and sidestep the original heuristic
optimisation techniques. The resulting models are highly extendible,
naturally permitting the introduction of phrasal dependencies. We present
extensive experimental results showing improvements in both AER and BLEU when
benchmarked against Giza++, including significant improvements over IBM model
4.

Zoubin Ghahramani.
**Bayesian
nonparametrics and the probabilistic approach to modelling**.
*Philosophical Transactions of the Royal Society A*, 2013.

** Abstract:** Modelling is fundamental to many fields of
science and engineering. A model can be thought of as a representation of
possible data one could predict from a system. The probabilistic approach to
modelling uses probability theory to express all aspects of uncertainty in
the model. The probabilistic approach is synonymous with Bayesian modelling,
which simply uses the rules of probability theory in order to make
predictions, compare alternative models, and learn model parameters and
structure from data. This simple and elegant framework is most powerful when
coupled with flexible probabilistic models. Flexibility is achieved through
the use of Bayesian nonparametrics. This article provides an overview of
probabilistic modelling and an accessible survey of some of the main tools in
Bayesian nonparametrics. The survey covers the use of Bayesian nonparametrics
for modelling unknown functions, density estimation, clustering, time series
modelling, and representing sparsity, hierarchies, and covariance structure.
More specifically it gives brief non-technical overviews of Gaussian
processes, Dirichlet processes, infinite hidden Markov models, Indian buffet
processes, Kingman's coalescent, Dirichlet diffusion tress, and Wishart
processes.

Colorado Reed and Zoubin Ghahramani.
**Scaling the
Indian buffet process via submodular maximization**.
In *ICML*, volume 28 of *JMLR Proceedings*, pages
1013-1021. JMLR.org, 2013.

** Abstract:** Inference for
latent feature models is inherently difficult as the inference space grows
exponentially with the size of the input data and number of latent features.
In this work, we use Kurihara & Welling (2008)'s maximization-expectation
framework to perform approximate MAP inference for linear-Gaussian latent
feature models with an Indian Buffet Process (IBP) prior. This formulation
yields a submodular function of the features that corresponds to a lower
bound on the model evidence. By adding a constant to this function, we obtain
a nonnegative submodular function that can be maximized via a greedy
algorithm that obtains at least a one-third approximation to the optimal
solution. Our inference method scales linearly with the size of the input
data, and we show the efficacy of our method on the largest datasets
currently analyzed using an IBP model.

Amar Shah and Zoubin Ghahramani.
**Determinantal
clustering processes - A nonparametric Bayesian approach to kernel based
semi-supervised clustering**.
*UAI*, 2013.

** Abstract:** Semi-supervised clustering is
the task of clustering data points into clusters where only a fraction of the
points are labelled. The true number of clusters in the data is often unknown
and most models require this parameter as an input. Dirichlet process mixture
models are appealing as they can infer the number of clusters from the data.
However, these models do not deal with high dimensional data well and can
encounter difficulties in inference. We present a novel nonparameteric
Bayesian kernel based method to cluster data points without the need to
prespecify the number of clusters or to model complicated densities from
which data points are assumed to be generated from. The key insight is to use
determinants of submatrices of a kernel matrix as a measure of how close
together a set of points are. We explore some theoretical properties of the
model and derive a natural Gibbs based algorithm with MCMC hyperparameter
learning. The model is implemented on a variety of synthetic and real world
data sets.

Andrew Gordon Wilson, Elad Gilboa, Arye Nehorai, and John P Cunningham.
**Gpatt: Fast multidimensional
pattern extrapolation with Gaussian processes**.
*arXiv preprint arXiv:1310.5288*, 2013.

** Abstract:**
Gaussian processes are typically used for smoothing and interpolation on
small datasets. We introduce a new Bayesian nonparametric framework - GPatt
- enabling automatic pattern extrapolation with Gaussian processes on large
multidimensional datasets. GPatt unifies and extends highly expressive
kernels and fast exact inference techniques. Without human intervention - no
hand crafting of kernel features, and no sophisticated initialisation
procedures - we show that GPatt can solve large scale pattern extrapolation,
inpainting, and kernel discovery problems, including a problem with 383,400
training points. We find that GPatt significantly outperforms popular
alternative scalable Gaussian process methods in speed and accuracy.
Moreover, we discover profound differences between each of these methods,
suggesting expressive kernels, nonparametric representations, and scalable
inference which exploits model structure are useful in combination for
modelling large scale multidimensional patterns.

James Robert Lloyd, Peter Orbanz, Zoubin Ghahramani, and Daniel M. Roy.
**Random
function priors for exchangeable arrays with applications to graphs and
relational data**.
In *Advances in Neural Information Processing Systems 26*, pages 1-9,
Lake Tahoe, California, USA, December 2012.

** Abstract:** A
fundamental problem in the analysis of structured relational data like
graphs, networks, databases, and matrices is to extract a summary of the
common structure underlying relations between individual entities. Relational
data are typically encoded in the form of arrays; invariance to the ordering
of rows and columns corresponds to exchangeable arrays. Results in
probability theory due to Aldous, Hoover and Kallenberg show that
exchangeable arrays can be represented in terms of a random measurable
function which constitutes the natural model parameter in a Bayesian model.
We obtain a flexible yet simple Bayesian nonparametric model by placing a
Gaussian process prior on the parameter function. Efficient inference
utilises elliptical slice sampling combined with a random sparse
approximation to the Gaussian process. We demonstrate applications of the
model to network data and clarify its relation to models in the literature,
several of which emerge as special cases.

Konstantina Palla, David A. Knowles, and Zoubin Ghahramani.
**A
nonparametric variable clustering model**.
In *Advances in Neural Information Processing Systems 26*, pages 1-9,
Lake Tahoe, California, USA, December 2012.

** Abstract:**
Factor analysis models effectively summarise the covariance structure of high
dimensional data, but the solutions are typically hard to interpret. This
motivates attempting to find a disjoint partition, i.e. a simple clustering,
of observed variables into highly correlated subsets. We introduce a Bayesian
non-parametric approach to this problem, and demonstrate advantages over
heuristic methods proposed to date. Our Dirichlet process variable clustering
(DPVC) model can discover block-diagonal covariance structures in data. We
evaluate our method on both synthetic and gene expression analysis
problems.

Ferenc Huszár and David Duvenaud.
**Optimally-weighted herding is
Bayesian quadrature**.
In *28th Conference on Uncertainty in Artificial Intelligence*, pages
377-385, Catalina Island, California, July 2012.

**
Abstract:** Herding and kernel herding are deterministic methods of
choosing samples which summarise a probability distribution. A related task
is choosing samples for estimating integrals using Bayesian quadrature. We
show that the criterion minimised when selecting samples in kernel herding is
equivalent to the posterior variance in Bayesian quadrature. We then show
that sequential Bayesian quadrature can be viewed as a weighted version of
kernel herding which achieves performance superior to any other weighted
herding method. We demonstrate empirically a rate of convergence faster than
O(1/N). Our results also imply an upper bound on the empirical error of the
Bayesian quadrature estimate.

Konstantina Palla, David A. Knowles, and Zoubin Ghahramani.
**An infinite latent attribute
model for network data**.
In *29th International Conference on Machine Learning*, Edinburgh,
Scotland, June 2012.

** Abstract:** Latent variable models for
network data extract a summary of the relational structure underlying an
observed network. The simplest possible models subdivide nodes of the network
into clusters; the probability of a link between any two nodes then depends
only on their cluster assignment. Currently available models can be
classified by whether clusters are disjoint or are allowed to overlap. These
models can explain a "flat" clustering structure. Hierarchical Bayesian
models provide a natural approach to capture more complex dependencies. We
propose a model in which objects are characterised by a latent feature
vector. Each feature is itself partitioned into disjoint groups
(subclusters), corresponding to a second layer of hierarchy. In experimental
comparisons, the model achieves significantly improved predictive performance
on social and biological link prediction tasks. The results indicate that
models with a single layer hierarchy over-simplify real networks.

P. Kirk, J. E. Griffin, R. S. Savage, Z. Ghahramani, and D. L. Wild.
**Bayesian
correlated clustering to integrate multiple datasets**.
*Bioinformatics*, 2012.

** Abstract:** Motivation: The
integration of multiple datasets remains a key challenge in systems biology
and genomic medicine. Modern high-throughput technologies generate a broad
array of different data types, providing distinct – but often complementary
– information. We present a Bayesian method for the unsupervised
integrative modelling of multiple datasets, which we refer to as MDI
(Multiple Dataset Integration). MDI can integrate information from a wide
range of different datasets and data types simultaneously (including the
ability to model time series data explicitly using Gaussian processes). Each
dataset is modelled using a Dirichlet-multinomial allocation (DMA) mixture
model, with dependencies between these models captured via parameters that
describe the agreement among the datasets.

Results: Using a set of 6
artificially constructed time series datasets, we show that MDI is able to
integrate a significant number of datasets simultaneously, and that it
successfully captures the underlying structural similarity between the
datasets. We also analyse a variety of real S. cerevisiae datasets. In the
2-dataset case, we show that MDI’s performance is comparable to the present
state of the art. We then move beyond the capabilities of current approaches
and integrate gene expression, ChIP-chip and protein-protein interaction
data, to identify a set of protein complexes for which genes are co-regulated
during the cell cycle. Comparisons to other unsupervised data integration
techniques – as well as to non-integrative approaches – demonstrate that
MDI is very competitive, while also providing information that would be
difficult or impossible to extract using other methods.

** Comment:** This paper is available from the Bioinformatics
site and a Matlab implementation of MDI is available fromthis site.

Paul D. W. Kirk, Jim E. Griffin, Richard S. Savage, Zoubin Ghahramani, and
David L. Wild.
**Bayesian
correlated clustering to integrate multiple datasets**.
*Bioinformatics*, 28(24):3290-3297, 2012.

** Abstract:**
MOTIVATION: The integration of multiple datasets remains a key challenge in
systems biology and genomic medicine. Modern high-throughput technologies
generate a broad array of different data types, providing distinct-but often
complementary-information. We present a Bayesian method for the unsupervised
integrative modelling of multiple datasets, which we refer to as MDI
(Multiple Dataset Integration). MDI can integrate information from a wide
range of different datasets and data types simultaneously (including the
ability to model time series data explicitly using Gaussian processes). Each
dataset is modelled using a Dirichlet-multinomial allocation (DMA) mixture
model, with dependencies between these models captured through parameters
that describe the agreement among the datasets. RESULTS: Using a set of six
artificially constructed time series datasets, we show that MDI is able to
integrate a significant number of datasets simultaneously, and that it
successfully captures the underlying structural similarity between the
datasets. We also analyse a variety of real Saccharomyces cerevisiae
datasets. In the two-dataset case, we show that MDI's performance is
comparable with the present state-of-the-art. We then move beyond the
capabilities of current approaches and integrate gene expression, chromatin
immunoprecipitation-chip and protein-protein interaction data, to identify a
set of protein complexes for which genes are co-regulated during the cell
cycle. Comparisons to other unsupervised data integration techniques-as well
as to non-integrative approaches-demonstrate that MDI is competitive, while
also providing information that would be difficult or impossible to extract
using other methods.

Donglin Niu, Jennifer G. Dy, and Z. Ghahramani.
**A nonparametric
Bayesian model for multiple clustering with overlapping feature
views**.
In *15th International Conference on Artificial Intelligence and
Statistics*, 2012.

** Abstract:** Most clustering
algorithms produce a single clustering solution. This is inadequate for many
data sets that are multi-faceted and can be grouped and interpreted in many
different ways. Moreover, for high-dimensional data, different features may
be relevant or irrelevant to each clustering solution, suggesting the need
for feature selection in clustering. Features relevant to one clustering
interpretation may be different from the ones relevant for an alternative
interpretation or view of the data. In this paper, we introduce a
probabilistic nonparametric Bayesian model that can discover multiple
clustering solutions from data and the feature subsets that are relevant for
the clusters in each view. In our model, the features in different views may
be shared and therefore the sets of relevant features are allowed to overlap.
We model feature relevance to each view using an Indian Buffet Process and
the cluster membership in each view using a Chinese Restaurant Process. We
provide an inference approach to learn the latent parameters corresponding to
this multiple partitioning problem. Our model not only learns the features
and clusters in each view but also automatically learns the number of
clusters, number of views and number of features in each view.

Jacob Steinhardt and Zoubin Ghahramani.
**Flexible martingale
priors for deep hierarchies**.
In *15th International Conference on Artificial Intelligence and
Statistics*, 2012.

** Abstract:** When building priors
over trees for Bayesian hierarchical models, there is a tension between
maintaining desirable theoretical properties such as infinite exchangeability
and important practical properties such as the ability to increase the depth
of the tree to accommodate new data. We resolve this tension by presenting a
family of infinitely exchangeable priors over discrete tree structures that
allows the depth of the tree to grow with the data, and then showing that our
family contains all hierarchical models with certain mild symmetry
properties. We also show that deep hierarchical models are in general
intimately tied to a process called a martingale, and use Doob’s martingale
convergence theorem to demonstrate some unexpected properties of deep
hierarchies.

Kyung-Ah Sohn, Zoubin Ghahramani, and Eric P. Xing.
**Robust estimation
of local genetic ancestry in admixed populations using a non-parametric
Bayesian approach**.
*Genetics*, 191(4), 2012.

** Abstract:** We present a new
haplotype-based approach for inferring local genetic ancestry of individuals
in an admixed population. Most existing approaches for local ancestry
estimation ignore the latent genetic relatedness between ancestral
populations and treat them as independent. In this paper, we exploit such
information by building an inheritance model that describes both the
ancestral populations and the admixed population jointly in a unified
framework. Based on an assumption that the common hypothetical founder
haplotypes give rise to both the ancestral and admixed population haplotypes,
we employ an infinite hidden Markov model to characterize each ancestral
population and further extend it to generate the admixed population. Through
an effective utilization of the population structural information under a
principled nonparametric Bayesian framework, the resulting model is
significantly less sensitive to the choice and the amount of training data
for ancestral populations than state-of-the-arts algorithms. We also improve
the robustness under deviation from common modeling assumptions by
incorporating population-specific scale parameters that allow variable
recombination rates in different populations. Our method is applicable to an
admixed population from an arbitrary number of ancestral populations and also
performs competitively in terms of spurious ancestry proportions under
general multi-way admixture assumption. We validate the proposed method by
simulation under various admixing scenarios and present empirical analysis
results on worldwide distributed dataset from Human Genome Diversity
Project.

** Comment:** doi: 10.1534/genetics.112.140228

Andrew Gordon Wilson, David A Knowles, and Zoubin Ghahramani.
**Gaussian process
regression networks**.
Technical Report arXiv:1110.4411 [stat.ML], Department of Engineering,
University of Cambridge, Cambridge, UK, October 19 2011.

**
Abstract:** We introduce a new regression framework, Gaussian process
regression networks (GPRN), which combines the structural properties of
Bayesian neural networks with the non-parametric flexibility of Gaussian
processes. This model accommodates input dependent signal and noise
correlations between multiple response variables, input dependent
length-scales and amplitudes, and heavy-tailed predictive distributions. We
derive both efficient Markov chain Monte Carlo and variational Bayes
inference procedures for this model. We apply GPRN as a multiple output
regression and multivariate volatility model, demonstrating substantially
improved performance over eight popular multiple output (multi-task) Gaussian
process models and three multivariate volatility models on benchmark
datasets, including a 1000 dimensional gene expression dataset.

** Comment:** arXiv:1110.4411

Thomas L. Griffiths and Zoubin Ghahramani.
**The Indian buffet
process: An introduction and review**.
*Journal of Machine Learning Research*, 12:1185-1224, April 2011.

** Abstract:** The Indian buffet process is a stochastic process
defining a probability distribution over equivalence classes of sparse binary
matrices with a finite number of rows and an unbounded number of columns.
This distribution is suitable for use as a prior in probabilistic models that
represent objects using a potentially infinite array of features, or that
involve bipartite graphs in which the size of at least one class of nodes is
unknown. We give a detailed derivation of this distribution, and illustrate
its use as a prior in an infinite latent feature model. We then review recent
applications of the Indian buffet process in machine learning, discuss its
extensions, and summarize its connections to other stochastic processes.

Daniel Hernández-Lobato, José Miguel Hernández-Lobato, and Pierre Dupont.
**Robust
multi-class Gaussian process classification**.
In *Advances in Neural Information Processing Systems 25*, 2011.

** Abstract:** Multi-class Gaussian Processs Classifiers (MGPCs)
are often affected by overfitting problems when labeling errors occur far
from the decision boundaries. To prevent this, we investigate a robust MGPC
(RMGPC) which considers labeling errors independently of their distance to
the decision boundaries. Expectation propagation is used for approximate
inference. Experiments with several datasets in which noise is injected in
the labels illustrate the benefits of RMGPC. This method performs better than
other Gaussian process alternatives based on considering latent Gaussian
noise or heavy-tailed processes. When no noise is injected in the labels,
RMGPC still performs equal or better than the other methods. Finally, we show
how RMGPC can be used for successfully indentifying data instances which are
difficult to classify correctly in practice.

David A. Knowles and Zoubin Ghahramani.
**Nonparametric
Bayesian sparse factor models with application to gene expression
modelling.**.
*Annals of Applied Statistics*, 5(2B):1534-1552, 2011.

**
Abstract:** A nonparametric Bayesian extension of Factor Analysis (FA) is
proposed where observed data Y is modeled as a linear superposition, G, of a
potentially infinite number of hidden factors, X. The Indian Buffet Process
(IBP) is used as a prior on G to incorporate sparsity and to allow the number
of latent features to be inferred. The model's utility for modeling gene
expression data is investigated using randomly generated data sets based on a
known sparse connectivity matrix for E. Coli, and on three biological data
sets of increasing complexity.

David A. Knowles and Zoubin Ghahramani.
**Pitman-Yor
diffusion trees**.
In *27th Conference on Uncertainty in Artificial Intelligence*, 2011.

** Abstract:** We introduce the Pitman Yor Diffusion Tree (PYDT)
for hierarchical clustering, a generalization of the Dirichlet Diffusion Tree
(Neal, 2001) which removes the restriction to binary branching structure. The
generative process is described and shown to result in an exchangeable
distribution over data points. We prove some theoretical properties of the
model and then present two inference methods: a collapsed MCMC sampler which
allows us to model uncertainty over tree structures, and a computationally
efficient greedy Bayesian EM search algorithm. Both algorithms use message
passing on the tree structure. The utility of the model and algorithms is
demonstrated on synthetic and real world data, both continuous and
binary.

** Comment:** web site

David A. Knowles, Jurgen Van Gael, and Zoubin Ghahramani.
**Message passing
algorithms for the Dirichlet diffusion tree**.
In *28th International Conference on Machine Learning*, 2011.

** Abstract:** We demonstrate efficient approximate inference
for the Dirichlet Diffusion Tree (Neal, 2003), a Bayesian nonparametric prior
over tree structures. Although DDTs provide a powerful and elegant approach
for modeling hierarchies they haven't seen much use to date. One problem is
the computational cost of MCMC inference. We provide the first deterministic
approximate inference methods for DDT models and show excellent performance
compared to the MCMC alternative. We present message passing algorithms to
approximate the Bayesian model evidence for a specific tree. This is used to
drive sequential tree building and greedy search to find optimal tree
structures, corresponding to hierarchical clusterings of the data. We
demonstrate appropriate observation models for continuous and binary data.
The empirical performance of our method is very close to the computationally
expensive MCMC alternative on a density estimation problem, and significantly
outperforms kernel density estimators.

** Comment:** web site

David A. Knowles and Thomas P. Minka.
**Non-conjugate
variational message passing for multinomial and binary regression**.
In *Advances in Neural Information Processing Systems 25*, 2011.

** Abstract:** Variational Message Passing (VMP) is an
algorithmic implementation of the Variational Bayes (VB) method which applies
only in the special case of conjugate exponential family models. We propose
an extension to VMP, which we refer to as Non-conjugate Variational Message
Passing (NCVMP) which aims to alleviate this restriction while maintaining
modularity, allowing choice in how expectations are calculated, and
integrating into an existing message-passing framework: Infer.NET. We
demonstrate NCVMP on logistic binary and multinomial regression. In the
multinomial case we introduce a novel variational bound for the softmax
factor which is tighter than other commonly used bounds whilst maintaining
computational tractability.

** Comment:** web site supplementary

Daniel M. Roy.
**On the computability
and complexity of Bayesian reasoning**.
In *NIPS Workshop on Philosophy and Machine Learning*, 2011.

** Abstract:** If we consider the claim made by some cognitive
scientists that the mind performs Bayesian reasoning, and if we
simultaneously accept the Physical Church-Turing thesis and thus believe that
the computational power of the mind is no more than that of a Turing machine,
then what limitations are there to the reasoning abilities of the mind? I
give an overview of joint work with Nathanael Ackerman (Harvard, Mathematics)
and Cameron Freer (MIT, CSAIL) that bears on the computability and complexity
of Bayesian reasoning. In particular, we prove that conditional probability
is in general not computable in the presence of continuous random variables.
However, in light of additional structure in the prior distribution, such as
the presence of certain types of noise, or of exchangeability, conditioning
is possible. These results cover most of statistical practice. At the
workshop on Logic and Computational Complexity, we presented results on the
computational complexity of conditioning, embedding sharp-P-complete problems
in the task of computing conditional probabilities for diffuse continuous
random variables. This work complements older work. For example, under
cryptographic assumptions, the computational complexity of producing samples
and computing probabilities was separated by Ben-David, Chor, Goldreich and
Luby. In recent work, we also make use of cryptographic assumptions to show
that different representations of exchangeable sequences may have vastly
different complexity. However, when faced with an adversary that is
computational bounded, these different representations have the same
complexity, highlighting the fact that knowledge representation and
approximation play a fundamental role in the possibility and plausibility of
Bayesian reasoning.

Andrew Gordon Wilson and Zoubin Ghahramani.
**Generalised
Wishart processes**.
In *27th Conference on Uncertainty in Artificial Intelligence*, 2011.

** Abstract:** We introduce a new stochastic process called the
generalised Wishart process (GWP). It is a collection of positive
semi-definite random matrices indexed by any arbitrary input variable. We use
this process as a prior over dynamic (e.g. time varying) covariance matrices.
The GWP captures a diverse class of covariance dynamics, naturally hanles
missing data, scales nicely with dimension, has easily interpretable
parameters, and can use input variables that include covariates other than
time. We describe how to construct the GWP, introduce general procedures for
inference and prediction, and show that it outperforms its main competitor,
multivariate GARCH, even on financial data that especially suits GARCH.

** Comment:** Supplementary
Material, Best Student Paper Award

Y. Guan, J. G. Dy, D. Niu, and Z. Ghahramani.
**Variational
inference for nonparametric multiple clustering**.
In *KDD10 Workshop on Discovering, Summarizing, and Using Multiple
Clusterings*, Washington, DC, USA, July 2010.

**
Abstract:** Most clustering algorithms produce a single clustering
solution. Similarly, feature selection for clustering tries to find one
feature subset where one interesting clustering solution resides. However, a
single data set may be multi-faceted and can be grouped and interpreted in
many different ways, especially for high dimensional data, where feature
selection is typically needed. Moreover, different clustering solutions are
interesting for different purposes. Instead of committing to one clustering
solution, in this paper we introduce a probabilistic nonparametric Bayesian
model that can discover several possible clustering solutions and the feature
subset views that generated each cluster partitioning simultaneously. We
provide a variational inference approach to learn the features and clustering
partitions in each view. Our model allows us not only to learn the multiple
clusterings and views but also allows us to automatically learn the number of
views and the number of clusters in each view.

Dilan Görür and Carl Edward Rasmussen.
**Dirichlet process
Gaussian mixture models: Choice of the base distribution**.
*Journal of Computer Science and Technology*, 25(4):615-625, July 2010,
doi
10.1007/s11390-010-9355-8.

** Abstract:** In the Bayesian
mixture modeling framework it is possible to infer the necessary number of
components to model the data and therefore it is unnecessary to explicitly
restrict the number of components. Nonparametric mixture models sidestep the
problem of finding the "correct" number of mixture components by assuming
infinitely many components. In this paper Dirichlet process mixture (DPM)
models are cast as infinite mixture models and inference using Markov chain
Monte Carlo is described. The specification of the priors on the model
parameters is often guided by mathematical and practical convenience. The
primary goal of this paper is to compare the choice of conjugate and
non-conjugate base distributions on a particular class of DPM models which is
widely used in applications, the Dirichlet process Gaussian mixture model
(DPGMM). We compare computational efficiency and modeling performance of
DPGMM defined using a conjugate and a conditionally conjugate base
distribution. We show that better density models can result from using a
wider class of priors with no or only a modest increase in computational
effort.

Sinead Williamson, Katherine A. Heller, C. Wang, and D. M. Blei.
**The IBP
compound Dirichlet process and its application to focused topic
modeling**.
In *27th International Conference on Machine Learning*, pages
1151-1158, Haifa, Israel, June 2010.

** Abstract:** The
hierarchical Dirichlet process (HDP) is a Bayesian nonparametric mixed
membership model - each data point is modeled with a collection of
components of different proportions. Though powerful, the HDP makes an
assumption that the probability of a component being exhibited by a data
point is positively correlated with its proportion within that data point.
This might be an undesirable assumption. For example, in topic modeling, a
topic (component) might be rare throughout the corpus but dominant within
those documents (data points) where it occurs. We develop the IBP compound
Dirichlet process (ICD), a Bayesian nonparametric prior that decouples
across-data prevalence and within-data proportion in a mixed membership
model. The ICD combines properties from the HDP and the Indian buffet process
(IBP), a Bayesian nonparametric prior on binary matrices. The ICD assigns a
subset of the shared mixture components to each data point. This subset, the
data point's "focus", is determined independently from the amount that each
of its components contribute. We develop an ICD mixture model for text, the
focused topic model (FTM), and show superior performance over the HDP-based
topic model.

R. P. Adams, H. Wallach, and Zoubin Ghahramani.
**Learning the
structure of deep sparse graphical models**.
In Yee Whye Teh and Mike Titterington, editors, *13th International
Conference on Artificial Intelligence and Statistics*, pages 1-8, Chia
Laguna, Sardinia, Italy, May 2010.

** Abstract:** Deep belief
networks are a powerful way to model complex probability distributions.
However, it is difficult to learn the structure of a belief network,
particularly one with hidden units. The Indian buffet process has been used
as a nonparametric Bayesian prior on the structure of a directed belief
network with a single infinitely wide hidden layer. Here, we introduce the
cascading Indian buffet process (CIBP), which provides a prior on the
structure of a layered, directed belief network that is unbounded in both
depth and width, yet allows tractable inference. We use the CIBP prior with
the nonlinear Gaussian belief network framework to allow each unit to vary
its behavior between discrete and continuous representations. We use Markov
chain Monte Carlo for inference in this model and explore the structures
learned on image data.

** Comment:** Winner of the Best Paper Award

Sinead Williamson, Peter Orbanz, and Zoubin Ghahramani.
**Dependent
Indian buffet processes**.
In *13th International Conference on Artificial Intelligence and
Statistics*, volume 9 of *W & CP*, pages 924-931, Chia
Laguna, Sardinia, Italy, May 2010.

** Abstract:** Latent
variable models represent hidden structure in observational data. To account
for the distribution of the observational data changing over time, space or
some other covariate, we need generalizations of latent variable models that
explicitly capture this dependency on the covariate. A variety of such
generalizations has been proposed for latent variable models based on the
Dirichlet process. We address dependency on covariates in binary latent
feature models, by introducing a dependent Indian Buffet Process. The model
generates a binary random matrix with an unbounded number of columns for each
value of the covariate. Evolution of the binary matrices over the covariate
set is controlled by a hierarchical Gaussian process model. The choice of
covariance functions controls the dependence structure and exchangeability
properties of the model. We derive a Markov Chain Monte Carlo sampling
algorithm for Bayesian inference, and provide experiments on both synthetic
and real-world data. The experimental results show that explicit modeling of
dependencies significantly improves accuracy of predictions.

R. P. Adams, Zoubin Ghahramani, and Michael I. Jordan.
**Tree-structured
stick breaking for hierarchical data**.
In *Advances in Neural Information Processing Systems 23*. The MIT
Press, 2010.

** Abstract:** Many data are naturally modeled by
an unobserved hierarchical structure. In this paper we propose a flexible
nonparametric prior over unknown data hierarchies. The approach uses nested
stick-breaking processes to allow for trees of unbounded width and depth,
where data can live at any node and are infinitely exchangeable. One can view
our model as providing infinite mixtures where the components have a
dependency structure corresponding to an evolutionary diffusion down a tree.
By using a stick-breaking approach, we can apply Markov chain Monte Carlo
methods based on slice sampling to perform Bayesian inference and simulate
from the posterior distribution on trees. We apply our method to hierarchical
clustering of images and topic modeling of text data.

Sébastien Bratières, Jurgen Van Gael, Andreas Vlachos, and Zoubin
Ghahramani.
**Scaling the
iHMM: Parallelization versus Hadoop**.
In *Proceedings of the 2010 10th IEEE International Conference on Computer
and Information Technology*, pages 1235-1240, Bradford, UK, 2010. IEEE
Computer Society, doi
10.1109/CIT.2010.223.

** Abstract:** This paper compares
parallel and distributed implementations of an iterative, Gibbs sampling,
machine learning algorithm. Distributed implementations run under Hadoop on
facility computing clouds. The probabilistic model under study is the
infinite HMM Beal, Ghahramani and Rasmussen,
2002, in which parameters are learnt using an instance blocked Gibbs
sampling, with a step consisting of a dynamic program. We apply this model to
learn part-of-speech tags from newswire text in an unsupervised fashion.
However our focus here is on runtime performance, as opposed to NLP-relevant
scores, embodied by iteration duration, ease of development, deployment and
debugging.

Andrew Gordon Wilson and Zoubin Ghahramani.
**Copula
processes**.
In *Advances in Neural Information Processing Systems 23*, 2010.
Spotlight.

** Abstract:** We define a copula process which
describes the dependencies between arbitrarily many random variables
independently of their marginal distributions. As an example, we develop a
stochastic volatility model, Gaussian Copula Process Volatility (GCPV), to
predict the latent standard deviations of a sequence of random variables. To
make predictions we use Bayesian inference, with the Laplace approximation,
and with Markov chain Monte Carlo as an alternative. We find our model can
outperform GARCH on simulated and financial data. And unlike GARCH, GCPV can
easily handle missing data, incorporate covariates other than time, and model
a rich class of covariance structures.

** Comment:** Supplementary
Material, slides.

Finale Doshi-Velez, David Knowles, Shakir Mohamed, and Zoubin Ghahramani.
**Large scale
non-parametric inference: Data parallelisation in the Indian buffet
process**.
In *Advances in Neural Information Processing Systems 23*, pages
1294-1302, Cambridge, MA, USA, December 2009. The MIT Press.

** Abstract:** Nonparametric Bayesian models provide a framework
for flexible probabilistic modelling of complex datasets. Unfortunately, the
high-dimensional averages required for Bayesian methods can be slow,
especially with the unbounded representations used by nonparametric models.
We address the challenge of scaling Bayesian inference to the increasingly
large datasets found in real-world applications. We focus on parallelisation
of inference in the Indian Buffet Process (IBP), which allows data points to
have an unbounded number of sparse latent features. Our novel MCMC sampler
divides a large data set between multiple processors and uses message passing
to compute the global likelihoods and posteriors. This algorithm, the first
parallel inference scheme for IBP-based models, scales to datasets orders of
magnitude larger than have previously been possible.

Finale Doshi-Velez.
**The Indian buffet
process: Scalable inference and extensions**.
Master's thesis, University of Cambridge, Cambridge, UK, August 2009.

** Abstract:** Many unsupervised learning problems seek to
identify hidden features from observations. In many real-world situations,
the number of hidden features is unknown. To avoid specifying the number of
hidden features a priori, one can use the Indian Buffet Process (IBP): a
nonparametric latent feature model that does not bound the number of active
features in a dataset. While elegant, the lack of efficient inference
procedures for the IBP has prevented its application in large-scale problems.
The core contribution of this thesis are three new inference procedures that
allow inference in the IBP to be scaled from a few hundred to 100,000
observations. This thesis contains three parts: (1) An introduction to the
IBP and a review of inference techniques and extensions. The first chapters
summarise three constructions for the IBP and review all currently published
inference techniques. Appendix C reviews extensions of the IBP to date. (2)
Novel techniques for scalable Bayesian inference. This thesis presents three
new inference procedures: (a) an accelerated Gibbs sampler for efficient
Bayesian inference in a broad class of conjugate models, (b) a parallel,
asynchronous Gibbs sampler that allows the accelerated Gibbs sampler to be
distributed across multiple processors, and (c) a variational inference
procedure for the IBP. (3) A framework for structured nonparametric latent
feature models. We also present extensions to the IBP to model more
sophisticated relationships between the co-occurring hidden features,
providing a general framework for correlated non-parametric feature
models.

J. Van Gael, A. Vlachos, and Z. Ghahramani.
**The infinite
HMM for unsupervised PoS tagging**.
In *Proceedings of the 2009 Conference on Empirical Methods in Natural
Language Processing (EMNLP)*, pages 678-687, Singapore, August 2009.
Association for Computational Linguistics.

** Abstract:** We
extend previous work on fully unsupervised part-of-speech tagging. Using a
non-parametric version of the HMM, called the infinite HMM (iHMM), we address
the problem of choosing the number of hidden states in unsupervised Markov
models for PoS tagging. We experiment with two non-parametric priors, the
Dirichlet and Pitman-Yor processes, on the Wall Street Journal dataset using
a parallelized implementation of an iHMM inference algorithm. We evaluate the
results with a variety of clustering evaluation metrics and achieve
equivalent or better performances than previously reported. Building on this
promising result we evaluate the output of the unsupervised PoS tagger as a
direct replacement for the output of a fully supervised PoS tagger for the
task of shallow parsing and compare the two evaluations.

R. Adams and Zoubin Ghahramani.
**Archipelago:
nonparametric Bayesian semi-supervised learning**.
In Léon Bottou and Michael Littman, editors, *26th International
Conference on Machine Learning*, pages 1-8, Montréal, QC, Canada,
June 2009. Omnipress.

** Abstract:** Semi-supervised learning
(SSL), is classification where additional unlabeled data can be used to
improve accuracy. Generative approaches are appealing in this situation, as a
model of the data's probability density can assist in identifying clusters.
Nonparametric Bayesian methods, while ideal in theory due to their principled
motivations, have been difficult to apply to SSL in practice. We present a
nonparametric Bayesian method that uses Gaussian processes for the generative
model, avoiding many of the problems associated with Dirichlet process
mixture models. Our model is fully generative and we take advantage of recent
advances in Markov chain Monte Carlo algorithms to provide a practical
inference method. Our method compares favorably to competing approaches on
synthetic and real-world multi-class data.

** Comment:** This paper was awarded Honourable Mention for
Best Paper at ICML 2009.

F. Doshi-Velez and Z. Ghahramani.
**Correlated
non-parametric latent feature models**.
In *Conference on Uncertainty in Artificial Intelligence (UAI 2009)*,
pages 143-150, Montréal, QC, Canada, June 2009. AUAI Press.

** Abstract:** We are often interested in explaining data
through a set of hidden factors or features. To allow for an unknown number
of such hidden features, one can use the IBP: a non-parametric latent feature
model that does not bound the number of active features in a dataset.
However, the IBP assumes that all latent features are uncorrelated, making it
inadequate for many real-world problems. We introduce a framework for
correlated non-parametric feature models, generalising the IBP. We use this
framework to generate several specific models and demonstrate applications on
real-world datasets.

Finale Doshi-Velez and Zoubin Ghahramani.
**Accelerated Gibbs
sampling for the Indian buffet process**.
In Léon Bottou and Michael Littman, editors, *26th International
Conference on Machine Learning*, pages 273-280, Montréal, QC,
Canada, June 2009. Omnipress.

** Abstract:** We often seek to
identify co-occurring hidden features in a set of observations. The Indian
Buffet Process (IBP) provides a non-parametric prior on the features present
in each observation, but current inference techniques for the IBP often scale
poorly. The collapsed Gibbs sampler for the IBP has a running time cubic in
the number of observations, and the uncollapsed Gibbs sampler, while linear,
is often slow to mix. We present a new linear-time collapsed Gibbs sampler
for conjugate likelihood models and demonstrate its efficacy on large
real-world datasets.

F. Doshi-Velez, K.T. Miller, J. Van Gael, and Y.W. Teh.
**Variational
inference for the Indian buffet process**.
In *12th International Conference on Artificial Intelligence and
Statistics*, volume 12, pages 137-144, Clearwater Beach, FL, USA,
April 2009. Journal of Machine Learning Research.

**
Abstract:** The Indian Buffet Process (IBP) is a nonparametric prior for
latent feature models in which observations are influenced by a combination
of hidden features. For example, images may be composed of several objects
and sounds may consist of several notes. Latent feature models seek to infer
these unobserved features from a set of observations; the IBP provides a
principled prior in situations where the number of hidden features is
unknown. Current inference methods for the IBP have all relied on sampling.
While these methods are guaranteed to be accurate in the limit, samplers for
the IBP tend to mix slowly in practice. We develop a deterministic
variational method for inference in the IBP based on a truncated
stick-breaking approximation, provide theoretical bounds on the truncation
error, and evaluate our method in several data regimes.

Finale Doshi-Velez, Kurt T. Miller, Jurgen Van Gael, and Yee Whye Teh.
**Variational
inference for the Indian buffet process**.
Technical Report CBL-2009-001, University of Cambridge, Computational and
Biological Learning Laboratory, Department of Engineering, April 2009.

** Abstract:** The Indian Buffet Process (IBP) is a
nonparametric prior for latent feature models in which observations are
influenced by a combination of hidden features. For example, images may be
composed of several objects and sounds may consist of several notes. Latent
feature models seek to infer these unobserved features from a set of
observations; the IBP provides a principled prior in situations where the
number of hidden features is unknown. Current inference methods for the IBP
have all relied on sampling. While these methods are guaranteed to be
accurate in the limit, samplers for the IBP tend to mix slowly in practice.
We develop a deterministic variational method for inference in the IBP based
on truncating to infinite models, provide theoretical bounds on the
truncation error, and evaluate our method in several data regimes. This
technical report is a longer version of Doshi-Velez et al. (2009).

T. Stepleton, Z. Ghahramani, G. Gordon, and T.-S. Lee.
**The block
diagonal infinite hidden Markov model**.
In D. van Dyk and M. Welling, editors, *12th International Conference on
Artificial Intelligence and Statistics*, volume 5, pages 552-559,
Clearwater Beach, FL, USA, April 2009. Microtome Publishing (paper) Journal
of Machine Learning Research.
ISSN 1938-7228.

** Abstract:** The Infinite Hidden Markov Model
(IHMM) extends hidden Markov models to have a countably infinite number of
hidden states (Beal et al., 2002; Teh et al.,
2006). We present a generalization of this framework that introduces nearly
block-diagonal structure in the transitions between the hidden states, where
blocks correspond to "sub-behaviors" exhibited by data sequences. In
identifying such structure, the model classifies, or partitions, sequence
data according to these sub-behaviors in an unsupervised way. We present an
application of this model to artificial data, a video gesture classification
task, and a musical theme labeling task, and show that components of the
model can also be applied to graph segmentation.

Yang Xu, Katherine A. Heller, and Zoubin Ghahramani.
**Tree-based
inference for Dirichlet process mixtures**.
In D. van Dyk and M. Welling, editors, *12th International Conference on
Artificial Intelligence and Statistics*, volume 5, pages 623-630,
Clearwater Beach, FL, USA, April 2009. Microtome Publishing (paper), Journal
of Machine Learning Research (online).
ISSN 1938-7228.

** Abstract:** The Dirichlet process mixture
(DPM) is a widely used model for clustering and for general nonparametric
Bayesian density estimation. Unfortunately, like in many statistical models,
exact inference in a DPM is intractable, and approximate methods are needed
to perform efficient inference. While most attention in the literature has
been placed on Markov chain Monte Carlo (MCMC) [1, 2, 3], variational
Bayesian (VB) [4] and collapsed variational methods [5], [6] recently
introduced a novel class of approximation for DPMs based on Bayesian
hierarchical clustering (BHC). These tree-based combinatorial approximations
efficiently sum over exponentially many ways of partitioning the data and
offer a novel lower bound on the marginal likelihood of the DPM [6]. In this
paper we make the following contributions: (1) We show empirically that the
BHC lower bounds are substantially tighter than the bounds given by VB [4]
and by collapsed variational methods [5] on synthetic and real datasets. (2)
We also show that BHC offers a more accurate predictive performance on these
datasets. (3) We further improve the tree-based lower bounds with an
algorithm that efficiently sums contributions from alternative trees. (4) We
present a fast approximate method for BHC. Our results suggest that our
combinatorial approximate inference methods and lower bounds may be useful
not only in DPMs but in other models as well.

A. Vlachos, A Korhonen, and Z. Ghahramani.
**Unsupervised and
constrained Dirichlet process mixture models for verb clustering**.
In *4th Workshop on Statistical Machine Translation, EACL '09*, Athens,
Greece, March 2009.

** Abstract:** In this work, we apply
Dirichlet Process Mixture Models (DPMMs) to a learning task in natural
language processing (NLP): lexical-semantic verb clustering. We thoroughly
evaluate a method of guiding DPMMs towards a particular clustering solution
using pairwise constraints. The quantitative and qualitative evaluation
performed highlights the benefits of both standard and constrained DPMMs
compared to previously used approaches. In addition, it sheds light on the
use of evaluation measures and their practical application.

Karsten M. Borgwardt and Zoubin Ghahramani.
**Bayesian two-sample
tests**.
*arXiv*, abs/0906.4032, 2009.

** Abstract:** In this
paper, we present two classes of Bayesian approaches to the two-sample
problem. Our first class of methods extends the Bayesian t-test to include
all parametric models in the exponential family and their conjugate priors.
Our second class of methods uses Dirichlet process mixtures (DPM) of such
conjugate-exponential distributions as flexible nonparametric priors over the
unknown distributions.

Finale Doshi-Velez and Zoubin Ghahramani.
**Accelerated
sampling for the Indian buffet process**.
In Andrea Pohoreckyj Danyluk, Léon Bottou, and Michael L. Littman, editors,
*ICML*, volume 382 of *ACM International Conference Proceeding
Series*, page 35. acm, 2009.

** Abstract:** We often
seek to identify co-occurring hidden features in a set of observations. The
Indian Buffet Process (IBP) provides a nonparametric prior on the features
present in each observation, but current inference techniques for the IBP
often scale poorly. The collapsed Gibbs sampler for the IBP has a running
time cubic in the number of observations, and the uncollapsed Gibbs sampler,
while linear, is often slow to mix. We present a new linear-time collapsed
Gibbs sampler for conjugate likelihood models and demonstrate its efficacy on
large real-world datasets.

Peter Orbanz.
**Construction of
nonparametric Bayesian models from parametric Bayes equations**.
In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta,
editors, *Advances in Neural Information Processing Systems 22*, pages
1392-1400. The MIT Press, 2009.

** Abstract:** We consider
the general problem of constructing nonparametric Bayesian models on
infinite-dimensional random objects, such as functions, infinite graphs or
infinite permutations. The problem has generated much interest in machine
learning, where it is treated heuristically, but has not been studied in full
generality in nonparametric Bayesian statistics, which tends to focus on
models over probability distributions. Our approach applies a standard tool
of stochastic process theory, the construction of stochastic processes from
their finite-dimensional marginal distributions. The main contribution of the
paper is a generalization of the classic Kolmogorov extension theorem to
conditional probabilities. This extension allows a rigorous construction of
nonparametric Bayesian models from systems of finitedimensional, parametric
Bayes equations. Using this approach, we show (i) how existence of a
conjugate posterior for the nonparametric model can be guaranteed by choosing
conjugate finite-dimensional models in the construction, (ii) how the mapping
to the posterior parameters of the nonparametric model can be explicitly
determined, and (iii) that the construction of conjugate models in essence
requires the finite-dimensional models to be in the exponential family. As an
application of our constructive framework, we derive a model on infinite
permutations, the nonparametric Bayesian analogue of a model recently
proposed for the analysis of rank data.

** Comment:** Supplements
(proofs) and techreport
version

J. Van Gael, Y.W. Teh, and Z. Ghahramani.
**The infinite
factorial hidden Markov model**.
In D. Koller, D. Schuurmans, L. Bottou, and Y. Bengio, editors, *Advances in
Neural Information Processing Systems 21*, volume 21, pages
1697-1704, Cambridge, MA, USA, December 2008. The MIT Press.

** Abstract:** The infinite factorial hidden Markov model is a
non-parametric extension of the factorial hidden Markov model. Our model
defines a probability distribution over an infinite number of independent
binary hidden Markov chains which together produce an observable sequence of
random variables. Central to our model is a new type of non-parametric prior
distribution inspired by the Indian Buffet Process which we call the
*Indian Buffet Markov Process*.

Jurgen Van Gael, Yunus Saatçi, Yee-Whye Teh, and Zoubin Ghahramani.
**Beam sampling
for the infinite hidden Markov model**.
In *25th International Conference on Machine Learning*, volume 25,
pages 1088-1095, Helsinki, Finland, 2008. Association for Computing
Machinery.

** Abstract:** The infinite hidden Markov model is
a non-parametric extension of the widely used hidden Markov model. Our paper
introduces a new inference algorithm for the infinite Hidden Markov model
called beam sampling. Beam sampling combines slice sampling, which limits the
number of states considered at each time step to a finite number, with
dynamic programming, which samples whole state trajectories efficiently. Our
algorithm typically outperforms the Gibbs sampler and is more robust. We
present applications of iHMM inference using the beam sampler on changepoint
detection and text prediction problems.

David Knowles and Zoubin Ghahramani.
**Infinite sparse
factor analysis and infinite independent components analysis**.
In *7th International Conference on Independent Component Analysis and
Signal Separation*, pages 381-388, London, UK, September 2007. Springer,
doi
10.1007/978-3-540-74494-8_48.

** Abstract:** A
nonparametric Bayesian extension of Independent Components Analysis (ICA) is
proposed where observed data Y is modelled as a linear superposition, G, of a
potentially infinite number of hidden sources, X. Whether a given source is
active for a specific data point is specified by an infinite binary matrix,
Z. The resulting sparse representation allows increased data reduction
compared to standard ICA. We define a prior on Z using the Indian Buffet
Process (IBP). We describe four variants of the model, with Gaussian or
Laplacian priors on X and the one or two-parameter IBPs. We demonstrate
Bayesian inference under these models using a Markov chain Monte Carlo (MCMC)
algorithm on synthetic and gene expression data and compare to standard ICA
algorithms.

E. Meeds, Z. Ghahramani, R. Neal, and S.T. Roweis.
**Modelling
dyadic data with binary latent factors**.
In B. Schölkopf, J. Platt, and T. Hofmann, editors, *Advances in Neural
Information Processing Systems 19*, Bradford Books, pages 977-984,
Cambridge, MA, USA, September 2007. The MIT Press.
Online contents gives pages 1002-1009, and 977-984 on pdf contents.

** Abstract:** We introduce binary matrix factorization, a novel
model for unsupervised matrix decomposition. The decomposition is learned by
fitting a non-parametric Bayesian probabilistic model with binary latent
variables to a matrix of dyadic data. Unlike bi-clustering models, which
assign each row or column to a single cluster based on a categorical hidden
feature, our binary feature model reflects the prior belief that items and
attributes can be associated with more than one latent cluster at a time. We
provide simple learning and inference rules for this new model and show how
to extend it to an infinite model in which the number of features is not a
priori fixed but is allowed to grow with the size of the data.

Z. Ghahramani, T.L. Griffiths, and P. Sollich.
**Bayesian
nonparametric latent feature models (with discussion)**.
In J.M. Bernardo, M.J. Bayarri, J.O. Berger, A.P. Dawid, D. Heckerman, A.F.M.
Smith, and M. West, editors, *Bayesian Statistics 8*, pages 201-226,
Oxford, UK, July 2007. Oxford University Press.

** Abstract:**
We describe a flexible nonparametric approach to latent variable modelling in
which the number of latent variables is unbounded. This approach is based on
a probability distribution over equivalence classes of binary matrices with a
finite number of rows, corresponding to the data points, and an unbounded
number of columns, corresponding to the latent variables. Each data point can
be associated with a subset of the possible latent variables, which we refer
to as the latent features of that data point. The binary variables in the
matrix indicate which latent feature is possessed by which data point, and
there is a potentially infinite array of features. We derive the distribution
over unbounded binary matrices by taking the limit of a distribution over
N×K binary matrices as K→∞. We define a simple generative
processes for this distribution which we call the Indian buffet process (IBP;
Griffiths and Ghahramani, 2005, 2006) by analogy
to the Chinese restaurant process (Aldous, 1985; Pitman, 2002). The IBP has a
single hyperparameter which controls both the number of feature per ob ject
and the total number of features. We describe a two-parameter generalization
of the IBP which has additional flexibility, independently controlling the
number of features per object and the total number of features in the matrix.
The use of this distribution as a prior in an infinite latent feature model
is illustrated, and Markov chain Monte Carlo algorithms for inference are
described.

** Comment:** Includes discussion by David Dunson, and
rejoinder.

Katherine A. Heller and Zoubin Ghahramani.
**A nonparametric
Bayesian approach to modeling overlapping clusters**.
In Marina Meila and Xiaotong Shen, editors, *AISTATS*, volume 2 of
*JMLR Proceedings*, pages 187-194. JMLR.org, 2007.

**
Abstract:** Although clustering data into mutually exclusive partitions has
been an extremely successful approach to unsupervised learning, there are
many situations in which a richer model is needed to fully represent the
data. This is the case in problems where data points actually simultaneously
belong to multiple, overlapping clusters. For example a particular gene may
have several functions, therefore belonging to several distinct clusters of
genes, and a biologist may want to discover these through unsupervised
modeling of gene expression data. We present a new nonparametric Bayesian
method, the Infinite Overlapping Mixture Model (IOMM), for modeling
overlapping clusters. The IOMM uses exponential family distributions to model
each cluster and forms an overlapping mixture by taking products of such
distributions, much like products of experts (Hinton, 2002). The IOMM allows
an unbounded number of clusters, and assignments of points to (multiple)
clusters is modeled using an Indian Buffet Process (IBP), (Griffiths and
Ghahramani, 2006). The IOMM has the desirable properties of being able to
focus in on overlapping regions while maintaining the ability to model a
potentially infinite number of clusters which may overlap. We derive MCMC
inference algorithms for the IOMM and show that these can be used to cluster
movies into multiple genres.

Yee Whye Teh, Dilan Görür, and Zoubin Ghahramani.
**Stick-breaking
construction for the Indian buffet process**.
In Marina Meila and Xiaotong Shen, editors, *AISTATS*, volume 2 of
*JMLR Proceedings*, pages 556-563. JMLR.org, 2007.

**
Abstract:** The Indian buffet process (IBP) is a Bayesian nonparametric
distribution whereby objects are modelled using an unbounded number of latent
features. In this paper we derive a stick-breaking representation for the
IBP. Based on this new representation, we develop slice samplers for the IBP
that are efficient, easy to implement and are more generally applicable than
the currently available Gibbs sampler. This representation, along with the
work of Thibaux and Jordan, also illuminates interesting theoretical
connections between the IBP, Chinese restaurant processes, Beta processes and
Dirichlet processes.

T. L. Griffiths and Z. Ghahramani.
**Infinite latent
feature models and the Indian Buffet Process**.
In Y. Weiss, B. Schölkopf, and J. Platt, editors, *Advances in Neural
Information Processing Systems 18*, pages 475-482, Cambridge, MA, USA,
December 2006. The MIT Press.

** Abstract:** We define a
probability distribution over equivalence classes of binary matrices with a
finite number of rows and an unbounded number of columns. This distribution
is suitable for use as a prior in probabilistic models that represent objects
using a potentially infinite array of features. We identify a simple
generative process that results in the same distribution over equivalence
classes, which we call the Indian buffet process. We illustrate the use of
this distribution as a prior in an infinite latent feature model, deriving a
Markov chain Monte Carlo algorithm for inference in this model and applying
the algorithm to an image dataset.

Dilan Görür, Frank Jäkel, and Carl Edward Rasmussen.
**A choice model
with infinitely many latent features**.
In W. W. Cohen and Andrew Moore, editors, *23rd International Conference on
Machine Learning*, pages 361-368, New York, NY, USA, June 2006. ACM
Press, doi
10.1145/1143844.1143890.

** Abstract:** Elimination by
aspects (EBA) is a probabilistic choice model describing how humans decide
between several options. The options from which the choice is made are
characterized by binary features and associated weights. For instance, when
choosing which mobile phone to buy the features to consider may be: long
lasting battery, color screen, etc. Existing methods for inferring the
parameters of the model assume pre-specified features. However, the features
that lead to the observed choices are not always known. Here, we present a
non-parametric Bayesian model to infer the features of the options and the
corresponding weights from choice data. We use the Indian buffet process
(IBP) as a prior over the features. Inference using Markov chain Monte Carlo
(MCMC) in conjugate IBP models has been previously described. The main
contribution of this paper is an MCMC algorithm for the EBA model that can
also be used in inference for other non-conjugate IBP models-this may
broaden the use of IBP priors considerably.

Frank Wood, Thomas L. Griffiths, and Zoubin Ghahramani.
**A non-parametric
Bayesian method for inferring hidden causes**.
In *UAI*. AUAI Press, 2006.

** Abstract:** We present a
non-parametric Bayesian approach to structure learning with hidden causes.
Previous Bayesian treatments of this problem define a prior over the number
of hidden causes and use algorithms such as reversible jump Markov chain
Monte Carlo to move between solutions. In contrast, we assume that the number
of hidden causes is unbounded, but only a finite number influence observable
variables. This makes it possible to use a Gibbs sampler to approximate the
distribution over causal structures. We evaluate the performance of both
approaches in discovering hidden causes in simulated data, and use our
non-parametric approach to discover hidden causes in a real medical
dataset.

A. Dubey, S. Hwang, C. Rangel, Carl Edward Rasmussen, Zoubin Ghahramani, and
David L. Wild.
**Clustering
protein sequence and structure space with infinite Gaussian mixture
models**.
In *Pacific Symposium on Biocomputing 2004*, pages 399-410, Singapore,
2004. World Scientific Publishing.

** Abstract:** We describe
a novel approach to the problem of automatically clustering protein sequences
and discovering protein families, subfamilies etc., based on the thoery of
infinite Gaussian mixture models. This method allows the data itself to
dictate how many mixture components are required to model it, and provides a
measure of the probability that two proteins belong to the same cluster. We
illustrate our methods with application to three data sets: globin sequences,
globin sequences with known tree-dimensional structures and G-pretein coupled
receptor sequences. The consistency of the clusters indicate that that our
methods is producing biologically meaningful results, which provide a very
good indication of the underlying families and subfamilies. With the
inclusion of secondary structure and residue solvent accessibility
information, we obtain a classification of sequences of known structure which
reflects and extends their SCOP classifications.

Matthew J. Beal, Zoubin Ghahramani, and Carl Edward Rasmussen.
**The infinite
hidden Markov model**.
In *Advances in Neural Information Processing Systems 14*, pages
577-584, Cambridge, MA, USA, December 2002. The MIT Press.

**
Abstract:** We show that it is possible to extend hidden Markov models to
have a countably infinite number of hidden states. By using the theory of
Dirichlet processes we can implicitly integrate out the infinitely many
transition parameters, leaving only three hyperparameters which can be
learned from data. These three hyperparameters define a hierarchical
Dirichlet process capable of capturing a rich set of transition dynamics. The
three hyperparameters control the time scale of the dynamics, the sparsity of
the underlying state-transition matrix, and the expected number of distinct
hidden states in a finite sequence. In this framework it is also natural to
allow the alphabet of emitted symbols to be infinite - consider, for
example, symbols being possible words appearing in English text.

Carl Edward Rasmussen and Zoubin Ghahramani.
**Infinite mixtures of
Gaussian process experts**.
In *Advances in Neural Information Processing Systems 14*, pages
881-888, Cambridge, MA, USA, December 2002. The MIT Press.

**
Abstract:** We present an extension to the Mixture of Experts (ME) model,
where the individual experts are Gaussian Process (GP) regression models.
Using an input-dependent adaptation of the Dirichlet Process, we implement a
gating network for an infinite number of Experts. Inference in this model may
be done efficiently using a Markov Chain relying on Gibbs sampling. The model
allows the effective covariance function to vary with the inputs, and may
handle large datasets - thus potentially overcoming two of the biggest
hurdles with GP models. Simulations show the viability of this approach.

Carl Edward Rasmussen and Zoubin Ghahramani.
**Occam's
razor**.
In *Advances in Neural Information Processing Systems 13*, pages
294-300, Cambridge, MA, USA, December 2001. The MIT Press.

**
Abstract:** The Bayesian paradigm apparently only sometimes gives rise to
Occam's Razor; at other times very large models perform well. We give simple
examples of both kinds of behaviour. The two views are reconciled when
measuring complexity of functions, rather than of the machinery used to
implement them. We analyze the complexity of functions for some linear in the
parameter models that are equivalent to Gaussian Processes, and always find
Occam's Razor at work.

Carl Edward Rasmussen.
**The infinite Gaussian
mixture model**.
In *Advances in Neural Information Processing Systems 12*, pages
554-560. The MIT Press, 2000.

** Abstract:** In a Bayesian
mixture model it is not necessary a priori to limit the number of components
to be finite. In this paper an infinite Gaussian mixture model is presented
which neatly sidesteps the difficult problem of finding the "right" number of
mixture components. Inference in the model is done using an efficient
parameter-free Markov Chain that relies entirely on Gibbs sampling.

## Approximate InferenceFor all but the simplest statistical models, exact learning and inference are computationally intractable. Approximate inference methods make it possible to learn realistic models from large data sets. Generally, approximate inference methods trade off computation time for accuracy. Some of the major classes of approximate inference methods include Markov chain Monte Carlo methods, variational methods and related algorithms such as Expectation Propagation. |

Wenlin Chen, Samuel Horváth, and Peter Richtárik.
**Optimal client sampling
for federated learning**.
*Transactions on Machine Learning Research*, August 2022.

** Abstract:** It is well understood that client-master
communication can be a primary bottleneck in federated learning (FL). In this
work, we address this issue with a novel client subsampling scheme, where we
restrict the number of clients allowed to communicate their updates back to
the master node. In each communication round, all participating clients
compute their updates, but only the ones with important updates communicate
back to the master. We show that importance can be measured using only the
norm of the update and give a formula for optimal client participation. This
formula minimizes the distance between the full update, where all clients
participate, and our limited update, where the number of participating
clients is restricted. In addition, we provide a simple algorithm that
approximates the optimal formula for client participation, which allows for
secure aggregation and stateless clients, and thus does not compromise client
privacy. We show both theoretically and empirically that for Distributed SGD
(DSGD) and Federated Averaging (FedAvg), the performance of our approach can
be close to full participation and superior to the baseline where
participating clients are sampled uniformly. Moreover, our approach is
orthogonal to and compatible with existing methods for reducing communication
overhead, such as local methods and communication compression methods.

** Comment:** arXiv

Matthew Ashman, Thang D. Bui, Cuong V. Nguyen, Efstratios Markou, Adrian
Weller, Siddharth Swaroop, and Richard E. Turner.
**Partitioned variational inferece:
A framework for probabilistic federated learning**.
2022.

** Abstract:** The proliferation of computing devices has
brought about an opportunity to deploy machine learning models on new problem
domains using previously inaccessible data. Traditional algorithms for
training such models often require data to be stored on a single machine with
compute performed by a single node, making them unsuitable for decentralised
training on multiple devices. This deficiency has motivated the development
of federated learning algorithms, which allow multiple data owners to train
collaboratively and use a shared model whilst keeping local data private.
However, many of these algorithms focus on obtaining point estimates of model
parameters, rather than probabilistic estimates capable of capturing model
uncertainty, which is essential in many applications. Variational inference
(VI) has become the method of choice for fitting many modern probabilistic
models. In this paper we introduce partitioned variational inference (PVI), a
general framework for performing VI in the federated setting. We develop new
supporting theory for PVI, demonstrating a number of properties that make it
an attractive choice for practitioners; use PVI to unify a wealth of
fragmented, yet related literature; and provide empirical results that
showcase the effectiveness of PVI in a variety of federated settings.

Javier Antorán, David Janz, James Urquhart Allingham, Erik A. Daxberger,
Riccardo Barbano, Eric T. Nalisnick, and José Miguel Hernández-Lobato.
**Adapting the
linearised Laplace model evidence for modern deep learning**.
In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang
Niu, and Sivan Sabato, editors, *39th International Conference on Machine
Learning*, volume 162 of *Proceedings of Machine Learning
Research*, pages 796-821. PMLR, 2022.

** Abstract:**
The linearised Laplace method for estimating model uncertainty has received
renewed attention in the Bayesian deep learning community. The method
provides reliable error bars and admits a closed-form expression for the
model evidence, allowing for scalable selection of model hyperparameters. In
this work, we examine the assumptions behind this method, particularly in
conjunction with model selection. We show that these interact poorly with
some now-standard tools of deep learning-stochastic approximation methods
and normalisation layers-and make recommendations for how to better adapt
this classic method to the modern setting. We provide theoretical support for
our recommendations and validate them empirically on MLPs, classic CNNs,
residual networks with and without normalisation layers, generative
autoencoders and transformers.

Wessel P. Bruinsma, Martin Tegnér, and Richard E. Turner.
**Modelling
non-smooth signals with complex spectral structure**.
In *aistats25*, 2022.

** Abstract:** The Gaussian Process
Convolution Model (GPCM; Tobar et al., 2015a) is a model for signals with
complex spectral structure. A significant limitation of the GPCM is that it
assumes a rapidly decaying spectrum: it can only model smooth signals.
Moreover, inference in the GPCM currently requires (1) a mean-field
assumption, resulting in poorly calibrated uncertainties, and (2) a tedious
variational optimisation of large covariance matrices. We redesign the GPCM
model to induce a richer distribution over the spectrum with relaxed
assumptions about smoothness: the Causal Gaussian Process Convolution Model
(CGPCM) introduces a causality assumption into the GPCM, and the Rough
Gaussian Process Convolution Model (RGPCM) can be interpreted as a Bayesian
nonparametric generalisation of the fractional Ornstein–Uhlenbeck process.
We also propose a more effective variational inference scheme, going beyond
the mean-field assumption: we design a Gibbs sampler which directly samples
from the optimal variational solution, circumventing any variational
optimisation entirely. The proposed variations of the GPCM are validated in
experiments on synthetic and real-world data, showing promising results.

Beau Coker, Wessel P. Bruinsma, David R. Burt, Weiwei Pan, and Finale
Doshi-Velez.
**Wide
mean-field Bayesian neural networks ignore the data**.
In *25th International Conference on Artificial Intelligence and
Statistics*, 2022.

** Abstract:** Bayesian neural networks
(BNNs) combine the expressive power of deep learning with the advantages of
Bayesian formalism. In recent years, the analysis of wide, deep BNNs has
provided theoretical insight into their priors and posteriors. However, we
have no analogous insight into their posteriors under approximate inference.
In this work, we show that mean-field variational inference *entirely
fails to model the data* when the network width is large and the
activation function is odd. Specifically, for fully-connected BNNs with odd
activation functions and a homoscedastic Gaussian likelihood, we show that
the *optimal* mean-field variational posterior predictive (i.e.,
function space) distribution converges to the prior predictive distribution
as the width tends to infinity. We generalize aspects of this result to other
likelihoods. Our theoretical results are suggestive of underfitting behavior
previously observered in BNNs. While our convergence bounds are
non-asymptotic and constants in our analysis can be computed, they are
currently too loose to be applicable in standard training regimes. Finally,
we show that the optimal approximate posterior need not tend to the prior if
the activation function is not odd, showing that our statements cannot be
generalized arbitrarily.

Yanzhi Chen, Weihao Sun, Yingzhen Li, and Adrian Weller.
**Scalable infomin
learning**.
In *Advances in Neural Information Processing Systems*, 2022.

** Abstract:** The task of infomin learning aims to learn a
representation with high utility while being uninformative about a specified
target, with the latter achieved by minimising the mutual information between
the representation and the target. It has broad applications, ranging from
training fair prediction models against protected attributes, to unsupervised
learning with disentangled representations. Recent works on infomin learning
mainly use adversarial training, which involves training a neural network to
estimate mutual information or its proxy and thus is slow and difficult to
optimise. Drawing on recent advances in slicing techniques, we propose a new
infomin learning approach, which uses a novel proxy metric to mutual
information. We further derive an accurate and analytically computable
approximation to this proxy metric, thereby removing the need of constructing
neural network-based mutual information estimators. Compared to baselines,
experiments on algorithmic fairness, disentangled representation learning and
domain adaptation verify that our method can more effectively remove unwanted
information with limited time budget.

Vincent Fortuin, Mark Collier, Florian Wenzel, James Urquhart Allingham,
Jeremiah Zhe Liu, Dustin Tran, Balaji Lakshminarayanan, Jesse Berent,
Rodolphe Jenatton, and Effrosyni Kokiopoulou.
**Deep classifiers with
label noise modeling and distance awareness**.
*Transactions on Machine Learning Research*, 2022.

**
Abstract:** Uncertainty estimation in deep learning has recently emerged as
a crucial area of interest to advance reliability and robustness in
safety-critical applications. While there have been many proposed methods
that either focus on distance-aware model uncertainties for
out-of-distribution detection or on input-dependent label uncertainties for
in-distribution calibration, both of these types of uncertainty are often
necessary. In this work, we propose the HetSNGP method for jointly modeling
the model and data uncertainty. We show that our proposed model affords a
favorable combination between these two types of uncertainty and thus
outperforms the baseline methods on some challenging out-of-distribution
datasets, including CIFAR-100C, ImageNet-C, and ImageNet-A. Moreover, we
propose HetSNGP Ensemble, an ensembled version of our method which
additionally models uncertainty over the network parameters and outperforms
other ensemble baselines.

** Comment:** Code

Vincent Fortuin, Adrià Garriga-Alonso, Sebastian W. Ober, Florian Wenzel,
Gunnar Rätsch, Richard E. Turner, Mark van der Wilk, and Laurence
Aitchison.
**Bayesian neural network
priors revisited**.
In *10th International Conference on Learning Representations*, 2022.

** Abstract:** Isotropic Gaussian priors are the de facto
standard for modern Bayesian neural network inference. However, it is unclear
whether these priors accurately reflect our true beliefs about the weight
distributions or give optimal performance. To find better priors, we study
summary statistics of neural network weights in networks trained using
stochastic gradient descent (SGD). We find that convolutional neural network
(CNN) and ResNet weights display strong spatial correlations, while fully
connected networks (FCNNs) display heavy-tailed weight distributions. We show
that building these observations into priors can lead to improved performance
on a variety of image classification datasets. Surprisingly, these priors
mitigate the cold posterior effect in FCNNs, but slightly increase the cold
posterior effect in ResNets.

Alessandro Davide Ialongo.
**Variational
Inference in Dynamical Systems**.
PhD thesis, University of Cambridge, Department of Engineering, Cambridge, UK,
2022.

** Abstract:** Dynamical systems are a powerful
formalism to analyse the world around us. Many datasets are sequential in
nature, and can be described by a discrete time evolution law. We are
interested in approaching the analysis of such datasets from a probabilistic
perspective. We would like to maintain justified beliefs about quantities
which, though useful in explaining the behaviour of a system, may not be
observable, as well as about the system's evolution itself, especially in
regimes we have not yet observed in our data. The framework of statistical
inference gives us the tools to do so, yet, for many systems of interest,
performing inference exactly is not computationally or analytically
tractable. The contribution of this thesis, then, is twofold: first, we
uncover two sources of bias in existing variational inference methods applied
to dynamical systems in general, and state space models whose transition
function is drawn from a Gaussian process (GPSSM) in particular. We show bias
can derive from assuming posteriors in non-linear systems to be jointly
Gaussian, and from assuming that we can sever the dependence between latent
states and transition function in state space model posteriors. Second, we
propose methods to address these issues, undoing the resulting biases. We do
this without compromising on computational efficiency or on the ability to
scale to larger datasets and higher dimensions, compared to the methods we
rectify. One method, the Markov Autoregressive Flow (Markov AF) addresses the
Gaussian assumption, by providing a more flexible class of posteriors, based
on normalizing flows, which can be easily evaluated, sampled, and optimised.
The other method, Variationally Coupled Dynamics and Trajectories (VCDT),
tackles the factorisation assumption, leveraging sparse Gaussian processes
and their variational representation to reintroduce dependence between latent
states and the transition function at no extra computational cost. Since the
objective of inference is to maintain calibrated beliefs, if we employed
approximations which are significantly biased in non-linear, noisy systems,
or when there is little data available, we would have failed in our
objective, as those are precisely the regimes in which uncertainty
quantification is all the more important. Hence we think it is essential, if
we wish to act optimally on such beliefs, to uncover, and, if possible, to
correct, all sources of systematic bias in our inference methods.

Vidhi Lalchand, Wessel P. Bruinsma, David R. Burt, and Carl E. Rasmussen.
**Sparse Gaussian
process hyperparameters: Optimize or integrate?**.
In *nips36*, 2022.

** Abstract:** The kernel function and
its hyperparameters are the central model selection choice in a Gaussian
process [Rasmussen and Williams, 2006]. Typically, the hyperparameters of the
kernel are chosen by maximising the marginal likelihood, an approach known as
Type-II maximum likelihood (ML-II). However, ML-II does not account for
hyperparameter uncertainty, and it is well-known that this can lead to
severely biased estimates and an underestimation of predictive uncertainty.
While there are several works which employ a fully Bayesian characterisation
of GPs, relatively few propose such approaches for the sparse GPs paradigm.
In this work we propose an algorithm for sparse Gaussian process regression
which leverages MCMC to sample from the hyperparameter posterior within the
variational inducing point framework of [Titsias, 2009]. This work is closely
related to Hensman et al. [2015b], but side-steps the need to sample the
inducing points, thereby significantly improving sampling efficiency in the
Gaussian likelihood case. We compare this scheme against natural baselines in
literature along with stochastic variational GPs (SVGPs) along with an
extensive computational analysis.

Ambrish Rawat, James Requeima, Wessel Bruinsma, and Richard Turner.
**Challenges and pitfalls of
Bayesian unlearning**.
In *ICML 2022 Workshop on Updatable Machine Learning (UpML)*, 2022.

** Abstract:** Machine unlearning refers to the task of removing
a subset of training data, thereby removing its contributions to a trained
model. Approximate unlearning are one class of methods for this task which
avoid the need to retrain the model from scratch on the retained data.
Bayes’ rule can be used to cast approximate unlearning as an inference
problem where the objective is to obtain the updated posterior by dividing
out the likelihood of deleted data. However this has its own set of
challenges as one often doesn’t have access to the exact posterior of the
model parameters. In this work we examine the use of the Laplace
approximation and Variational Inference to obtain the updated posterior. With
a neural network trained for a regression task as the guiding example, we
draw insights on the applicability of Bayesian unlearning in practical
scenarios.

Will Tebbutt.
**Advances in
Software and Spatio-Temporal Modelling with Gaussian Processes**.
PhD thesis, University of Cambridge, Department of Engineering, 2022.

** Abstract:** This thesis concerns the use of Gaussian
processes (GPs) as distributions over unknown functions in Machine Learning
and probabilistic modeling. GPs have been found to have utility in a wide
range of applications owing to their flexibility, interpretability, and
tractability. I advance their use in three directions. Firstly, the
abstractions upon which software is built for their use in practice. In
modern GP software libraries such as GPML, GPy, GPflow, and GPyTorch, the
kernel is undoubtedly the dominant abstraction. While it remains highly
successful it of course has limitations, and I propose to address some of
these through a complementary abstraction: affine transformations of GPs.
Specifically I show how a collection of GPs, and affine transformations
thereof, can themselves be treated as a single GP. This in turn leads to a
design for software, including exact and approximate inference algorithms. I
demonstrate the utility of this software through a collection of worked
examples, focussing on models which are more cleanly and easily expressed
using this new software. Secondly, I develop a new scalable approximate
inference algorithm for a class of GPs commonly utilised in spatio-temporal
problems. This is a setting in which GPs excel, for example enabling the
incorporation of important inductive biases, and observations made at
arbitrary points in time and space. However, the computation required to
perform exact inference and learning in GPs scales cubically in the number of
observations, necessitating approximation, to which end I combine two
important complementary classes of approximation: pseudo-point and Markovian.
The key contribution is the insight that a simple and useful way to combine
them turns out to be well-justified. This resolves an open question in the
literature, provides new insight into existing work, and a new family of
approximations. The efficacy of an important member of this family is
demonstrated empirically. Finally I develop a GP model and associated
approximate inference techniques for the prediction of sea surface
temperatures (SSTs) on decadal time scales, which are relevant when taking
planning decisions which consider resilience to climate change. There remains
a large degree of uncertainty as to the state of the climate on such time
scales, but it is thought to be possible to reduce this by exploiting the
predictability of natural variability in the climate. The developed GP-based
model incorporates a key assumption used by the existing statistical models
employed for decadal prediction, thus retaining a valuable inductive bias,
while offering several advantages. Amongst these is the lack of need for
spatial aggregation of data, which is especially relevant when data are
sparse, as is the case with historical ocean SST data. In summary, this
thesis contributes to the practical use of GPs through a set of abstractions
that are useful in the design of software, algorithms for approximate
inference in spatial-temporal settings, and their use in decadal climate
prediction.

Laurence Aitchison, Adam X. Yang, and Sebastian W. Ober.
**Deep
kernel processes**.
In *38th International Conference on Machine Learning*, 2021.

** Abstract:** We define deep kernel processes in which positive
definite Gram matrices are progressively transformed by nonlinear kernel
functions and by sampling from (inverse) Wishart distributions. Remarkably,
we find that deep Gaussian processes (DGPs), Bayesian neural networks (BNNs),
infinite BNNs, and infinite BNNs with bottlenecks can all be written as deep
kernel processes. For DGPs the equivalence arises because the Gram matrix
formed by the inner product of features is Wishart distributed, and as we
show, standard isotropic kernels can be written entirely in terms of this
Gram matrix — we do not need knowledge of the underlying features. We
define a tractable deep kernel process, the deep inverse Wishart process, and
give a doubly-stochastic inducing-point variational inference scheme that
operates on the Gram matrices, not on the features, as in DGPs. We show that
the deep inverse Wishart process gives superior performance to DGPs and
infinite BNNs on fully-connected baselines.

David R. Burt, Sebastian W. Ober, Adrià Garriga-Alonso, and Mark van der
Wilk.
**Understanding
variational inference in function-space**.
In *3rd Symposium on Advances in Approximate Bayesian Inference*,
2021.

** Abstract:** Recent work has attempted to directly
approximate the ‘function-space’ or predictive posterior distribution of
Bayesian models, without approximating the posterior distribution over the
parameters. This is appealing in e.g. Bayesian neural networks, where we only
need the former, and the latter is hard to represent. In this work, we
highlight some advantages and limitations of employing the Kullback-Leibler
divergence in this setting. For example, we show that minimizing the KL
divergence between a wide class of parametric distributions and the posterior
induced by a (non-degenerate) Gaussian process prior leads to an ill-defined
objective function. Then, we propose (featurized) Bayesian linear regression
as a benchmark for ‘function-space’ inference methods that directly
measures approximation quality. We apply this methodology to assess aspects
of the objective function and inference scheme considered in Sun et al.
(2018), emphasizing the quality of approximation to Bayesian inference as
opposed to predictive performance.

Erik A. Daxberger, Eric T. Nalisnick, James Urquhart Allingham, Javier
Antorán, and José Miguel Hernández-Lobato.
**Bayesian deep
learning via subnetwork inference**.
In Marina Meila and Tong Zhang, editors, *32nd International Conference on
Machine Learning*, volume 139 of *Proceedings of Machine Learning
Research*, pages 2510-2521. PMLR, 2021.

** Abstract:**
The Bayesian paradigm has the potential to solve core issues of deep neural
networks such as poor calibration and data inefficiency. Alas, scaling
Bayesian inference to large weight spaces often requires restrictive
approximations. In this work, we show that it suffices to perform inference
over a small subset of model weights in order to obtain accurate predictive
posteriors. The other weights are kept as point estimates. This subnetwork
inference framework enables us to use expressive, otherwise intractable,
posterior approximations over such subsets. In particular, we implement
subnetwork linearized Laplace: We first obtain a MAP estimate of all weights
and then infer a full-covariance Gaussian posterior over a subnetwork. We
propose a subnetwork selection strategy that aims to maximally preserve the
model's predictive uncertainty. Empirically, our approach is effective
compared to ensembles and less expressive posterior approximations over full
networks.

Metod Jazbec, Matt Ashman, Vincent Fortuin, Michael Pearce, Stephan Mandt, and
Gunnar Rätsch.
**Scalable
Gaussian process variational autoencoders**.
In Arindam Banerjee and Kenji Fukumizu, editors, *Proceedings of The 24th
International Conference on Artificial Intelligence and Statistics*,
volume 130 of *Proceedings of Machine Learning Research*, pages
3511-3519. Proceedings of Machine Learning Research, 13-15 Apr 2021.

** Abstract:** Conventional variational autoencoders fail in
modeling correlations between data points due to their use of factorized
priors. Amortized Gaussian process inference through GP-VAEs has led to
significant improvements in this regard, but is still inhibited by the
intrinsic complexity of exact GP inference. We improve the scalability of
these methods through principled sparse inference approaches. We propose a
new scalable GP-VAE model that outperforms existing approaches in terms of
runtime and memory footprint, is easy to implement, and allows for joint
end-to-end optimization of all components.

Chelsea Murray, James Urquhart Allingham, Javier Antorán, and José Miguel
Hernández-Lobato.
**Addressing bias
in active learning with depth uncertainty networks... or not**.
In Melanie F. Pradier, Aaron Schein, Stephanie L. Hyland, Francisco J. R. Ruiz,
and Jessica Zosa Forde, editors, *I (Still) Can't Believe It's Not Better!
Workshop at NeurIPS 2021, Virtual Workshop, December 13, 2021*, volume
163 of *Proceedings of Machine Learning Research*, pages 59-63.
PMLR, 2021.

** Abstract:** Farquhar et al. [2021] show that
correcting for active learning bias with underparameterised models leads to
improved downstream performance. For overparameterised models such as NNs,
however, correction leads either to decreased or unchanged performance. They
suggest that this is due to an "overfitting bias" which offsets the active
learning bias. We show that depth uncertainty networks operate in a low
overfitting regime, much like underparameterised models. They should
therefore see an increase in performance with bias correction. Surprisingly,
they do not. We propose that this negative result, as well as the results
Farquhar et al. [2021], can be explained via the lens of the bias-variance
decomposition of generalisation error.

Joseph Marino, Alexandre Piche, Alessandro Davide Ialongo, and Yisong Yue.
**Iterative
amortized policy optimization**.
In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan,
editors, *Advances in Neural Information Processing Systems 34*,
volume 34, pages 15667-15681. Curran Associates, Inc., 2021.

** Abstract:** Policy networks are a central feature of deep
reinforcement learning (RL) algorithms for continuous control, enabling the
estimation and sampling of high-value actions. From the variational inference
perspective on RL, policy networks, when used with entropy or KL
regularization, are a form of amortized optimization, optimizing network
parameters rather than the policy distributions directly. However, direct
amortized mappings can yield suboptimal policy estimates and restricted
distributions, limiting performance and exploration. Given this perspective,
we consider the more flexible class of iterative amortized optimizers. We
demonstrate that the resulting technique, iterative amortized policy
optimization, yields performance improvements over direct amortization on
benchmark continuous control tasks.

Sebastian W. Ober and Laurence Aitchison.
**Global
inducing point variational posteriors for Bayesian neural networks and deep
Gaussian processes**.
In *38th International Conference on Machine Learning*, 2021.

** Abstract:** We consider the optimal approximate posterior
over the top-layer weights in a Bayesian neural network for regression, and
show that it exhibits strong dependencies on the lower-layer weights. We
adapt this result to develop a correlated approximate posterior over the
weights at all layers in a Bayesian neural network. We extend this approach
to deep Gaussian processes, unifying inference in the two model classes. Our
approximate posterior uses learned ``global'' inducing points, which are
defined only at the input layer and propagated through the network to obtain
inducing inputs at subsequent layers. By contrast, standard ``local'',
inducing point methods from the deep Gaussian process literature optimise a
separate set of inducing inputs at every layer, and thus do not model
correlations across layers. Our method gives state-of-the-art performance for
a variational Bayesian method, without data augmentation or tempering, on
CIFAR-10 of 86.7%, which is comparable to SGMCMC without tempering but with
data augmentation (88% in Wenzel et al. 2020).

Sebastian W. Ober and Laurence Aitchison.
**A
variational approximate posterior for the deep Wishart process**.
In *Advances in Neural Information Processing Systems 34*, 2021.

** Abstract:** Recent work introduced deep kernel processes as
an entirely kernel-based alternative to NNs (Aitchison et al. 2020). Deep
kernel processes flexibly learn good top-layer representations by alternately
sampling the kernel from a distribution over positive semi-definite matrices
and performing nonlinear transformations. A particular deep kernel process,
the deep Wishart process (DWP), is of particular interest because its prior
can be made equivalent to deep Gaussian process (DGP) priors for kernels that
can be expressed entirely in terms of Gram matrices. However, inference in
DWPs has not yet been possible due to the lack of sufficiently flexible
distributions over positive semi-definite matrices. Here, we give a novel
approach to obtaining flexible distributions over positive semi-definite
matrices by generalising the Bartlett decomposition of the Wishart
probability density. We use this new distribution to develop an approximate
posterior for the DWP that includes dependency across layers. We develop a
doubly-stochastic inducing-point inference scheme for the DWP and show
experimentally that inference in the DWP can improve performance over doing
inference in a DGP with the equivalent prior.

Will Tebbutt, Arno Solin, and Richard E. Turner.
**Combining
pseudo-point and state space approximations for sum-separable Gaussian
processes**.
In Cassio de Campos and Marloes H. Maathuis, editors, *Proceedings of the
Thirty-Seventh Conference on Uncertainty in Artificial Intelligence*,
Proceedings of Machine Learning Research, pages 1607-1617. PMLR, 2021.

** Abstract:** Gaussian processes (GPs) are important
probabilistic tools for inference and learning in spatio-temporal modelling
problems such as those in climate science and epidemiology. However, existing
GP approximations do not simultaneously support large numbers of off-the-grid
spatial data-points and long time-series which is a hallmark of many
applications. Pseudo-point approximations, one of the gold-standard methods
for scaling GPs to large data sets, are well suited for handling off-the-grid
spatial data. However, they cannot handle long temporal observation horizons
effectively reverting to cubic computational scaling in the time dimension.
State space GP approximations are well suited to handling temporal data, if
the temporal GP prior admits a Markov form, leading to linear complexity in
the number of temporal observations, but have a cubic spatial cost and cannot
handle off-the-grid spatial data. In this work we show that there is a simple
and elegant way to combine pseudo-point methods with the state space GP
approximation framework to get the best of both worlds. The approach hinges
on a surprising conditional independence property which applies to
space–time separable GPs. We demonstrate empirically that the combined
approach is more scalable and applicable to a greater range of
spatio-temporal problems than either method on its own.

Kai Xu, Tor Erlend Fjelde, Charles Sutton, and Hong Ge.
**Couplings for
multinomial Hamiltonian Monte Carlo**.
130:3646-3654, 13-15 Apr 2021.

** Abstract:** Hamiltonian
Monte Carlo (HMC) is a popular sampling method in Bayesian inference.
Recently, Heng & Jacob (2019) studied Metropolis HMC with couplings for
unbiased Monte Carlo estimation, establishing a generic parallelizable scheme
for HMC. However, in practice a different HMC method, multinomial HMC, is
considered as the go-to method, e.g. as part of the no-U-turn sampler. In
multinomial HMC, proposed states are not limited to end-points as in
Metropolis HMC; instead points along the entire trajectory can be proposed.
In this paper, we establish couplings for multinomial HMC, based on optimal
transport for multinomial sampling in its transition. We prove an upper bound
for the meeting time – the time it takes for the coupled chains to meet –
based on the notion of local contractivity. We evaluate our methods using
three targets: 1,000 dimensional Gaussians, logistic regression and
log-Gaussian Cox point processes. Compared to Heng & Jacob (2019),
coupled multinomial HMC generally attains a smaller meeting time, and is more
robust to choices of step sizes and trajectory lengths, which allows re-use
of existing adaptation methods for HMC. These improvements together paves the
way for a wider and more practical use of coupled HMC methods.

Jonathan Gordon, Wessel Bruinsma, Andrew Y. K. Foong, James Requeima, Yann
Dubois, and Richard Turner.
**Convolutional
conditional neural processes**.
In *8th International Conference on Learning Representations*, Adis
Ababa, April 2020.

** Abstract:** We introduce the
Convolutional Conditional Neural Process (ConvCNP), a new member of the
Neural Process family that models translation equivariance in the data.
Translation equivariance is an important inductive bias for many learning
problems including time series modelling, spatial data, and images. The model
embeds data sets into an infinite-dimensional function space, as opposed to
finite-dimensional vector spaces. To formalize this notion, we extend the
theory of neural representations of sets to include functional
representations, and demonstrate that any translation-equivariant embedding
can be represented using a convolutional deep-set. We evaluate ConvCNPs in
several settings, demonstrating that they achieve state-of-the-art
performance compared to existing NPs. We demonstrate that building in
translation equivariance enables zero-shot generalization to challenging,
out-of-domain tasks.

Javier Antorán, James Urquhart Allingham, and José Miguel
Hernández-Lobato.
**Depth
uncertainty in neural networks**.
In Hugo Larochelle, Marc'Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan,
and Hsuan-Tien Lin, editors, *Advances in Neural Information Processing
Systems 33*, 2020.

** Abstract:** Existing methods for
estimating uncertainty in deep learning tend to require multiple forward
passes, making them unsuitable for applications where computational resources
are limited. To solve this, we perform probabilistic reasoning over the depth
of neural networks. Different depths correspond to subnetworks which share
weights and whose predictions are combined via marginalisation, yielding
model uncertainty. By exploiting the sequential structure of feed-forward
networks, we are able to both evaluate our training objective and make
predictions with a single forward pass. We validate our approach on
real-world regression and image classification tasks. Our approach provides
uncertainty calibration, robustness to dataset shift, and accuracies
competitive with more computationally expensive baselines.

** Comment:** Code

Matthew Ashman, Jonny So, Will Tebbutt, Vincent Fortuin, Michael Pearce, and
Richard E. Turner.
**Sparse Gaussian process
variational autoencoders**.
2020.

** Abstract:** Large, multi-dimensional spatio-temporal
datasets are omnipresent in modern science and engineering. An effective
framework for handling such data are Gaussian process deep generative models
(GP-DGMs), which employ GP priors over the latent variables of DGMs. Existing
approaches for performing inference in GP-DGMs do not support sparse GP
approximations based on inducing points, which are essential for the
computational efficiency of GPs, nor do they handle missing data - a natural
occurrence in many spatio-temporal datasets - in a principled manner. We
address these shortcomings with the development of the sparse Gaussian
process variational autoencoder (SGP-VAE), characterised by the use of
partial inference networks for parameterising sparse GP approximations.
Leveraging the benefits of amortised variational inference, the SGP-VAE
enables inference in multi-output sparse GPs on previously unobserved data
with no additional training. The SGP-VAE is evaluated in a variety of
experiments where it outperforms alternative approaches including
multi-output GPs and structured VAEs.

David R. Burt, Carl Edward Rasmussen, and Mark van der Wilk.
**Convergence
of sparse variational inference in Gaussian processes regression**.
*Journal of Machine Learning Research*, 21, 2020.

**
Abstract:** Gaussian processes are distributions over functions that are
versatile and mathematically convenient priors in Bayesian modelling.
However, their use is often impeded for data with large numbers of
observations, N, due to the cubic (in N) cost of matrix operations used in
exact inference. Many solutions have been proposed that rely on M << N
inducing variables to form an approximation at a cost of O(NM^{2}).
While the computational cost appears linear in N, the true complexity depends
on how M must scale with N to ensure a certain quality of the approximation.
In this work, we investigate upper and lower bounds on how M needs to grow
with N to ensure high quality approximations. We show that we can make the
KL-divergence between the approximate model and the exact posterior
arbitrarily small for a Gaussian-noise regression model with M << N.
Specifically, for the popular squared exponential kernel and D-dimensional
Gaussian distributed covariates, M = O((log N)^{D}) suffice and a
method with an overall computational cost of O(N(log N)^{2D}(log log
N)^{2}) can be used to perform inference.

Krzysztof Choromanski, David Cheikhi, Jared Davis, Valerii Likhosherstov,
Achille Nazaret, Achraf Bahamou, Xingyou Song, Mrugank Akarte, Jack
Parker-Holder, Jacob Bergquist, Yuan Gao, Aldo Pacchiano, Tamas Sarlos,
Adrian Weller, and Vikas Sindhwani.
**Stochastic flows and geometric
optimization on the orthogonal group**.
In *37th International Conference on Machine Learning*, 2020.

** Abstract:** We present a new class of stochastic,
geometrically-driven optimization algorithms on the orthogonal group O(d) and
naturally reductive homogeneous manifolds obtained from the action of the
rotation group SO(d). We theoretically and experimentally demonstrate that
our methods can be applied in various fields of machine learning including
deep, convolutional and recurrent neural networks, reinforcement learning,
normalizing flows and metric learning. We show an intriguing connection
between efficient stochastic optimization on the orthogonal group and graph
theory (e.g. matching problem, partition functions over graphs,
graph-coloring). We leverage the theory of Lie groups and provide theoretical
results for the designed class of algorithms. We demonstrate broad
applicability of our methods by showing strong performance on the seemingly
unrelated tasks of learning world models to obtain stable policies for the
most difficult Humanoid agent from OpenAI Gym and improving convolutional
neural networks.

Jan-Peter Calliess, Stephen J. Roberts, Carl Edward Rasmussen, and Jan
Maciejowski.
**Lazily adapted constant kinky inference for non-parametric regression and
model-reference adaptive control**.
*Automatica*, 122, 2020, doi
10.1016/j.automatica.2020.109216.

** Abstract:**
Techniques known as Nonlinear Set Membership prediction or Lipschitz
Interpolation are approaches to supervised machine learning that utilise
presupposed Lipschitz properties to perform inference over unobserved
function values. Provided a bound on the true best Lipschitz constant of the
target function is known a priori, they offer convergence guarantees, as well
as bounds around the predictions. Considering a more general setting that
builds on Lipschitz continuity, we propose an online method for estimating
the Lipschitz constant online from function value observations that are
possibly corrupted by bounded noise. Utilising this as a data-dependent
hyper-parameter gives rise to a nonparametric machine learning method, for
which we establish strong universal approximation guarantees. That is, we
show that our prediction rule can learn any continuous function on compact
support in the limit of increasingly dense data, up to a worst-case error
that can be bounded by the level of observational error. We also consider
applications of our nonparametric regression method to learning-based
control. For a class of discrete-time settings, we establish convergence
guarantees on the closed-loop tracking error of our online learning-based
controllers. To provide evidence that our method can be beneficial not only
in theory but also in practice, we apply it in the context of nonparametric
model-reference adaptive control (MRAC). Across a range of simulated aircraft
roll-dynamics and performance metrics our approach outperforms recently
proposed alternatives that were based on Gaussian processes and RBF-neural
networks.

Andrew Foong, David Burt, Yingzhen Li, and Richard Turner.
**On the expressiveness of approximate inference in bayesian neural
networks**.
In *Advances in Neural Information Processing Systems 34*, 2020.

Moein Khajehnejad, Ahmad Asgharian Rezaei, Mahmoudreza Babaei, Jessica
Hoffmann, Mahdi Jalili, and Adrian Weller.
**Adversarial graph embeddings for
fair influence maximization over social networks**.
In *International Joint Conference on Artificial Intelligence*, 2020.

** Abstract:** Influence maximization is a widely studied topic
in network science, where the aim is to reach the maximum possible number of
nodes, while only targeting a small initial set of individuals. It has
critical applications in many fields, including viral marketing, information
propagation, news dissemination, and vaccinations. However, the objective
does not usually take into account whether the final set of influenced nodes
is fair with respect to sensitive attributes, such as race or gender. Here we
address fair influence maximization, aiming to reach minorities more
equitably. We introduce Adversarial Graph Embeddings: we co-train an
auto-encoder for graph embedding and a discriminator to discern sensitive
attributes. This leads to embeddings which are similarly distributed across
sensitive attributes. We then find a good initial set by clustering the
embeddings. We believe we are the first to use embeddings for the task of
fair influence maximization. While there are typically trade-offs between
fairness and influence maximization objectives, our experiments on synthetic
and real-world datasets show that our approach dramatically reduces disparity
while remaining competitive with state-of-the-art influence maximization
methods.

Yunfei Teng, Wenbo Gao, Francois Chalus, Anna Choromanska, Donald Goldfarb, and
Adrian Weller.
**Leader stochastic gradient
descent (LSGD) for distributed training of deep learning models**.
In *Advances in Neural Information Processing Systems 33*, Vancouver,
December 2019.

** Abstract:** We consider distributed
optimization under communication constraints for training deep learning
models. We propose a new algorithm, whose parameter updates rely on two
forces: a regular gradient step, and a corrective direction dictated by the
currently best-performing worker (leader). Our method differs from the
parameter-averaging scheme EASGD in a number of ways: (i) our objective
formulation does not change the location of stationary points compared to the
original optimization problem; (ii) we avoid convergence decelerations caused
by pulling local workers descending to different local minima to each other
(i.e. to the average of their parameters); (iii) our update by design breaks
the curse of symmetry (the phenomenon of being trapped in poorly generalizing
sub-optimal solutions in symmetric non-convex landscapes); and (iv) our
approach is more communication efficient since it broadcasts only parameters
of the leader rather than all workers. We provide theoretical analysis of the
batch version of the proposed algorithm, which we call Leader Gradient
Descent (LGD), and its stochastic variant (LSGD). Finally, we implement an
asynchronous version of our algorithm and extend it to the multi-leader
setting, where we form groups of workers, each represented by its own local
leader (the best performer in a group), and update each worker with a
corrective direction comprised of two attractive forces: one to the local,
and one to the global leader (the best performer among all workers). The
multi-leader setting is well-aligned with current hardware architecture,
where local workers forming a group lie within a single computational node
and different groups correspond to different nodes. For training
convolutional neural networks, we empirically demonstrate that our approach
compares favorably to state-of-the-art baselines.

Tameem Adel and Adrian Weller.
**TibGM: A
transferable and information-based graphical model approach for reinforcement
learning**.
In *36th International Conference on Machine Learning*, Long Beach, June
2019.

** Abstract:** One of the challenges to reinforcement
learning (RL) is scalable transferability among complex tasks. Incorporating
a graphical model (GM), along with the rich family of related methods, as a
basis for RL frameworks provides potential to address issues such as
transferability, generalisation and exploration. Here we propose a flexible
GM-based RL framework which leverages efficient inference procedures to
enhance generalisation and transfer power. In our proposed transferable and
information-based graphical model framework ‘TibGM’, we show the
equivalence between our mutual information-based objective in the GM, and an
RL consolidated objective consisting of a standard reward maximisation target
and a generalisation/transfer objective. In settings where there is a sparse
or deceptive reward signal, our TibGM framework is flexible enough to
incorporate exploration bonuses depicting intrinsic rewards. We empirically
verify improved performance and exploration power.

Alessandro Davide Ialongo, Mark van der Wilk, James Hensman, and Carl Edward
Rasmussen.
**Overcoming mean-field
approximations in recurrent Gaussian process models**.
In *36th International Conference on Machine Learning*, Long Beach, June
2019.

** Abstract:** We identify a new variational inference
scheme for dynamical systems whose transition function is modelled by a
Gaussian process. Inference in this setting has either employed
computationally intensive MCMC methods, or relied on factorisations of the
variational posterior. As we demonstrate in our experiments, the
factorisation between latent system states and transition function can lead
to a miscalibrated posterior and to learning unnecessarily large noise terms.
We eliminate this factorisation by explicitly modelling the dependence
between state trajectories and the Gaussian process posterior. Samples of the
latent states can then be tractably generated by conditioning on this
representation. The method we obtain (VCDT: variationally coupled dynamics
and trajectories) gives better predictive performance and more calibrated
estimates of the transition function, yet maintains the same time and space
complexities as mean-field methods. Code is available at:
https://github.com/ialong/GPt.

Jonathan Gordon, John Bronskill, Matthias Bauer, Sebastian Nowozin, and
Richard Turner.
**Meta-learning
probabilistic inference for prediction**.
In *7th International Conference on Learning Representations*, New
Orleans, April 2019.

** Abstract:** This paper introduces a
new framework for data efficient and versatile learning. Specifically: 1) We
develop ML-PIP, a general framework for Meta-Learning approximate
Probabilistic Inference for Prediction. ML-PIP extends existing probabilistic
interpretations of meta-learning to cover a broad class of methods. 2) We
introduce \Versa, an instance of the framework employing a flexible and
versatile amortization network that takes few-shot learning datasets as
inputs, with arbitrary numbers of shots, and outputs a distribution over
task-specific parameters in a single forward pass. Versa substitutes
optimization at test time with forward passes through inference networks,
amortizing the cost of inference and relieving the need for second
derivatives during training. 3) We evaluate \Versa on benchmark datasets
where the method sets new state-of-the-art results, and can handle arbitrary
number of shots, and for classification, arbitrary numbers of classes at
train and test time. The power of the approach is then demonstrated through a
challenging few-shot ShapeNet view reconstruction task.

David R Burt, Carl Edward Rasmussen, and Mark van der Wilk.
**Rates of convergence for sparse
variational Gaussian process regression**.
*arXiv*, 2019.

** Abstract:** Excellent variational
approximations to Gaussian process posteriors have been developed which avoid
the O(N^{3}) scaling with dataset size N. They reduce the
computational cost to O(NM^{2}), with M≪N being the number of
inducing variables, which summarise the process. While the computational cost
seems to be linear in N, the true complexity of the algorithm depends on how
M must increase to ensure a certain quality of approximation. We address this
by characterising the behavior of an upper bound on the KL divergence to the
posterior. We show that with high probability the KL divergence can be made
arbitrarily small by growing M more slowly than N. A particular case of
interest is that for regression with normally distributed inputs in
D-dimensions with the popular Squared Exponential kernel,
M=O(log^{D}N) is sufficient. Our results show that as datasets grow,
Gaussian process posteriors can truly be approximated cheaply, and provide a
concrete rule for how to increase M in continual learning scenarios.

Kazuki Osawa, Siddharth Swaroop, Anirudh Jain, Runa Eschenhagen, Richard E.
Turner, Rio Yokota, and Mohammad Emtiyaz Khan.
**Practical
deep learning with Bayesian principles**.
In *Advances in Neural Information Processing Systems 33*, 2019.

** Abstract:** Bayesian methods promise to fix many shortcomings
of deep learning, but they are impractical and rarely match the performance
of standard methods, let alone improve them. In this paper, we demonstrate
practical training of deep networks with natural-gradient variational
inference. By applying techniques such as batch normalisation, data
augmentation, and distributed training, we achieve similar performance in
about the same number of epochs as the Adam optimiser, even on large datasets
such as ImageNet. Importantly, the benefits of Bayesian principles are
preserved: predictive probabilities are well-calibrated, uncertainties on
out-of-distribution data are improved, and continual-learning performance is
boosted. This work enables practical deep learning while preserving benefits
of Bayesian principles. A PyTorch implementation is available as a
plug-and-play optimiser.

Alessandro Davide Ialongo, Mark van der Wilk, James Hensman, and Carl Edward
Rasmussen.
**Non-factorised variational
inference in dynamical systems**.
In *First Symposium on Advances in Approximate Bayesian Inference*,
Montreal, December 2018.

** Abstract:** We focus on
variational inference in dynamical systems where the discrete time transition
function (or evolution rule) is modelled by a Gaussian process. The dominant
approach so far has been to use a factorised posterior distribution,
decoupling the transition function from the system states. This is not exact
in general and can lead to an overconfident posterior over the transition
function as well as an overestimation of the intrinsic stochasticity of the
system (process noise). We propose a new method that addresses these issues
and incurs no additional computational costs.

Yingzhen Li and Richard E. Turner.
**Gradient Estimators
for Implicit Models**.
In *Sixth International Conference on Learning Representations*,
Vancouver CANADA, May 2018.

** Abstract:** Implicit models,
which allow for the generation of samples but not for point-wise evaluation
of probabilities, are omnipresent in real-world problems tackled by machine
learning and a hot topic of current research. Some examples include data
simulators that are widely used in engineering and scientific research,
generative adversarial networks (GANs) for image synthesis, and
hot-off-the-press approximate inference techniques relying on implicit
distributions. The majority of existing approaches to learning implicit
models rely on approximating the intractable distribution or optimisation
objective for gradient- based optimisation, which is liable to produce
inaccurate updates and thus poor models. This paper alleviates the need for
such approximations by proposing the Stein gradient estimator, which directly
estimates the score function of the implicitly defined distribution. The
efficacy of the proposed estimator is empirically demonstrated by examples
that include meta-learning for approximate inference and entropy regularised
GANs that provide improved sample diversities.

Cuong V. Nguyen, Yingzhen Li, and Thang D. Bui Richard E. Turner.
**Variational
Continual Learning**.
In *Sixth International Conference on Learning Representations*,
Vancouver CANADA, May 2018.

** Abstract:** This paper develops
variational continual learning (VCL), a simple but general framework for
continual learning that fuses online variational inference (VI) and recent
advances in Monte Carlo VI for neural networks. The framework can
successfully train both deep discriminative models and deep generative models
in complex continual learning settings where existing tasks evolve over time
and entirely new tasks emerge. Experimental results show that variational
continual learning outperforms state-of-the-art continual learning methods on
a variety of tasks, avoiding catastrophic forgetting in a fully automatic
way.

Sungsoo Ahn, Michael Chertkov, Jinwoo Shin, and Adrian Weller.
**Gauged
mini-bucket elimination for approximate inference**.
In *21st International Conference on Artificial Intelligence and
Statistics*, Playa Blanca, Lanzarote, Canary Islands, April 2018.

** Abstract:** Computing the partition function Z of a discrete
graphical model is a fundamental inference challenge. Since this is
computationally intractable, variational approximations are often used in
practice. Recently, so-called gauge transformations were used to improve
variational lower bounds on Z. In this paper, we propose a new
gauge-variational approach, termed WMBE-G, which combines gauge
transformations with the weighted mini-bucket elimination (WMBE) method.
WMBE-G can provide both upper and lower bounds on Z, and is easier to
optimize than the prior gauge-variational algorithm. We show that WMBE-G
strictly improves the earlier WMBE approximation for symmetric models
including Ising models with no magnetic field. Our experimental results
demonstrate the effectiveness of WMBE-G even for generic, nonsymmetric
models.

Krzysztof Choromanski, Mark Rowland, Tamas Sarlos, Vikas Sindhwani, Richard E.
Turner, and Adrian Weller.
**The geometry of
random features**.
In *21st International Conference on Artificial Intelligence and
Statistics*, Playa Blanca, Lanzarote, Canary Islands, April 2018.

** Abstract:** We present an in-depth examination of the
effectiveness of radial basis function kernel (beyond Gaussian) estimators
based on orthogonal random feature maps. We show that orthogonal estimators
outperform state-of-the-art mechanisms that use iid sampling under weak
conditions for tails of the associated Fourier distributions. We prove that
for the case of many dimensions, the superiority of the orthogonal transform
can be accurately measured by a property we define called the charm of the
kernel, and that orthogonal random features provide optimal (in terms of mean
squared error) kernel estimators. We provide the first theoretical results
which explain why orthogonal random features outperform unstructured on
downstream tasks such as kernel ridge regression by showing that orthogonal
random features provide kernel algorithms with better spectral properties
than the previous state-of-the-art. Our results enable practitioners more
generally to estimate the benefits from applying orthogonal transforms.

Sungsoo Ahn, Michael Chertkov, Adrian Weller, and Jinwoo Shin.
**Bucket
renormalization for approximate inference**.
In *35th International Conference on Machine Learning*, 2018.

** Abstract:** Probabilistic graphical models are a key tool in
machine learning applications. Computing the partition function, i.e.,
normalizing constant, is a fundamental task of statistical inference but it
is generally computationally intractable, leading to extensive study of
approximation methods. Iterative variational methods are a popular and
successful family of approaches. However, even state of the art variational
methods can return poor results or fail to converge on difficult instances.
In this paper, we instead consider computing the partition function via
sequential summation over variables. We develop robust approximate algorithms
by combining ideas from mini-bucket elimination with tensor network and
renormalization group methods from statistical physics. The resulting
“convergence-free” methods show good empirical performance on both
synthetic and real-world benchmark models, even for difficult instances.

Hong Ge, Kai Xu, and Zoubin Ghahramani.
**Turing: A language
for flexible probabilistic inference**.
84:1682-1690, 09-11 Apr 2018.

** Abstract:** Probabilistic
programming promises to simplify and democratize probabilistic machine
learning, but successful probabilistic programming systems require flexible,
generic and efficient inference engines. In this work, we present a system
called Turing for building MCMC algorithms for probabilistic programming
inference. Turing has a very simple syntax and makes full use of the
numerical capabilities in the Julia programming language, including all
implemented probability distributions, and automatic differentiation. Turing
supports a wide range of popular Monte Carlo algorithms, including
Hamiltonian Monte Carlo (HMC), HMC with No-U-Turns (NUTS), Gibbs sampling,
sequential Monte Carlo (SMC), and several particle MCMC (PMCMC) samplers.
Most importantly, Turing inference is composable: it combines MCMC operations
on subsets of variables, for example using a combination of an HMC engine and
a particle Gibbs (PG) engine. We explore several combinations of inference
methods with the aim of finding approaches that are both efficient and
universal, i.e. applicable to arbitrary probabilistic models. NUTS—a
popular variant of HMC that adapts Hamiltonian simulation path length
automatically, although quite powerful for exploring differentiable target
distributions, is however not universal. We identify some failure modes for
the NUTS engine, and demonstrate that composition of PG (for discrete
variables) and NUTS (for continuous variables) can be useful when the NUTS
engine is either not applicable, or simply does not work well. Our aim is to
present Turing and its composable inference engines to the world and
encourage other researchers to build on this system to help advance the field
of probabilistic machine learning.

Thang D. Bui, Cuong V. Nguyen, and Richard E. Turner.
**Streaming
sparse Gaussian process approximations**.
In *Advances in Neural Information Processing Systems 31*,
volume 31, Long Beach, California, USA, December 2017.

**
Abstract:** Sparse approximations for Gaussian process models provide a
suite of methods that enable these models to be deployed in large data regime
and enable analytic intractabilities to be sidestepped. However, the field
lacks a principled method to handle streaming data in which the posterior
distribution over function values and the hyperparameters are updated in an
online fashion. The small number of existing approaches either use suboptimal
hand-crafted heuristics for hyperparameter learning, or suffer from
catastrophic forgetting or slow updating when new data arrive. This paper
develops a new principled framework for deploying Gaussian process
probabilistic models in the streaming setting, providing principled methods
for learning hyperparameters and optimising pseudo-input locations. The
proposed framework is experimentally validated using synthetic and real-world
datasets.

** Comment:** The first two authors contributed equally.

Krzysztof Choromanski, Mark Rowland, and Adrian Weller.
**The
unreasonable effectiveness of structured random orthogonal
embeddings**.
In *Advances in Neural Information Processing Systems 31*, Long Beach,
California, December 2017.

** Abstract:** We examine a class
of embeddings based on structured random matrices with orthogonal rows which
can be applied in many machine learning applications including dimensionality
reduction and kernel approximation. For both the Johnson-Lindenstrauss
transform and the angular kernel, we show that we can select matrices
yielding guaranteed improved performance in accuracy and/or speed compared to
earlier methods. We introduce matrices with complex entries which give
significant further accuracy improvement. We provide geometric and Markov
chain-based perspectives to help understand the benefits, and empirical
results which suggest that the approach is helpful in a wider range of
applications.

Alessandro Davide Ialongo, Mark van der Wilk, and Carl Edward Rasmussen.
**Closed-form inference and
prediction in Gaussian process state-space models**.
In *NIPS Time Series Workshop 2017*, Long Beach, December 2017.

** Abstract:** We examine an analytic variational inference
scheme for the Gaussian Process State Space Model (GPSSM) - a probabilistic
model for system identification and time-series modelling. Our approach
performs variational inference over both the system states and the transition
function. We exploit Markov structure in the true posterior, as well as an
inducing point approximation to achieve linear time complexity in the length
of the time series. Contrary to previous approaches, no Monte Carlo sampling
is required: inference is cast as a deterministic optimisation problem. In a
number of experiments, we demonstrate the ability to model non-linear
dynamics in the presence of both process and observation noise as well as to
impute missing information (e.g. velocities from raw positions through time),
to de-noise, and to estimate the underlying dimensionality of the system.
Finally, we also introduce a closed-form method for multi-step prediction,
and a novel criterion for assessing the quality of our approximate
posterior.

Matej Balog, Nilesh Tripuraneni, Zoubin Ghahramani, and Adrian Weller.
**Lost
relatives of the Gumbel trick**.
In *34th International Conference on Machine Learning*, Sydney,
Australia, August 2017.

** Abstract:** The Gumbel trick is a
method to sample from a discrete probability distribution, or to estimate its
normalizing partition function. The method relies on repeatedly applying a
random perturbation to the distribution in a particular way, each time
solving for the most likely configuration. We derive an entire family of
related methods, of which the Gumbel trick is one member, and show that the
new methods have superior properties in several settings with minimal
additional computational cost. In particular, for the Gumbel trick to yield
computational benefits for discrete graphical models, Gumbel perturbations on
all configurations are typically replaced with so-called low-rank
perturbations. We show how a subfamily of our new methods adapts to this
setting, proving new upper and lower bounds on the log partition function and
deriving a family of sequential samplers for the Gibbs distribution. Finally,
we balance the discussion by showing how the simpler analytical form of the
Gumbel trick enables additional theoretical results.

** Comment:** [arXiv] [Poster]
[Slides]
[Code]

Yingzhen Li and Yarin Gal.
**Dropout Inference
in Bayesian Neural Networks with Alpha-divergences**.
In *34th International Conference on Machine Learning*, Sydney
AUSTRALIA, Aug 2017.

** Abstract:** To obtain uncertainty
estimates with real-world Bayesian deep learning models, practical inference
approximations are needed. Dropout variational inference (VI) for example has
been used for machine vision and medical applications, but VI can severely
underestimates model uncertainty. Alpha-divergences are alternative
divergences to VI’s KL objective, which are able to avoid VI’s
uncertainty underestimation. But these are hard to use in practice: existing
techniques can only use Gaussian approximating distributions, and require
existing models to be changed radically, thus are of limited use for
practitioners. We propose a re-parametrisation of the alpha-divergence
objectives, deriving a simple inference technique which, together with
dropout, can be easily implemented with existing models by simply changing
the loss of the model. We demonstrate improved uncertainty estimates and
accuracy compared to VI in dropout networks. We study our model’s epistemic
uncertainty far away from the data using adversarial images, showing that
these can be distinguished from non-adversarial images by examining our
model’s uncertainty.

Eric Jang, Shixiang Gu, and Ben Poole.
**Categorical reparametrization
with gumble-softmax**.
In *5th International Conference on Learning Representations*, Toulon
FRANCE, April 2017.

** Abstract:** Categorical variables are a
natural choice for representing discrete structure in the world. However,
stochastic neural networks rarely use categorical latent variables due to the
inability to backpropagate through samples. In this work, we present an
efficient gradient estimator that replaces the non-differentiable sample from
a categorical distribution with a differentiable sample from a novel
Gumbel-Softmax distribution. This distribution has the essential property
that it can be smoothly annealed into a categorical distribution. We show
that our Gumbel-Softmax estimator outperforms state-of-the-art gradient
estimators on structured output prediction and unsupervised generative
modeling tasks with categorical latent variables, and enables large speedups
on semi-supervised classification.

Alexandre Khae Wu Navarro, Jes Frellsen, and Richard E. Turner.
**The Multivariate Generalised
von Mises distribution: Inference and applications**.
In *31st AAAI Conference on Artificial Intelligence*, San Francisco, CA,
USA, January 2017. AAAI Press.

** Abstract:** Circular
variables arise in a multitude of data-modelling contexts ranging from
robotics to the social sciences, but they have been largely overlooked by the
machine learning community. This paper partially redresses this imbalance by
extending some standard probabilistic modelling tools to the circular domain.
First we introduce a new multivariate distribution over circular variables,
called the multivariate Generalised von Mises (mGvM) distribution. This
distribution can be constructed by restricting and renormalising a general
multivariate Gaussian distribution to the unit hyper-torus. Previously
proposed multivariate circular distributions are shown to be special cases of
this construction. Second, we introduce a new probabilistic model for
circular regression, that is inspired by Gaussian Processes, and a method for
probabilistic principal component analysis with circular hidden variables.
These models can leverage standard modelling tools (e.g. covariance functions
and methods for automatic relevance determination). Third, we show that the
posterior distribution in these models is a mGvM distribution which enables
development of an efficient variational free-energy scheme for performing
approximate inference and approximate maximum-likelihood learning.

Thang D. Bui, Josiah Yan, and Richard E. Turner.
**A unifying framework for
Gaussian process pseudo-point approximations using power expectation
propagation**.
*Journal of Machine Learning Research*, 18(104):1-72, 2017.

** Abstract:** Gaussian processes (GPs) are flexible
distributions over functions that enable high-level assumptions about unknown
functions to be encoded in a parsimonious, flexible and general way. Although
elegant, the application of GPs is limited by computational and analytical
intractabilities that arise when data are sufficiently numerous or when
employing non-Gaussian models. Consequently, a wealth of GP approximation
schemes have been developed over the last 15 years to address these key
limitations. Many of these schemes employ a small set of pseudo data points
to summarise the actual data. In this paper we develop a new pseudo-point
approximation framework using Power Expectation Propagation (Power EP) that
unifies a large number of these pseudo-point approximations. Unlike much of
the previous venerable work in this area, the new framework is built on
standard methods for approximate inference (variational free-energy, EP and
Power EP methods) rather than employing approximations to the probabilistic
generative model itself. In this way all of the approximation is performed at
`inference time' rather than at `modelling time', resolving awkward
philosophical and empirical questions that trouble previous approaches.
Crucially, we demonstrate that the new framework includes new pseudo-point
approximation methods that outperform current approaches on regression and
classification tasks.

Yingzhen Li and Richard E. Turner.
**Rényi
Divergence Variational Inference**.
In *Advances in Neural Information Processing Systems 29*, Barcelona
SPAIN, Dec 2016.

** Abstract:** This paper introduces the
variational Rényi bound (VR) that extends traditional variational
inference to Rényi's alpha-divergences. This new family of variational
methods unifies a number of existing approaches, and enables a smooth
interpolation from the evidence lower-bound to the log (marginal) likelihood
that is controlled by the value of alpha that parametrises the divergence.
The reparameterization trick, Monte Carlo approximation and stochastic
optimisation methods are deployed to obtain a tractable and unified framework
for optimisation. We further consider negative alpha values and propose a
novel variational inference method as a new special case in the proposed
framework. Experiments on Bayesian neural networks and variational
auto-encoders demonstrate the wide applicability of the VR bound.

Thang D. Bui, Daniel Hernández-Lobato, José Miguel Hernández-Lobato,
Yingzhen Li, and Richard E. Turner.
**Deep Gaussian
processes for regression using approximate expectation propagation**.
In *33rd International Conference on Machine Learning*, New York, USA,
June 2016.

** Abstract:** Deep Gaussian processes (DGPs) are
multi-layer hierarchical generalisations of Gaussian processes (GPs) and are
formally equivalent to neural networks with multiple, infinitely wide hidden
layers. DGPs are nonparametric probabilistic models and as such are arguably
more flexible, have a greater capacity to generalise, and provide better
calibrated uncertainty estimates than alternative deep models. This paper
develops a new approximate Bayesian learning scheme that enables DGPs to be
applied to a range of medium to large scale regression problems for the first
time. The new method uses an approximate Expectation Propagation procedure
and a novel and efficient extension of the probabilistic backpropagation
algorithm for learning. We evaluate the new method for non-linear regression
on eleven real-world datasets, showing that it always outperforms GP
regression and is almost always better than state-of-the-art deterministic
and sampling-based approximate inference methods for Bayesian neural
networks. As a by-product, this work provides a comprehensive analysis of six
approximate Bayesian methods for training neural networks.

José Miguel Hernández-Lobato, Yingzhen Li, Mark Rowland, Thang D. Bui,
Daniel Hernández-Lobato, and Richard E. Turner.
**Black-box
Alpha Divergence Minimization**.
In *33rd International Conference on Machine Learning*, New York USA,
June 2016.

** Abstract:** Black-box alpha (BB-$α$) is a
new approximate inference method based on the minimization of
$α$-divergences. BB-$α$ scales to large datasets because it can be
implemented using stochastic gradient descent. BB-$α$ can be applied to
complex probabilistic models with little effort since it only requires as
input the likelihood function and its gradients. These gradients can be
easily obtained using automatic differentiation. By changing the divergence
parameter α, the method is able to interpolate between variational Bayes
(VB) ($α \rightarrow 0$) and an algorithm similar to expectation
propagation (EP) ($α = 1$). Experiments on probit regression and neural
network regression and classification problems show that BB-αwith
non-standard settings of $α$, such as $α = 0.5$, usually produces
better predictions than with $α \rightarrow 0$ (VB) or $α = 1$
(EP).

Jes Frellsen, Ole Winther, Zoubin Ghahramani, and Jesper Ferkinghoff-Borg.
**Bayesian generalised ensemble Markov chain Monte Carlo**.
In *19th International Conference on Artificial Intelligence and
Statistics*, Cadiz, Spain, May 2016.

** Abstract:**
Bayesian generalised ensemble (BayesGE) is a new method that addresses two
major drawbacks of standard Markov chain Monte Carlo algorithms for inference
in high-dimensional probability models: inapplicability to estimate the
partition function, and poor mixing properties. BayesGE uses a Bayesian
approach to iteratively update the belief about the density of states
(distribution of the log likelihood under the prior) for the model, with the
dual purpose of enhancing the sampling efficiency and make the estimation of
the partition function tractable. We benchmark BayesGE on Ising and Potts
systems and show that it compares favourably to existing state-of-the-art
methods.

Shixiang Gu, Sergey Levine, Ilya Sutskever, and Andriy Mnih.
**Muprop: Unbiased backpropagation
for stochastic neural networks**.
In *4th International Conference on Learning Representations*, San Juan
PUERTO RICO, May 2016.

** Abstract:** Deep neural networks are
powerful parametric models that can be trained efficiently using the
backpropagation algorithm. Stochastic neural networks combine the power of
large parametric functions with that of graphical models, which makes it
possible to learn very complex distributions. However, as backpropagation is
not directly applicable to stochastic networks that include discrete sampling
operations within their computational graph, training such networks remains
difficult. We present MuProp, an unbiased gradient estimator for stochastic
networks, designed to make this task easier. MuProp improves on the
likelihood-ratio estimator by reducing its variance using a control variate
based on the first-order Taylor expansion of a mean-field network. Crucially,
unlike prior attempts at using backpropagation for training stochastic
networks, the resulting estimator is unbiased and well behaved. Our
experiments on structured output prediction and discrete latent variable
modeling demonstrate that MuProp yields consistently good performance across
a range of difficult tasks.

Alexander G D G Matthews, James Hensman, Richard E. Turner, and Zoubin
Ghahramani.
**On Sparse Variational methods
and the Kullback-Leibler divergence between stochastic processes**.
In *19th International Conference on Artificial Intelligence and
Statistics*, Cadiz, Spain, May 2016.

** Abstract:** The
variational framework for learning inducing variables (Titsias, 2009a) has
had a large impact on the Gaussian process literature. The framework may be
interpreted as minimizing a rigorously defined Kullback-Leibler divergence
between the approximating and posterior processes. To our knowledge this
connection has thus far gone unremarked in the literature. In this paper we
give a substantial generalization of the literature on this topic. We give a
new proof of the result for infinite index sets which allows inducing points
that are not data points and likelihoods that depend on all function values.
We then discuss augmented index sets and show that, contrary to previous
works, marginal consistency of augmentation is not enough to guarantee
consistency of variational inference with the original model. We then
characterize an extra condition where such a guarantee is obtainable. Finally
we show how our framework sheds light on interdomain sparse approximations
and sparse approximations for Cox processes.

Matthias Stephan Bauer, Mark van der Wilk, and Carl Edward Rasmussen.
**Understanding
probabilistic sparse Gaussian process approximations**.
In *Advances in Neural Information Processing Systems 29*, 2016.

** Abstract:** Good sparse approximations are essential for
practical inference in Gaussian Processes as the computational cost of exact
methods is prohibitive for large datasets. The Fully Independent Training
Conditional (FITC) and the Variational Free Energy (VFE) approximations are
two recent popular methods. Despite superficial similarities, these
approximations have surprisingly different theoretical properties and behave
differently in practice. We thoroughly investigate the two methods for
regression both analytically and through illustrative examples, and draw
conclusions to guide practical application.

** Comment:** arXiv

Shixiang Gu, Zoubin Ghahramani, and Richard E. Turner.
**Neural adaptive sequential Monte
Carlo**.
In *Advances in Neural Information Processing Systems 29*, Montréal
CANADA, Dec 2015.

** Abstract:** Sequential Monte Carlo (SMC),
or particle filtering, is a popular class of methods for sampling from an
intractable target distribution using a sequence of simpler intermediate
distributions. Like other importance sampling-based methods, performance is
critically dependent on the proposal distribution: a bad proposal can lead to
arbitrarily inaccurate estimates of the target distribution. This paper
presents a new method for automatically adapting the proposal using an
approximation of the Kullback-Leibler divergence between the true posterior
and the proposal distribution. The method is very flexible, applicable to any
parameterised proposal distribution and it supports online and batch
variants. We use the new framework to adapt powerful proposal distributions
with rich parameterisations based upon neural networks leading to Neural
Adaptive Sequential Monte Carlo (NASMC). Experiments indicate that NASMC
significantly improves inference in a non-linear state space model
outperforming adaptive proposal methods including the Extended Kalman and
Unscented Particle Filters. Experiments also indicate that improved inference
translates into improved parameter learning when NASMC is used as a
subroutine of Particle Marginal Metropolis Hastings. Finally we show that
NASMC is able to train a neural network-based deep recurrent generative model
achieving results that compete with the state-of-the-art for polymorphic
music modelling. NASMC can be seen as bridging the gap between adaptive SMC
methods and the recent work in scalable, black-box variational inference.

James Hensman, Alexander G D G Matthews, Maurizio Filippone, and Zoubin
Ghahramani.
**MCMC
for Variationally Sparse Gaussian Processes**.
In *Advances in Neural Information Processing Systems 28*, pages 1-9,
Montreal, Canada, December 2015.

** Abstract:** Gaussian
process (GP) models form a core part of probabilistic machine learning.
Considerable research effort has been made into attacking three issues with
GP models: how to compute efficiently when the number of data is large; how
to approximate the posterior when the likelihood is not Gaussian and how to
estimate covariance function parameter posteriors. This paper simultaneously
addresses these, using a variational approximation to the posterior which is
sparse in support of the function but otherwise free-form. The result is a
Hybrid Monte-Carlo sampling scheme which allows for a non-Gaussian
approximation over the function values and covariance parameters
simultaneously, with efficient computations based on inducing-point sparse
GPs. Code to replicate each experiment in this paper will be available
shortly.

Yingzhen Li, José Miguel Hernández-Lobato, and Richard E. Turner.
**Stochastic Expectation
Propagation**.
In *Advances in Neural Information Processing Systems 28*, Montréal
CANADA, Dec 2015.

** Abstract:** Expectation propagation (EP)
is a deterministic approximation algorithm that is often used to perform
approximate Bayesian parameter learning. EP approximates the full intractable
posterior distribution through a set of local-approximations that are
iteratively refined for each datapoint. EP can offer analytic and
computational advantages over other approximations, such as Variational
Inference (VI), and is the method of choice for a number of models. The local
nature of EP appears to make it an ideal candidate for performing Bayesian
learning on large models in large-scale datasets settings. However, EP has a
crucial limitation in this context: the number approximating factors needs to
increase with the number of data-points, N, which often entails a
prohibitively large memory overhead. This paper presents an extension to EP,
called stochastic expectation propagation (SEP), that maintains a global
posterior approximation (like VI) but updates it in a local way (like EP ).
Experiments on a number of canonical learning problems using synthetic and
real-world datasets indicate that SEP performs almost as well as full EP, but
reduces the memory consumption by a factor of N. SEP is therefore ideally
suited to performing approximate Bayesian learning in the large model, large
dataset setting.

Felipe Tobar, Thang D. Bui, and Richard E. Turner.
**Learning
stationary time series using gaussian process with nonparametric
kernels**.
In *Advances in Neural Information Processing Systems 29*, Montréal
CANADA, Dec 2015.

** Abstract:** We introduce the Gaussian
Process Convolution Model (GPCM), a two-stage nonparametric generative
procedure to model stationary signals as the convolution between a
continuous-time white-noise process and a continuous-time linear filter drawn
from Gaussian process. The GPCM is a continuous-time nonparametricwindow
moving average process and, conditionally, is itself a Gaussian process with
a nonparametric kernel defined in a probabilistic fashion. The generative
model can be equivalently considered in the frequency domain, where the power
spectral density of the signal is specified using a Gaussian process. One of
the main contributions of the paper is to develop a novel variational
freeenergy approach based on inter-domain inducing variables that efficiently
learns the continuous-time linear filter and infers the driving white-noise
process. In turn, this scheme provides closed-form probabilistic estimates of
the covariance kernel and the noise-free signal both in denoising and
prediction scenarios. Additionally, the variational inference procedure
provides closed-form expressions for the approximate posterior of the
spectral density given the observed data, leading to new Bayesian
nonparametric approaches to spectrum estimation. The proposed GPCM is
validated using synthetic and real-world signals.

Adrian Weller.
**Bethe and
related pairwise entropy approximations**.
In *31st Conference on Uncertainty in Artificial Intelligence*, pages
942-951, Amsterdam, July 2015.

** Abstract:** For undirected
graphical models, belief propagation often performs remarkably well for
approximate marginal inference, and may be viewed as a heuristic to minimize
the Bethe free energy. Focusing on binary pairwise models, we demonstrate
that several recent results on the Bethe approximation may be generalized to
a broad family of related pairwise free energy approximations with arbitrary
counting numbers. We explore approximation error and shed light on the
empirical success of the Bethe approximation.

** Comment:** Supplementary
Material

James Hensman, Alexander G D G Matthews, and Zoubin Ghahramani.
**Scalable
Variational Gaussian Process Classification**.
In *18th International Conference on Artificial Intelligence and
Statistics*, pages 1-9, San Diego, California, USA, May 2015.

** Abstract:** Gaussian process classification is a popular
method with a number of appealing properties. We show how to scale the model
within a variational inducing point framework, out-performing the state of
the art on benchmark datasets. Importantly, the variational formulation an be
exploited to allow classification in problems with millions of data points,
as we demonstrate in experiments.

Nilesh Tripuraneni, Shixiang Gu, Hong Ge, and Zoubin Ghahramani.
**Particle
gibbs for infinite hidden markov models**.
In *Advances in Neural Information Processing Systems 29*, Montreal
CANADA, May 2015.

** Abstract:** Infinite Hidden Markov Models
(iHMM’s) are an attractive, nonparametric gener- alization of the classical
Hidden Markov Model which can automatically infer the number of hidden states
in the system. However, due to the infinite-dimensional nature of the
transition dynamics, performing inference in the iHMM is difficult. In this
paper, we present an infinite-state Particle Gibbs (PG) algorithm to re-
sample state trajectories for the iHMM. The proposed algorithm uses an
efficient proposal optimized for iHMMs and leverages ancestor sampling to
improve the mixing of the standard PG algorithm. Our algorithm demonstrates
significant con- vergence improvements on synthetic and real world data
sets.

Roger Frigola.
**Bayesian time series
learning with Gaussian processes**.
PhD thesis, University of Cambridge, Department of Engineering, Cambridge, UK,
2015.

** Abstract:** The analysis of time series data is
important in fields as disparate as the social sciences, biology, engineering
or econometrics. In this dissertation, we present a number of algorithms
designed to learn Bayesian nonparametric models of time series. The goal of
these kinds of models is twofold. First, they aim at making predictions which
quantify the uncertainty due to limitations in the quantity and the quality
of the data. Second, they are flexible enough to model highly complex data
whilst preventing overfitting when the data does not warrant complex
models.

We begin with a unifying literature review on time series models
based on Gaussian processes. Then, we centre our attention on the Gaussian
Process State-Space Model (GP-SSM): a Bayesian nonparametric generalisation
of discrete-time nonlinear state-space models. We present a novel formulation
of the GP-SSM that offers new insights into its properties. We then proceed
to exploit those insights by developing new learning algorithms for the
GP-SSM based on particle Markov chain Monte Carlo and variational
inference.

Finally, we present a filtered nonlinear auto-regressive
model with a simple, robust and fast learning algorithm that makes it well
suited to its application by non-experts on large datasets. Its main
advantage is that it avoids the computationally expensive (and potentially
difficult to tune) smoothing step that is a key part of learning nonlinear
state-space models.

Yarin Gal, Yutian Chen, and Zoubin Ghahramani.
**Latent
Gaussian processes for distribution estimation of multivariate categorical
data**.
In *Proceedings of the 32nd International Conference on Machine Learning
(ICML-15)*, pages 645-654, 2015.

** Abstract:**
Multivariate categorical data occur in many applications of machine learning.
One of the main difficulties with these vectors of categorical variables is
sparsity. The number of possible observations grows exponentially with vector
length, but dataset diversity might be poor in comparison. Recent models have
gained significant improvement in supervised tasks with this data. These
models embed observations in a continuous space to capture similarities
between them. Building on these ideas we propose a Bayesian model for the
unsupervised task of distribution estimation of multivariate categorical
data. We model vectors of categorical variables as generated from a
non-linear transformation of a continuous latent space. Non-linearity
captures multi-modality in the distribution. The continuous representation
addresses sparsity. Our model ties together many existing models, linking the
linear categorical latent Gaussian model, the Gaussian process latent
variable model, and Gaussian process classification. We derive inference for
our model based on recent developments in sampling based variational
inference. We show empirically that the model outperforms its linear and
discrete counterparts in imputation tasks of sparse data.

Hong Ge, Yutian Chen, Moquan Wan, and Zoubin Ghahramani.
**Distributed
Inference for Dirichlet Process Mixture Models**.
37:2276-2284, 07-09 Jul 2015.

** Abstract:** Bayesian
nonparametric mixture models based on the Dirichlet process (DP) have been
widely used for solving problems like clustering, density estimation and
topic modelling. These models make weak assumptions about the underlying
process that generated the observed data. Thus, when more data are collected,
the complexity of these models can change accordingly. These theoretical
properties often lead to superior predictive performance when compared to
traditional finite mixture models. However, despite the increasing amount of
data available, the application of Bayesian nonparametric mixture models is
so far limited to relatively small data sets. In this paper, we propose an
efficient distributed inference algorithm for the DP and the HDP mixture
model. The proposed method is based on a variant of the slice sampler for
DPs. Since this sampler does not involve a pre-determined truncation, the
stationary distribution of the sampling algorithm is unbiased. We provide
both local thread-level and distributed machine-level parallel
implementations and study the performance of this sampler through an
extensive set of experiments on image and text data. When compared to
existing inference algorithms, the proposed method exhibits state-of-the-art
accuracy and strong scalability with up to 512 cores.

Yarin Gal and Richard Turner.
**Improving the
Gaussian process sparse spectrum approximation by representing uncertainty
in frequency inputs**.
In *Proceedings of the 32nd International Conference on Machine Learning
(ICML-15)*, pages 655-664, 2015.

** Abstract:** Standard
sparse pseudo-input approximations to the Gaussian process (GP) cannot handle
complex functions well. Sparse spectrum alternatives attempt to answer this
but are known to over-fit. We suggest the use of variational inference for
the sparse spectrum approximation to avoid both issues. We model the
covariance function with a finite Fourier series approximation and treat it
as a random variable. The random covariance function has a posterior, on
which a variational distribution is placed. The variational distribution
transforms the random covariance function to fit the data. We study the
properties of our approximate inference, compare it to alternative ones, and
extend it to the distributed and stochastic domains. Our approximation
captures complex functions better than standard approaches and avoids
over-fitting.

José Miguel Hernández-Lobato, Matthew W. Hoffman, and Zoubin Ghahramani.
**Predictive
entropy search for efficient global optimization of black-box
functions**.
In *Advances in Neural Information Processing Systems 28*, pages 1-9,
Montreal, Canada, December 2014.

** Abstract:** We propose a
novel information-theoretic approach for Bayesian optimization called
Predictive Entropy Search (PES). At each iteration, PES selects the next
evaluation point that maximizes the expected information gained with respect
to the global maximum. PES codifies this intractable acquisition function in
terms of the expected reduction in the differential entropy of the predictive
distribution. This reformulation allows PES to obtain approximations that are
both more accurate and efficient than other alternatives such as Entropy
Search (ES). Furthermore, PES can easily perform a fully Bayesian treatment
of the model hyperparameters while ES cannot. We evaluate PES in both
synthetic and realworld applications, including optimization problems in
machine learning, finance, biotechnology, and robotics. We show that the
increased accuracy of PES leads to significant gains in optimization
performance.

Alexander G. D. G Matthews, James Hensman, and Zoubin Ghahramani.
**Comparing
lower bounds on the entropy of mixture distributions for use in variational
inference**.
In *NIPS workshop on Advances in Variational Inference*,
Montreal, Canada, December 2014.

Yue Wu, José Miguel Hernández-Lobato, and Zoubin Ghahramani.
**Gaussian
process volatility model**.
In *Advances in Neural Information Processing Systems 28*, pages 1-9,
Montreal, Canada, December 2014.

** Abstract:** The prediction
of time-changing variances is an important task in the modeling of financial
data. Standard econometric models are often limited as they assume rigid
functional relationships for the evolution of the variance. Moreover,
functional parameters are usually learned by maximum likelihood, which can
lead to overfitting. To address these problems we introduce GP-Vol, a novel
non-parametric model for time-changing variances based on Gaussian Processes.
This new model can capture highly flexible functional relationships for the
variances. Furthermore, we introduce a new online algorithm for fast
inference in GP-Vol. This method is much faster than current offline
inference procedures and it avoids overfitting problems by following a fully
Bayesian approach. Experiments with financial data show that GP-Vol performs
significantly better than current standard alternatives.

Anoop Korattikara, Yutian Chen, and Max Welling.
**Austerity
in MCMC land: Cutting the Metropolis-Hastings budget**.
In *31st International Conference on Machine Learning*, pages 181-189,
Beijing, China, June 2014.

** Abstract:** Can we make Bayesian
posterior MCMC sampling more efficient when faced with very large datasets?
We argue that computing the likelihood for N datapoints in the
Metropolis-Hastings (MH) test to reach a single binary decision is
computationally inefficient. We introduce an approximate MH rule based on a
sequential hypothesis test that allows us to accept or reject samples with
high confidence using only a fraction of the data required for the exact MH
rule. While this method introduces an asymptotic bias, we show that this bias
can be controlled and is more than offset by a decrease in variance due to
our ability to draw more samples per unit of time.

** Comment:** supplementary

Thang D. Bui and Richard E. Turner.
**Tree-structured Gaussian process approximations**.
In Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger,
editors, *Advances in Neural Information Processing Systems 28*,
volume 28, pages 2213-2221. Curran Associates, Inc., 2014.

** Abstract:** Gaussian process regression can be accelerated by
constructing a small pseudo-dataset to summarize the observed data. This idea
sits at the heart of many approximation schemes, but such an approach
requires the number of pseudo-datapoints to be scaled with the range of the
input space if the accuracy of the approximation is to be maintained. This
presents problems in time-series settings or in spatial datasets where large
numbers of pseudo-datapoints are required since computation typically scales
quadratically with the pseudo-dataset size. In this paper we devise an
approximation whose complexity grows linearly with the number of
pseudo-datapoints. This is achieved by imposing a tree or chain structure on
the pseudo-datapoints and calibrating the approximation using a
Kullback-Leibler (KL) minimization. Inference and learning can then be
performed efficiently using the Gaussian belief propagation algorithm. We
demonstrate the validity of our approach on a set of challenging regression
tasks including missing data imputation for audio and spatial datasets. We
trace out the speed-accuracy trade-off for the new method and show that the
frontier dominates those obtained from a large number of existing
approximation techniques.

Neil Houlsby and Massimiliano Ciaramita.
**A
scalable Gibbs sampler for probabilistic entity linking**.
In *36th European Conference on Information Retrieval*, pages 335-346.
Springer, 2014.

** Abstract:** Entity linking involves
labeling phrases in text with their referent entities, such as Wikipedia or
Freebase entries. This task is challenging due to the large number of
possible entities, in the millions, and heavy-tailed mention ambiguity. We
formulate the problem in terms of probabilistic inference within a topic
model, where each topic is associated with a Wikipedia article. To deal with
the large number of topics we propose a novel efficient Gibbs sampling scheme
which can also incorporate side information, such as the Wikipedia graph.
This conceptually simple probabilistic approach achieves state-of-the-art
performance in entity-linking on the Aida-CoNLL dataset.

Adrian Weller and Tony Jebara.
**Clamping
variables and approximate inference**.
In Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger,
editors, *Advances in Neural Information Processing Systems 28*, pages
909-917. Curran Associates, Inc., 2014.

** Abstract:** It was
recently proved using graph covers (Ruozzi, 2012) that the Bethe partition
function is upper bounded by the true partition function for a binary
pairwise model that is attractive. Here we provide a new, arguably simpler
proof from first principles. We make use of the idea of clamping a variable
to a particular value. For an attractive model, we show that summing over the
Bethe partition functions for each sub-model obtained after clamping any
variable can only raise (and hence improve) the approximation. In fact, we
derive a stronger result that may have other useful implications. Repeatedly
clamping until we obtain a model with no cycles, where the Bethe
approximation is exact, yields the result. We also provide a related lower
bound on a broad class of approximate partition functions of general pairwise
multi-label models that depends only on the topology. We demonstrate that
clamping a few wisely chosen variables can be of practical value by
dramatically reducing approximation error.

** Comment:** Supplementary
Material

Daniel Hernández-Lobato and José Miguel Hernández-Lobato.
**Learning
feature selection dependencies in multi-task learning**.
In *Advances in Neural Information Processing Systems 27*, pages 1-9,
Lake Tahoe, California, USA, December 2013.

** Abstract:** A
probabilistic model based on the horseshoe prior is proposed for learning de-
pendencies in the process of identifying relevant features for prediction.
Exact inference is intractable in this model. However, expectation
propagation offers an approximate alternative. Because the process of
estimating feature selection dependencies may suffer from over-fitting in the
model proposed, additional data from a multi-task learning scenario are
considered for induction. The same model can be used in this setting with few
modifications. Furthermore, the assumptions made are less restrictive than in
other multi-task methods: The different tasks must share feature selection
dependencies, but can have different relevant features and model
coefficients. Experiments with real and synthetic data show that this model
performs better than other multi-task alternatives from the literature. The
experiments also show that the model is able to induce suitable feature
selection dependencies for the problems considered, only from the training
data.

José Miguel Hernández-Lobato, James Robert Lloyds, and Daniel
Hernández-Lobato.
**Gaussian process
conditional copulas with applications to financial time series**.
In *Advances in Neural Information Processing Systems 27*, pages 1-9,
Lake Tahoe, California, USA, December 2013.

** Abstract:** The
estimation of dependencies between multiple variables is a central problem in
the analysis of financial time series. A common approach is to express these
dependencies in terms of a copula function. Typically the copula function is
assumed to be constant but this may be inaccurate when there are covariates
that could have a large influence on the dependence structure of the data. To
account for this, a Bayesian framework for the estimation of conditional
copulas is proposed. In this framework the parameters of a copula are
non-linearly related to some arbitrary conditioning variables. We evaluate
the ability of our method to predict time-varying dependencies on several
equities and currencies and observe consistent performance gains compared to
static copula models and other time-varying copula methods.

Daniel Hernández-Lobato, José Miguel Hernández-Lobato, and Pierre Dupont.
**Generalized
spike-and-slab priors for Bayesian group feature selection using
expectation propagation**.
*Journal of Machine Learning Research*, 14:1891-1945, July 2013.

** Abstract:** We describe a Bayesian method for group feature
selection in linear regression problems. The method is based on a generalized
version of the standard spike-and-slab prior distribution which is often used
for individual feature selection. Exact Bayesian inference under the prior
considered is infeasible for typical regression problems. However,
approximate inference can be carried out efficiently using Expectation
Propagation (EP). A detailed analysis of the generalized spike-and-slab
prior shows that it is well suited for regression problems that are sparse at
the group level. Furthermore, this prior can be used to introduce prior
knowledge about specific groups of features that are a priori believed to be
more relevant. An experimental evaluation compares the performance of the
proposed method with those of group LASSO, Bayesian group LASSO,
automatic relevance determination and additional variants used for group
feature selection. The results of these experiments show that a model based
on the generalized spike-and-slab prior and the EP algorithm has
state-of-the-art prediction performance in the problems analyzed.
Furthermore, this model is also very useful to carry out sequential
experimental design (also known as active learning), where the data instances
that are most informative are iteratively included in the training set,
reducing the number of instances needed to obtain a particular level of
prediction accuracy.

David Lopez-Paz, José Miguel Hernández-Lobato, and Zoubin Ghahramani.
**Gaussian
process vine copulas for multivariate dependence**.
In *30th International Conference on Machine Learning*, Atlanta,
Georgia, USA, June 2013.

** Abstract:** Copulas allow to learn
marginal distributions separately from the multivariate dependence structure
(copula) that links them together into a density function. Vine
factorizations ease the learning of high-dimensional copulas by constructing
a hierarchy of conditional bivariate copulas. However, to simplify inference,
it is common to assume that each of these conditional bivariate copulas is
independent from its conditioning variables. In this paper, we relax this
assumption by discovering the latent functions that specify the shape of a
conditional copula given its conditioning variables We learn these functions
by following a Bayesian approach based on sparse Gaussian processes with
expectation propagation for scalable, approximate inference. Experiments on
real-world datasets show that, when modeling all conditional dependencies, we
obtain better estimates of the underlying copula of the data.

E. Gilboa, Yunus Saatçi, and John P. Cunningham.
**Scaling
multidimensional Gaussian processes using projected additive
approximations**.
In *30th International Conference on Machine Learning*, 2013.

** Abstract:** Exact Gaussian Process (GP) regression has
O(N^{3}) runtime for data size N, making it intractable for large N.
Many algorithms for improving GP scaling approximate the covariance with
lower rank matrices. Other work has exploited structure inherent in
particular covariance functions, including GPs with implied Markov structure,
and equispaced inputs (both enable O(N) runtime). However, these GP advances
have not been extended to the multidimensional input setting, despite the
preponderance of multidimensional applications. This paper introduces and
tests novel extensions of structured GPs to multidimensional inputs. We
present new methods for additive GPs, showing a novel connection between the
classic backﬁtting method and the Bayesian framework. To achieve optimal
accuracy-complexity tradeoff, we extend this model with a novel variant of
projection pursuit regression. Our primary result – projection pursuit
Gaussian Process Regression – shows orders of magnitude speedup while
preserving high accuracy. The natural second and third steps include
non-Gaussian observations and higher dimensional equispaced grid methods. We
introduce novel techniques to address both of these necessary directions. We
thoroughly illustrate the power of these three advances on several datasets,
achieving close performance to the naive Full GP at orders of magnitude less
cost.

José Miguel Hernández-Lobato, Neil Houlsby, and Zoubin Ghahramani.
**Stochastic
inference for scalable probabilistic modeling of binary matrices**.
In *NIPS Workshop on Randomized Methods for Machine Learning*, 2013.

** Abstract:** Fully observed large binary matrices appear in a
wide variety of contexts. To model them, probabilistic matrix factorization
(PMF) methods are an attractive solution. However, current batch algorithms
for PMF can be inefficient since they need to analyze the entire data matrix
before producing any parameter updates. We derive an efficient stochastic
inference algorithm for PMF models of fully observed binary matrices. Our
method exhibits faster convergence rates than more expensive batch approaches
and has better predictive performance than scalable alternatives. The
proposed method includes new data subsampling strategies which produce large
gains over standard uniform subsampling. We also address the task of
automatically selecting the size of the minibatches of data and we propose an
algorithm that adjusts this hyper-parameter in an online manner.

Colorado Reed and Zoubin Ghahramani.
**Scaling the
Indian buffet process via submodular maximization**.
In *ICML*, volume 28 of *JMLR Proceedings*, pages
1013-1021. JMLR.org, 2013.

** Abstract:** Inference for
latent feature models is inherently difficult as the inference space grows
exponentially with the size of the input data and number of latent features.
In this work, we use Kurihara & Welling (2008)'s maximization-expectation
framework to perform approximate MAP inference for linear-Gaussian latent
feature models with an Indian Buffet Process (IBP) prior. This formulation
yields a submodular function of the features that corresponds to a lower
bound on the model evidence. By adding a constant to this function, we obtain
a nonnegative submodular function that can be maximized via a greedy
algorithm that obtains at least a one-third approximation to the optimal
solution. Our inference method scales linearly with the size of the input
data, and we show the efficacy of our method on the largest datasets
currently analyzed using an IBP model.

**Optimally-weighted herding is
Bayesian quadrature**.
In *28th Conference on Uncertainty in Artificial Intelligence*, pages
377-385, Catalina Island, California, July 2012.

**
Abstract:** Herding and kernel herding are deterministic methods of
choosing samples which summarise a probability distribution. A related task
is choosing samples for estimating integrals using Bayesian quadrature. We
show that the criterion minimised when selecting samples in kernel herding is
equivalent to the posterior variance in Bayesian quadrature. We then show
that sequential Bayesian quadrature can be viewed as a weighted version of
kernel herding which achieves performance superior to any other weighted
herding method. We demonstrate empirically a rate of convergence faster than
O(1/N). Our results also imply an upper bound on the empirical error of the
Bayesian quadrature estimate.

Yichuan Zhang, Charles A. Sutton, Amos J. Storkey, and Zoubin Ghahramani.
**Continuous
relaxations for discrete Hamiltonian Monte Carlo**.
In Peter L. Bartlett, Fernando C. N. Pereira, Christopher J. C. Burges,
Léon Bottou, and Kilian Q. Weinberger, editors, *NIPS*, pages
3203-3211, 2012.

** Abstract:** Continuous relaxations play
an important role in discrete optimization, but have not seen much use in
approximate probabilistic inference. Here we show that a general form of the
Gaussian Integral Trick makes it possible to transform a wide class of
discrete variable undirected models into fully continuous systems. The
continuous representation allows the use of gradient-based Hamiltonian Monte
Carlo for inference, results in new ways of estimating normalization
constants (partition functions), and in general opens up a number of new
avenues for inference in difficult discrete systems. We demonstrate some of
these continuous relaxation inference algorithms on a number of illustrative
problems.

Simon Lacoste-Julien, Ferenc Huszár, and Zoubin Ghahramani.
**Approximate
inference for the loss-calibrated Bayesian**.
In Geoff Gordon and David Dunson, editors, *14th International Conference on
Artificial Intelligence and Statistics*, volume 15, pages 416-424,
Fort Lauderdale, FL, USA, April 2011. Journal of Machine Learning Research.

** Abstract:** We consider the problem of approximate inference
in the context of Bayesian decision theory. Traditional approaches focus on
approximating general properties of the posterior, ignoring the decision task
- and associated losses - for which the posterior could be used. We argue
that this can be suboptimal and propose instead to *loss-calibrate* the
approximate inference methods with respect to the decision task at hand. We
present a general framework rooted in Bayesian decision theory to analyze
approximate inference from the perspective of losses, opening up several
research directions. As a first loss-calibrated approximate inference
attempt, we propose an EM-like algorithm on the Bayesian posterior risk and
show how it can improve a standard approach to Gaussian process
classification when losses are asymmetric.

David A. Knowles, Jurgen Van Gael, and Zoubin Ghahramani.
**Message passing
algorithms for the Dirichlet diffusion tree**.
In *28th International Conference on Machine Learning*, 2011.

** Abstract:** We demonstrate efficient approximate inference
for the Dirichlet Diffusion Tree (Neal, 2003), a Bayesian nonparametric prior
over tree structures. Although DDTs provide a powerful and elegant approach
for modeling hierarchies they haven't seen much use to date. One problem is
the computational cost of MCMC inference. We provide the first deterministic
approximate inference methods for DDT models and show excellent performance
compared to the MCMC alternative. We present message passing algorithms to
approximate the Bayesian model evidence for a specific tree. This is used to
drive sequential tree building and greedy search to find optimal tree
structures, corresponding to hierarchical clusterings of the data. We
demonstrate appropriate observation models for continuous and binary data.
The empirical performance of our method is very close to the computationally
expensive MCMC alternative on a density estimation problem, and significantly
outperforms kernel density estimators.

** Comment:** web site

David Sontag and Daniel M. Roy.
**The Complexity of
Inference in Latent Dirichlet Allocation**.
In *Advances in Neural Information Processing Systems 24*, Cambridge,
MA, USA, 2011. The MIT Press.

** Abstract:** We consider the
computational complexity of probabilistic inference in Latent Dirichlet
Allocation (LDA). First, we study the problem of finding the maximum a
posteriori (MAP) assignment of topics to words, where the document's topic
distribution is integrated out. We show that, when the effective number of
topics per document is small, exact inference takes polynomial time. In
contrast, we show that, when a document has a large number of topics, finding
the MAP assignment of topics to words in LDA is NP-hard. Next, we consider
the problem of finding the MAP topic distribution for a document, where the
topic-word assignments are integrated out. We show that this problem is also
NP-hard. Finally, we briefly discuss the problem of sampling from the
posterior, showing that this is NP-hard in one restricted setting, but
leaving open the general question.

Richard E. Turner and Maneesh Sahani.
**Two
problems with variational expectation maximisation for time-series
models**.
In D. Barber, T. Cemgil, and S. Chiappa, editors, *Bayesian Time series
models*, chapter 5, pages 109-130. Cambridge University Press,
2011.

** Abstract:** Variational methods are a key component
of the approximate inference and learning toolbox. These methods fill an
important middle ground, retaining distributional information about
uncertainty in latent variables, unlike maximum a posteriori methods (MAP),
and yet generally requiring less computational time than Monte Carlo Markov
Chain methods. In particular the variational Expectation Maximisation (vEM)
and variational Bayes algorithms, both involving variational optimisation of
a free-energy, are widely used in time-series modelling. Here, we investigate
the success of vEM in simple probabilistic time-series models. First we
consider the inference step of vEM, and show that a consequence of the
well-known compactness property of variational inference is a failure to
propagate uncertainty in time, thus limiting the usefulness of the retained
distributional information. In particular, the uncertainty may appear to be
smallest precisely when the approximation is poorest. Second, we consider
parameter learning and analytically reveal systematic biases in the
parameters found by vEM. Surprisingly, simpler variational approximations
(such a mean-field) can lead to less bias than more complicated structured
approximations.

Richard E. Turner and Maneesh Sahani.
**Statistical
inference for single- and multi-band probabilistic amplitude
demodulation.**.
In *Proceedings of the IEEE International Conference on Acoustics, Speech,
and Signal Processing (ICASSP)*, pages 5466-5469, 2010.

**
Abstract:** Amplitude demodulation is an ill-posed problem and so it is
natural to treat it from a Bayesian viewpoint, inferring the most likely
carrier and envelope under probabilistic constraints. One such treatment is
Probabilistic Amplitude Demodulation (PAD), which, whilst computationally
more intensive than traditional approaches, offers several advantages. Here
we provide methods for estimating the uncertainty in the PAD-derived
envelopes and carriers, and for learning free-parameters like the time-scale
of the envelope. We show how the probabilistic approach can naturally handle
noisy and missing data. Finally, we indicate how to extend the model to
signals which contain multiple modulators and carriers.

Richard E. Turner.
**Statistical
Models for Natural Sounds**.
PhD thesis, Gatsby Computational Neuroscience Unit, UCL, 2010.

**
Abstract:** It is important to understand the rich structure of natural
sounds in order to solve important tasks, like automatic speech recognition,
and to understand auditory processing in the brain. This thesis takes a step
in this direction by characterising the statistics of simple natural sounds.
We focus on the statistics because perception often appears to depend on
them, rather than on the raw waveform. For example the perception of auditory
textures, like running water, wind, fire and rain, depends on
summary-statistics, like the rate of falling rain droplets, rather than on
the exact details of the physical source. In order to analyse the statistics
of sounds accurately it is necessary to improve a number of traditional
signal processing methods, including those for amplitude demodulation,
time-frequency analysis, and sub-band demodulation. These estimation tasks
are ill-posed and therefore it is natural to treat them as Bayesian inference
problems. The new probabilistic versions of these methods have several
advantages. For example, they perform more accurately on natural signals and
are more robust to noise, they can also fill-in missing sections of data, and
provide error-bars. Furthermore, free-parameters can be learned from the
signal. Using these new algorithms we demonstrate that the energy, sparsity,
modulation depth and modulation time-scale in each sub-band of a signal are
critical statistics, together with the dependencies between the sub-band
modulators. In order to validate this claim, a model containing co-modulated
coloured noise carriers is shown to be capable of generating a range of
realistic sounding auditory textures. Finally, we explored the connection
between the statistics of natural sounds and perception. We demonstrate that
inference in the model for auditory textures qualitatively replicates the
primitive grouping rules that listeners use to understand simple acoustic
scenes. This suggests that the auditory system is optimised for the
statistics of natural sounds.

Finale Doshi-Velez, David Knowles, Shakir Mohamed, and Zoubin Ghahramani.
**Large scale
non-parametric inference: Data parallelisation in the Indian buffet
process**.
In *Advances in Neural Information Processing Systems 23*, pages
1294-1302, Cambridge, MA, USA, December 2009. The MIT Press.

** Abstract:** Nonparametric Bayesian models provide a framework
for flexible probabilistic modelling of complex datasets. Unfortunately, the
high-dimensional averages required for Bayesian methods can be slow,
especially with the unbounded representations used by nonparametric models.
We address the challenge of scaling Bayesian inference to the increasingly
large datasets found in real-world applications. We focus on parallelisation
of inference in the Indian Buffet Process (IBP), which allows data points to
have an unbounded number of sparse latent features. Our novel MCMC sampler
divides a large data set between multiple processors and uses message passing
to compute the global likelihoods and posteriors. This algorithm, the first
parallel inference scheme for IBP-based models, scales to datasets orders of
magnitude larger than have previously been possible.

Yang Xu, Katherine A. Heller, and Zoubin Ghahramani.
**Tree-based
inference for Dirichlet process mixtures**.
In D. van Dyk and M. Welling, editors, *12th International Conference on
Artificial Intelligence and Statistics*, volume 5, pages 623-630,
Clearwater Beach, FL, USA, April 2009. Microtome Publishing (paper), Journal
of Machine Learning Research (online).
ISSN 1938-7228.

** Abstract:** The Dirichlet process mixture
(DPM) is a widely used model for clustering and for general nonparametric
Bayesian density estimation. Unfortunately, like in many statistical models,
exact inference in a DPM is intractable, and approximate methods are needed
to perform efficient inference. While most attention in the literature has
been placed on Markov chain Monte Carlo (MCMC) [1, 2, 3], variational
Bayesian (VB) [4] and collapsed variational methods [5], [6] recently
introduced a novel class of approximation for DPMs based on Bayesian
hierarchical clustering (BHC). These tree-based combinatorial approximations
efficiently sum over exponentially many ways of partitioning the data and
offer a novel lower bound on the marginal likelihood of the DPM [6]. In this
paper we make the following contributions: (1) We show empirically that the
BHC lower bounds are substantially tighter than the bounds given by VB [4]
and by collapsed variational methods [5] on synthetic and real datasets. (2)
We also show that BHC offers a more accurate predictive performance on these
datasets. (3) We further improve the tree-based lower bounds with an
algorithm that efficiently sums contributions from alternative trees. (4) We
present a fast approximate method for BHC. Our results suggest that our
combinatorial approximate inference methods and lower bounds may be useful
not only in DPMs but in other models as well.

Pietro Berkes, Richard E. Turner, and Maneesh Sahani.
**A
structured model of video reproduces primary visual cortical
organisation**.
*PLoS Computational Biology*, 5(9), 09 2009, doi
10.1371/journal.pcbi.1000495.

** Abstract:** The visual
system must learn to infer the presence of objects and features in the world
from the images it encounters, and as such it must, either implicitly or
explicitly, model the way these elements interact to create the image. Do the
response properties of cells in the mammalian visual system reflect this
constraint? To address this question, we constructed a probabilistic model in
which the identity and attributes of simple visual elements were represented
explicitly and learnt the parameters of this model from unparsed, natural
video sequences. After learning, the behaviour and grouping of variables in
the probabilistic model corresponded closely to functional and anatomical
properties of simple and complex cells in the primary visual cortex (V1). In
particular, feature identity variables were activated in a way that resembled
the activity of complex cells, while feature attribute variables responded
much like simple cells. Furthermore, the grouping of the attributes within
the model closely parallelled the reported anatomical grouping of simple
cells in cat V1. Thus, this generative model makes explicit an interpretation
of complex and simple cells as elements in the segmentation of a visual scene
into basic independent features, along with a parametrisation of their
moment-by-moment appearances. We speculate that such a segmentation may form
the initial stage of a hierarchical system that progressively separates the
identity and appearance of more articulated visual elements, culminating in
view-invariant object recognition.

Jörg Lücke, Richard E. Turner, Maneesh Sahani, and Marc Henniges.
**Occlusive
components analysis**.
In Y Bengio, D Schuurmans, J Lafferty, C K I Williams, and A Culotta, editors,
*nips22*, pages 1069-1077. mit, 2009.

** Abstract:**
We study unsupervised learning in a probabilistic generative model for
occlusion. The model uses two types of latent variables: one indicates which
objects are present in the image, and the other how they are ordered in
depth. This depth order then determines how the positions and appearances of
the objects present, specified in the model parameters, combine to form the
image. We show that the object parameters can be learnt from an unlabelled
set of images in which objects occlude one another. Exact maximum-likelihood
learning is intractable. However, we show that tractable approximations to
Expectation Maximization (EM) can be found if the training images each
contain only a small number of objects on average. In numerical experiments
it is shown that these approximations recover the correct set of object
parameters. Experiments on a novel version of the bars test using colored
bars, and experiments on more realistic data, show that the algorithm
performs well in extracting the generating causes. Experiments based on the
standard bars benchmark test for object learning show that the algorithm
performs well in comparison to other recent component extraction approaches.
The model and the learning algorithm thus connect research on occlusion with
the research field of multiple-causes component extraction methods.

J.M. Sung, Z. Ghahramani, and S.Y. Bang.
**Second-order
latent space variational Bayes for approximate Bayesian
inference**.
*IEEE Signal Processing Letters*, 15:918-921, December 2008.

** Abstract:** In this letter, we consider a variational
approximate Bayesian inference framework, latent-space variational Bayes
(LSVB), in the general context of conjugate-exponential family models with
latent variables. In the LSVB approach, we integrate out model parameters in
an exact way and then perform the variational inference over only the latent
variables. It can be shown that LSVB can achieve better estimates of the
model evidence as well as the distribution over the latent variables than the
popular variational Bayesian expectation-maximization (VBEM). However, the
distribution over the latent variables in LSVB has to be approximated in
practice. As an approximate implementation of LSVB, we propose a second-order
LSVB (SoLSVB) method. In particular, VBEM can be derived as a special case of
a first-order approximation in LSVB. SoLSVB can capture higher order
statistics neglected in VBEM and can therefore achieve a better
approximation. Examples of Gaussian mixture models are used to illustrate the
comparison between our method and VBEM, demonstrating the improvement.

J.M. Sung, Z. Ghahramani, and S.Y. Bang.
**Latent space
variational Bayes**.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*,
30(12):2236-2242, November 2008.

** Abstract:** Variational
Bayesian Expectation-Maximization (VBEM), an approximate inference method for
probabilistic models based on factorizing over latent variables and model
parameters, has been a standard technique for practical Bayesian inference.
In this paper, we introduce a more general approximate inference framework
for conjugate-exponential family models, which we call Latent-Space
Variational Bayes (LSVB). In this approach, we integrate out the model
parameters in an exact way, leaving only the latent variables. It can be
shown that the LSVB approach gives better estimates of the model evidence as
well as the distribution over the latent variables than the VBEM approach,
but, in practice, the distribution over the latent variables has to be
approximated. As a practical implementation, we present a First-order LSVB
(FoLSVB) algorithm to approximate the distribution over the latent variables.
From this approximate distribution, one can also estimate the model evidence
and the posterior over the model parameters. The FoLSVB algorithm is directly
comparable to the VBEM algorithm and has the same computational complexity.
We discuss how LSVB generalizes the recently proposed collapsed variational
methods to general conjugate-exponential families. Examples based on mixtures
of Gaussians and mixtures of Bernoullis with synthetic and real-world data
sets are used to illustrate some advantages of our method over VBEM.

Hannes Nickisch and Carl Edward Rasmussen.
**Approximations
for binary Gaussian process classification**.
*Journal of Machine Learning Research*, 9:2035-2078, October 2008.

** Abstract:** We provide a comprehensive overview of many
recent algorithms for approximate inference in Gaussian process models for
probabilistic binary classification. The relationships between several
approaches are elucidated theoretically, and the properties of the different
algorithms are corroborated by experimental results. We examine both 1) the
quality of the predictive distributions and 2) the suitability of the
different marginal likelihood approximations for model selection (selecting
hyperparameters) and compare to a gold standard based on MCMC. Interestingly,
some methods produce good predictive distributions although their marginal
likelihood approximations are poor. Strong conclusions are drawn about the
methods: The Expectation Propagation algorithm is almost always the method of
choice unless the computational budget is very tight. We also extend existing
methods in various ways, and provide unifying code implementing all
approaches.

Marc Peter Deisenroth, Jan Peters, and Carl Edward Rasmussen.
**Approximate
dynamic programming with Gaussian processes**.
In *2008 American Control Conference (ACC 2008)*, pages 4480-4485,
Seattle, WA, USA, June 2008.

** Abstract:** In general, it is
difficult to determine an optimal closed-loop policy in nonlinear control
problems with continuous-valued state and control domains. Hence,
approximations are often inevitable. The standard method of discretizing
states and controls suffers from the curse of dimensionality and strongly
depends on the chosen temporal sampling rate. The paper introduces Gaussian
Process Dynamic Programming (GPDP). In GPDP, value functions in the Bellman
recursion of the dynamic programming algorithm are modeled using Gaussian
processes. GPDP returns an optimal state-feedback for a finite set of states.
Based on these outcomes, we learn a possibly discontinuous closed-loop policy
on the entire state space by switching between two independently trained
Gaussian processes.

** Comment:** code.

Pietro Berkes, Richard E. Turner, and Maneesh Sahani.
**On
sparsity and overcompleteness in image models**.
In J. C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, *nips20*,
volume 20. mit, 2008.

** Abstract:** Computational models
of visual cortex, and in particular those based on sparse coding, have
enjoyed much recent attention. Despite this currency, the question of how
sparse or how over-complete a sparse representation should be, has gone
without principled answer. Here, we use Bayesian model-selection methods to
address these questions for a sparse-coding model based on a Student-t prior.
Having validated our methods on toy data, we find that natural images are
indeed best modelled by extremely sparse distributions; although for the
Student-t prior, the associated optimal basis size is only modestly
over-complete.

J. P. Cunningham.
**Derivation
of Expectation Propagation for "fast Gaussian process methods for point
process intensity estimation"**.
Technical report, Stanford University, 2008.

** Abstract:** We
derive the Expectation Propagation algorithm updates for approximating the
posterior distribution on intensity in a conditionally inhomogeneous gamma
interval process with a Gaussian Process prior (GP IGIP), a model which
appeared in Cunningham, Shenoy, Sahani (2008) ICML.

Richard E. Turner and Maneesh Sahani.
**Modeling
natural sounds with modulation cascade processes**.
In J. C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, *nips20*,
volume 20. mit, 2008.

** Abstract:** Natural sounds are
structured on many time-scales. A typical segment of speech, for example,
contains features that span four orders of magnitude: Sentences ($\sim1$s);
phonemes ($\sim10$−$1$ s); glottal pulses ($\sim 10$−$2$s); and formants
($\sim 10$−$3$s). The auditory system uses information from each of these
time-scales to solve complicated tasks such as auditory scene analysis [1].
One route toward understanding how auditory processing accomplishes this
analysis is to build neuroscience-inspired algorithms which solve similar
tasks and to compare the properties of these algorithms with properties of
auditory processing. There is however a discord: Current machine-audition
algorithms largely concentrate on the shorter time-scale structures in
sounds, and the longer structures are ignored. The reason for this is
two-fold. Firstly, it is a difficult technical problem to construct an
algorithm that utilises both sorts of information. Secondly, it is
computationally demanding to simultaneously process data both at high
resolution (to extract short temporal information) and for long duration (to
extract long temporal information). The contribution of this work is to
develop a new statistical model for natural sounds that captures structure
across a wide range of time-scales, and to provide efficient learning and
inference algorithms. We demonstrate the success of this approach on a
missing data task.

Richard E. Turner and M Sahani.
**Probabilistic
amplitude demodulation**.
In *7th International Conference on Independent Component Analysis and
Signal Separation*, pages 544-551, 2007.

** Abstract:**
Auditory scene analysis is extremely challenging. One approach, perhaps that
adopted by the brain, is to shape useful representations of sounds on prior
knowledge about their statistical structure. For example, sounds with
harmonic sections are common and so time-frequency representations are
efficient. Most current representations concentrate on the shorter
components. Here, we propose representations for structures on longer
time-scales, like the phonemes and sentences of speech. We decompose a sound
into a product of processes, each with its own characteristic time-scale.
This demodulation cascade relates to classical amplitude demodulation, but
traditional algorithms fail to realise the representation fully. A new
approach, probabilistic amplitude demodulation, is shown to out-perform the
established methods, and to easily extend to representation of a full
demodulation cascade.

Edward Snelson and Zoubin Ghahramani.
**Compact
approximations to Bayesian predictive distributions**.
In *22nd International Conference on Machine Learning*, Bonn, Germany,
August 2005. Omnipress.

** Abstract:** We provide a general
framework for learning precise, compact, and fast representations of the
Bayesian predictive distribution for a model. This framework is based on
minimizing the KL divergence between the true predictive density and a
suitable compact approximation. We consider various methods for doing this,
both sampling based approximations, and deterministic approximations such as
expectation propagation. These methods are tested on a mixture of Gaussians
model for density estimation and on binary linear classification, with both
synthetic data sets for visualization and several real data sets. Our results
show significant reductions in prediction time and memory footprint.

Malte Kuß and Carl Edward Rasmussen.
**Assessing
approximate inference for binary Gaussian process classification**.
*Journal of Machine Learning Research*, 6:1679-1704, 2005.

** Abstract:** Gaussian process priors can be used to define
flexible, probabilistic classification models. Unfortunately exact Bayesian
inference is analytically intractable and various approximation techniques
have been proposed. In this work we review and compare Laplace's method and
Expectation Propagation for approximate Bayesian inference in the binary
Gaussian process classification model. We present a comprehensive comparison
of the approximations, their predictive performance and marginal likelihood
estimates to results obtained by MCMC sampling. We explain theoretically and
corroborate empirically the advantages of Expectation Propagation compared to
Laplace's method.

Joaquin Quiñonero-Candela and Carl Edward Rasmussen.
**A
unifying view of sparse approximate Gaussian process regression**.
*Journal of Machine Learning Research*, 6:1939-1959, 2005.

** Abstract:** We provide a new unifying view, including all
existing proper probabilistic sparse approximations for Gaussian process
regression. Our approach relies on expressing the effective prior which the
methods are using. This allows new insights to be gained, and highlights the
relationship between existing methods. It also allows for a clear
theoretically justified ranking of the closeness of the known approximations
to the corresponding full GPs. Finally we point directly to designs of new
better sparse approximations, combining the best of the existing strategies,
within attractive computational constraints.

Yuan (Alan) Qi, Thomas P. Minka, Rosalind W. Picard, and Zoubin Ghahramani.
**Predictive
automatic relevance determination by expectation propagation**.
In Carla E. Brodley, editor, *ICML*, volume 69 of *ACM
International Conference Proceeding Series*. Association for Computing
Machinery, 2004.

** Abstract:** In many real-world
classification problems the input contains a large number of potentially
irrelevant features. This paper proposes a new Bayesian framework for
determining the relevance of input features. This approach extends one of the
most successful Bayesian methods for feature selection and sparse learning,
known as Automatic Relevance Determination (ARD). ARD finds the relevance of
features by optimizing the model marginal likelihood, also known as the
evidence. We show that this can lead to overfitting. To address this problem,
we propose Predictive ARD based on estimating the predictive performance of
the classifier. While the actual leave-one-out predictive performance is
generally very costly to compute, the expectation propagation (EP) algorithm
proposed by Minka provides an estimate of this predictive performance as a
side-effect of its iterations. We exploit this in our algorithm to do feature
selection, and to select data points in a sparse Bayesian kernel classifier.
Moreover, we provide two other improvements to previous algorithms, by
replacing Laplace's approximation with the generally more accurate EP, and by
incorporating the fast optimization algorithm proposed by Faul and Tipping.
Our experiments show that our method based on the EP estimate of predictive
performance is more accurate on test data than relevance determination by
optimizing the evidence.

Ruslan Salakhutdinov, Sam T. Roweis, and Zoubin Ghahramani.
**On the
convergence of bound optimization algorithms**.
In Christopher Meek and Uffe Kjærulff, editors, *UAI*, pages
509-516. Morgan Kaufmann, 2003.

** Abstract:** Many
practitioners who use EM and related algorithms complain that they are
sometimes slow. When does this happen, and what can be done about it? In this
paper, we study the general class of bound optimization algorithms -
including EM, Iterative Scaling, Non-negative Matrix Factorization, CCCP -
and their relationship to direct optimization algorithms such as
gradientbased methods for parameter learning. We derive a general
relationship between the updates performed by bound optimization methods and
those of gradient and second-order methods and identify analytic conditions
under which bound optimization algorithms exhibit quasi-Newton behavior, and
under which they possess poor, first-order convergence. Based on this
analysis, we consider several specific algorithms, interpret and analyze
their convergence properties and provide some recipes for preprocessing input
to these algorithms to yield faster convergence behavior. We report empirical
results supporting our analysis and showing that simple data preprocessing
can result in dramatically improved performance of bound optimizers in
practice.

Ruslan Salakhutdinov, Sam T. Roweis, and Zoubin Ghahramani.
**Optimization
with EM and expectation-conjugate-gradient**.
In Tom Fawcett and Nina Mishra, editors, *ICML*, pages 672-679. AAAI
Press, 2003.

** Abstract:** We show a close relationship
between the Expectation- Maximization (EM) algorithm and direct optimization
algorithms such as gradientbased methods for parameter learning. We identify
analytic conditions under which EM exhibits Newton-like behavior, and
conditions under which it possesses poor, first-order convergence. Based on
this analysis, we propose two novel algorithms for maximum likelihood
estimation of latent variable models, and report empirical results showing
that, as predicted by theory, the proposed new algorithms can substantially
outperform standard EM in terms of speed of convergence in certain cases.

Zoubin Ghahramani and Matthew J. Beal.
**Propagation
algorithms for variational Bayesian learning**.
In Michael J. Kearns, Sara A. Solla, and David A. Cohn, editors, *NIPS*,
pages 507-513. The MIT Press, 2000.

** Abstract:**
Variational approximations are becoming a widespread tool for Bayesian
learning of graphical models. We provide some theoretical results for the
variational updates in a very general family of conjugate-exponential
graphical models. We show how the belief propagation and the junction tree
algorithms can be used in the inference step of variational Bayesian
learning. Applying these results to the Bayesian analysis of linear-Gaussian
state-space models we obtain a learning procedure that exploits the Kalman
smoothing propagation, while integrating over all model parameters. We
demonstrate how this can be used to infer the hidden state dimensionality of
the state-space model in a variety of synthetic problems and one real
high-dimensional data set.

Zoubin Ghahramani and Geoffrey E. Hinton.
**Variational
learning for switching state-space models**.
*Neural Computation*, 12(4):831-864, 2000.

**
Abstract:** We introduce a new statistical model for time series which
iteratively segments data into regimes with approximately linear dynamics and
learns the parameters of each of these linear regimes. This model combines
and generalizes two of the most widely used stochastic time series
models-hidden Markov models and linear dynamical systems-and is closely
related to models that are widely used in the control and econometrics
literatures. It can also be derived by extending the mixture of experts
neural network (Jacobs et al, 1991) to its fully dynamical version, in which
both expert and gating networks are recurrent. Inferring the posterior
probabilities of the hidden states of this model is computationally
intractable, and therefore the exact Expectation Maximization (EM) algorithm
cannot be applied. However, we present a variational approximation that
maximizes a lower bound on the log likelihood and makes use of both the
forward-backward recursions for hidden Markov models and the Kalman filter
recursions for linear dynamical systems. We tested the algorithm both on
artificial data sets and on a natural data set of respiration force from a
patient with sleep apnea. The results suggest that variational approximations
are a viable method for inference and learning in switching state-space
models.

Michael I. Jordan, Zoubin Ghahramani, Tommi Jaakkola, and Lawrence K. Saul.
**An introduction
to variational methods for graphical models**.
*Machine Learning*, 37(2):183-233, 1999.

** Abstract:**
This paper presents a tutorial introduction to the use of variational methods
for inference and learning in graphical models (Bayesian networks and Markov
random fields). We present a number of examples of graphical models,
including the QMR-DT database, the sigmoid belief network, the Boltzmann
machine, and several variants of hidden Markov models, in which it is
infeasible to run exact inference algorithms. We then introduce variational
methods, which exploit laws of large numbers to transform the original
graphical model into a simplified graphical model in which inference is
efficient. Inference in the simpified model provides bounds on probabilities
of interest in the original model. We describe a general framework for
generating variational transformations based on convex duality. Finally we
return to the examples and demonstrate how variational algorithms can be
formulated in each case.

Lars Kai Hansen and Carl Edward Rasmussen.
**Pruning from
adaptive regularization**.
*Neural Computation*, 6(6):1222-1231, 1994.

**
Abstract:** Inspired by the recent upsurge of interest in Bayesian methods
we consider adaptive regularization. A generalization based scheme for
adaptation of regularization parameters is introduced and compared to
Bayesian regularization. We show that pruning arises naturally within both
adaptive regularization schemes. As model example we have chosen the simplest
possible: estimating the mean of a random variable with known variance.
Marked similarities are found between the two methods in that they both
involve a "noise limit", below which they regularize with infinite weight
decay, i.e., they prune. However, pruning is not always beneficial. We show
explicitly that both methods in some cases may increase the generalization
error. This corresponds to situations where the underlying assumptions of the
regularizer are poorly matched to the environment.

## BioinformaticsRecent advances in biology have allowed us to collect vast amounts of genetic, proteomic and biomedical data. While this data offers the potential to help us understand the building blocks of life, and to revolutionise medicine, analysing and understanding it poses immense computational and statistical challenges. Our work in Bionformatics includes modelling protein secondary and tertiary structure, analysis of gene microarray data, protein-protein interactions, and biomarker discovery. |

George Nicholson, Marta Blangiardo, Mark Briers, Peter J Diggle, Tor Erlend
Fjelde, Hong Ge, Robert J B Goudie, Radka Jersakova, Ruairidh E King, Brieuc
C L Lehmann, Ann-Marie Mallon, Tullia Padellini, Yee Whye Teh, Chris Holmes,
and Sylvia Richardson.
**Interoperability of
statistical models in pandemic preparedness: principles and reality**.
*Stat. Sci.*, 37(2):183-206, May 2022.

** Abstract:** We
present interoperability as a guiding framework for statistical modelling to
assist policy makers asking multiple questions using diverse datasets in the
face of an evolving pandemic response. Interoperability provides an important
set of principles for future pandemic preparedness, through the joint design
and deployment of adaptable systems of statistical models for disease
surveillance using probabilistic reasoning. We illustrate this through case
studies for inferring and characterising spatial-temporal prevalence and
reproduction numbers of SARS-CoV-2 infections in England.

Wenlin Chen, Austin Tripp, and José Miguel Hernández-Lobato.
**Meta-learning adaptive deep
kernel Gaussian processes for molecular property prediction**.
*arXiv*, 2022.

** Abstract:** We propose Adaptive Deep
Kernel Fitting with Implicit Function Theorem (ADKF-IFT), a novel framework
for learning deep kernel Gaussian processes (GPs) by interpolating between
meta-learning and conventional deep kernel learning. Our approach employs a
bilevel optimization objective where we meta-learn generally useful feature
representations across tasks, in the sense that task-specific GP models
estimated on top of such features achieve the lowest possible predictive loss
on average. We solve the resulting nested optimization problem using the
implicit function theorem (IFT). We show that our ADKF-IFT framework contains
previously proposed Deep Kernel Learning (DKL) and Deep Kernel Transfer (DKT)
as special cases. Although ADKF-IFT is a completely general method, we argue
that it is especially well-suited for drug discovery problems and demonstrate
that it significantly outperforms previous state-of-the-art methods on a
variety of real-world few-shot molecular property prediction tasks and
out-of-domain molecular property prediction and optimization tasks.

Austin Tripp, Wenlin Chen, and José Miguel Hernández-Lobato.
**An evaluation framework
for the objective functions of de novo drug design benchmarks**.
In *ICLR 2022 Workshop on Machine Learning for Drug Discovery*, 2022.

** Abstract:** De novo drug design has recently received
increasing attention from the machine learning community. It is important
that the field is aware of the actual goals and challenges of drug design and
the roles that de novo molecule design algorithms could play in accelerating
the process, so that algorithms can be evaluated in a way that reflects how
they would be applied in real drug design scenarios. In this paper, we
propose a framework for critically assessing the merits of benchmarks, and
argue that most of the existing de novo drug design benchmark functions are
either highly unrealistic or depend upon a surrogate model whose performance
is not well characterized. In order for the field to achieve its long-term
goals, we recommend that poor benchmarks (especially logP and QED) be
deprecated in favour of better benchmarks. We hope that our proposed
framework can play a part in developing new de novo drug design benchmarks
that are more realistic and ideally incorporate the intrinsic goals of drug
design.

Jan M Brauner, Sören Mindermann, Mrinank Sharma, David Johnston, John
Salvatier, Tomáš Gavenčiak, Anna B Stephenson, Gavin Leech,
George Altman, Vladimir Mikulik, Alexander John Norman, Joshua Teperowski
Monrad, Tamay Besiroglu, Hong Ge, Meghan A Hartwick, Yee Whye Teh, Leonid
Chindelevitch, Yarin Gal, and Jan Kulveit.
**Inferring the
effectiveness of government interventions against COVID-19**.
*Science*, December 2020.

**Latent
Gaussian processes for distribution estimation of multivariate categorical
data**.
In *Proceedings of the 32nd International Conference on Machine Learning
(ICML-15)*, pages 645-654, 2015.

** Abstract:**
Multivariate categorical data occur in many applications of machine learning.
One of the main difficulties with these vectors of categorical variables is
sparsity. The number of possible observations grows exponentially with vector
length, but dataset diversity might be poor in comparison. Recent models have
gained significant improvement in supervised tasks with this data. These
models embed observations in a continuous space to capture similarities
between them. Building on these ideas we propose a Bayesian model for the
unsupervised task of distribution estimation of multivariate categorical
data. We model vectors of categorical variables as generated from a
non-linear transformation of a continuous latent space. Non-linearity
captures multi-modality in the distribution. The continuous representation
addresses sparsity. Our model ties together many existing models, linking the
linear categorical latent Gaussian model, the Gaussian process latent
variable model, and Gaussian process classification. We derive inference for
our model based on recent developments in sampling based variational
inference. We show empirically that the model outperforms its linear and
discrete counterparts in imputation tasks of sparse data.

Wouter Boomsma, Pengfei Tian, Jes Frellsen, Jesper Ferkinghoff-Borg, Thomas
Hamelryck, Kresten Lindorff-Larsen, and Michele Vendruscolo.
**Equilibrium simulations of proteins using molecular fragment replacement and
NMR chemical shifts**.
*Proceedings of the National Academy of Sciences*, 111(38):13852-13857,
2014, doi
10.1073/pnas.1404948111.

** Abstract:** Methods of protein
structure determination based on NMR chemical shifts are becoming
increasingly common. The most widely used approaches adopt the molecular
fragment replacement strategy, in which structural fragments are repeatedly
reassembled into different complete conformations in molecular simulations.
Although these approaches are effective in generating individual structures
consistent with the chemical shift data, they do not enable the sampling of
the conformational space of proteins with correct statistical weights. Here,
we present a method of molecular fragment replacement that makes it possible
to perform equilibrium simulations of proteins, and hence to determine their
free energy landscapes. This strategy is based on the encoding of the
chemical shift information in a probabilistic model in Markov chain Monte
Carlo simulations. First, we demonstrate that with this approach it is
possible to fold proteins to their native states starting from extended
structures. Second, we show that the method satisfies the detailed balance
condition and hence it can be used to carry out an equilibrium sampling from
the Boltzmann distribution corresponding to the force field used in the
simulations. Third, by comparing the results of simulations carried out with
and without chemical shift restraints we describe quantitatively the effects
that these restraints have on the free energy landscapes of proteins. Taken
together, these results demonstrate that the molecular fragment replacement
strategy can be used in combination with chemical shift information to
characterize not only the native structures of proteins but also their
conformational fluctuations.

Jes Frellsen, Thomas Hamelryck, and Jesper Ferkinghoff-Borg.
**Combining the
multicanonical ensemble with generative probabilistic models of local
biomolecular structure**.
In *Proceedings of the 59th World Statistics Congress of the
International Statistical Institute*, pages 139-144, Hong Kong,
2014.

** Abstract:** Markov chain Monte Carlo is a powerful
tool for sampling complex systems such as large biomolecular structures.
However, the standard Metropolis-Hastings algorithm suffers from a number of
deficiencies when applied to systems with rugged free-energy landscapes. Some
of these deficiencies can be addressed with the multicanonical ensemble. In
this paper we will present two strategies for applying the multicanonical
ensemble to distributions constructed from generative probabilistic models of
local biomolecular structure. In particular, we will describe how to use the
multicanonical ensemble efficiently in conjunction with the reference ratio
method.

P. Kirk, J. E. Griffin, R. S. Savage, Z. Ghahramani, and D. L. Wild.
**Bayesian
correlated clustering to integrate multiple datasets**.
*Bioinformatics*, 2012.

** Abstract:** Motivation: The
integration of multiple datasets remains a key challenge in systems biology
and genomic medicine. Modern high-throughput technologies generate a broad
array of different data types, providing distinct – but often complementary
– information. We present a Bayesian method for the unsupervised
integrative modelling of multiple datasets, which we refer to as MDI
(Multiple Dataset Integration). MDI can integrate information from a wide
range of different datasets and data types simultaneously (including the
ability to model time series data explicitly using Gaussian processes). Each
dataset is modelled using a Dirichlet-multinomial allocation (DMA) mixture
model, with dependencies between these models captured via parameters that
describe the agreement among the datasets.

Results: Using a set of 6
artificially constructed time series datasets, we show that MDI is able to
integrate a significant number of datasets simultaneously, and that it
successfully captures the underlying structural similarity between the
datasets. We also analyse a variety of real S. cerevisiae datasets. In the
2-dataset case, we show that MDI’s performance is comparable to the present
state of the art. We then move beyond the capabilities of current approaches
and integrate gene expression, ChIP-chip and protein-protein interaction
data, to identify a set of protein complexes for which genes are co-regulated
during the cell cycle. Comparisons to other unsupervised data integration
techniques – as well as to non-integrative approaches – demonstrate that
MDI is very competitive, while also providing information that would be
difficult or impossible to extract using other methods.

** Comment:** This paper is available from the Bioinformatics
site and a Matlab implementation of MDI is available fromthis site.

Paul D. W. Kirk, Jim E. Griffin, Richard S. Savage, Zoubin Ghahramani, and
David L. Wild.
**Bayesian
correlated clustering to integrate multiple datasets**.
*Bioinformatics*, 28(24):3290-3297, 2012.

** Abstract:**
MOTIVATION: The integration of multiple datasets remains a key challenge in
systems biology and genomic medicine. Modern high-throughput technologies
generate a broad array of different data types, providing distinct-but often
complementary-information. We present a Bayesian method for the unsupervised
integrative modelling of multiple datasets, which we refer to as MDI
(Multiple Dataset Integration). MDI can integrate information from a wide
range of different datasets and data types simultaneously (including the
ability to model time series data explicitly using Gaussian processes). Each
dataset is modelled using a Dirichlet-multinomial allocation (DMA) mixture
model, with dependencies between these models captured through parameters
that describe the agreement among the datasets. RESULTS: Using a set of six
artificially constructed time series datasets, we show that MDI is able to
integrate a significant number of datasets simultaneously, and that it
successfully captures the underlying structural similarity between the
datasets. We also analyse a variety of real Saccharomyces cerevisiae
datasets. In the two-dataset case, we show that MDI's performance is
comparable with the present state-of-the-art. We then move beyond the
capabilities of current approaches and integrate gene expression, chromatin
immunoprecipitation-chip and protein-protein interaction data, to identify a
set of protein complexes for which genes are co-regulated during the cell
cycle. Comparisons to other unsupervised data integration techniques-as well
as to non-integrative approaches-demonstrate that MDI is competitive, while
also providing information that would be difficult or impossible to extract
using other methods.

Kyung-Ah Sohn, Zoubin Ghahramani, and Eric P. Xing.
**Robust estimation
of local genetic ancestry in admixed populations using a non-parametric
Bayesian approach**.
*Genetics*, 191(4), 2012.

** Abstract:** We present a new
haplotype-based approach for inferring local genetic ancestry of individuals
in an admixed population. Most existing approaches for local ancestry
estimation ignore the latent genetic relatedness between ancestral
populations and treat them as independent. In this paper, we exploit such
information by building an inheritance model that describes both the
ancestral populations and the admixed population jointly in a unified
framework. Based on an assumption that the common hypothetical founder
haplotypes give rise to both the ancestral and admixed population haplotypes,
we employ an infinite hidden Markov model to characterize each ancestral
population and further extend it to generate the admixed population. Through
an effective utilization of the population structural information under a
principled nonparametric Bayesian framework, the resulting model is
significantly less sensitive to the choice and the amount of training data
for ancestral populations than state-of-the-arts algorithms. We also improve
the robustness under deviation from common modeling assumptions by
incorporating population-specific scale parameters that allow variable
recombination rates in different populations. Our method is applicable to an
admixed population from an arbitrary number of ancestral populations and also
performs competitively in terms of spurious ancestry proportions under
general multi-way admixture assumption. We validate the proposed method by
simulation under various admixing scenarios and present empirical analysis
results on worldwide distributed dataset from Human Genome Diversity
Project.

** Comment:** doi: 10.1534/genetics.112.140228

David A. Knowles and Zoubin Ghahramani.
**Nonparametric
Bayesian sparse factor models with application to gene expression
modelling.**.
*Annals of Applied Statistics*, 5(2B):1534-1552, 2011.

**
Abstract:** A nonparametric Bayesian extension of Factor Analysis (FA) is
proposed where observed data Y is modeled as a linear superposition, G, of a
potentially infinite number of hidden factors, X. The Indian Buffet Process
(IBP) is used as a prior on G to incorporate sparsity and to allow the number
of latent features to be inferred. The model's utility for modeling gene
expression data is investigated using randomly generated data sets based on a
known sparse connectivity matrix for E. Coli, and on three biological data
sets of increasing complexity.

David A. Knowles, Leopold Parts, Daniel Glass, and John M. Winn.
**Inferring a
measure of physiological age from multiple ageing related phenotypes**.
In *NIPS Workshop: From Statistical Genetics to Predictive Models in
Personalized Medicine*, 2011.

** Abstract:** What is
ageing? One definition is simultaneous degradation of multiple organ systems.
Can an individual be said to be "old" or "young" for their (chronological)
age in a scientifically meaningful way? We investigate these questions using
ageing related phenotypes measured on the 12,000 female twins in the Twins UK
study. We propose a simple linear model of ageing, which allows a latent
adjustment to be made to an individual's chronological age to give her
"physiological age", shared across the observed phenotypes. We note problems
with the analysis resulting from the linearity assumption and show how to
alleviate these issues using a non-linear extension. We find more gene
expression probes are significantly associated with our measurement of
physiological age than to chronological age.

** Comment:** web site

Mehregan Movassagh, Mun-Kit Choy, David A. Knowles, Lina Cordeddu, Syed Haider,
Thomas Down, Lee Siggens, Ana Vujic, Ilenia Simeoni, Chris Penkett, Martin
Goddard, Pietro Lio, Martin Bennett, and Roger Foo.
**Distinct
epigenomic features in human cardiomyopathy**.
*Circulation, American Heart Association*, 2011.

**
Abstract:** Background. The epigenome refers to marks on the genome
including DNA methylation and histone modifications that regulate the
expression of underlying genes. A consistent profile of gene expression
changes in end- stage cardiomyopathy led us to hypothesise that distinct
global patterns of the epigenome may also exist. Methods and Results. We
constructed genome-wide maps of DNA methylation and Histone-3 Lysine-36
tri-methylation (H3K36me3)-enrichment for cardiomyopathic and normal human
hearts. 506Mb of sequence per library was generated by high-throughput
sequencing, covering 24 million out of the 28 million CG di-nucleotides in
the human genome. DNA methylation was significantly different in promoter
CpG-islands (CGI), intra-genic CGI, gene bodies and H3K36me3-enriched regions
of the genome. Moreover DNA methylation differences were present in promoters
of upregulated genes but not down-regulated genes. The profile of
H3K36me3-enrichment itself was also significantly different in protein-coding
regions of the genome. Conclusions. Distinct epigenomic patterns exist in
important DNA elements of the human cardiac genome in end-stage
cardiomyopathy. If epigenomic patterns track with disease progression, assays
for the epigenome may be more useful than quantification of mRNA for
assessing prognosis in heart failure. These results open up an important new
horizon of research and further studies will be needed to determine how
epigenomics contribute to altered gene expression in cardiomyopathy.

Cornelia Schone, Anne Venner, David A. Knowles, Mahesh M Karnani, and Denis
Burdakov.
**Dichotomous
cellular properties of mouse orexin/hypocretin neurons**.
*The Journal of Physiology*, 2011.

** Abstract:**
Hypothalamic hypocretin/orexin (hcrt/orx) neurons recently emerged as
critical regulators of sleep-wake cycles, reward-seeking, and body energy
balance. However, at the level of cellular and network properties, it remains
unclear whether hcrt/orx neurons are one homogenous population, or whether
there are several distinct types of hcrt/orx cells. Here, we collated diverse
structural and functional information about individual hcrt/orx neurons in
mouse brain slices, by combining patch-clamp analysis of spike firing,
membrane currents, and synaptic inputs with confocal imaging of cell shape
and subsequent 3-dimensional Sholl analysis of dendritic architecture.
Statistical cluster analysis of intrinsic firing properties revealed that
hcrt/orx neurons fall into two distinct types. These two cell types also
differ in the complexity of their dendritic arbour, the strength of AMPA and
GABAA receptor-mediated synaptic drive that they receive, and the density of
low-threshold, 4-aminopyridine-sensitive, transient K+ current. Our results
provide quantitative evidence that, at the cellular level, the mouse hcrt/orx
system is composed of two classes of neurons with different firing
properties, morphologies, and synaptic input organization.

Daniel Glass, Leopold Parts, David A. Knowles, Abraham Aviv, , and Tim D.
Spector.
**No correlation between childhood maltreatment and telomere length.**.
*Biological Psychiatry*, 68(6):21-22, 2010.

**
Abstract:** Telomeres are lengths of repetitive DNA that cap the ends of
chromosomes. They protect the ends of the chromosome and shorten with each
cell division. Short leukocyte telomere length has been related to a number
of age-related diseases. In addition, shorter telomere length has been
associated with environmental factors such as smoking and lack of exercise.
In a recent issue of Biological Psychiatry, Tyrka et al. (4) published a
report suggesting a link between maltreatment in childhood and telomere
shortening in 31 subjects. Individuals who had suffered maltreatment had
telomere length .70 +/- .24 compared with 1.02 +/- .52 in individuals who had
not been abused.

David A. Knowles, Leopold Parts, Daniel Glass, and John M. Winn.
**Modeling skin
and ageing phenotypes using latent variable models in infer.net**.
In *NIPS Workshop: Predictive Models in Personalized Medicine Workshop*,
2010.

** Abstract:** We demonstrate and compare three
unsupervised Bayesian latent variable models implemented in Infer.NET for
biomedical data modeling of 42 skin and ageing phenotypes measured on the
12,000 female twins in the Twins UK study. We address various data modeling
problems include high missingness, heterogeneous data, and repeat
observations. We compare the proposed models in terms of their performance at
predicting disease labels and symptoms from available explanatory variables,
concluding that factor analysis type models have the strongest statistical
performance in this setting. We show that such models can be combined with
regression components for improved interpretability.

** Comment:** web
site

C. Lippert, Z. Ghahramani, and K. Borgwardt.
**Gene function
prediction from synthetic lethality networks via ranking on demand**.
*Bioinformatics*, 26:912-918, 2010.

** Abstract:**
Motivation: Synthetic lethal interactions represent pairs of genes whose
individual mutations are not lethal, while the double mutation of both genes
does incur lethality. Several studies have shown a correlation between
functional similarity of genes and their distances in networks based on
synthetic lethal interactions. However, there is a lack of algorithms for
predicting gene function from synthetic lethality interaction networks.

Results: In this article, we present a novel technique called kernelROD for
gene function prediction from synthetic lethal interaction networks based on
kernel machines. We apply our novel algorithm to Gene Ontology functional
annotation prediction in yeast. Our experiments show that our method leads to
improved gene function prediction compared with state-of-the-art competitors
and that combining genetic and congruence networks leads to a further
improvement in prediction accuracy.

O. Stegle, K. J. Denby, E. J. Cooke, D. L. Wild, Z. Ghahramani, and K. M.
Borgwardt.
**A
robust Bayesian two-sample test for detecting intervals of differential
gene expression in microarray time series**.
*Journal of Computational Biology*, 17(3):1-13, 2010, doi
10.1089/cmb.2009.0175.

** Abstract:** Understanding the
regulatory mechanisms that are responsible for an organism's response to
environmental change is an important issue in molecular biology. A first and
important step towards this goal is to detect genes whose expression levels
are affected by altered external conditions. A range of methods to test for
differential gene expression, both in static as well as in time-course
experiments, have been proposed. While these tests answer the question
*whether* a gene is differentially expressed, they do not explicitly
address the question *when* a gene is differentially expressed, although
this information may provide insights into the course and causal structure of
regulatory programs. In this article, we propose a twosample test for
identifying intervals of differential gene expression in microarray time
series. Our approach is based on Gaussian process regression, can deal with
arbitrary numbers of replicates, and is robust with respect to outliers. We
apply our algorithm to study the response of *Arabidopsis thaliana*
genes to an infection by a fungal pathogen using a microarray time series
dataset covering 30,336 gene probes at 24 observed time points. In
classification experiments, our test compares favorably with existing methods
and provides additional insights into time-dependent differential
expression.

R. S. Savage, Z. Ghahramani, J. E. Griffin, B. de la Cruz, and D. L. Wild.
**Discovering
transcriptional modules by Bayesian data integration**.
*Bioinformatics*, 26:i158-i167, 2010.

** Abstract:**
Motivation: We present a method for directly inferring transcriptional
modules (TMs) by integrating gene expression and transcription factor binding
(ChIP-chip) data. Our model extends a hierarchical Dirichlet process mixture
model to allow data fusion on a gene-by-gene basis. This encodes the
intuition that co-expression and co-regulation are not necessarily equivalent
and hence we do not expect all genes to group similarly in both datasets. In
particular, it allows us to identify the subset of genes that share the same
structure of transcriptional modules in both datasets.

Results: We find
that by working on a gene-by-gene basis, our model is able to extract
clusters with greater functional coherence than existing methods. By
combining gene expression and transcription factor binding (ChIP-chip) data
in this way, we are better able to determine the groups of genes that are
most likely to represent underlying TMs.

Availability: If interested in
the code for the work presented in this article, please contact the
authors.

R. Silva, K. A. Heller, Z. Ghahramani, and E. M. Airoldi.
**Ranking
relations using analogies in biological and information networks**.
*Annals of Applied Statistics*, 4(2):615-644, 2010.

**
Abstract:** Analogical reasoning depends fundamentally on the ability to
learn and generalize about relations between objects. We develop an approach
to relational learning which, given a set of pairs of objects S = A(1):B(1),
A(2):B(2), ..., A(N):B(N), measures how well other pairs A:B fit in with the
set S. Our work addresses the question: is the relation between objects A and
B analogous to those relations found in S? Such questions are particularly
relevant in information retrieval, where an investigator might want to search
for analogous pairs of objects that match the query set of interest. There
are many ways in which objects can be related, making the task of measuring
analogies very challenging. Our approach combines a similarity measure on
function spaces with Bayesian analysis to produce a ranking. It requires data
containing features of the objects of interest and a link matrix specifying
which relationships exist; no further attributes of such relationships are
necessary. We illustrate the potential of our method on text analysis and
information networks. An application on discovering functional interactions
between pairs of proteins is discussed in detail, where we show that our
approach can work in practice even if a small set of protein pairs is
provided.

O. Stegle, K. Denby, S. McHattie, A. Meade, D. Wild, Z. Ghahramani, and
K Borgwardt.
**Discovering
temporal patterns of differential gene expression in microarray time
series**.
In *German Conference on Bioinformatics*, pages 133-142, Halle,
Germany, September 2009.

** Abstract:** A wealth of time
series of microarray measurements have become available over recent years.
Several two-sample tests for detecting differential gene expression in these
time series have been defined, but they can only answer the question
*whether* a gene is differentially expressed across the whole time
series, not *in which intervals* it is differentially expressed. In this
article, we propose a Gaussian process based approach for studying these
dynamics of differential gene expression. In experiments on *Arabidopsis
thaliana* gene expression levels, our novel technique helps us to uncover
that the family of WRKY transcription factors appears to be involved in the
early response to infection by a fungal pathogen.

R. Savage, K. A. Heller, Y. Xu, Zoubin Ghahramani, W. Truman, M. Grant,
K. Denby, and D. L. Wild.
**R/BHC: fast
Bayesian hierarchical clustering for microarray data**.
*BMC Bioinformatics 2009*, 10(242):1-9, August 2009, doi
10.1186/1471-2105-10-242.

** Abstract:** Background:
Although the use of clustering methods has rapidly become one of the standard
computational approaches in the literature of microarray gene expression data
analysis, little attention has been paid to uncertainty in the results
obtained.

Results: We present an R/Bioconductor port of a fast novel
algorithm for Bayesian agglomerative hierarchical clustering and demonstrate
its use in clustering gene expression microarray data. The method performs
bottom-up hierarchical clustering, using a Dirichlet Process (infinite
mixture) to model uncertainty in the data and Bayesian model selection to
decide at each step which clusters to merge.

Conclusion: Biologically
plausible results are presented from a well studied data set: expression
profiles of *A. thaliana* subjected to a variety of biotic and abiotic
stresses. Our method avoids several limitations of traditional methods, for
example how many clusters there should be and how to choose a principled
distance metric.

C. Lippert, O. Stegle, Z. Ghahramani, and K. Borgwardt.
**A kernel
method for unsupervised structured network inference**.
In D. van Dyk and M. Welling, editors, *12th International Conference on
Artificial Intelligence and Statistics*, volume 5, pages 368-375,
Clearwater Beach, FL, USA, April 2009. Journal of Machine Learning Research.
ISSN: 1938-7228.

** Abstract:** Network inference is the problem
of inferring edges between a set of real-world objects, for instance,
interactions between pairs of proteins in bioinformatics. Current
kernel-based approaches to this problem share a set of common features: (i)
they are supervised and hence require labeled training data; (ii) edges in
the network are treated as mutually independent and hence topological
properties are largely ignored; (iii) they lack a statistical interpretation.
We argue that these common assumptions are often undesirable for network
inference, and propose (i) an unsupervised kernel method (ii) that takes the
global structure of the network into account and (iii) is statistically
motivated. We show that our approach can explain commonly used heuristics in
statistical terms. In experiments on social networks, dfferent variants of
our method demonstrate appealing predictive performance.

David A. Knowles and Susan Holmes.
**Statistical tools
for ultra-deep pyrosequencing of fast evolving viruses**.
In *NIPS Workshop: Computational Biology*, 2009.

**
Abstract:** We aim to detect minor variant Hepatitis B viruses (HBV) in 38
pyrosequencing samples from infected individuals. Errors involved in the
amplification and ultra deep pyrosequencing (UDPS) of these samples are
characterised using HBV plasmid controls. Homopolymeric regions and quality
scores are found to be significant covariates in determining insertion and
deletion (indel) error rates, but not mismatch rates which depend on the
nucleotide transition matrix. This knowledge is used to derive two methods
for classifying genuine mutations: a hypothesis testing framework and a
mixture model. Using an approximate "ground truth" from a limiting dilution
Sanger sequencing run, these methods are shown to outperform the naive
percentage threshold approach. The possibility of early stage PCR errors
becoming significant is investigated by simulation, which underlines the
importance of the initial copy number.

** Comment:** web site

Carl Edward Rasmussen, Bernhard J. de la Cruz, Zoubin Ghahramani, and David L.
Wild.
**Modeling and visualizing
uncertainty in gene expression clusters using Dirichlet process
mixtures**.
*IEEE/ACM Transactions on Computational Biology and Bioinformatics*,
6(4):615-628, 2009, doi
10.1109/TCBB.2007.70269.

** Abstract:** Although the use
of clustering methods has rapidly become one of the standard computational
approaches in the literature of microarray gene expression data, little
attention has been paid to uncertainty in the results obtained. Dirichlet
process mixture (DPM) models provide a nonparametric Bayesian alternative to
the bootstrap approach to modeling uncertainty in gene expression clustering.
Most previously published applications of Bayesian model-based clustering
methods have been to short time series data. In this paper, we present a case
study of the application of nonparametric Bayesian clustering methods to the
clustering of high-dimensional nontime series gene expression data using full
Gaussian covariances. We use the probability that two genes belong to the
same cluster in a DPM model as a measure of the similarity of these gene
expression profiles. Conversely, this probability can be used to define a
dissimilarity measure, which, for the purposes of visualization, can be input
to one of the standard linkage algorithms used for hierarchical clustering.
Biologically plausible results are obtained from the Rosetta compendium of
expression profiles which extend previously published cluster analyses of
this data.

O. Stegle, K. Denby, David L. Wild, Zoubin Ghahramani, and Karsten Borgwardt.
**A robust
Bayesian two-sample test for detecting intervals of differential gene
expression in microarray time series**.
In *13th Annual International Conference on Research in Computational
Molecular Biology (RECOMB 2009)*, volume 5541 of *Lecture Notes in
Bioinformatics*, pages 201-216, Tucson, AZ, USA, 2009. Springer-Verlag,
doi
10.1007/978-3-642-02008-7_14.

** Abstract:** Understanding
the regulatory mechanisms that are responsible for an organism's response to
environmental changes is an important question in molecular biology. A first
and important step towards this goal is to detect genes whose expression
levels are affected by altered external conditions. A range of methods to
test for differential gene expression, both in static as well as in
time-course experiments, have been proposed. While these tests answer the
question *whether* a gene is differentially expressed, they do not
explicitly address the question *when* a gene is differentially
expressed, although this information may provide insights into the course and
causal structure of regulatory programs. In this article, we propose a
two-sample test for identifying *intervals* of differential gene
expression in microarray time series. Our approach is based on Gaussian
process regression, can deal with arbitrary numbers of replicates and is
robust with respect to outliers. We apply our algorithm to study the response
of *Arabidopsis thaliana* genes to an infection by a fungal pathogen
using a microarray time series dataset covering 30,336 gene probes at 24 time
points. In classification experiments our test compares favorably with
existing methods and provides additional insights into time-dependent
differential expression.

David Knowles and Zoubin Ghahramani.
**Infinite sparse
factor analysis and infinite independent components analysis**.
In *7th International Conference on Independent Component Analysis and
Signal Separation*, pages 381-388, London, UK, September 2007. Springer,
doi
10.1007/978-3-540-74494-8_48.

** Abstract:** A
nonparametric Bayesian extension of Independent Components Analysis (ICA) is
proposed where observed data Y is modelled as a linear superposition, G, of a
potentially infinite number of hidden sources, X. Whether a given source is
active for a specific data point is specified by an infinite binary matrix,
Z. The resulting sparse representation allows increased data reduction
compared to standard ICA. We define a prior on Z using the Indian Buffet
Process (IBP). We describe four variants of the model, with Gaussian or
Laplacian priors on X and the one or two-parameter IBPs. We demonstrate
Bayesian inference under these models using a Markov chain Monte Carlo (MCMC)
algorithm on synthetic and gene expression data and compare to standard ICA
algorithms.

Wei Chu, Zoubin Ghahramani, Roland Krause, and David L. Wild.
**Identifying
protein complexes in high-throughput protein interaction screens using an
infinite latent feature model**.
In Russ B. Altman, Tiffany Murray, Teri E. Klein, A. Keith Dunker, and Lawrence
Hunter, editors, *Pacific Symposium on Biocomputing*, pages 231-242.
World Scientific, 2006.

** Abstract:** We propose a Bayesian
approach to identify protein complexes and their constituents from
high-throughput protein-protein interaction screens. An infinite latent
feature model that allows for multi-complex membership by individual proteins
is coupled with a graph diffusion kernel that evaluates the likelihood of two
proteins belonging to the same complex. Gibbs sampling is then used to infer
a catalog of protein complexes from the interaction screen data. An advantage
of this model is that it places no prior constraints on the number of
complexes and automatically infers the number of significant complexes from
the data. Validation results using affinity purification/mass spectrometry
experimental data from yeast RNA-processing complexes indicate that our
method is capable of partitioning the data in a biologically meaningful way.
A supplementary web site containing larger versions of the figures is
available at http://public.kgi.edu/wild/PSBO6/index.html.

Wei Chu, Zoubin Ghahramani, Alexei A. Podtelezhnikov, and David L. Wild.
**Bayesian
segmental models with multiple sequence alignment profiles for protein
secondary structure and contact map prediction**.
*IEEE/ACM Trans. Comput. Biology Bioinform.*, 3(2):98-113, 2006.

** Abstract:** In this paper, we develop a segmental semi-Markov
model (SSMM) for protein secondary structure prediction which incorporates
multiple sequence alignment profiles with the purpose of improving the
predictive performance. The segmental model is a generalization of the hidden
Markov model where a hidden state generates segments of various length and
secondary structure type. A novel parameterized model is proposed for the
likelihood function that explicitly represents multiple sequence alignment
profiles to capture the segmental conformation. Numerical results on
benchmark data sets show that incorporating the profiles results in
substantial improvements and the generalization performance is promising. By
incorporating the information from long range interactions in beta-sheets,
this model is also capable of carrying out inference on contact maps. This is
an important advantage of probabilistic generative models over the
traditional discriminative approach to protein secondary structure
prediction. The Web server of our algorithm and supplementary materials are
available at http://public.kgi.edu/-wild/bsm.html.

Matthew J. Beal, Francesco Falciani, Zoubin Ghahramani, Claudia Rangel, and
David L. Wild.
**A Bayesian
approach to reconstructing genetic regulatory networks with hidden
factors**.
*Bioinformatics*, 21(3):349-356, 2005.

** Abstract:**
Motivation: We have used state-space models (SSMs) to reverse engineer
transcriptional networks from highly replicated gene expression profiling
time series data obtained from a well-established model of T cell activation.
SSMs are a class of dynamic Bayesian networks in which the observed
measurements depend on some hidden state variables that evolve according to
Markovian dynamics. These hidden variables can capture effects that cannot be
directly measured in a gene expression profiling experiment, for example:
genes that have not been included in the microarray, levels of regulatory
proteins, the effects of mRNA and protein degradation, etc. Results: We have
approached the problem of inferring the model structure of these state-space
models using both classical and Bayesian methods. In our previous work, a
bootstrap procedure was used to derive classical confidence intervals for
parameters representing `gene-gene' interactions over time. In this article,
variational approximations are used to perform the analogous model selection
task in the Bayesian context. Certain interactions are present in both the
classical and the Bayesian analyses of these regulatory networks. The
resulting models place JunB and JunD at the centre of the mechanisms that
control apoptosis and proliferation. These mechanisms are key for clonal
expansion and for controlling the long term behavior (e.g. programmed cell
death) of these cells.

Wei Chu, Zoubin Ghahramani, Francesco Falciani, and David L. Wild.
**Biomarker
discovery in microarray gene expression data with Gaussian
processes**.
*Bioinformatics*, 21(16):3385-3393, 2005.

** Abstract:**
MOTIVATION: In clinical practice, pathological phenotypes are often labelled
with ordinal scales rather than binary, e.g. the Gleason grading system for
tumour cell differentiation. However, in the literature of microarray
analysis, these ordinal labels have been rarely treated in a principled way.
This paper describes a gene selection algorithm based on Gaussian processes
to discover consistent gene expression patterns associated with ordinal
clinical phenotypes. The technique of automatic relevance determination is
applied to represent the significance level of the genes in a Bayesian
inference framework. RESULTS: The usefulness of the proposed algorithm for
ordinal labels is demonstrated by the gene expression signature associated
with the Gleason score for prostate cancer data. Our results demonstrate how
multi-gene markers that may be initially developed with a diagnostic or
prognostic application in mind are also useful as an investigative tool to
reveal associations between specific molecular and cellular events and
features of tumour physiology. Our algorithm can also be applied to
microarray data with binary labels with results comparable to other methods
in the literature.

Philip E. Bourne, C. K. J. Allerston, Werner G. Krebs, Wilfred W. Li, Ilya N.
Shindyalov, Adam Godzik, Iddo Friedberg, Tong Liu, David L. Wild, Seungwoo
Hwang, Zoubin Ghahramani, Li Chen, and John D. Westbrook.
**The status of
structural genomics defined through the analysis of current targets and
structures**.
In Russ B. Altman, A. Keith Dunker, Lawrence Hunter, Tiffany A. Jung, and
Teri E. Klein, editors, *Pacific Symposium on Biocomputing*, pages
375-386. World Scientific, 2004.

** Abstract:** Structural
genomics-large-scale macromolecular 3-dimenional structure determination-is
unique in that major participants report scientific progress on a weekly
basis. The target database (TargetDB) maintained by the Protein Data Bank
(http://targetdb.pdb.org) reports this progress through the status of each
protein sequence (target) under consideration by the major structural
genomics centers worldwide. Hence, TargetDB provides a unique opportunity to
analyze the potential impact that this major initiative provides to
scientists interested in the sequence-structure-function-disease paradigm.
Here we report such an analysis with a focus on: (i) temporal
characteristics-how is the project doing and what can we expect in the
future? (ii) target characteristics-what are the predicted functions of the
proteins targeted by structural genomics and how biased is the target set
when compared to the PDB and to predictions across complete genomes? (iii)
structures solved-what are the characteristics of structures solved thus far
and what do they contribute? The analysis required a more extensive database
of structure predictions using different methods integrated with data from
other sources. This database, associated tools and related data sources are
available from http://spam.sdsc.edu.

Wei Chu, Zoubin Ghahramani, and David L. Wild.
**A graphical
model for protein secondary structure prediction**.
In Carla E. Brodley, editor, *ICML*, volume 69 of *ACM
International Conference Proceeding Series*. acm, 2004.

**
Abstract:** In this paper, we present a graphical model for protein
secondary structure prediction. This model extends segmental semi-Markov
models (SSMM) to exploit multiple sequence alignment profiles which contain
information from evolutionarily related sequences. A novel parameterized
model is proposed as the likelihood function for the SSMM to capture the
segmental conformation. By incorporating the information from long range
interactions in β-sheets, this model is capable of carrying out inference on
contact maps. The numerical results on benchmark data sets show that
incorporating the profiles results in substantial improvements and the
generalization performance is promising.

Wei Chu, Zoubin Ghahramani, and David L. Wild.
**Protein
secondary structure prediction using sigmoid belief networks to parameterize
segmental semi-Markov models**.
In *ESANN*, pages 81-86, 2004.

** Abstract:** In this
paper, we merge the parametric structure of neural networks into a segmental
semi-Markov model to set up a Bayesian framework for protein structure
prediction. The parametric model, which can also be regarded as an extension
of a sigmoid belief network, captures the underlying dependency in residue
sequences. The results of numerical experiments indicate the usefulness of
this approach.

A. Dubey, S. Hwang, C. Rangel, Carl Edward Rasmussen, Zoubin Ghahramani, and
David L. Wild.
**Clustering
protein sequence and structure space with infinite Gaussian mixture
models**.
In *Pacific Symposium on Biocomputing 2004*, pages 399-410, Singapore,
2004. World Scientific Publishing.

** Abstract:** We describe
a novel approach to the problem of automatically clustering protein sequences
and discovering protein families, subfamilies etc., based on the thoery of
infinite Gaussian mixture models. This method allows the data itself to
dictate how many mixture components are required to model it, and provides a
measure of the probability that two proteins belong to the same cluster. We
illustrate our methods with application to three data sets: globin sequences,
globin sequences with known tree-dimensional structures and G-pretein coupled
receptor sequences. The consistency of the clusters indicate that that our
methods is producing biologically meaningful results, which provide a very
good indication of the underlying families and subfamilies. With the
inclusion of secondary structure and residue solvent accessibility
information, we obtain a classification of sequences of known structure which
reflects and extends their SCOP classifications.

Ananya Dubey, Seungwoo Hwang, Claudia Rangel, Carl Edward Rasmussen, Zoubin
Ghahramani, and David L. Wild.
**Clustering
protein sequence and structure space with infinite Gaussian mixture
models**.
In Russ B. Altman, A. Keith Dunker, Lawrence Hunter, Tiffany A. Jung, and
Teri E. Klein, editors, *Pacific Symposium on Biocomputing*, pages
399-410. World Scientific, 2004.

** Abstract:** We describe a
novel approach to the problem of automatically clustering protein sequences
and discovering protein families, subfamilies etc., based on the theory of
infinite Gaussian mixtures models. This method allows the data itself to
dictate how many mixture components are required to model it, and provides a
measure of the probability that two proteins belong to the same cluster. We
illustrate our methods with application to three data sets: globin sequences,
globin sequences with known three-dimensional structures and G-protein
coupled receptor sequences. The consistency of the clusters indicate that our
method is producing biologically meaningful results, which provide a very
good indication of the underlying families and subfamilies. With the
inclusion of secondary structure and residue solvent accessibility
information, we obtain a classification of sequences of known structure which
both reflects and extends their SCOP classifications. A supplementray web
site containing larger versions of the figures is available at
http://public.kgi.edu/approximately wid/PSB04/index.html

Claudia Rangel, John Angus, Zoubin Ghahramani, Maria Lioumi, Elizabeth
Sotheran, Alessia Gaiba, David L. Wild, and Francesco Falciani.
**Modeling t-cell
activation using gene expression profiling and state-space models**.
*Bioinformatics*, 20(9):1361-1372, 2004.

** Abstract:**
Motivation: We have used state-space models to reverse engineer
transcriptional networks from highly replicated gene expression profiling
time series data obtained from a well-established model of T-cell activation.
State space models are a class of dynamic Bayesian networks that assume that
the observed measurements depend on some hidden state variables that evolve
according to Markovian dynamics. These hidden variables can capture effects
that cannot be measured in a gene expression profiling experiment, e.g. genes
that have not been included in the microarray, levels of regulatory proteins,
the effects of messenger RNA and protein degradation, etc. Results: Bootstrap
confidence intervals are developed for parameters representing `gene-gene'
interactions over time. Our models represent the dynamics of T-cell
activation and provide a methodology for the development of rational and
experimentally testable hypotheses. Availability: Supplementary data and
Matlab computer source code will be made available on the web at the URL
given below. Supplementary information: .

A. Raval, Zoubin Ghahramani, and David L. Wild.
**A Bayesian
network model for protein fold and remote homologue recognition**.
*Bioinformatics*, 18(6):788-801, 2002.

** Abstract:**
Motivation: The Bayesian network approach is a framework which combines
graphical representation and probability theory, which includes, as a special
case, hidden Markov models. Hidden Markov models trained on amino acid
sequence or secondary structure data alone have been shown to have potential
for addressing the problem of protein fold and superfamily classification.
Results: This paper describes a novel implementation of a Bayesian network
which simultaneously learns amino acid sequence, secondary structure and
residue accessibility for proteins of known three-dimensional structure. An
awareness of the errors inherent in predicted secondary structure may be
incorporated into the model by means of a confusion matrix. Training and
validation data have been derived for a number of protein superfamilies from
the Structural Classification of Proteins (SCOP) database. Cross validation
results using posterior probability classification demonstrate that the
Bayesian network performs better in classifying proteins of known structural
superfamily than a hidden Markov model trained on amino acid sequences
alone.

## Information RetrievalInformation retrieval concerns develping systems that find material from within a large unstructured collection (e.g. the internet) that satisfy the user's need. The best example of such systems are web search engines, such as Google, but there are many other specialized applications of information retrieval (such as collaborative filtering and recommender systems). Information retrieval can be thought of as an inference problem: given the user's query, what are the relevant items in the data collection? |

Bradley Butcher, Chris Robinson, Miri Zilka, Riccardo Fogliato, Carolyn
Ashurst, and Adrian Weller.
**Racial disparities in the
enforcement of marijuana violations in the us**.
*Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and
Society*, 2022.

** Abstract:** Racial disparities in US
drug arrest rates have been observed for decades, but their causes and policy
implications are still contested. Some have argued that the disparities
largely reflect differences in drug use between racial groups, while others
have hypothesized that discriminatory enforcement policies and police
practices play a significant role. In this work, we analyze racial
disparities in the enforcement of marijuana violations in the US. Using data
from the National Incident-Based Reporting System (NIBRS) and the National
Survey on Drug Use and Health (NSDUH) programs, we investigate whether
marijuana usage and purchasing behaviors can explain the racial composition
of offenders in police records. We examine potential driving mechanisms
behind these disparities and the extent to which county-level socioeconomic
factors are associated with corresponding disparities. Our results indicate
that the significant racial disparities in reported incidents and arrests
cannot be explained by differences in marijuana days-of-use alone. Variations
in the location where marijuana is purchased and in the frequency of these
purchases partially explain the observed disparities. We observe an increase
in racial disparities across most counties over the last decade, with the
greatest increases in states that legalized the use of marijuana within this
timeframe. Income, high school graduation rate, and rate of employment
positively correlate with larger racial disparities, while the rate of
incarceration is negatively correlated. We conclude with a discussion of the
implications of the observed racial disparities in the context of algorithmic
fairness.

Yanzhi Chen, Weihao Sun, Yingzhen Li, and Adrian Weller.
**Scalable infomin
learning**.
In *Advances in Neural Information Processing Systems*, 2022.

** Abstract:** The task of infomin learning aims to learn a
representation with high utility while being uninformative about a specified
target, with the latter achieved by minimising the mutual information between
the representation and the target. It has broad applications, ranging from
training fair prediction models against protected attributes, to unsupervised
learning with disentangled representations. Recent works on infomin learning
mainly use adversarial training, which involves training a neural network to
estimate mutual information or its proxy and thus is slow and difficult to
optimise. Drawing on recent advances in slicing techniques, we propose a new
infomin learning approach, which uses a novel proxy metric to mutual
information. We further derive an accurate and analytically computable
approximation to this proxy metric, thereby removing the need of constructing
neural network-based mutual information estimators. Compared to baselines,
experiments on algorithmic fairness, disentangled representation learning and
domain adaptation verify that our method can more effectively remove unwanted
information with limited time budget.

Jiri Hron, Karl Krauth, Michael I. Jordan, Niki Kilbertus, and Sarah Dean.
**Modelling content creator
incentives on algorithm-curated platforms**.
*arXiv*, 2022.

** Abstract:** Content creators compete
for user attention. Their reach crucially depends on algorithmic choices made
by developers on online platforms. To maximize exposure, many creators adapt
strategically, as evidenced by examples like the sprawling search engine
optimization industry. This begets competition for the finite user attention
pool. We formalize these dynamics in what we call an exposure game, a model
of incentives induced by algorithms including modern factorization and (deep)
two-tower architectures. We prove that seemingly innocuous algorithmic
choices—e.g., non-negative vs. unconstrained factorization—significantly
affect the existence and character of (Nash) equilibria in exposure games. We
proffer use of creator behavior models like ours for an (ex-ante)
pre-deployment audit. Such an audit can identify misalignment between
desirable and incentivized content, and thus complement post-hoc measures
like content filtering and moderation. To this end, we propose tools for
numerically finding equilibria in exposure games, and illustrate results of
an audit on the MovieLens and LastFM datasets. Among else, we find that the
strategically produced content exhibits strong dependence between algorithmic
exploration and content diversity, and between model expressivity and bias
towards gender-based user and creator groups.

Marion Oswald, Luke Chambers, Ellen P Goodman, Pam Ugwudike, and Miri Zilka.
**The uk
algorithmic transparency standard: A qualitative analysis of police
perspectives**.
*Available at SSRN*, 2022.

** Abstract:** 1. The UK
Government’s draft ‘Algorithmic Transparency Standard’ is intended to
provide a standardised way for public bodies and government departments to
provide information about how algorithmic tools are being used to support
decisions. The research discussed in this report was conducted in parallel to
the piloting of the Standard by the Cabinet Office and the Centre for Data
Ethics and Innovation. 2. We conducted semi-structured interviews with
respondents from across UK policing and commercial bodies involved in
policing technologies. Our aim was to explore the implications for police
forces of participation in the Standard, to identify rewards, risks,
challenges for the police, and areas where the Standard could be improved,
and therefore to contribute to the exploration of policy options for
expansion of participation in the Standard. 3. Algorithmic transparency is
both achievable for policing and could bring significant rewards. A key
reward of police participation in the Standard is that it provides the
opportunity to demonstrate proficient implementation of technology-driven
policing, thus enhancing earned trust. Research participants highlighted the
public good that could result from the considered use of algorithms. 4.
Participants noted, however, a risk of misperception of the dangers of
policing technology, especially if use of algorithmic tools was not
appropriately compared to the status quo and current methods. 5.
Participation in the Standard provides an opportunity to develop increased
sharing among police forces of best practices (and things to avoid), and
increased thoughtfulness among police force personnel in building and
implementing new tools. Research participants were keen for compliance with
the Standard to become an integral part of a holistic system to drive
reflective practice across policing around the development and deployment of
algorithmic technology. This could enable police to learn from each other,
facilitate good policy choices and decrease wasted costs. Otherwise, the
Standard may come to be regarded as an administrative burden rather than a
benefit for policing. 6. Several key areas for amendment and improvement from
the perspective of policing were identified in the research. These could
improve the Standard for the benefit of all participants. These include a
need for clarification of the scope of the Standard, and the stage of project
development at which the Standard should apply. It is recommended that
consideration be given to a ‘Standard-Lite’ for projects at the pilot or
early stages of the development process in order to gain public understanding
of new tools and applications. Furthermore, the Standard would benefit from a
more substantial glossary (to include relevant policing terms) and additional
guidance on the level of detail required in each section and how accuracy
rates should be described, justified and explained in order to ensure
consistency. 7. The research does not suggest any overriding reason why the
Standard should not be applied in policing. Suitable exemptions for sensitive
contexts and tradecraft would be required, however, and consideration given
to ensuring that forces have the resources to comply with the Standard and to
respond to the increased public interest that could ensue. Limiting the scope
initially to tools on a defined list (to include the most high-risk tools,
such as those that produce individualised risk/predictive scores) could
assist in mitigating concerns over sensitive policing capabilities and
resourcing. A non-public version of the Standard for sensitive applications
and tools could also be considered, which would be available to bodies with
an independent oversight function. 8. To support police compliance with the
Standard, supplier responsibilities – including appropriate disclosure of
algorithmic functionality, data inputs and performance – should be covered
in procurement contracts and addressed up front as a mandatory requirement of
doing business with the police. 9. As well as contributing to the piloting of
the Standard, it is recommended that the findings of this report are
considered at NPCC level, by the College of Policing and by the office of the
Chief Scientific Advisor for Policing, as new sector-led guidance, best
practice and policy are developed.

J. von Kügelgen, A.-H. Karimi, U. Bhatt, I. Valera, A. Weller, and
B. Schölkopf.
**On the fairness of causal
algorithmic recourse**.
In *Proceedings of the 36th AAAI Conference on Artificial Intelligence
(AAAI)*, 2022.

** Abstract:** Algorithmic fairness is
typically studied from the perspective of predictions. Instead, here we
investigate fairness from the perspective of recourse actions suggested to
individuals to remedy an unfavourable classification. We propose two new
fairness criteria at the group and individual level, which - unlike prior
work on equalising the average group-wise distance from the decision boundary
- explicitly account for causal relationships between features, thereby
capturing downstream effects of recourse actions performed in the physical
world. We explore how our criteria relate to others, such as counterfactual
fairness, and show that fairness of recourse is complementary to fairness of
prediction. We study theoretically and empirically how to enforce fair causal
recourse by altering the classifier and perform a case study on the Adult
dataset. Finally, we discuss whether fairness violations in the data
generating process revealed by our criteria may be better addressed by
societal interventions as opposed to constraints on the classifier.

Miri Zilka, Bradley Butcher, and Adrian Weller.
**A survey and datasheet
repository of publicly available us criminal justice datasets**.
*Thirty-sixth Conference on Neural Information Processing Systems Datasets
and Benchmarks Track*, 2022.

** Abstract:** Criminal
justice is an increasingly important application domain for machine learning
and algorithmic fairness, as predictive tools are becoming widely used in
police, courts, and prison systems worldwide. A few relevant benchmarks have
received significant attention, e.g., the COMPAS dataset, often without
proper consideration of the domain context. To raise awareness of publicly
available criminal justice datasets and encourage their responsible use, we
conduct a survey, consider contexts, highlight potential uses, and identify
gaps and limitations. We provide datasheets for 15 datasets and upload them
to a public repository. We compare the datasets across several dimensions,
including size, coverage of the population, and potential use, highlighting
concerns. We hope that this work can provide a useful starting point for
researchers looking for appropriate datasets related to criminal justice, and
that the repository will continue to grow as a community effort.

Miri Zilka, Holli Sargeant, and Adrian Weller.
**Transparency, governance
and regulation of algorithmic tools deployed in the criminal justice system:
A uk case study**.
*Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and
Society*, 2022.

** Abstract:** We present a survey of
tools used in the criminal justice system in the UK in three categories: data
infrastructure, data analysis, and risk prediction. Many tools are currently
in deployment, offering potential benefits, including improved efficiency and
consistency. However, there are also important concerns. Transparent
information about these tools, their purpose, how they are used, and by whom
is difficult to obtain. Even when information is available, it is often
insufficient to enable a satisfactory evaluation. More work is needed to
establish governance mechanisms to ensure that tools are deployed in a
transparent, safe and ethical way. We call for more engagement with
stakeholders and greater documentation of the intended goal of a tool, how it
will achieve this goal compared to other options, and how it will be
monitored in deployment. We highlight additional points to consider when
evaluating the trustworthiness of deployed tools and make concrete proposals
for policy.

Jiri Hron, Karl Krauth, Michael I. Jordan, and Niki Kilbertus.
**On component interactions in
two-stage recommender systems**.
*NeurIPS*, 2021.

** Abstract:** Thanks to their
scalability, two-stage recommenders are used by many of today's largest
online platforms, including YouTube, LinkedIn, and Pinterest. These systems
produce recommendations in two steps: (i) multiple nominators, tuned for low
prediction latency, preselect a small subset of candidates from the whole
item pool; (ii) a slower but more accurate ranker further narrows down the
nominated items, and serves to the user. Despite their popularity, the
literature on two-stage recommenders is relatively scarce, and the algorithms
are often treated as mere sums of their parts. Such treatment presupposes
that the two-stage performance is explained by the behavior of the individual
components in isolation. This is not the case: using synthetic and real-world
data, we demonstrate that interactions between the ranker and the nominators
substantially affect the overall performance. Motivated by these findings, we
derive a generalization lower bound which shows that independent nominator
training can lead to performance on par with uniformly random
recommendations. We find that careful design of item pools, each assigned to
a different nominator, alleviates these issues. As manual search for a good
pool allocation is difficult, we propose to learn one instead using a
Mixture-of-Experts based approach. This significantly improves both precision
and recall at K.

Oliver Thomas, Miri Zilka, Adrian Weller, and Novi Quadrianto.
**An algorithmic framework
for positive action**.
*Equity and Access in Algorithms, Mechanisms, and Optimization*, 2021.

** Abstract:** Positive action is defined within
anti-discrimination legislation as voluntary, legal action taken to address
an imbalance of opportunity affecting individuals belonging to
under-represented groups. Within this theme, we propose a novel algorithmic
fairness framework to advance equal representation while respecting
anti-discrimination legislation and equal-treatment rights. We use a
counterfactual fairness approach to assign one of three outcomes to each
candidate: accept; reject; or flagged as a positive action candidate.

Botty Dimanov, Umang Bhatt, Mateja Jamnik, and Adrian Weller.
**You
shouldn't trust me: Learning models which conceal unfairness from multiple
explanation methods**.
In *European Conference on Artificial Intelligence (ECAI)*, 2020.

** Abstract:** Transparency of algorithmic systems has been
discussed as a way for end-users and regulators to develop appropriate trust
in machine learning models. One popular approach, LIME [26], even suggests
that model explanations can answer the question ``Why should I trust you?''
Here we show a straightforward method for modifying a pre-trained model to
manipulate the output of many popular feature importance explanation methods
with little change in accuracy, thus demonstrating the danger of trusting
such explanation methods. We show how this explanation attack can mask a
model’s discriminatory use of a sensitive feature, raising strong concerns
about using such explanation methods to check model fairness.

Jiri Hron, Karl Krauth, Michael I. Jordan, and Niki Kilbertus.
**Exploration in two-stage
recommender systems**.
*REVEAL (ACM RecSys workshop)*, 2020.

** Abstract:**
Two-stage recommender systems are widely adopted in industry due to their
scalability and maintainability. These systems produce recommendations in two
steps: (i) multiple nominators preselect a small number of items from a large
pool using cheap-to-compute item embeddings; (ii) with a richer set of
features, a ranker rearranges the nominated items and serves them to the
user. A key challenge of this setup is that optimal performance of each stage
in isolation does not imply optimal global performance. In response to this
issue, Ma et al. (2020) proposed a nominator training objective importance
weighted by the ranker's probability of recommending each item. In this work,
we focus on the complementary issue of exploration. Modeled as a contextual
bandit problem, we find LinUCB (a near optimal exploration strategy for
single-stage systems) may lead to linear regret when deployed in two-stage
recommenders. We therefore propose a method of synchronising the exploration
strategies between the ranker and the nominators. Our algorithm only relies
on quantities already computed by standard LinUCB at each stage and can be
implemented in three lines of additional code. We end by demonstrating the
effectiveness of our algorithm experimentally.

Moein Khajehnejad, Ahmad Asgharian Rezaei, Mahmoudreza Babaei, Jessica
Hoffmann, Mahdi Jalili, and Adrian Weller.
**Adversarial graph embeddings for
fair influence maximization over social networks**.
In *International Joint Conference on Artificial Intelligence*, 2020.

** Abstract:** Influence maximization is a widely studied topic
in network science, where the aim is to reach the maximum possible number of
nodes, while only targeting a small initial set of individuals. It has
critical applications in many fields, including viral marketing, information
propagation, news dissemination, and vaccinations. However, the objective
does not usually take into account whether the final set of influenced nodes
is fair with respect to sensitive attributes, such as race or gender. Here we
address fair influence maximization, aiming to reach minorities more
equitably. We introduce Adversarial Graph Embeddings: we co-train an
auto-encoder for graph embedding and a discriminator to discern sensitive
attributes. This leads to embeddings which are similarly distributed across
sensitive attributes. We then find a good initial set by clustering the
embeddings. We believe we are the first to use embeddings for the task of
fair influence maximization. While there are typically trade-offs between
fairness and influence maximization objectives, our experiments on synthetic
and real-world datasets show that our approach dramatically reduces disparity
while remaining competitive with state-of-the-art influence maximization
methods.

Niki Kilbertus, Manuel Gomez Rodriguez, Bernhard Schölkopf, Krikamol Muandet,
and Isabel Valera.
**Fair decisions
despite imperfect predictions**.
In Silvia Chiappa and Roberto Calandra, editors, *23rd International
Conference on Artificial Intelligence and Statistics*, volume 108 of
*Proceedings of Machine Learning Research*, pages 277-287. PMLR,
26-28 Aug 2020.

** Abstract:** Consequential decisions are
increasingly informed by sophisticated data-driven predictive models.
However, consistently learning accurate predictive models requires access to
ground truth labels. Unfortunately, in practice, labels may only exist
conditional on certain decisions—if a loan is denied, there is not even an
option for the individual to pay back the loan. In this paper, we show that,
in this selective labels setting, learning to predict is suboptimal in terms
of both fairness and utility. To avoid this undesirable behavior, we propose
to directly learn stochastic decision policies that maximize utility under
fairness constraints. In the context of fair machine learning, our results
suggest the need for a paradigm shift from "learning to predict" to "learning
to decide". Experiments on synthetic and real-world data illustrate the
favorable properties of learning to decide, in terms of both utility and
fairness.

Niki Kilbertus, Phil Ball, Matt Kusner, Adrian Weller, and Ricardo Silva.
**The sensitivity of counterfactual
fairness to unmeasured confounding**.
In *35th Conference on Uncertainty in Artificial Intelligence*, Tel
Aviv, July 2019.

** Abstract:** Causal approaches to fairness
have seen substantial recent interest, both from the machine learning
community and from wider parties interested in ethical prediction algorithms.
In no small part, this has been due to the fact that causal models allow one
to simultaneously leverage data and expert knowledge to remove discriminatory
effects from predictions. However, one of the primary assumptions in causal
modeling is that you know the causal graph. This introduces a new opportunity
for bias, caused by misspecifying the causal model. One common way for
misspecification to occur is via unmeasured confounding: the true causal
effect between variables is partially described by unobserved quantities. In
this work we design tools to assess the sensitivity of fairness measures to
this confounding for the popular class of non-linear additive noise models
(ANMs). Specifically, we give a procedure for computing the maximum
difference between two counterfactually fair predictors, where one has become
biased due to confounding. For the case of bivariate confounding our
technique can be swiftly computed via a sequence of closed-form updates. For
multivariate confounding we give an algorithm that can be efficiently solved
via automatic differentiation. We demonstrate our new sensitivity analysis
tools in real-world fairness scenarios to assess the bias arising from
confounding.

Tameem Adel, Isabel Valera, Zoubin Ghahramani, and Adrian Weller.
**One-network
adversarial fairness**.
In *33rd AAAI Conference on Artificial Intelligence*, Hawaii, January
2019.

** Abstract:** There is currently a great expansion of
the impact of machine learning algorithms on our lives, prompting the need
for objectives other than pure performance, including fairness. Fairness here
means that the outcome of an automated decision-making system should not
discriminate between subgroups characterized by sensitive attributes such as
gender or race. Given any existing differentiable classifier, we make only
slight adjustments to the architecture including adding a new hidden layer,
in order to enable the concurrent adversarial optimization for fairness and
accuracy. Our framework provides one way to quantify the tradeoff between
fairness and accuracy, while also leading to strong empirical
performance.

Stephen Cave, Rune Nyrup, Karina Vold, and Adrian Weller.
**Motivations and risks
of machine ethics**.
*Proceedings of the IEEE*, 107(3):562-574, 2019.

**
Abstract:** This paper surveys reasons for and against pursuing the field
of machine ethics, understood as research aiming to build ``ethical
machines.'' We clarify the nature of this goal, why it is worth pursuing, and
the risks involved in its pursuit. First, we survey and clarify some of the
philosophical issues surrounding the concept of an ``ethical machine'' and
the aims of machine ethics. Second, we argue that while there are good prima
facie reasons for pursuing machine ethics, including the potential to improve
the ethical alignment of both humans and machines, there are also potential
risks that must be considered. Third, we survey these potential risks and
point to where research should be devoted to clarifying and managing
potential risks. We conclude by making some recommendations about the
questions that future work could address.

Niki Kilbertus, Adria Gascon, Matt Kusner, Michael Veale, Krishna P. Gummadi,
and Adrian Weller.
**Blind
justice: Fairness with encrypted sensitive attributes**.
In *35th International Conference on Machine Learning*, Stockholm
Sweden, July 2018.

** Abstract:** Recent work has explored how
to train machine learning models which do not discriminate against any
subgroup of the population as determined by sensitive attributes such as
gender or race. To avoid disparate treatment, sensitive attributes should not
be considered. On the other hand, in order to avoid disparate impact,
sensitive attributes must be examined — e.g., in order to learn a fair
model, or to check if a given model is fair. We introduce methods from secure
multi-party computation which allow us to avoid both. By encrypting sensitive
attributes, we show how an outcome based fair model may be learned, checked,
or have its outputs verified and held to account, without users revealing
their sensitive attributes.

Nina Grgić-Hlača, Elissa Redmiles, Krishna P. Gummadi, and Adrian
Weller.
**Human
perceptions of fairness in algorithmic decision making: A case study of
criminal risk prediction**.
In *The Web Conference (WWW)*, Lyon, April 2018.

**
Abstract:** As algorithms are increasingly used to make important decisions
that affect human lives, ranging from social benefit assignment to predicting
risk of criminal recidivism, concerns have been raised about the fairness of
algorithmic decision making. Most prior works on algorithmic fairness
normatively prescribe how fair decisions ought to be made. In contrast, here,
we descriptively survey users for how they perceive and reason about fairness
in algorithmic decision making. A key contribution of this work is the
framework we propose to understand why people perceive certain features as
fair or unfair to be used in algorithms. Our framework identifies eight
properties of features, such as relevance, volitionality and reliability, as
latent considerations that inform people’s moral judgments about the
fairness of feature use in decision-making algorithms. We validate our
framework through a series of scenario-based surveys with 576 people. We find
that, based on a person’s assessment of the eight latent properties of a
feature in our exemplar scenario, we can accurately (> 85%) predict if the
person will judge the use of the feature as fair. Our findings have important
implications. At a high-level, we show that people’s unfairness concerns
are multi-dimensional and argue that future studies need to address
unfairness concerns beyond discrimination. At a low-level, we find
considerable disagreements in people’s fairness judgments. We identify root
causes of the disagreements, and note possible pathways to resolve them.

Mahmoudreza Babaei, Juhi Kulshrestha, Abhijnan Chakraborty, Fabricio
Benevenuto, Krishna P. Gummadi, and Adrian Weller.
**Purple
feed: Identifying high consensus news posts on social media**.
In *1st AAAI/ACM Conference on Artificial Intelligence, Ethics and
Society*, New Orleans, February 2018.

** Abstract:**
Although diverse news stories are actively posted on social media, readers
often focus on news which reinforces their pre-existing views, leading to
‘filter bubble’ effects. To combat this, some recent systems expose and
nudge readers toward stories with different points of view. One example is
the Wall Street Journal’s ‘Blue Feed, Red Feed’ system, which presents
posts from biased publishers on each side of a topic. However, these systems
have had limited success. In this work, we present a complementary approach
which identifies high consensus ‘purple’ posts that generate similar
reactions from both ‘blue’ and ‘red’ readers. We define and
operationalize consensus for news posts on Twitter in the context of US
politics. We identify several high consensus posts and discuss their
empirical properties. We present a highly scalable method for automatically
identifying high and low consensus news posts on Twitter, by utilizing a
novel category of audience leaning based features, which we show are well
suited to this task. Finally, we build and publicly deploy our ‘Purple
Feed’ system (twitter-app.mpi-sws.org/purple-feed), which highlights high
consensus posts from publishers on both sides of the political spectrum.

N. Grgić-Hlača, M. B. Zafar, K. P. Gummadi, and A. Weller.
**Beyond
distributive fairness in algorithmic decision making: Feature selection for
procedurally fair learning**.
In *32nd AAAI Conference on Artificial Intelligence*, New Orleans,
February 2018.

** Abstract:** With wide-spread usage of
machine learning methods in numerous domains involving human subjects,
several studies have raised questions about the potential for unfairness
towards certain individuals or groups. A number of recent works have proposed
methods to measure and eliminate unfairness from machine learning methods.
However, most of this work on fair learning has focused on only one dimension
of fair decision making: distributive fairness, i.e., the fairness of the
decision outcomes. In this work, we leverage the rich literature on
organizational justice and focus on another dimension of fair decision
making: procedural fairness, i.e., the fairness of the decision making
process. We propose measures for procedural fairness that consider the input
features used in the decision process, and evaluate the moral judgments of
humans regarding the use of these features. We operationalize these measures
on two real world datasets using human surveys on the Amazon Mechanical Turk
(AMT) platform, demonstrating that we capture important properties of
procedurally fair decision making. We provide fast submodular mechanisms to
optimize the tradeoff between procedural fairness and prediction accuracy. On
our datasets, we observe empirically that procedural fairness may be achieved
with little cost to outcome fairness, but that some loss of accuracy is
unavoidable.

Till Speicher, Hoda Heidari, Nina Grgic-Hlaca, Krishna P. Gummadi, Adish
Singla, Adrian Weller, and Muhammad Bilal Zafar.
**A
unified approach to quantifying algorithmic unfairness: Measuring individual
and group unfairness via inequality indices**.
In *KDD*, 2018.

** Abstract:** Discrimination via
algorithmic decision making has received considerable attention. Prior work
largely focuses on defining conditions for fairness, but does not define
satisfactory measures of algorithmic unfairness. In this paper, we focus on
the following question: Given two unfair algorithms, how should we determine
which of the two is more unfair? Our core idea is to use existing inequality
indices from economics to measure how unequally the outcomes of an algorithm
benefit different individuals or groups in a population. Our work offers a
justified and general framework to compare and contrast the (un)fairness of
algorithmic predictors. This unifying approach enables us to quantify
unfairness both at the individual and the group level. Further, our work
reveals overlooked tradeoffs between different fairness notions: using our
proposed measures, the overall individual-level unfairness of an algorithm
can be decomposed into a between-group and a within-group component. Earlier
methods are typically designed to tackle only between-group unfairness, which
may be justified for legal or other reasons. However, we demonstrate that
minimizing exclusively the between-group component may, in fact, increase the
within-group, and hence the overall unfairness. We characterize and
illustrate the tradeoffs between our measures of (un)fairness and the
prediction accuracy.

Niki Kilbertus, Mateo Rojas-Carulla, Giambattista Parascandolo, Moritz Hardt,
Dominik Janzing, and Bernhard Schölkopf.
**Avoiding discrimination
through causal reasoning**.
In *Advances in Neural Information Processing Systems 30*, Long Beach,
California, December 2017.

** Abstract:** Recent work on
fairness in machine learning has focused on various statistical
discrimination criteria and how they trade off. Most of these criteria are
observational: They depend only on the joint distribution of predictor,
protected attribute, features, and outcome. While convenient to work with,
observational criteria have severe inherent limitations that prevent them
from resolving matters of fairness conclusively. Going beyond observational
criteria, we frame the problem of discrimination based on protected
attributes in the language of causal reasoning. This viewpoint shifts
attention from ``What is the right fairness criterion?'' to ``What do we want
to assume about our model of the causal data generating process?'' Through
the lens of causality, we make several contributions. First, we crisply
articulate why and when observational criteria fail, thus formalizing what
was before a matter of opinion. Second, our approach exposes previously
ignored subtleties and why they are fundamental to the problem. Finally, we
put forward natural causal non-discrimination criteria and develop algorithms
that satisfy them.

M. B. Zafar, Isabel Valera, Manuel Rodriguez, Krishna P. Gummadi, and Adrian
Weller.
**From
parity to preference: Learning with cost-effective notions of
fairness**.
In *Advances in Neural Information Processing Systems 31*, Long Beach,
California, December 2017.

** Abstract:** The adoption of
automated, data-driven decision making in an ever expanding range of
applications has raised concerns about its potential unfairness towards
certain social groups. In this context, a number of recent studies have
focused on defining, detecting, and removing unfairness from data-driven
decision systems. However, the existing notions of fairness, based on parity
(equality) in treatment or outcomes for different social groups, tend to be
needlessly stringent, limiting the overall decision making accuracy. In this
paper, we draw inspiration from the fair-division and envy-freeness
literature in economics and game theory and propose preference-based notions
of fairness —- given the choice between various sets of decision treatments
or outcomes, any group of users would collectively prefer its treatment or
outcomes, regardless of the (dis)parity as compared to the other groups.
Then, we introduce tractable proxies to design convex margin-based
classifiers that satisfy these preference-based notions of fairness. Finally,
we experiment with a variety of synthetic and real-world datasets and show
that preference-based fairness allows for greater decision accuracy than
parity-based fairness.

Neil Houlsby and Massimiliano Ciaramita.
**A
scalable Gibbs sampler for probabilistic entity linking**.
In *36th European Conference on Information Retrieval*, pages 335-346.
Springer, 2014.

** Abstract:** Entity linking involves
labeling phrases in text with their referent entities, such as Wikipedia or
Freebase entries. This task is challenging due to the large number of
possible entities, in the millions, and heavy-tailed mention ambiguity. We
formulate the problem in terms of probabilistic inference within a topic
model, where each topic is associated with a Wikipedia article. To deal with
the large number of topics we propose a novel efficient Gibbs sampling scheme
which can also incorporate side information, such as the Wikipedia graph.
This conceptually simple probabilistic approach achieves state-of-the-art
performance in entity-linking on the Aida-CoNLL dataset.

Simon Lacoste-Julien, Konstantina Palla, Alex Davies, Gjergji Kasneci, Thore
Graepel, and Zoubin Ghahramani.
**Sigma: simple
greedy matching for aligning large knowledge bases**.
In *KDD*, pages 572-580. Association for Computing Machinery, 2013.

** Abstract:** The Internet has enabled the creation of a
growing number of large-scale knowledge bases in a variety of domains
containing complementary information. Tools for automatically aligning these
knowledge bases would make it possible to unify many sources of structured
knowledge and answer complex queries. However, the efficient alignment of
large-scale knowledge bases still poses a considerable challenge. Here, we
present Simple Greedy Matching (SiGMa), a simple algorithm for aligning
knowledge bases with millions of entities and facts. SiGMa is an iterative
propagation algorithm which leverages both the structural information from
the relationship graph as well as flexible similarity measures between entity
properties in a greedy local search, thus making it scalable. Despite its
greedy nature, our experiments indicate that SiGMa can efficiently match some
of the world's largest knowledge bases with high precision. We provide
additional experiments on benchmark datasets which demonstrate that SiGMa can
outperform state-of-the-art approaches both in accuracy and efficiency.

R. Silva, K. A. Heller, Z. Ghahramani, and E. M. Airoldi.
**Ranking
relations using analogies in biological and information networks**.
*Annals of Applied Statistics*, 4(2):615-644, 2010.

**
Abstract:** Analogical reasoning depends fundamentally on the ability to
learn and generalize about relations between objects. We develop an approach
to relational learning which, given a set of pairs of objects S = A(1):B(1),
A(2):B(2), ..., A(N):B(N), measures how well other pairs A:B fit in with the
set S. Our work addresses the question: is the relation between objects A and
B analogous to those relations found in S? Such questions are particularly
relevant in information retrieval, where an investigator might want to search
for analogous pairs of objects that match the query set of interest. There
are many ways in which objects can be related, making the task of measuring
analogies very challenging. Our approach combines a similarity measure on
function spaces with Bayesian analysis to produce a ranking. It requires data
containing features of the objects of interest and a link matrix specifying
which relationships exist; no further attributes of such relationships are
necessary. We illustrate the potential of our method on text analysis and
information networks. An application on discovering functional interactions
between pairs of proteins is discussed in detail, where we show that our
approach can work in practice even if a small set of protein pairs is
provided.

Ricardo Silva, Katherine A. Heller, and Zoubin Ghahramani.
**Analogical
reasoning with relational Bayesian sets**.
In Marina Meila and Xiaotong Shen, editors, *AISTATS*, volume 2 of
*JMLR Proceedings*, pages 500-507. JMLR.org, 2007.

**
Abstract:** Analogical reasoning depends fundamentally on the ability to
learn and generalize about relations between objects. There are many ways in
which objects can be related, making automated analogical reasoning very
chal- lenging. Here we develop an approach which, given a set of pairs of
related objects S = A1:B1,A2:B2,...,AN:BN, measures how well other pairs
A:B fit in with the set S. This addresses the question: is the relation
between objects A and B analogous to those relations found in S? We recast
this classi- cal problem as a problem of Bayesian analy- sis of relational
data. This problem is non- trivial because direct similarity between ob-
jects is not a good way of measuring analo- gies. For instance, the analogy
between an electron around the nucleus of an atom and a planet around the Sun
is hardly justified by isolated, non-relational, comparisons of an electron
to a planet, and a nucleus to the Sun. We develop a generative model for
predicting the existence of relationships and extend the framework of
Ghahramani and Heller (2005) to provide a Bayesian measure for how analogous
a relation is to other relations. This sheds new light on an old problem,
which we motivate and illustrate through practical applications in
exploratory data analysis.

Zoubin Ghahramani and Katherine A. Heller.
**Bayesian
sets**.
In Y. Weiss, B. Schölkopf, and J. Platt, editors, *Advances in Neural
Information Processing Systems 18*, pages 435-442, Cambridge, MA, USA,
December 2006. The MIT Press.

** Abstract:** Inspired by
"Google™ Sets", we consider the problem of retrieving items from a
concept or cluster, given a query consisting of a few items from that
cluster. We formulate this as a Bayesian inference problem and describe a
very simple algorithm for solving it. Our algorithm uses a model-based
concept of a cluster and ranks items using a score which evaluates the
marginal probability that each item belongs to a cluster containing the query
items. For exponential family models with conjugate priors this marginal
probability is a simple function of sufficient statistics. We focus on sparse
binary data and show that our score can be evaluated exactly using a single
sparse matrix multiplication, making it possible to apply our algorithm to
very large datasets. We evaluate our algorithm on three datasets: retrieving
movies from EachMovie, finding completions of author sets from the NIPS
dataset, and finding completions of sets of words appearing in the Grolier
encyclopedia. We compare to Google™ Sets and show that Bayesian Sets
gives very reasonable set completions.

Katherine A. Heller and Zoubin Ghahramani.
**A simple Bayesian
framework for content-based image retrieval**.
In *CVPR*, pages 2110-2117. IEEE Computer Society, 2006.

** Abstract:** We present a Bayesian framework for content-based
image retrieval which models the distribution of color and texture features
within sets of related images. Given a userspecified text query (e.g.
"penguins") the system first extracts a set of images, from a labelled
corpus, corresponding to that query. The distribution over features of these
images is used to compute a Bayesian score for each image in a large
unlabelled corpus. Unlabelled images are then ranked using this score and the
top images are returned. Although the Bayesian score is based on computing
marginal likelihoods, which integrate over model parameters, in the case of
sparse binary data the score reduces to a single matrix-vector multiplication
and is therefore extremely efficient to compute. We show that our method
works surprisingly well despite its simplicity and the fact that no relevance
feedback is used. We compare different choices of features, and evaluate our
results using human subjects.

## Reinforcement Learning and ControlWe are interested in understanding the human sensory motor system from a mathematical, computational and engineering point of view. To do this, we need to use concepts from control theory, optimization, machine learning and statistics, as well as experimental methods based on human psychophysics and virtual reality. These formal tools are also useful for advancing robotics and decision theory. |

Arunkumar Byravan, Leonard Hasenclever, Piotr Trochim, Mehdi Mirza,
Alessandro Davide Ialongo, Yuval Tassa, Jost Tobias Springenberg, Abbas
Abdolmaleki, Nicolas Heess, Josh Merel, and Martin Riedmiller.
**Evaluating model-based
planning and planner amortization for continuous control**.
In *10th International Conference on Learning Representations*, 2022.

** Abstract:** There is a widespread intuition that model-based
control methods should be able to surpass the data efficiency of model-free
approaches. In this paper we attempt to evaluate this intuition on various
challenging locomotion tasks. We take a hybrid approach, combining model
predictive control (MPC) with a learned model and model-free policy learning;
the learned policy serves as a proposal for MPC. We show that MPC with
learned proposals and models (trained on the fly or transferred from related
tasks) can significantly improve performance and data efficiency with respect
to model-free methods. However, we find that well-tuned model-free agents are
strong baselines even for high DoF control problems. Finally, we show that it
is possible to distil a model-based planner into a policy that amortizes the
planning computation without any loss of performance.

Joseph Marino, Alexandre Piche, Alessandro Davide Ialongo, and Yisong Yue.
**Iterative
amortized policy optimization**.
In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan,
editors, *Advances in Neural Information Processing Systems 34*,
volume 34, pages 15667-15681. Curran Associates, Inc., 2021.

** Abstract:** Policy networks are a central feature of deep
reinforcement learning (RL) algorithms for continuous control, enabling the
estimation and sampling of high-value actions. From the variational inference
perspective on RL, policy networks, when used with entropy or KL
regularization, are a form of amortized optimization, optimizing network
parameters rather than the policy distributions directly. However, direct
amortized mappings can yield suboptimal policy estimates and restricted
distributions, limiting performance and exploration. Given this perspective,
we consider the more flexible class of iterative amortized optimizers. We
demonstrate that the resulting technique, iterative amortized policy
optimization, yields performance improvements over direct amortization on
benchmark continuous control tasks.

Krzysztof Choromanski, David Cheikhi, Jared Davis, Valerii Likhosherstov,
Achille Nazaret, Achraf Bahamou, Xingyou Song, Mrugank Akarte, Jack
Parker-Holder, Jacob Bergquist, Yuan Gao, Aldo Pacchiano, Tamas Sarlos,
Adrian Weller, and Vikas Sindhwani.
**Stochastic flows and geometric
optimization on the orthogonal group**.
In *37th International Conference on Machine Learning*, 2020.

** Abstract:** We present a new class of stochastic,
geometrically-driven optimization algorithms on the orthogonal group O(d) and
naturally reductive homogeneous manifolds obtained from the action of the
rotation group SO(d). We theoretically and experimentally demonstrate that
our methods can be applied in various fields of machine learning including
deep, convolutional and recurrent neural networks, reinforcement learning,
normalizing flows and metric learning. We show an intriguing connection
between efficient stochastic optimization on the orthogonal group and graph
theory (e.g. matching problem, partition functions over graphs,
graph-coloring). We leverage the theory of Lie groups and provide theoretical
results for the designed class of algorithms. We demonstrate broad
applicability of our methods by showing strong performance on the seemingly
unrelated tasks of learning world models to obtain stable policies for the
most difficult Humanoid agent from OpenAI Gym and improving convolutional
neural networks.

Tameem Adel and Adrian Weller.
**TibGM: A
transferable and information-based graphical model approach for reinforcement
learning**.
In *36th International Conference on Machine Learning*, Long Beach, June
2019.

** Abstract:** One of the challenges to reinforcement
learning (RL) is scalable transferability among complex tasks. Incorporating
a graphical model (GM), along with the rich family of related methods, as a
basis for RL frameworks provides potential to address issues such as
transferability, generalisation and exploration. Here we propose a flexible
GM-based RL framework which leverages efficient inference procedures to
enhance generalisation and transfer power. In our proposed transferable and
information-based graphical model framework ‘TibGM’, we show the
equivalence between our mutual information-based objective in the GM, and an
RL consolidated objective consisting of a standard reward maximisation target
and a generalisation/transfer objective. In settings where there is a sparse
or deceptive reward signal, our TibGM framework is flexible enough to
incorporate exploration bonuses depicting intrinsic rewards. We empirically
verify improved performance and exploration power.

Robert Pinsler, Peter Karkus, Andras Kupcsik, David Hsu, and Wee Sun Lee.
**Factored
contextual policy search with Bayesian optimization**.
In *IEEE International Conference on Robotics and Automation*, Montreal,
Canada, May 2019.

** Abstract:** Scarce data is a major
challenge to scaling robot learning to truly complex tasks, as we need to
generalize locally learned policies over different task contexts. Contextual
policy search offers data-efficient learning and generalization by explicitly
conditioning the policy on a parametric context space. In this paper, we
further structure the contextual policy representation. We propose to factor
contexts into two components: target contexts that describe the task
objectives, e.g. target position for throwing a ball; and environment
contexts that characterize the environment, e.g. initial position or mass of
the ball. Our key observation is that experience can be directly generalized
over target contexts. We show that this can be easily exploited in contextual
policy search algorithms. In particular, we apply factorization to a Bayesian
optimization approach to contextual policy search both in sampling-based and
active learning settings. Our simulation results show faster learning and
better generalization in various robotic domains. See our supplementary
video: https://youtu.be/MNTbBAOufDY.

David Janz, Jiri Hron, Przemyslaw Mazur, José Miguel Hernández-Lobato,
Katja Hofmann, and Sebastian Tschiatschek.
**Successor Uncertainties:
exploration and uncertainty in temporal difference learning**.
*NeurIPS*, 2019.

** Abstract:** Posterior sampling for
reinforcement learning (PSRL) is an effective method for balancing
exploration and exploitation in reinforcement learning. Randomised value
functions (RVF) can be viewed as a promising approach to scaling PSRL.
However, we show that most contemporary algorithms combining RVF with neural
network function approximation do not possess the properties which make PSRL
effective, and provably fail in sparse reward problems. Moreover, we find
that propagation of uncertainty, a property of PSRL previously thought
important for exploration, does not preclude this failure. We use these
insights to design Successor Uncertainties (SU), a cheap and easy to
implement RVF algorithm that retains key properties of PSRL. SU is highly
effective on hard tabular exploration benchmarks. Furthermore, on the Atari
2600 domain, it surpasses human performance on 38 of 49 games tested
(achieving a median human normalised score of 2.09), and outperforms its
closest RVF competitor, Bootstrapped DQN, on 36 of those.

Robert Pinsler, Riad Akrour, Takayuki Osa, Jan Peters, and Gerhard Neumann.
**Sample
and feedback efficient hierarchical reinforcement learning from human
preferences**.
In *IEEE International Conference on Robotics and Automation*, Brisbane,
Australia, May 2018.

** Abstract:** While reinforcement
learning has led to promising results in robotics, defining an informative
reward function is challenging. Prior work considered including the human in
the loop to jointly learn the reward function and the optimal policy.
Generating samples from a physical robot and requesting human feedback are
both taxing efforts for which efficiency is critical. We propose to learn
reward functions from both the robot and the human perspectives to improve on
both efficiency metrics. Learning a reward function from the human
perspective increases feedback efficiency by assuming that humans rank
trajectories according to a low-dimensional outcome space. Learning a reward
function from the robot perspective circumvents the need for a dynamics model
while retaining the sample efficiency of model-based approaches. We provide
an algorithm that incorporates bi-perspective reward learning into a general
hierarchical reinforcement learning framework and demonstrate the merits of
our approach on a toy task and a simulated robot grasping task.

Benjamin Eysenbach, Shixiang Gu, Julian Ibarz, and Sergey Levine.
**Leave no trace: Learning to reset
for safe and autonomous reinforcement learning**.
In *6th International Conference on Learning Representations*, Vancouver
CANADA, Apr 2018.

** Abstract:** Deep reinforcement learning
algorithms can learn complex behavioral skills, but real-world application of
these methods requires a large amount of experience to be collected by the
agent. In practical settings, such as robotics, this involves repeatedly
attempting a task, resetting the environment between each attempt. However,
not all tasks are easily or automatically reversible. In practice, this
learning process requires extensive human intervention. In this work, we
propose an autonomous method for safe and efficient reinforcement learning
that simultaneously learns a forward and reset policy, with the reset policy
resetting the environment for a subsequent attempt. By learning a value
function for the reset policy, we can automatically determine when the
forward policy is about to enter a non-reversible state, providing for
uncertainty-aware safety aborts. Our experiments illustrate that proper use
of the reset policy can greatly reduce the number of manual resets required
to learn a task, can reduce the number of unsafe actions that lead to
non-reversible states, and can automatically induce a curriculum.

** Comment:** [Video]

Vitchyr Pong, Shixiang Gu, Murtaza Dalal, and Sergey Levine.
**Temporal
difference models: Model-free deep rl for model-based control**.
In *6th International Conference on Learning Representations*, Vancouver
CANADA, Apr 2018.

** Abstract:** Model-free reinforcement
learning (RL) has been proven to be a powerful, general tool for learning
complex behaviors. However, its sample efficiency is often impractically
large for solving challenging real-world problems, even for off-policy
algorithms such as Q-learning. A limiting factor in classic model-free RL is
that the learning signal consists only of scalar rewards, ignoring much of
the rich information contained in state transition tuples. Model-based RL
uses this information, by training a predictive model, but often does not
achieve the same asymptotic performance as model-free RL due to model bias.
We introduce temporal difference models (TDMs), a family of goal-conditioned
value functions that can be trained with model-free learning and used for
model-based control. TDMs combine the benefits of model-free and model-based
RL: they leverage the rich information in state transitions to learn very
efficiently, while still attaining asymptotic performance that exceeds that
of direct model-based RL methods. Our experimental results show that, on a
range of continuous control tasks, TDMs provide a substantial improvement in
efficiency compared to state-of-the-art model-based and model-free
methods.

Mark Rowland, Marc G. Bellemare, Will Dabney, Rémi Munos, and Yee Whye Teh.
**An analysis of categorical
distributional reinforcement learning**.
In *21st International Conference on Artificial Intelligence and
Statistics*, Playa Blanca, Lanzarote, Canary Islands, April 2018.

** Abstract:** Distributional approaches to value-based
reinforcement learning model the entire distribution of returns, rather than
just their expected values, and have recently been shown to yield
state-of-the-art empirical performance. This was demonstrated by the recently
proposed C51 algorithm, based on categorical distributional reinforcement
learning (CDRL) [Bellemare et al., 2017]. However, the theoretical properties
of CDRL algorithms are not yet well understood. In this paper, we introduce a
framework to analyse CDRL algorithms, establish the importance of the
projected distributional Bellman operator in distributional RL, draw
fundamental connections between CDRL and the Cramér distance, and give a
proof of convergence for sample-based categorical distributional
reinforcement learning algorithms.

Will Dabney, Mark Rowland, Marc G. Bellemare, and Rémi Munos.
**Distributional reinforcement
learning with quantile regression**.
In *32nd AAAI Conference on Artificial Intelligence*, New Orleans,
February 2018.

** Abstract:** In reinforcement learning an
agent interacts with the environment by taking actions and observing the next
state and reward. When sampled probabilistically, these state transitions,
rewards, and actions can all induce randomness in the observed long-term
return. Traditionally, reinforcement learning algorithms average over this
randomness to estimate the value function. In this paper, we build on recent
work advocating a distributional approach to reinforcement learning in which
the distribution over returns is modeled explicitly instead of only
estimating the mean. That is, we examine methods of learning the value
distribution instead of the value function. We give results that close a
number of gaps between the theoretical and algorithmic results given by
Bellemare, Dabney, and Munos (2017). First, we extend existing results to the
approximate distribution setting. Second, we present a novel distributional
reinforcement learning algorithm consistent with our theoretical formulation.
Finally, we evaluate this new algorithm on the Atari 2600 games, observing
that it significantly outperforms many of the recent improvements on DQN,
including the related distributional algorithm C51.

Jan-Peter Calliess, Stephen Roberts, Carl Edward Rasmussen, and Jan
Maciejowski.
**Nonlinear set
membership regression with adaptive hyper-parameter estimation for online
learning and control**.
In *Proceedings of the European Control Conference*, 2018.

** Abstract:** Methods known as Lipschitz Interpolation or
Nonlinear Set Membership regression have become established tools for
nonparametric system-identification and data-based control. They utilise
presupposed Lipschitz properties to compute inferences over unobserved
function values. Unfortunately, they rely on the a priori knowledge of a
Lipschitz constant of the underlying target function which serves as a
hyperparameter. We propose a closed-form estimator of the Lipschitz constant
that is robust to bounded observational noise in the data. The merger of
Lipschitz Interpolation with the new hyperparameter estimator gives a new
nonparametric machine learning method for which we derive online learning
convergence guarantees. Furthermore, we apply our learning method to
model-reference adaptive control and provide a convergence guarantee on the
closed-loop dynamics. In a simulated flight manoeuvre control scenario, we
compare the performance of our approach to recently proposed alternative
learning-based controllers.

Paavo Parmas, Carl Edward Rasmussen, Jan Peters, and Kenji Doya.
**PIPPS:
Flexible model-based policy search robust to the curse of chaos**.
In *35th International Conference on Machine Learning*, 2018.

** Abstract:** Previously, the exploding gradient problem has
been explained to be central in deep learning and model-based reinforcement
learning, because it causes numerical issues and instability in optimization.
Our experiments in model-based reinforcement learning imply that the problem
is not just a numerical issue, but it may be caused by a fundamental
chaos-like nature of long chains of nonlinear computations. Not only do the
magnitudes of the gradients become large, the direction of the gradients
becomes essentially random. We show that reparameterization gradients suffer
from the problem, while likelihood ratio gradients are robust. Using our
insights, we develop a model-based policy search framework, Probabilistic
Inference for Particle-Based Policy Search (PIPPS), which is easily
extensible, and allows for almost arbitrary models and policies, while
simultaneously matching the performance of previous data-efficient learning
algorithms. Finally, we invent the total propagation algorithm, which
efficiently computes a union over all pathwise derivative depths during a
single backwards pass, automatically giving greater weight to estimators with
lower variance, sometimes improving over reparameterization gradients by
10^{6} times.

Shixiang Gu, Timothy Lillicrap, Zoubin Ghahramani, Richard E. Turner, Bernhard
Schölkopf, and Sergey Levine.
**Interpolated policy gradient:
Merging on-policy and off-policy policy gradient estimation for deep
reinforcement learning**.
In *Advances in Neural Information Processing Systems 31*, Long Beach
USA, Dec 2017.

** Abstract:** Off-policy model-free deep
reinforcement learning methods using previously collected data can improve
sample efficiency over on-policy policy gradient techniques. On the other
hand, on-policy algorithms are often more stable and easier to use. This
paper examines, both theoretically and empirically, approaches to merging on-
and off-policy updates for deep reinforcement learning. Theoretical results
show that off-policy updates with a value function estimator can be
interpolated with on-policy policy gradient updates whilst still satisfying
performance bounds. Our analysis uses control variate methods to produce a
family of policy gradient algorithms, with several recently proposed
algorithms being special cases of this family. We then provide an empirical
comparison of these techniques with the remaining algorithmic details fixed,
and show how different mixing of off-policy gradient estimates with on-policy
samples contribute to improvements in empirical performance. The final
algorithm provides a generalization and unification of existing deep policy
gradient techniques, has theoretical guarantees on the bias introduced by
off-policy updates, and improves on the state-of-the-art model-free deep RL
methods on a number of OpenAI Gym continuous control benchmarks.

Rowan McAllister and Carl Edward Rasmussen.
**Data-efficient
reinforcement learning in continuous state-action
Gaussian-POMDPs**.
In *Advances in Neural Information Processing Systems 31*, Long Beach,
California, December 2017.

** Abstract:** We present a
data-efficient reinforcement learning method for continuous state-action
systems under significant observation noise. Data-efficient solutions under
small noise exist, such as PILCO which learns the cartpole swing-up task in
30s. PILCO evaluates policies by planning state-trajectories using a dynamics
model. However, PILCO applies policies to the observed state, therefore
planning in observation space. We extend PILCO with filtering to instead plan
in belief space, consistent with partially observable Markov decisions
process (POMDP) planning. This enables data-efficient learning under
significant observation noise, outperforming more naive methods such as
post-hoc application of a filter to policies optimised by the original
(unfiltered) PILCO algorithm. We test our method on the cartpole swing-up
task, which involves nonlinear dynamics and requires nonlinear control.

Natasha Jaques, Shixiang Gu, Dzmitry Bahdanau, Jose Miguel Hernndez Lobato,
Richard E. Turner, and Douglas Eck.
**Sequence tutor: Conservative
fine-tuning of sequence generation models with kl-control**.
In *34th International Conference on Machine Learning*, Sydney
AUSTRALIA, Aug 2017.

** Abstract:** This paper proposes a
general method for improving the structure and quality of sequences generated
by a recurrent neural network (RNN), while maintaining information originally
learned from data, as well as sample diversity. An RNN is first pre-trained
on data using maximum likelihood estimation (MLE), and the probability
distribution over the next token in the sequence learned by this model is
treated as a prior policy. Another RNN is then trained using reinforcement
learning (RL) to generate higher-quality outputs that account for
domain-specific incentives while retaining proximity to the prior policy of
the MLE RNN. To formalize this objective, we derive novel off-policy RL
methods for RNNs from KL-control. The effectiveness of the approach is
demonstrated on two applications; 1) generating novel musical melodies, and
2) computational molecular generation. For both problems, we show that the
proposed method improves the desired properties and structure of the
generated sequences, while maintaining information learned from data.

** Comment:** [MIT
Technology Review] [Video]

Daniel Limon, Jan-Peter Calliess, and Jan Maciejowski.
**Learning-based nonlinear model predictive control**.
In *IFAC 2017 World Congress*, Toulouse, France, July 2017. doi
10.1016/j.ifacol.2017.08.1050.

** Abstract:** This paper
presents stabilizing Model Predictive Controllers (MPC) in which prediction
models are inferred from experimental data of the inputs and outputs of the
plant. Using a nonparametric machine learning technique called LACKI, the
estimated (possibly nonlinear) model function together with an estimation of
Hoelder constant is provided. Based on these, a number of predictive
controllers with stability guaranteed by design are proposed. Firstly, the
case when the prediction model is estimated off- line is considered and
robust stability and recursive feasibility is ensured by using tightened
constraints in the optimisation problem. This controller has been extended to
the more interesting and complex case: the online learning of the model,
where the new data collected from feedback is added to enhance the prediction
model. A on-line learning MPC based on a double sequence of predictions is
proposed. Stability of the online learning MPC is proved. These controllers
are illustrated by simulation.

Shixiang Gu, Ethan Holly, Timothy Lillicrap, and Sergey Levine.
**Deep reinforcement learning for
robotic manipulation with asynchronous off-policy updates**.
In *IEEE International Conference on Robotics and Automation*,
SINGAPORE, May 2017.

** Abstract:** Reinforcement learning
holds the promise of enabling autonomous robots to learn large repertoires of
behavioral skills with minimal human intervention. However, robotic
applications of reinforcement learning often compromise the autonomy of the
learning process in favor of achieving training times that are practical for
real physical systems. This typically involves introducing hand-engineered
policy representations and human-supplied demonstrations. Deep reinforcement
learning alleviates this limitation by training general-purpose neural
network policies, but applications of direct deep reinforcement learning
algorithms have so far been restricted to simulated settings and relatively
simple tasks, due to their apparent high sample complexity. In this paper, we
demonstrate that a recent deep reinforcement learning algorithm based on
off-policy training of deep Q-functions can scale to complex 3D manipulation
tasks and can learn deep neural network policies efficiently enough to train
on real physical robots. We demonstrate that the training times can be
further reduced by parallelizing the algorithm across multiple robots which
pool their policy updates asynchronously. Our experimental evaluation shows
that our method can learn a variety of 3D manipulation skills in simulation
and a complex door opening skill on real robots without any prior
demonstrations or manually designed representations.

** Comment:** [Google
Blogpost] [MIT
Technology Review] [Video]

Shixiang Gu, Timothy Lillicrap, Zoubin Ghahramani, Richard E. Turner, and
Sergey Levine.
**Q-prop: Sample-efficient policy
gradient with an off-policy critic**.
In *5th International Conference on Learning Representations*, Toulon
France, April 2017.

** Abstract:** Model-free deep
reinforcement learning (RL) methods have been successful in a wide variety of
simulated domains. However, a major obstacle facing deep RL in the real world
is their high sample complexity. Batch policy gradient methods offer stable
learning, but at the cost of high variance, which often requires large
batches. TD-style methods, such as off-policy actor-critic and Q-learning,
are more sample-efficient but biased, and often require costly hyperparameter
sweeps to stabilize. In this work, we aim to develop methods that combine the
stability of policy gradients with the efficiency of off-policy RL. We
present Q-Prop, a policy gradient method that uses a Taylor expansion of the
off-policy critic as a control variate. Q-Prop is both sample efficient and
stable, and effectively combines the benefits of on-policy and off-policy
methods. We analyze the connection between Q-Prop and existing model-free
algorithms, and use control variate theory to derive two variants of Q-Prop
with conservative and aggressive adaptation. We show that conservative Q-Prop
provides substantial gains in sample efficiency over trust region policy
optimization (TRPO) with generalized advantage estimation (GAE), and improves
stability over deep deterministic policy gradient (DDPG), the
state-of-the-art on-policy and off-policy methods, on OpenAI Gym's MuJoCo
continuous control environments.

Shixiang Gu, Timothy Lillicrap, Ilya Sutskever, and Sergey Levine.
**Continuous deep q-learning with
model-based acceleration**.
In *33rd International Conference on Machine Learning*, New York USA,
June 2016.

** Abstract:** Model-free reinforcement learning
has been successfully applied to a range of challenging problems, and has
recently been extended to handle large neural network policies and value
functions. However, the sample complexity of model-free algorithms,
particularly when using high-dimensional function approximators, tends to
limit their applicability to physical systems. In this paper, we explore
algorithms and representations to reduce the sample complexity of deep
reinforcement learning for continuous control tasks. We propose two
complementary techniques for improving the efficiency of such algorithms.
First, we derive a continuous variant of the Q-learning algorithm, which we
call normalized adantage functions (NAF), as an alternative to the more
commonly used policy gradient and actor-critic methods. NAF representation
allows us to apply Q-learning with experience replay to continuous tasks, and
substantially improves performance on a set of simulated robotic control
tasks. To further improve the efficiency of our approach, we explore the use
of learned models for accelerating model-free reinforcement learning. We show
that iteratively refitted local linear models are especially effective for
this, and demonstrate substantially faster learning on domains where such
models are applicable.

Jan-Peter Calliess.
**Lazily adapted constant kinky
inference for nonparametric regression and model-reference adaptive
control**.
*arXiv*, arXiv:1701.00178, 2016.

** Abstract:**
Techniques known as Nonlinear Set Membership prediction, Lipschitz
Interpolation or Kinky Inference are approaches to machine learning that
utilise presupposed Lipschitz properties to compute inferences over
unobserved function values. Provided a bound on the true best Lipschitz
constant of the target function is known a priori they offer convergence
guarantees as well as bounds around the predictions. Considering a more
general setting that builds on Hölder continuity relative to
pseudo-metrics, we propose an online method for estimating the Hoelder
constant online from function value observations that possibly are corrupted
by bounded observational errors. Utilising this to compute adaptive
parameters within a kinky inference rule gives rise to a nonparametric
machine learning method, for which we establish strong universal
approximation guarantees. That is, we show that our prediction rule can learn
any continuous function in the limit of increasingly dense data to within a
worst-case error bound that depends on the level of observational
uncertainty. We apply our method in the context of nonparametric
model-reference adaptive control (MRAC). Across a range of simulated aircraft
roll-dynamics and performance metrics our approach outperforms recently
proposed alternatives that were based on Gaussian processes and RBF-neural
networks. For discrete-time systems, we provide stability guarantees for our
learning-based controllers both for the batch and the online learning
setting.

Rowan McAllister.
**Bayesian learning for
data-efficient control**.
PhD thesis, University of Cambridge, Department of Engineering, Cambridge, UK,
2016.

** Abstract:** Applications to learn control of
unfamiliar dynamical systems with increasing autonomy are ubiquitous. From
robotics, to finance, to industrial processing, autonomous learning helps
obviate a heavy reliance on experts for system identification and controller
design. Often real world systems are nonlinear, stochastic, and expensive to
operate (e.g. slow, energy intensive, prone to wear and tear). Ideally
therefore, nonlinear systems can be identified with minimal system
interaction. This thesis considers data efficient autonomous learning of
control of nonlinear, stochastic systems. Data efficient learning critically
requires probabilistic modelling of dynamics. Traditional control approaches
use deterministic models, which easily overfit data, especially small
datasets. We use probabilistic Bayesian modelling to learn systems from
scratch, similar to the PILCO algorithm, which achieved unprecedented data
efficiency in learning control of several benchmarks. We extend PILCO in
three principle ways. First, we learn control under significant observation
noise by simulating a filtered control process using a tractably analytic
framework of Gaussian distributions. In addition, we develop the `latent
variable belief Markov decision process' when filters must predict under
real-time constraints. Second, we improve PILCO's data efficiency by
directing exploration with predictive loss uncertainty and Bayesian
optimisation, including a novel approximation to the Gittins index. Third, we
take a step towards data efficient learning of high-dimensional control using
Bayesian neural networks (BNN). Experimentally we show although filtering
mitigates adverse effects of observation noise, much greater performance is
achieved when optimising controllers with evaluations faithful to reality: by
simulating closed-loop filtered control if executing closed-loop filtered
control. Thus, controllers are optimised w.r.t. how they are used,
outperforming filters applied to systems optimised by unfiltered simulations.
We show directed exploration improves data efficiency. Lastly, we show BNN
dynamics models are almost as data efficient as Gaussian process models.
Results show data efficient learning of high-dimensional control is possible
as BNNs scale to high-dimensional state inputs.

Marc Peter Deisenroth, Dieter Fox, and Carl Edward Rasmussen.
**Gaussian processes for data-efficient learning in robotics and control**.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*,
37:408-423, 2015, doi
10.1109/TPAMI.2013.218.

** Abstract:** Autonomous learning
has been a promising direction in control and robotics for more than a decade
since data-driven learning allows to reduce the amount of engineering
knowledge, which is otherwise required. However, autonomous reinforcement
learning (RL) approaches typically require many interactions with the system
to learn controllers, which is a practical limitation in real systems, such
as robots, where many interactions can be impractical and time consuming. To
address this problem, current learning approaches typically require
task-speciﬁc knowledge in form of expert demonstrations, realistic
simulators, pre-shaped policies, or speciﬁc knowledge about the underlying
dynamics. In this article, we follow a different approach and speed up
learning by extracting more information from data. In particular, we learn a
probabilistic, non-parametric Gaussian process transition model of the
system. By explicitly incorporating model uncertainty into long-term planning
and controller learning our approach reduces the effects of model errors, a
key problem in model-based learning. Compared to state-of-the art RL our
model-based policy search method achieves an unprecedented speed of learning.
We demonstrate its applicability to autonomous learning in real robot and
control tasks.

B. Bischoff, D. Nguyen-Tuong, D. van Hoof, A. McHutchon, Carl Edward Rasmussen,
A. Knoll, and M. P. Deisenroth.
**Policy search
for learning robot control using sparse data**.
In *IEEE International Conference on Robotics and Automation*, pages
3882-3887, Hong Kong, China, 2014. IEEE, doi
10.1109/ICRA.2014.6907422.

** Abstract:** In many complex
robot applications, such as grasping and manipulation, it is difficult to
program desired task solutions beforehand, as robots are within an uncertain
and dynamic environment. In such cases, learning tasks from experience can be
a useful alternative. To obtain a sound learning and generalization
performance, machine learning, especially, reinforcement learning, usually
requires sufficient data. However, in cases where only little data is
available for learning, due to system constraints and practical issues,
reinforcement learning can act suboptimally. In this paper, we investigate
how model-based reinforcement learning, in particular the probabilistic
inference for learning control method (PILCO), can be tailored to cope with
the case of sparse data to speed up learning. The basic idea is to include
further prior knowledge into the learning process. As PILCO is built on the
probabilistic Gaussian processes framework, additional system knowledge can
be incorporated by defining appropriate prior distributions, e.g. a linear
mean Gaussian prior. The resulting PILCO formulation remains in closed form
and analytically tractable. The proposed approach is evaluated in simulation
as well as on a physical robot, the Festo Robotino XT. For the robot
evaluation, we employ the approach for learning an object pick-up task. The
results show that by including prior knowledge, policy learning can be sped
up in presence of sparse data.

Andrew McHutchon.
**Nonlinear modelling and
control using Gaussian processes**.
PhD thesis, University of Cambridge, Department of Engineering, Cambridge, UK,
2014.

** Abstract:** In many scientific disciplines it is
often required to make predictions about how a system will behave or to
deduce the correct control values to elicit a particular desired response.
Efficiently solving both of these tasks relies on the construction of a model
capturing the system's operation. In the most interesting situations, the
model needs to capture strongly nonlinear effects and deal with the presence
of uncertainty and noise. Building models for such systems purely based on a
theoretical understanding of underlying physical principles can be infeasibly
complex and require a large number of simplifying assumptions. An alternative
is to use a data-driven approach, which builds a model directly from
observations. A powerful and principled approach to doing this is to use a
Gaussian Process (GP).

In this thesis we start by discussing how GPs can
be applied to data sets which have noise affecting their inputs. We present
the "Noisy Input GP", which uses a simple local-linearisation to refer the
input noise into heteroscedastic output noise, and compare it to other
methods both theoretically and empirically. We show that this technique leads
to a effective model for nonlinear functions with input and output noise. We
then consider the broad topic of GP state space models for application to
dynamical systems. We discuss a very wide variety of approaches for using GPs
in state space models, including introducing a new method based on
moment-matching, which consistently gave the best performance. We analyse the
methods in some detail including providing a systematic comparison between
approximate-analytic and particle methods. To our knowledge such a comparison
has not been provided before in this area. Finally, we investigate an
automatic control learning framework, which uses Gaussian Processes to model
a system for which we wish to design a controller. Controller design for
complex systems is a difficult task and thus a framework which allows an
automatic design directly from data promises to be extremely useful. We
demonstrate that the previously published framework cannot cope with the
presence of observation noise but that the introduction of a state space
model dramatically improves its performance. This contribution, along with
some other suggested improvements opens the door for this framework to be
used in real-world applications.

Joseph Hall, Carl Edward Rasmussen, and Jan Maciejowski.
**Modelling and
control of nonlinear systems using Gaussian processes with partial model
information**.
In *51st IEEE Conference on Decision and Control*, 2012.

**
Abstract:** Gaussian processes are gaining increasing popularity among the
control community, in particular for the modelling of discrete time state
space systems. However, it has not been clear how to incorporate model
information, in the form of known state relationships, when using a Gaussian
process as a predictive model. An obvious example of known prior information
is position and velocity related states. Incorporation of such information
would be beneficial both computationally and for faster dynamics learning.
This paper introduces a method of achieving this, yielding faster dynamics
learning and a reduction in computational effort from O(Dn^{2}) to
O((D-F)n^{2}) in the prediction stage for a system with D states, F
known state relationships and n observations. The effectiveness of the method
is demonstrated through its inclusion in the PILCO learning algorithm with
application to the swing-up and balance of a torque-limited pendulum and the
balancing of a robotic unicycle in simulation.

Marc Peter Deisenroth, Carl Edward Rasmussen, and Dieter Fox.
**Learning to
control a low-cost manipulator using data-efficient reinforcement
learning**.
In *9th International Conference on Robotics: Science & Systems*, Los
Angeles, CA, USA, June 2011.

** Abstract:** Over the last
years, there has been substantial progress in robust manipulation in
unstructured environments. The long-term goal of our work is to get away from
precise, but very expensive robotic systems and to develop affordable,
potentially imprecise, self-adaptive manipulator systems that can
interactively perform tasks such as playing with children. In this paper, we
demonstrate how a low-cost off-the-shelf robotic system can learn closed-loop
policies for a stacking task in only a handful of trials - from scratch. Our
manipulator is inaccurate and provides no pose feedback. For learning a
controller in the work space of a Kinect-style depth camera, we use a
model-based reinforcement learning technique. Our learning method is data
efficient, reduces model bias, and deals with several noise sources in a
principled way during long-term planning. We present a way of incorporating
state-space constraints into the learning process and analyze the learning
gain by exploiting the sequential structure of the stacking task.

** Comment:** project
site

Daniel A. Braun, Pedro A. Ortega, Evangelos Theodorou, and Stefan Schaal.
**Path integral
control and bounded rationality**.
In *2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement
Learning*, 2011.

** Abstract:** Path integral methods have
recently been shown to be applicable to a very general class of optimal
control problems. Here we examine the path integral formalism from a
decision-theoretic point of view, since an optimal controller can always be
regarded as an instance of a perfectly rational decision-maker that chooses
its actions so as to maximize its expected utility. The problem with perfect
rationality is, however, that finding optimal actions is often very difficult
due to prohibitive computational resource costs that are not taken into
account. In contrast, a bounded rational decision-maker has only limited
resources and therefore needs to strike some compromise between the desired
utility and the required resource costs. In particular, we suggest an
information-theoretic measure of resource costs that can be derived
axiomatically. As a consequence we obtain a variational principle for choice
probabilities that trades off maximizing a given utility criterion and
avoiding resource costs that arise due to deviating from initially given
default choice probabilities. The resulting bounded rational policies are in
general probabilistic. We show that the solutions found by the path integral
formalism are such bounded rational policies. Furthermore, we show that the
same formalism generalizes to discrete control problems, leading to linearly
solvable bounded rational control policies in the case of Markov systems.
Importantly, Bellman's optimality principle is not presupposed by this
variational principle, but it can be derived as a limit case. This suggests
that the information theoretic formalization of bounded rationality might
serve as a general principle in control design that unifies a number of
recently reported approximate optimal control methods both in the continuous
and discrete domain.

Marc Peter Deisenroth and Carl Edward Rasmussen.
**PILCO: A
model-based and data-efficient approach to policy search**.
In *28th International Conference on Machine Learning*, 2011.

** Abstract:** In this paper, we introduce PILCO, a practical,
data-efficient model-based policy search method. PILCO reduces model bias,
one of the key problems of model-based reinforcement learning, in a
principled way. By learning a probabilistic dynamics model and explicitly
incorporating model uncertainty into long-term planning, PILCO can cope with
very little data and facilitates learning from scratch in only a few trials.
Policy evaluation is performed in closed form using state-of-the-art
approximate inference. Furthermore, policy gradients are computed
analytically for policy improvement. We report unprecedented learning
efficiency on challenging and high-dimensional control tasks.

** Comment:** web
site

Finale Doshi-Velez and Zoubin Ghahramani.
**A comparison of
human and agent reinforcement learning in partially observable
domains**.
In *33rd Annual Meeting of the Cognitive Science Society*, Boston, MA,
2011.

** Abstract:** It is commonly stated that reinforcement
learning (RL) algorithms learn slower than humans. In this work, we
investigate this claim using two standard problems from the RL literature. We
compare the performance of human subjects to RL techniques. We find that
context-the meaningfulness of the observations—-plays a significant role
in the rate of human RL. Moreover, without contextual information, humans
often fare much worse than classic algorithms. Comparing the detailed
responses of humans and RL algorithms, we also find that humans appear to
employ rather different strategies from standard algorithms, even in cases
where they had indistinguishable performance to them. Our research both sheds
light on human RL and provides insights for improving RL algorithms.

Joseph Hall, Carl Edward Rasmussen, and Jan Maciejowski.
**Reinforcement
learning with reference tracking control in continuous state spaces**.
In *Proceedings of 50th IEEE Conference on Decision and Control and European
Control Conference*, 2011.

** Abstract:** The contribution
described in this paper is an algorithm for learning nonlinear, reference
tracking, control policies given no prior knowledge of the dynamical system
and limited interaction with the system through the learning process.
Concepts from the field of reinforcement learning, Bayesian statistics and
classical control have been brought together in the formulation of this
algorithm which can be viewed as a form indirect self tuning regulator. On
the task of reference tracking using the inverted pendulum it was shown to
yield generally improved performance on the best controller derived from the
standard linear quadratic method using only 30 s of total interaction with
the system. Finally, the algorithm was shown to work on the double pendulum
proving its ability to solve nontrivial control tasks.

Pedro A. Ortega and Daniel A. Braun.
**Information,
utility and bounded rationality**.
In *The fourth conference on artificial general intelligence*, volume
6830 of *Lecture Notes on Artificial Intelligence*, pages 269-274.
Springer-Verlag, 2011.

** Abstract:** Perfectly rational
decision-makers maximize expected utility, but crucially ignore the resource
costs incurred when determining optimal actions. Here we employ an axiomatic
framework for bounded rational decision-making based on a thermodynamic
interpretation of resource costs as information costs. This leads to a
variational free utility principle akin to thermodynamical free energy that
trades off utility and information costs. We show that bounded optimal
control solutions can be derived from this variational principle, which leads
in general to stochastic policies. Furthermore, we show that risk-sensitive
and robust (minimax) control schemes fall out naturally from this framework
if the environment is considered as a bounded rational and perfectly rational
opponent, respectively. When resource costs are ignored, the maximum expected
utility principle is recovered.

Pedro A. Ortega, Daniel A. Braun, and Simon Godsill.
**Reinforcement
learning and the Bayesian control rule**.
In *The fourth conference on artificial general intelligence*, volume
6830 of *Lecture Notes on Artificial Intelligence*, pages 281-285.
Springer-Verlag, 2011.

** Abstract:** We present an
actor-critic scheme for reinforcement learning in complex domains. The main
contribution is to show that planning and I/O dynamics can be separated such
that an intractable planning problem reduces to a simple multi-armed bandit
problem, where each lever stands for a potentially arbitrarily complex
policy. Furthermore, we use the Bayesian control rule to construct an
adaptive bandit player that is universal with respect to a given class of
optimal bandit players, thus indirectly constructing an adaptive agent that
is universal with respect to a given class of policies.

Pedro A. Ortega.
**A
Unified Framework for Resource-Bounded Agents Interacting with an Unknown
Environment**.
PhD thesis, Department of Engineering, University of Cambridge, 2011.

** Abstract:** The aim of this thesis is to present a
mathematical framework for conceptualizing and constructing adaptive
autonomous systems under resource constraints. The first part of this thesis
contains a concise presentation of the foundations of classical agency:
namely the formalizations of decision making and learning. Decision making
includes: (a) subjective expected utility (SEU) theory, the framework of
decision making under uncertainty; (b) the maximum SEU principle to choose
the optimal solution; and (c) its application to the design of autonomous
systems, culminating in the Bellman optimality equations. Learning includes:
(a) Bayesian probability theory, the theory for reasoning under uncertainty
that extends logic; and (b) Bayes-Optimal agents, the application of Bayesian
probability theory to the design of optimal adaptive agents. Then, two major
problems of the maximum SEU principle are highlighted: (a) the prohibitive
computational costs and (b) the need for the causal precedence of the choice
of the policy. The second part of this thesis tackles the two aforementioned
problems. First, an information-theoretic notion of resources in autonomous
systems is established. Second, a framework for resource-bounded agency is
introduced. This includes: (a) a maximum bounded SEU principle that is
derived from a set of axioms of utility; (b) an axiomatic model of
probabilistic causality, which is applied for the formalization of autonomous
systems having uncertainty over their policy and environment; and (c) the
Bayesian control rule, which is derived from the maximum bounded SEU
principle and the model of causality, implementing a stochastic adaptive
control law that deals with the case where autonomous agents are uncertain
about their policy and environment.

Daniel A. Braun and Pedro A. Ortega.
**A minimum relative
entropy principle for adaptive control in linear quadratic
regulators**.
In *Proceedings of the 7th international conference on informatics in
control, automation and robotics*, page (in press), 2010.

**
Abstract:** The design of optimal adaptive controllers is usually based on
heuristics, because solving Bellman's equations over information states is
notoriously intractable. Approximate adaptive controllers often rely on the
principle of certainty-equivalence where the control process deals with
parameter point estimates as if they represented ``true'' parameter values.
Here we present a stochastic control rule instead where controls are sampled
from a posterior distribution over a set of probabilistic input-output models
and the true model is identified by Bayesian inference. This allows
reformulating the adaptive control problem as an inference and sampling
problem derived from a minimum relative entropy principle. Importantly,
inference and action sampling both work forward in time and hence such a
Bayesian adaptive controller is applicable on-line. We demonstrate the
improved performance that can be achieved by such an approach for linear
quadratic regulator examples.

Marc Peter Deisenroth.
**Efficient reinforcement
learning using Gaussian processes**.
PhD thesis, Karlsruhe Institute of Technology, Karlsruhe, Germany, 2010.

** Abstract:** In many research areas, including control and
medical applications, we face decision-making problems where data are limited
and/or the underlying generative process is complicated and partially
unknown. In these scenarios, we can profit from algorithms that learn from
data and aid decision making.

Reinforcement learning (RL) is a general
computational approach to experience-based goal-directed learning for
sequential decision making under uncertainty. However, RL often lacks
efficiency in terms of the number of required trials when no task-specific
knowledge is available. This lack of efficiency makes RL often inapplicable
to (optimal) control problems. Thus, a central issue in RL is to speed up
learning by extracting more information from available experience.

The
contributions of this dissertation are threefold:

1. We propose PILCO, a
fully Bayesian approach for efficient RL in continuous-valued state and
action spaces when no expert knowledge is available. PILCO is based on
well-established ideas from statistics and machine learning. PILCO's key
ingredient is a probabilistic dynamics model learned from data, which is
implemented by a Gaussian process (GP). The GP carefully quantifies knowledge
by a probability distribution over plausible dynamics models. By averaging
over all these models during long-term planning and decision making, PILCO
takes uncertainties into account in a principled way and, therefore, reduces
model bias, a central problem in model-based RL.

2. Due to its generality
and efficiency, PILCO can be considered a conceptual and practical approach
to jointly learning models and controllers when expert knowledge is difficult
to obtain or simply not available. For this scenario, we investigate PILCO's
properties its applicability to challenging real and simulated nonlinear
control problems. For example, we consider the tasks of learning to swing up
a double pendulum attached to a cart or to balance a unicycle with five
degrees of freedom. Across all tasks we report unprecedented automation and
an unprecedented learning efficiency for solving these tasks.

3. As a
step toward pilco's extension to partially observable Markov decision
processes, we propose a principled algorithm for robust filtering and
smoothing in GP dynamic systems. Unlike commonly used Gaussian filters for
nonlinear systems, it does neither rely on function linearization nor on
finite-sample representations of densities. Our algorithm profits from exact
moment matching for predictions while keeping all computations analytically
tractable. We present experimental evidence that demonstrates the robustness
and the advantages of our method over unscented Kalman filters, the cubature
Kalman filter, and the extended Kalman filter.

Pedro A. Ortega and Daniel A. Braun.
**An axiomatic formalization of
bounded rationality based on a utility-information equivalence**.
Technical report, Dept. of Engineering, University of Cambridge, 2010.

** Abstract:** Classic decision-theory is based on the maximum
expected utility (MEU) principle, but crucially ignores the resource costs
incurred when determining optimal decisions. Here we propose an axiomatic
framework for bounded decision-making that considers resource costs. Agents
are formalized as probability measures over input-output streams. We
postulate that any such probability measure can be assigned a corresponding
conjugate utility function based on three axioms: utilities should be
real-valued, additive and monotonic mappings of probabilities. We show that
these axioms enforce a unique conversion law between utility and probability
(and thereby, information). Moreover, we show that this relation can be
characterized as a variational principle: given a utility function, its
conjugate probability measure maximizes a free utility functional.
Transformations of probability measures can then be formalized as a change in
free utility due to the addition of new constraints expressed by a target
utility function. Accordingly, one obtains a criterion to choose a
probability measure that trades off the maximization of a target utility
function and the cost of the deviation from a reference distribution. We show
that optimal control, adaptive estimation and adaptive control problems can
be solved this way in a resource-efficient way. When resource costs are
ignored, the MEU principle is recovered. Our formalization might thus provide
a principled approach to bounded rationality that establishes a close link to
information theory.

Pedro A. Ortega and Daniel A. Braun.
**A
Bayesian rule for adaptive control based on causal interventions**.
In *The third conference on artificial general intelligence*, pages
115-120, Paris, 2010. Atlantis Press.

** Abstract:**
Explaining adaptive behavior is a central problem in artificial intelligence
research. Here we formalize adaptive agents as mixture distributions over
sequences of inputs and outputs (I/O). Each distribution of the mixture
constitutes a "possible world", but the agent does not know which of the
possible worlds it is actually facing. The problem is to adapt the I/O stream
in a way that is compatible with the true world. A natural measure of
adaptation can be obtained by the Kullback Leibler (KL) divergence between
the I/O distribution of the true world and the I/O distribution expected by
the agent that is uncertain about possible worlds. In the case of pure input
streams, the Bayesian mixture provides a well-known solution for this
problem. We show, however, that in the case of I/O streams this solution
breaks down, because outputs are issued by the agent itself and require a
different probabilistic syntax as provided by intervention calculus. Based on
this calculus, we obtain a Bayesian control rule that allows modeling
adaptive behavior with mixture distributions over I/O streams. This rule
might allow for a novel approach to adaptive control based on a minimum
KL-principle.

Pedro A. Ortega and Daniel A. Braun.
**A
conversion between utility and information**.
In *The third conference on artificial general intelligence*, pages
115-120, Paris, 2010. Atlantis Press.

** Abstract:** Rewards
typically express desirabilities or preferences over a set of alternatives.
Here we propose that rewards can be defined for any probability distribution
based on three desiderata, namely that rewards should be real- valued,
additive and order-preserving, where the later implies that more probable
events should also be more desirable. Our main result states that rewards are
then uniquely determined by the negative information content. To analyze
stochastic processes, we define the utility of a realization as its reward
rate. Under this interpretation, we show that the expected utility of a
stochastic process is its negative entropy rate. Furthermore, we apply our
results to analyze agent-environment interactions. We show that the expected
utility that will actually be achieved by the agent is given by the negative
cross-entropy from the input-output (I/O) distribution of the coupled
interaction system and the agent's I/O distribution. Thus, our results allow
for an information-theoretic interpretation of the notion of utility and the
characterization of agent-environment interactions in terms of entropy
dynamics.

Pedro A. Ortega and Daniel A. Braun.
**A minimum relative entropy
principle for learning and acting**.
*Journal of Artificial Intelligence Research*, 38:475-511, 2010, doi 10.1613/jair.3062.

** Abstract:** This paper proposes a method to construct an
adaptive agent that is univemacmacrsal with respect to a given class of
experts, where each expert is designed specifically for a particular
environment. This adaptive control problem is formalized as the problem of
minimizing the relative entropy of the adaptive agent from the expert that is
most suitable for the unknown environment. If the agent is a passive
observer, then the optimal solution is the well-known Bayesian predictor.
However, if the agent is active, then its past actions need to be treated as
causal interventions on the I/O stream rather than normal probability
conditions. Here it is shown that the solution to this new variational
problem is given by a stochastic controller called the Bayesian control rule,
which implements adaptive behavior as a mixture of experts. Furthermore, it
is shown that under mild assumptions, the Bayesian control rule converges to
the control law of the most suitable expert.

Marc Peter Deisenroth and Carl Edward Rasmussen.
**Efficient
reinforcement learning for motor control**.
In *10th International PhD Workshop on Systems and Control*, Hluboká
nad Vltavou, Czech Republic, September 2009.

** Abstract:**
Artificial learners often require many more trials than humans or animals
when learning motor control tasks in the absence of expert knowledge. We
implement two key ingredients of biological learning systems, generalization
and incorporation of uncertainty into the decision-making process, to speed
up artificial learning. We present a coherent and fully Bayesian framework
that allows for efficient artificial learning in the absence of expert
knowledge. The success of our learning framework is demonstrated on
challenging nonlinear control problems in simulation and in hardware.

Marc Peter Deisenroth and Carl Edward Rasmussen.
**Bayesian inference
for efficient learning in control**.
In *Multidisciplinary Symposium on Reinforcement Learning*,
Montréal, QC, Canada, June 2009.

** Abstract:** In
contrast to humans or animals, artificial learners often require more trials
when learning motor control tasks solely based on experience. Efficient
autonomous learners will reduce the amount of engineering required to solve
control problems. By using probabilistic forward models, we can employ two
key ingredients of biological learning systems to speed up artificial
learning. We present a consistent and coherent Bayesian framework that allows
for efficient autonomous experience-based learning. We demonstrate the
success of our learning algorithm by applying it to challenging nonlinear
control problems in simulation and in hardware.

Marc Peter Deisenroth, Carl Edward Rasmussen, and Jan Peters.
**Gaussian process
dynamic programming**.
*Neurocomputing*, 72(7-9):1508-1524, March 2009, doi
10.1016/j.neucom.2008.12.019.

** Abstract:** Reinforcement
learning (RL) and optimal control of systems with continuous states and
actions require approximation techniques in most interesting cases. In this
article, we introduce Gaussian process dynamic programming (GPDP), an
approximate value function-based RL algorithm. We consider both a classic
optimal control problem, where problem-specific prior knowledge is available,
and a classic RL problem, where only very general priors can be used. For the
classic optimal control problem, GPDP models the unknown value functions with
Gaussian processes and generalizes dynamic programming to continuous-valued
states and actions. For the RL problem, GPDP starts from a given initial
state and explores the state space using Bayesian active learning. To design
a fast learner, available data have to be used efficiently. Hence, we propose
to learn probabilistic models of the a priori unknown transition dynamics and
the value functions on the fly. In both cases, we successfully apply the
resulting continuous-valued controllers to the under-actuated pendulum swing
up and analyze the performances of the suggested algorithms. It turns out
that GPDP uses data very efficiently and can be applied to problems, where
classic dynamic programming would be cumbersome.

** Comment:** code.

Carl Edward Rasmussen and Marc Peter Deisenroth.
**Probabilistic
inference for fast learning in control**.
In S. Girgin, M. Loth, R. Munos, P. Preux, and D. Ryabko, editors, *Recent
Advances in Reinforcement Learning*, volume 5323 of *Lecture Notes in
Computer Science (LNCS)*, pages 229-242, Villeneuve d'Ascq, France,
November 2008. Springer-Verlag.

** Abstract:** We provide a
novel framework for very fast model-based reinforcement learning in
continuous state and action spaces. The framework requires probabilistic
models that explicitly characterize their levels of confidence. Within this
framework, we use flexible, non-parametric models to describe the world based
on previously collected experience. We demonstrate learning on the cart-pole
problem in a setting where we provide very limited prior knowledge about the
task. Learning progresses rapidly, and a good policy is found after only a
hand-full of iterations.

** Comment:** videos and more. slides.

Marc Peter Deisenroth, Carl Edward Rasmussen, and Jan Peters.
**Model-based
reinforcement learning with continuous states and actions**.
In *Proceedings of the 16th European Symposium on Artificial Neural Networks
(ESANN 2008)*, pages 19-24, Bruges, Belgium, April 2008.

**
Abstract:** Finding an optimal policy in a reinforcement learning (RL)
framework with continuous state and action spaces is challenging. Approximate
solutions are often inevitable. GPDP is an approximate dynamic programming
algorithm based on Gaussian process (GP) models for the value functions. In
this paper, we extend GPDP to the case of unknown transition dynamics. After
building a GP model for the transition dynamics, we apply GPDP to this model
and determine a continuous-valued policy in the entire state space. We apply
the resulting controller to the underpowered pendulum swing up. Moreover, we
compare our results on this RL task to a nearly optimal discrete DP solution
in a fully known environment.

Malte Kuß and Carl Edward Rasmussen.
**Assessing
approximations for Gaussian process classification**.
In Y. Weiss, B. Schölkopf, and J. Platt, editors, *Advances in Neural
Information Processing Systems 18*, pages 699-706, Cambridge, MA, USA,
April 2006. The MIT Press.

** Abstract:** Gaussian processes
are attractive models for probabilistic classification but unfortunately
exact inference is analytically intractable. We compare Laplace's method and
Expectation Propagation (EP) focusing on marginal likelihood estimates and
predictive performance. We explain theoretically and corroborate empirically
that EP is superior to Laplace. We also compare to a sophisticated MCMC
scheme and show that EP is surprisingly accurate.

Carl Edward Rasmussen and Malte Kuß.
**Gaussian processes
in reinforcement learning**.
In S. Thrun, L.K. Saul, and B. Schölkopf, editors, *Advances in Neural
Information Processing Systems 16*, pages 751-759, Cambridge, MA, USA,
December 2004. The MIT Press.

** Abstract:** We exploit some
useful properties of Gaussian process (GP) regression models for
reinforcement learning in continuous state spaces and discrete time. We
demonstrate how the GP model allows evaluation of the value function in
closed form. The resulting policy iteration algorithm is demonstrated on a
simple problem with a two dimensional state space. Further, we speculate that
the intrinsic ability of GP models to characterise distributions of functions
would allow the method to capture entire distributions over future values
instead of merely their expectation, which has traditionally been the focus
of much of reinforcement learning.

Juš Kocijan, Roderick Murray-Smith, Carl Edward Rasmussen, and Agathe
Girard.
**Gaussian
process model based predictive control**.
In *American Control Conference*, pages 2214-2219, 2004.

** Abstract:** Gaussian process models provide a probabilistic
non-parametric modelling approach for black-box identi cation of non-linear
dynamic systems. The Gaussian processes can highlight areas of the input
space where prediction quality is poor, due to the lack of data or its
complexity, by indicating the higher variance around the predicted mean.
Gaussian process models contain noticeably less coef cients to be optimised.
This paper illustrates possible application of Gaussian process models within
model-based predictive control. The extra information provided within
Gaussian process model is used in predictive control, where optimisation of
control signal takes the variance information into account. The predictive
control principle is demonstrated on control of pH process benchmark.

Sebastian Thrun, Yufeng Liu, Daphne Koller, Andrew Y. Ng, Zoubin Ghahramani,
and Hugh F. Durrant-Whyte.
**Simultaneous
localization and mapping with sparse extended information filters**.
*I. J. Robotic Res.*, 23(7-8):693-716, 2004.

**
Abstract:** This paper describes a scalable algorithm for the simultaneous
mapping and localization (SLAM) problem. SLAM is the problem of acquiring a
map of a static environment with a mobile robot. The vast majority of SLAM
algorithms are based on the extended Kalman filter (EKF). This paper
advocates an algorithm that relies on the dual of the EKF, the extended
information filter (EIF). We show that when represented in the information
form, map posteriors are dominated by a small number of links that tie
together nearby features in the map. This insight is developed into a sparse
variant of the EIF, called the sparse extended information filters (SEIF).
SEIFs represent maps by graphical networks of features that are locally
interconnected, where links represent relative information between pairs of
nearby features, as well as information about the robot's pose relative to
the map. We show that all essential update equations in SEIFs can be executed
in constant time, irrespective of the size of the map. We also provide
empirical results obtained for a benchmark data set collected in an outdoor
environment, and using a multi-robot mapping simulation.

Roderick Murray-Smith, Daniel Sbarbaro, Carl Edward Rasmussen, and Agathe
Girard.
**Adaptive,
cautious, predictive control with Gaussian process priors**.
In P. Van den Hof, B. Wahlberg, and S. Weiland, editors, *IFAC SYSID
2003*, pages 1195-1200, Oxford, UK, August 2003. Elsevier Science Ltd.

** Abstract:** Nonparametric Gaussian Process models, a Bayesian
statistics approach, are used to implement a nonlinear adaptive control law.
Predictions, including propagation of the state uncertainty are made over a
k-step horizon. The expected value of a quadratic cost function is minimised,
over this prediction horizon, without ignoring the variance of the model
predictions. The general method and its main features are illustrated on a
simulation example.

Juš Kocijan, Roderick Murray-Smith, Carl Edward Rasmussen, and Bojan
Likar.
**Predictive
control with Gaussian process models**.
In B. Zajc and M. Tkal, editors, *IEEE Region 8 Eurocon 2003: Computer as a
Tool*, pages 352-356, 2003.

** Abstract:** This paper
describes model-based predictive control based on Gaussian processes.
Gaussian process models provide a probabilistic non-parametric modelling
approach for black-box identification of non-linear dynamic systems. It
offers more insight in variance of obtained model response, as well as fewer
parameters to determine than other models. The Gaussian processes can
highlight areas of the input space where prediction quality is poor, due to
the lack of data or its complexity, by indicating the higher variance around
the predicted mean. This property is used in predictive control, where
optimisation of control signal takes the variance information into account.
The predictive control principle is demonstrated on a simulated example of
nonlinear system.

Zoubin Ghahramani and Sam T. Roweis.
**Learning nonlinear
dynamical systems using an EM algorithm**.
In Michael J. Kearns, Sara A. Solla, and David A. Cohn, editors, *NIPS*,
pages 431-437. The MIT Press, 1998.

** Abstract:** The
Expectation Maximization (EM) algorithm is an iterative procedure for maximum
likelihood parameter estimation from data sets with missing or hidden
variables. It has been applied to system identification in linear stochastic
state-space models, where the state variables are hidden from the observer
and both the state and the parameters of the model have to be estimated
simultaneously [9]. We present a generalization of the EM algorithm for
parameter estimation in nonlinear dynamical systems. The ``expectation'' step
makes use of Extended Kalman Smoothing to estimate the state, while the
``maximization'' step re-estimates the parameters using these uncertain state
estimates. In general, the nonlinear maximization step is difficult because
it requires integrating out the uncertainty in the states. However, if
Gaussian radial basis function (RBF) approximators are used to model the
nonlinearities, the integrals become tractable and the maximization step can
be solved via systems of linear equations.

## Time Series ModelsModelling time series and sequential data is an essential part of many different areas of science and engineering, including for example, signal processing and control, bioinformatics, speech recognition, econometrics and finance. Using basic building blocks such as hidden Markov models, linear Gaussian state-space models, and Bayesian networks, it is possible to develop sophisticated time series models for real world data. However learning (parameter inference / system identification) becomes computationally challenging for such sophisticated models. |

Wessel P. Bruinsma, Martin Tegnér, and Richard E. Turner.
**Modelling
non-smooth signals with complex spectral structure**.
In *aistats25*, 2022.

** Abstract:** The Gaussian Process
Convolution Model (GPCM; Tobar et al., 2015a) is a model for signals with
complex spectral structure. A significant limitation of the GPCM is that it
assumes a rapidly decaying spectrum: it can only model smooth signals.
Moreover, inference in the GPCM currently requires (1) a mean-field
assumption, resulting in poorly calibrated uncertainties, and (2) a tedious
variational optimisation of large covariance matrices. We redesign the GPCM
model to induce a richer distribution over the spectrum with relaxed
assumptions about smoothness: the Causal Gaussian Process Convolution Model
(CGPCM) introduces a causality assumption into the GPCM, and the Rough
Gaussian Process Convolution Model (RGPCM) can be interpreted as a Bayesian
nonparametric generalisation of the fractional Ornstein–Uhlenbeck process.
We also propose a more effective variational inference scheme, going beyond
the mean-field assumption: we design a Gibbs sampler which directly samples
from the optimal variational solution, circumventing any variational
optimisation entirely. The proposed variations of the GPCM are validated in
experiments on synthetic and real-world data, showing promising results.

Talay M Cheema.
**Contrasting
discrete and continuous methods for Bayesian system identification**.
In *Workshop on Continuous Time Machine Learning at the 39th International
Conference on Machine Learning*, 2022.

** Abstract:** In
recent years, there has been considerable interest in embedding continuous
time methods in machine learning algorithms. In system identification, the
task is to learn a dynamical model from incomplete observation data, and when
prior knowledge is in continuous time – for example, mechanistic
differential equation models – it seems natural to use continuous time
models for learning. Yet when learning flexible, nonlinear, probabilistic
dynamics models, most previous work has focused on discrete time models to
avoid computational, numerical, and mathematical difficulties. In this work
we show, with the aid of small-scale examples, that this mismatch between
model and data generating process can be consequential under certain
circumstances, and we discuss possible modifications to discrete time models
which may better suit them to handling data generated by continuous time
processes.

Alessandro Davide Ialongo.
**Variational
Inference in Dynamical Systems**.
PhD thesis, University of Cambridge, Department of Engineering, Cambridge, UK,
2022.

** Abstract:** Dynamical systems are a powerful
formalism to analyse the world around us. Many datasets are sequential in
nature, and can be described by a discrete time evolution law. We are
interested in approaching the analysis of such datasets from a probabilistic
perspective. We would like to maintain justified beliefs about quantities
which, though useful in explaining the behaviour of a system, may not be
observable, as well as about the system's evolution itself, especially in
regimes we have not yet observed in our data. The framework of statistical
inference gives us the tools to do so, yet, for many systems of interest,
performing inference exactly is not computationally or analytically
tractable. The contribution of this thesis, then, is twofold: first, we
uncover two sources of bias in existing variational inference methods applied
to dynamical systems in general, and state space models whose transition
function is drawn from a Gaussian process (GPSSM) in particular. We show bias
can derive from assuming posteriors in non-linear systems to be jointly
Gaussian, and from assuming that we can sever the dependence between latent
states and transition function in state space model posteriors. Second, we
propose methods to address these issues, undoing the resulting biases. We do
this without compromising on computational efficiency or on the ability to
scale to larger datasets and higher dimensions, compared to the methods we
rectify. One method, the Markov Autoregressive Flow (Markov AF) addresses the
Gaussian assumption, by providing a more flexible class of posteriors, based
on normalizing flows, which can be easily evaluated, sampled, and optimised.
The other method, Variationally Coupled Dynamics and Trajectories (VCDT),
tackles the factorisation assumption, leveraging sparse Gaussian processes
and their variational representation to reintroduce dependence between latent
states and the transition function at no extra computational cost. Since the
objective of inference is to maintain calibrated beliefs, if we employed
approximations which are significantly biased in non-linear, noisy systems,
or when there is little data available, we would have failed in our
objective, as those are precisely the regimes in which uncertainty
quantification is all the more important. Hence we think it is essential, if
we wish to act optimally on such beliefs, to uncover, and, if possible, to
correct, all sources of systematic bias in our inference methods.

Talay M Cheema.
**Understanding
local linearisation in variational Gaussian process state space
models**.
In *Time Series Workshop at the 38th International Conference on Machine
Learning*, 2021.

** Abstract:** We describe variational
inference approaches in Gaussian process state space models in terms of local
linearisations of the approximate posterior function. Most previous
approaches have either assumed independence between the posterior dynamics
and latent states (the mean-field (MF) approximation), or optimised free
parameters for both, leading to limited scalability. We use our framework to
prove that (i) there is a theoretical imperative to use non-MF approaches, to
avoid excessive bias in the process noise hyperparameter estimate, and (ii)
we can parameterise only the posterior dynamics without any less of
performance. Our approach suggests further approximations, based on the
existing rich literature on filtering and smoothing for nonlinear systems,
and unifies approaches for discrete and continuous time models.

Timothy Gebhard, Niki Kilbertus, Ian Harry, and Bernhard Schölkopf.
**Convolutional
neural networks: A magic bullet for gravitational-wave detection?**.
*Physical Review D*, 100(6):063015, September 2019, doi
https://doi.org/10.1103/PhysRevD.100.063015.

**
Abstract:** In the last few years, machine learning techniques, in
particular convolutional neural networks, have been investigated as a method
to replace or complement traditional matched filtering techniques that are
used to detect the gravitational-wave signature of merging black holes.
However, to date, these methods have not yet been successfully applied to the
analysis of long stretches of data recorded by the Advanced LIGO and Virgo
gravitational-wave observatories. In this work, we critically examine the use
of convolutional neural networks as a tool to search for merging black holes.
We identify the strengths and limitations of this approach, highlight some
common pitfalls in translating between machine learning and
gravitational-wave astronomy, and discuss the interdisciplinary challenges.
In particular, we explain in detail why convolutional neural networks alone
cannot be used to claim a statistically significant gravitational-wave
detection. However, we demonstrate how they can still be used to rapidly flag
the times of potential signals in the data for a more detailed follow-up. Our
convolutional neural network architecture as well as the proposed performance
metrics are better suited for this task than a standard binary
classifications scheme. A detailed evaluation of our approach on Advanced
LIGO data demonstrates the potential of such systems as trigger generators.
Finally, we sound a note of caution by constructing adversarial examples,
which showcase interesting "failure modes" of our model, where inputs with no
visible resemblance to real gravitational-wave signals are identified as such
by the network with high confidence.

Alessandro Davide Ialongo, Mark van der Wilk, James Hensman, and Carl Edward
Rasmussen.
**Overcoming mean-field
approximations in recurrent Gaussian process models**.
In *36th International Conference on Machine Learning*, Long Beach, June
2019.

** Abstract:** We identify a new variational inference
scheme for dynamical systems whose transition function is modelled by a
Gaussian process. Inference in this setting has either employed
computationally intensive MCMC methods, or relied on factorisations of the
variational posterior. As we demonstrate in our experiments, the
factorisation between latent system states and transition function can lead
to a miscalibrated posterior and to learning unnecessarily large noise terms.
We eliminate this factorisation by explicitly modelling the dependence
between state trajectories and the Gaussian process posterior. Samples of the
latent states can then be tractably generated by conditioning on this
representation. The method we obtain (VCDT: variationally coupled dynamics
and trajectories) gives better predictive performance and more calibrated
estimates of the transition function, yet maintains the same time and space
complexities as mean-field methods. Code is available at:
https://github.com/ialong/GPt.

Alessandro Davide Ialongo, Mark van der Wilk, James Hensman, and Carl Edward
Rasmussen.
**Non-factorised variational
inference in dynamical systems**.
In *First Symposium on Advances in Approximate Bayesian Inference*,
Montreal, December 2018.

** Abstract:** We focus on
variational inference in dynamical systems where the discrete time transition
function (or evolution rule) is modelled by a Gaussian process. The dominant
approach so far has been to use a factorised posterior distribution,
decoupling the transition function from the system states. This is not exact
in general and can lead to an overconfident posterior over the transition
function as well as an overestimation of the intrinsic stochasticity of the
system (process noise). We propose a new method that addresses these issues
and incurs no additional computational costs.

Yingzhen Li and Stephan Mandt.
**Disentangled Sequential
Autoencoder**.
In *35th International Conference on Machine Learning*, Stockholm
Sweden, July 2018.

** Abstract:** We present a VAE
architecture for encoding and generating high dimensional sequential data,
such as video or audio. Our deep generative model learns a latent
representation of the data which is split into a static and dynamic part,
allowing us to approximately disentangle latent time-dependent features
(dynamics) from features which are preserved over time (content). This
architecture gives us partial control over generating content and dynamics by
conditioning on either one of these sets of features. In our experiments on
artificially generated cartoon video clips and voice recordings, we show that
we can convert the content of a given sequence into another one by such
content swapping. For audio, this allows us to convert a male speaker into a
female speaker and vice versa, while for video we can separately manipulate
shapes and dynamics. Furthermore, we give empirical evidence for the
hypothesis that stochastic RNNs as latent state models are more efficient at
compressing and generating long sequences than deterministic ones, which may
be relevant for applications in video compression.

Alessandro Davide Ialongo, Mark van der Wilk, and Carl Edward Rasmussen.
**Closed-form inference and
prediction in Gaussian process state-space models**.
In *NIPS Time Series Workshop 2017*, Long Beach, December 2017.

** Abstract:** We examine an analytic variational inference
scheme for the Gaussian Process State Space Model (GPSSM) - a probabilistic
model for system identification and time-series modelling. Our approach
performs variational inference over both the system states and the transition
function. We exploit Markov structure in the true posterior, as well as an
inducing point approximation to achieve linear time complexity in the length
of the time series. Contrary to previous approaches, no Monte Carlo sampling
is required: inference is cast as a deterministic optimisation problem. In a
number of experiments, we demonstrate the ability to model non-linear
dynamics in the presence of both process and observation noise as well as to
impute missing information (e.g. velocities from raw positions through time),
to de-noise, and to estimate the underlying dimensionality of the system.
Finally, we also introduce a closed-form method for multi-step prediction,
and a novel criterion for assessing the quality of our approximate
posterior.

Rowan McAllister and Carl Edward Rasmussen.
**Data-efficient
reinforcement learning in continuous state-action
Gaussian-POMDPs**.
In *Advances in Neural Information Processing Systems 31*, Long Beach,
California, December 2017.

** Abstract:** We present a
data-efficient reinforcement learning method for continuous state-action
systems under significant observation noise. Data-efficient solutions under
small noise exist, such as PILCO which learns the cartpole swing-up task in
30s. PILCO evaluates policies by planning state-trajectories using a dynamics
model. However, PILCO applies policies to the observed state, therefore
planning in observation space. We extend PILCO with filtering to instead plan
in belief space, consistent with partially observable Markov decisions
process (POMDP) planning. This enables data-efficient learning under
significant observation noise, outperforming more naive methods such as
post-hoc application of a filter to policies optimised by the original
(unfiltered) PILCO algorithm. We test our method on the cartpole swing-up
task, which involves nonlinear dynamics and requires nonlinear control.

Natasha Jaques, Shixiang Gu, Dzmitry Bahdanau, Jose Miguel Hernndez Lobato,
Richard E. Turner, and Douglas Eck.
**Sequence tutor: Conservative
fine-tuning of sequence generation models with kl-control**.
In *34th International Conference on Machine Learning*, Sydney
AUSTRALIA, Aug 2017.

** Abstract:** This paper proposes a
general method for improving the structure and quality of sequences generated
by a recurrent neural network (RNN), while maintaining information originally
learned from data, as well as sample diversity. An RNN is first pre-trained
on data using maximum likelihood estimation (MLE), and the probability
distribution over the next token in the sequence learned by this model is
treated as a prior policy. Another RNN is then trained using reinforcement
learning (RL) to generate higher-quality outputs that account for
domain-specific incentives while retaining proximity to the prior policy of
the MLE RNN. To formalize this objective, we derive novel off-policy RL
methods for RNNs from KL-control. The effectiveness of the approach is
demonstrated on two applications; 1) generating novel musical melodies, and
2) computational molecular generation. For both problems, we show that the
proposed method improves the desired properties and structure of the
generated sequences, while maintaining information learned from data.

** Comment:** [MIT
Technology Review] [Video]

Rowan McAllister.
**Bayesian learning for
data-efficient control**.
PhD thesis, University of Cambridge, Department of Engineering, Cambridge, UK,
2016.

** Abstract:** Applications to learn control of
unfamiliar dynamical systems with increasing autonomy are ubiquitous. From
robotics, to finance, to industrial processing, autonomous learning helps
obviate a heavy reliance on experts for system identification and controller
design. Often real world systems are nonlinear, stochastic, and expensive to
operate (e.g. slow, energy intensive, prone to wear and tear). Ideally
therefore, nonlinear systems can be identified with minimal system
interaction. This thesis considers data efficient autonomous learning of
control of nonlinear, stochastic systems. Data efficient learning critically
requires probabilistic modelling of dynamics. Traditional control approaches
use deterministic models, which easily overfit data, especially small
datasets. We use probabilistic Bayesian modelling to learn systems from
scratch, similar to the PILCO algorithm, which achieved unprecedented data
efficiency in learning control of several benchmarks. We extend PILCO in
three principle ways. First, we learn control under significant observation
noise by simulating a filtered control process using a tractably analytic
framework of Gaussian distributions. In addition, we develop the `latent
variable belief Markov decision process' when filters must predict under
real-time constraints. Second, we improve PILCO's data efficiency by
directing exploration with predictive loss uncertainty and Bayesian
optimisation, including a novel approximation to the Gittins index. Third, we
take a step towards data efficient learning of high-dimensional control using
Bayesian neural networks (BNN). Experimentally we show although filtering
mitigates adverse effects of observation noise, much greater performance is
achieved when optimising controllers with evaluations faithful to reality: by
simulating closed-loop filtered control if executing closed-loop filtered
control. Thus, controllers are optimised w.r.t. how they are used,
outperforming filters applied to systems optimised by unfiltered simulations.
We show directed exploration improves data efficiency. Lastly, we show BNN
dynamics models are almost as data efficient as Gaussian process models.
Results show data efficient learning of high-dimensional control is possible
as BNNs scale to high-dimensional state inputs.

Shixiang Gu, Zoubin Ghahramani, and Richard E. Turner.
**Neural adaptive sequential Monte
Carlo**.
In *Advances in Neural Information Processing Systems 29*, Montréal
CANADA, Dec 2015.

** Abstract:** Sequential Monte Carlo (SMC),
or particle filtering, is a popular class of methods for sampling from an
intractable target distribution using a sequence of simpler intermediate
distributions. Like other importance sampling-based methods, performance is
critically dependent on the proposal distribution: a bad proposal can lead to
arbitrarily inaccurate estimates of the target distribution. This paper
presents a new method for automatically adapting the proposal using an
approximation of the Kullback-Leibler divergence between the true posterior
and the proposal distribution. The method is very flexible, applicable to any
parameterised proposal distribution and it supports online and batch
variants. We use the new framework to adapt powerful proposal distributions
with rich parameterisations based upon neural networks leading to Neural
Adaptive Sequential Monte Carlo (NASMC). Experiments indicate that NASMC
significantly improves inference in a non-linear state space model
outperforming adaptive proposal methods including the Extended Kalman and
Unscented Particle Filters. Experiments also indicate that improved inference
translates into improved parameter learning when NASMC is used as a
subroutine of Particle Marginal Metropolis Hastings. Finally we show that
NASMC is able to train a neural network-based deep recurrent generative model
achieving results that compete with the state-of-the-art for polymorphic
music modelling. NASMC can be seen as bridging the gap between adaptive SMC
methods and the recent work in scalable, black-box variational inference.

Felipe Tobar, Thang D. Bui, and Richard E. Turner.
**Learning
stationary time series using gaussian process with nonparametric
kernels**.
In *Advances in Neural Information Processing Systems 29*, Montréal
CANADA, Dec 2015.

** Abstract:** We introduce the Gaussian
Process Convolution Model (GPCM), a two-stage nonparametric generative
procedure to model stationary signals as the convolution between a
continuous-time white-noise process and a continuous-time linear filter drawn
from Gaussian process. The GPCM is a continuous-time nonparametricwindow
moving average process and, conditionally, is itself a Gaussian process with
a nonparametric kernel defined in a probabilistic fashion. The generative
model can be equivalently considered in the frequency domain, where the power
spectral density of the signal is specified using a Gaussian process. One of
the main contributions of the paper is to develop a novel variational
freeenergy approach based on inter-domain inducing variables that efficiently
learns the continuous-time linear filter and infers the driving white-noise
process. In turn, this scheme provides closed-form probabilistic estimates of
the covariance kernel and the noise-free signal both in denoising and
prediction scenarios. Additionally, the variational inference procedure
provides closed-form expressions for the approximate posterior of the
spectral density given the observed data, leading to new Bayesian
nonparametric approaches to spectrum estimation. The proposed GPCM is
validated using synthetic and real-world signals.

Roger Frigola.
**Bayesian time series
learning with Gaussian processes**.
PhD thesis, University of Cambridge, Department of Engineering, Cambridge, UK,
2015.

** Abstract:** The analysis of time series data is
important in fields as disparate as the social sciences, biology, engineering
or econometrics. In this dissertation, we present a number of algorithms
designed to learn Bayesian nonparametric models of time series. The goal of
these kinds of models is twofold. First, they aim at making predictions which
quantify the uncertainty due to limitations in the quantity and the quality
of the data. Second, they are flexible enough to model highly complex data
whilst preventing overfitting when the data does not warrant complex
models.

We begin with a unifying literature review on time series models
based on Gaussian processes. Then, we centre our attention on the Gaussian
Process State-Space Model (GP-SSM): a Bayesian nonparametric generalisation
of discrete-time nonlinear state-space models. We present a novel formulation
of the GP-SSM that offers new insights into its properties. We then proceed
to exploit those insights by developing new learning algorithms for the
GP-SSM based on particle Markov chain Monte Carlo and variational
inference.

Finally, we present a filtered nonlinear auto-regressive
model with a simple, robust and fast learning algorithm that makes it well
suited to its application by non-experts on large datasets. Its main
advantage is that it avoids the computationally expensive (and potentially
difficult to tune) smoothing step that is a key part of learning nonlinear
state-space models.

James Rovert Lloyd.
**Representation,
learning, description and criticism of probabilistic models with applications
to networks, functions and relational data**.
PhD thesis, University of Cambridge, Department of Engineering, Cambridge, UK,
2015.

** Abstract:** This thesis makes contributions to a
variety of aspects of probabilistic inference. When performing probabilistic
inference, one must first represent one’s beliefs with a probability
distribution. Specifying the details of a probability distribution can be a
difficult task in many situations, but when expressing beliefs about complex
data structures it may not even be apparent what form such a distribution
should take. This thesis starts by demonstrating how representation theorems
due to Aldous, Hoover and Kallenberg can be used to specify appropriate
models for data in the form of networks. These theorems are then extended in
order to reveal appropriate probability distributions for arbitrary
relational data or databases. A simpler data structure to specify probability
distributions for is that of functions; many probability distributions for
functions have been used for centuries. We demonstrate that many of these
distributions can be expressed in a common language of Gaussian process
kernels constructed from a few base elements and operators. The structure of
this language allows for the effective automatic construction of
probabilistic models for functions. Furthermore, the formal mathematical
language of kernels can be mapped neatly onto natural language allowing for
automatic descriptions of the automatically constructed models. By further
automating the construction of statistical models, the need to be able to
effectively check or criticise these models becomes greater. This thesis
demonstrates how kernel two sample tests can be used to demonstrate where a
probabilistic model most disagrees with data allowing for targeted
improvements to the model. In proposing a new method of model criticism this
thesis also briefly discusses the philosophy of model criticism within the
context of probabilistic inference.

Felipe Tobar, Petar M. Djurić, and Danilo P. Mandic.
**Unsupervised
state-space modeling using reproducing kernels**.
*IEEE Transactions on Signal Processing*, 63:5210 - 5221, 2015.

** Abstract:** A novel framework for the design of state-space
models (SSMs) is proposed whereby the state-transition function of the model
is parametrized using reproducing kernels. The nature of SSMs requires
learning a latent function that resides in the state space and for which
input-output sample pairs are not available, thus prohibiting the use of
gradient-based supervised kernel learning. To this end, we then propose to
learn the mixing weights of the kernel estimate by sampling from their
posterior density using Monte Carlo methods. We first introduce an offline
version of the proposed algorithm, followed by an online version which
performs inference on both the parameters and the hidden state through
particle filtering. The accuracy of the estimation of the state-transition
function is first validated on synthetic data. Next, we show that the
proposed algorithm outperforms kernel adaptive filters in the prediction of
real-world time series, while also providing probabilistic estimates, a key
advantage over standard methods.

Felipe Tobar and Richard E. Turner.
**Modelling
of complex signals using Gaussian processes**.
In

** Abstract:** In complex-valued signal processing, estimation
algorithms require complete knowledge (or accurate estimation) of the second
order statistics, this makes Gaussian processes (GP) well suited for
modelling complex signals, as they are designed in terms of covariance
functions. Dealing with bivariate signals using GPs require four covariance
matrices, or equivalently, two complex matrices. We propose a GP-based
approach for modelling complex signals, whereby the second-order statistics
are learnt through maximum likelihood; in particular, the complex GP approach
allows for circularity coefficient estimation in a robust manner when the
observed signal is corrupted by (circular) white noise. The proposed model is
validated using climate signals, for both circular and noncircular cases. The
results obtained open new possibilities for collaboration between the complex
signal processing and Gaussian processes communities towards an appealing
representation and statistical description of bivariate signals.

James Robert Lloyd, David Duvenaud, Roger Grosse, Joshua B. Tenenbaum, and
Zoubin Ghahramani.
**Automatic construction and
natural-language description of nonparametric regression models**.
In *Association for the Advancement of Artificial Intelligence (AAAI)*,
July 2014.

** Abstract:** This paper presents the beginnings
of an automatic statistician, focusing on regression problems. Our system
explores an open-ended space of statistical models to discover a good
explanation of a data set, and then produces a detailed report with figures
and natural-language text. Our approach treats unknown regression functions
nonparametrically using Gaussian processes, which has two important
consequences. First, Gaussian processes can model functions in terms of
high-level properties (e.g. smoothness, trends, periodicity, changepoints).
Taken together with the compositional structure of our language of models
this allows us to automatically describe functions in simple terms. Second,
the use of flexible nonparametric models and a rich language for composing
them in an open-ended manner also results in state-of-the-art extrapolation
performance evaluated over 13 real time series data sets from various
domains.

Thang D. Bui and Richard E. Turner.
**Tree-structured Gaussian process approximations**.
In Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger,
editors, *Advances in Neural Information Processing Systems 28*,
volume 28, pages 2213-2221. Curran Associates, Inc., 2014.

** Abstract:** Gaussian process regression can be accelerated by
constructing a small pseudo-dataset to summarize the observed data. This idea
sits at the heart of many approximation schemes, but such an approach
requires the number of pseudo-datapoints to be scaled with the range of the
input space if the accuracy of the approximation is to be maintained. This
presents problems in time-series settings or in spatial datasets where large
numbers of pseudo-datapoints are required since computation typically scales
quadratically with the pseudo-dataset size. In this paper we devise an
approximation whose complexity grows linearly with the number of
pseudo-datapoints. This is achieved by imposing a tree or chain structure on
the pseudo-datapoints and calibrating the approximation using a
Kullback-Leibler (KL) minimization. Inference and learning can then be
performed efficiently using the Gaussian belief propagation algorithm. We
demonstrate the validity of our approach on a set of challenging regression
tasks including missing data imputation for audio and spatial datasets. We
trace out the speed-accuracy trade-off for the new method and show that the
frontier dominates those obtained from a large number of existing
approximation techniques.

Roger Frigola, Yutian Chen, and Carl Edward Rasmussen.
**Variational
Gaussian process state-space models**.
In Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger,
editors, *Advances in Neural Information Processing Systems 27*,
2014.

** Abstract:** State-space models have been successfully
used for more than fifty years in different areas of science and engineering.
We present a procedure for efficient variational Bayesian learning of
nonlinear state-space models based on sparse Gaussian processes. The result
of learning is a tractable posterior over nonlinear dynamical systems. In
comparison to conventional parametric models, we offer the possibility to
straightforwardly trade off model capacity and computational cost whilst
avoiding overfitting. Our main algorithm uses a hybrid inference approach
combining variational Bayes and sequential Monte Carlo. We also present
stochastic variational inference and online learning approaches for fast
learning with long time series.

Roger Frigola, Fredrik Lindsten, Thomas B. Schön, and Carl Edward
Rasmussen.
**Identification of Gaussian
process state-space models with particle stochastic approximation
EM**.
In *Proceedings of the 19th World Congress of the International Federation
of Automatic Control (IFAC)*, 2014.

** Abstract:**
Gaussian process state-space models (GP-SSMs) are a very flexible family of
models of nonlinear dynamical systems. They comprise a Bayesian nonparametric
representation of the dynamics of the system and additional
(hyper-)parameters governing the properties of this nonparametric
representation. The Bayesian formalism enables systematic reasoning about the
uncertainty in the system dynamics. We present an approach to maximum
likelihood identification of the parameters in GP-SSMs, while retaining the
full nonparametric description of the dynamics. The method is based on a
stochastic approximation version of the EM algorithm that employs recent
developments in particle Markov chain Monte Carlo for efficient
identification.

Andrew McHutchon.
**Nonlinear modelling and
control using Gaussian processes**.
PhD thesis, University of Cambridge, Department of Engineering, Cambridge, UK,
2014.

** Abstract:** In many scientific disciplines it is
often required to make predictions about how a system will behave or to
deduce the correct control values to elicit a particular desired response.
Efficiently solving both of these tasks relies on the construction of a model
capturing the system's operation. In the most interesting situations, the
model needs to capture strongly nonlinear effects and deal with the presence
of uncertainty and noise. Building models for such systems purely based on a
theoretical understanding of underlying physical principles can be infeasibly
complex and require a large number of simplifying assumptions. An alternative
is to use a data-driven approach, which builds a model directly from
observations. A powerful and principled approach to doing this is to use a
Gaussian Process (GP).

In this thesis we start by discussing how GPs can
be applied to data sets which have noise affecting their inputs. We present
the "Noisy Input GP", which uses a simple local-linearisation to refer the
input noise into heteroscedastic output noise, and compare it to other
methods both theoretically and empirically. We show that this technique leads
to a effective model for nonlinear functions with input and output noise. We
then consider the broad topic of GP state space models for application to
dynamical systems. We discuss a very wide variety of approaches for using GPs
in state space models, including introducing a new method based on
moment-matching, which consistently gave the best performance. We analyse the
methods in some detail including providing a systematic comparison between
approximate-analytic and particle methods. To our knowledge such a comparison
has not been provided before in this area. Finally, we investigate an
automatic control learning framework, which uses Gaussian Processes to model
a system for which we wish to design a controller. Controller design for
complex systems is a difficult task and thus a framework which allows an
automatic design directly from data promises to be extremely useful. We
demonstrate that the previously published framework cannot cope with the
presence of observation noise but that the introduction of a state space
model dramatically improves its performance. This contribution, along with
some other suggested improvements opens the door for this framework to be
used in real-world applications.

Andrew Gordon Wilson.
**Covariance
Kernels for Fast Automatic Pattern Discovery and Extrapolation with
Gaussian Processes**.
PhD thesis, University of Cambridge, Cambridge, UK, 2014.

**
Abstract:** Truly intelligent systems are capable of pattern discovery and
extrapolation without human intervention. Bayesian nonparametric models,
which can uniquely represent expressive prior information and detailed
inductive biases, provide a distinct opportunity to develop intelligent
systems, with applications in essentially any learning and prediction
task.

Gaussian processes are rich distributions over functions, which
provide a Bayesian nonparametric approach to smoothing and interpolation. A
covariance kernel determines the support and inductive biases of a Gaussian
process. In this thesis, we introduce new covariance kernels to enable fast
automatic pattern discovery and extrapolation with Gaussian processes.

In
the introductory chapter, we discuss the high level principles behind all of
the models in this thesis: 1) we can typically improve the predictive
performance of a model by accounting for additional structure in data; 2) to
automatically discover rich structure in data, a model must have large
support and the appropriate inductive biases; 3) we most need expressive
models for large datasets, which typically provide more information for
learning structure, and 4) we can often exploit the existing inductive biases
(assumptions) or structure of a model for scalable inference, without the
need for simplifying assumptions.

In the context of this introduction, we
then discuss, in chapter 2, Gaussian processes as kernel machines, and my
views on the future of Gaussian process research.

In chapter 3 we
introduce the Gaussian process regression network (GPRN) framework, a
multi-output Gaussian process method which scales to many output variables,
and accounts for input-dependent correlations between the outputs. Underlying
the GPRN is a highly expressive kernel, formed using an adaptive mixture of
latent basis functions in a neural network like architecture. The GPRN is
capable of discovering expressive structure in data. We use the GPRN to model
the time-varying expression levels of 1000 genes, the spatially varying
concentrations of several distinct heavy metals, and multivariate volatility
(input dependent noise covariances) between returns on equity indices and
currency exchanges, which is particularly valuable for portfolio allocation.
We generalise the GPRN to an adaptive network framework, which does not
depend on Gaussian processes or Bayesian nonparametrics; and we outline
applications for the adaptive network in nuclear magnetic resonance (NMR)
spectroscopy, ensemble learning, and change-point modelling.

In chapter 4
we introduce simple closed form kernel for automatic pattern discovery and
extrapolation. These spectral mixture (SM) kernels are derived by modelling
the spectral densiy of a kernel (its Fourier transform) using a
scale-location Gaussian mixture. SM kernels form a basis for all stationary
covariances, and can be used as a drop-in replacement for standard kernels,
as they retain simple and exact learning and inference procedures. We use the
SM kernel to discover patterns and perform long range extrapolation on
atmospheric CO2 trends and airline passenger data, as well as on synthetic
examples. We also show that the SM kernel can be used to automatically
reconstruct several standard covariances. The SM kernel and the GPRN are
highly complementary; we show that using the SM kernel with adaptive basis
functions in a GPRN induces an expressive prior over non-stationary
kernels.

In chapter 5 we introduce GPatt, a method for fast
multidimensional pattern extrapolation, particularly suited to imge and movie
data. Without human intervention - no hand crafting of kernel features, and
no sophisticated initialisation procedures - we show that GPatt can solve
large scale pattern extrapolation, inpainting and kernel discovery problems,
including a problem with 383,400 training points. GPatt exploits the
structure of a spectral mixture product (SMP) kernel, for fast yet exact
inference procedures. We find that GPatt significantly outperforms popular
alternative scalable gaussian process methods in speed and accuracy.
Moreover, we discover profound differences between each of these methods,
suggesting expressive kernels, nonparametric representations, and scalable
inference which exploits existing model structure are useful in combination
for modelling large scale multidimensional patterns.

The models in this
dissertation have proven to be scalable and with greatly enhanced predictive
performance over the alternatives: the extra structure being modelled is an
important part of a wide variety of real data - including problems in
econometrics, gene expression, geostatistics, nuclear magnetic resonance
spectroscopy, ensemble learning, multi-output regression, change point
modelling, time series, multivariate volatility, image inpainting, texture
extrapolation, video extrapolation, acoustic modelling, and kernel
discovery.

Andrew Gordon Wilson, Yuting Wu, Daniel J. Holland, Sebastian Nowozin,
Mick D. Mantle, Lynn F. Gladden, and Andrew Blake.
**Bayesian inference for NMR
spectroscopy with applications to chemical quantification**.
*arXiv preprint arXiv 1402.3580*, 2014.

** Abstract:**
Nuclear magnetic resonance (NMR) spectroscopy exploits the magnetic
properties of atomic nuclei to discover the structure, reaction state and
chemical environment of molecules. We propose a probabilistic generative
model and inference procedures for NMR spectroscopy. Specifically, we use a
weighted sum of trigonometric functions undergoing exponential decay to model
free induction decay (FID) signals. We discuss the challenges in estimating
the components of this general model - amplitudes, phase shifts,
frequencies, decay rates, and noise variances - and offer practical
solutions. We compare with conventional Fourier transform spectroscopy for
estimating the relative concentrations of chemicals in a mixture, using
synthetic and experimentally acquired FID signals. We find the proposed model
is particularly robust to low signal to noise ratios (SNR), and overlapping
peaks in the Fourier transform of the FID, enabling accurate predictions
(e.g., 1% error at low SNR) which are not possible with conventional
spectroscopy (5% error).

David Duvenaud, James Robert Lloyd, Roger Grosse, Joshua B. Tenenbaum, and
Zoubin Ghahramani.
**Structure
discovery in nonparametric regression through compositional kernel
search**.
In *30th International Conference on Machine Learning*, Atlanta,
Georgia, USA, June 2013.

** Abstract:** Despite its
importance, choosing the structural form of the kernel in nonparametric
regression remains a black art. We define a space of kernel structures which
are built compositionally by adding and multiplying a small number of base
kernels. We present a method for searching over this space of structures
which mirrors the scientific discovery process. The learned structures can
often decompose functions into interpretable components and enable long-range
extrapolation on time-series datasets. Our structure search method
outperforms many widely used kernels and kernel combination methods on a
variety of prediction tasks.

Creighton Heaukulani and Zoubin Ghahramani.
**Dynamic
probabilistic models for latent feature propagation in social
networks**.
In *30th International Conference on Machine Learning*, Atlanta,
Georgia, USA, June 2013.

** Abstract:** Current Bayesian
models for dynamic social network data have focused on modelling the
influence of evolving unobserved structure on observed social interactions.
However, an understanding of how observed social relationships from the past
affect future unobserved structure in the network has been neglected. In this
paper, we introduce a new probabilistic model for capturing this phenomenon,
which we call latent feature propagation, in social networks. We demonstrate
our model's capability for inferring such latent structure in varying types
of social network datasets, and experimental studies show this structure
achieves higher predictive performance on link prediction and forecasting
tasks.

Yue Wu, José Miguel Hernández-Lobato, and Zoubin Ghahramani.
**Dynamic covariance
models for multivariate financial time series**.
In *30th International Conference on Machine Learning*, Atlanta,
Georgia, USA, June 2013.

** Abstract:** The accurate
prediction of time-changing covariances is an important problem in the
modeling of multivariate financial data. However, some of the most popular
models suffer from a) overfitting problems and multiple local optima, b)
failure to capture shifts in market conditions and c) large computational
costs. To address these problems we introduce a novel dynamic model for
time-changing covariances. Over-fitting and local optima are avoided by
following a Bayesian approach instead of computing point estimates. Changes
in market conditions are captured by assuming a diffusion process in
parameter values, and finally computationally efficient and scalable
inference is performed using particle filters. Experiments with financial
data show excellent performance of the proposed method with respect to
current standard models.

Andrew Gordon Wilson and Ryan Prescott Adams.
**Gaussian
process kernels for pattern discovery and extrapolation**.
In *30th International Conference on Machine Learning*, February 18
2013.

** Abstract:** Gaussian processes are rich distributions
over functions, which provide a Bayesian nonparametric approach to smoothing
and interpolation. We introduce simple closed form kernels that can be used
with Gaussian processes to discover patterns and enable extrapolation. These
kernels are derived by modelling a spectral density - the Fourier transform
of a kernel - with a Gaussian mixture. The proposed kernels support a broad
class of stationary covariances, but Gaussian process inference remains
simple and analytic. We demonstrate the proposed kernels by discovering
patterns and performing long range extrapolation on synthetic examples, as
well as atmospheric CO2 trends and airline passenger data. We also show that
we can reconstruct standard covariances within our framework.

** Comment:** arXiv:1302.4245

Roger Frigola, Fredrik Lindsten, Thomas B. Schön, and Carl Edward
Rasmussen.
**Bayesian inference and learning in
Gaussian process state-space models with particle MCMC**.
In L. Bottou, C.J.C. Burges, Z. Ghahramani, M. Welling, and K.Q. Weinberger,
editors, *Advances in Neural Information Processing Systems 26*.
Curran Associates, Inc., 2013.

** Abstract:** State-space
models are successfully used in many areas of science, engineering and
economics to model time series and dynamical systems. We present a fully
Bayesian approach to inference and learning in nonlinear nonparametric
state-space models. We place a Gaussian process prior over the transition
dynamics, resulting in a flexible model able to capture complex dynamical
phenomena. However, to enable efficient inference, we marginalize over the
dynamics of the model and instead infer directly the joint smoothing
distribution through the use of specially tailored Particle Markov Chain
Monte Carlo samplers. Once a sample from the smoothing distribution is
computed, the state transition predictive distribution can be formulated
analytically. We make use of sparse Gaussian process models to greatly reduce
the computational complexity of the approach.

Roger Frigola and Carl Edward Rasmussen.
**Integrated pre-processing for
Bayesian nonlinear system identification with Gaussian processes**.
In *Decision and Control (CDC), 2013 IEEE 52nd Annual Conference on*,
2013.

** Abstract:** We introduce GP-FNARX: a new model for
nonlinear system identification based on a nonlinear autoregressive exogenous
model (NARX) with filtered regressors (F) where the nonlinear regression
problem is tackled using sparse Gaussian processes (GP). We integrate data
pre-processing with system identification into a fully automated procedure
that goes from raw data to an identified model. Both pre-processing
parameters and GP hyper-parameters are tuned by maximizing the marginal
likelihood of the probabilistic model. We obtain a Bayesian model of the
system's dynamics which is able to report its uncertainty in regions where
the data is scarce. The automated approach, the modeling of uncertainty and
its relatively low computational cost make of GP-FNARX a good candidate for
applications in robotics and adaptive control.

Andrew Gordon Wilson, Elad Gilboa, Arye Nehorai, and John P Cunningham.
**Gpatt: Fast multidimensional
pattern extrapolation with Gaussian processes**.
*arXiv preprint arXiv:1310.5288*, 2013.

** Abstract:**
Gaussian processes are typically used for smoothing and interpolation on
small datasets. We introduce a new Bayesian nonparametric framework - GPatt
- enabling automatic pattern extrapolation with Gaussian processes on large
multidimensional datasets. GPatt unifies and extends highly expressive
kernels and fast exact inference techniques. Without human intervention - no
hand crafting of kernel features, and no sophisticated initialisation
procedures - we show that GPatt can solve large scale pattern extrapolation,
inpainting, and kernel discovery problems, including a problem with 383,400
training points. We find that GPatt significantly outperforms popular
alternative scalable Gaussian process methods in speed and accuracy.
Moreover, we discover profound differences between each of these methods,
suggesting expressive kernels, nonparametric representations, and scalable
inference which exploits model structure are useful in combination for
modelling large scale multidimensional patterns.

John P. Cunningham, Zoubin Ghahramani, and Carl Edward Rasmussen.
**Gaussian
Processes for time-marked time-series data**.
In *15th International Conference on Artificial Intelligence and
Statistics*, pages 255-263, 2012.

** Abstract:** In many
settings, data is collected as multiple time series, where each recorded time
series is an observation of some underlying dynamical process of interest.
These observations are often time-marked with known event times, and one
desires to do a range of standard analyses. When there is only one time
marker, one simply aligns the observations temporally on that marker. When
multiple time-markers are present and are at different times on different
time series observations, these analyses are more difficult. We describe a
Gaussian Process model for analyzing multiple time series with multiple time
markings, and we test it on a variety of data.

Marc Peter Deisenroth, Ryan D. Turner, Marco F. Huber, Uwe D. Hanebeck, and
Carl Edward Rasmussen.
**Robust
filtering and smoothing with Gaussian processes**.
*IEEE Transactions on Automatic Control*, 57(7):1865-1871, 2012, doi
10.1109/TAC.2011.2179426.

** Abstract:** We propose a
principled algorithm for robust Bayesian filtering and smoothing in nonlinear
stochastic dynamic systems when both the transition function and the
measurement function are described by nonparametric Gaussian process (GP)
models. GPs are gaining increasing importance in signal processing, machine
learning, robotics, and control for representing unknown system functions by
posterior probability distributions. This modern way of "system
identification" is more robust than finding point estimates of a parametric
function representation. Our principled filtering/smoothing approach for GP
dynamic systems is based on analytic moment matching in the context of the
forward-backward algorithm. Our numerical evaluations demonstrate the
robustness of the proposed approach in situations where other
state-of-the-art Gaussian filters and smoothers can fail.

Ryan D. Turner and Carl Edward Rasmussen.
**Model based learning
of sigma points in unscented Kalman filtering**.
*Neurocomputing*, 80:47-53, 2012, doi
10.1016/j.neucom.2011.07.029.

** Abstract:** The unscented
Kalman filter (UKF) is a widely used method in control and time series
applications. The UKF suffers from arbitrary parameters necessary for sigma
point placement, potentially causing it to perform poorly in nonlinear
problems. We show how to treat sigma point placement in a UKF as a learning
problem in a model based view. We demonstrate that learning to place the
sigma points correctly from data can make sigma point collapse much less
likely. Learning can result in a significant increase in predictive
performance over default settings of the parameters in the UKF and other
filters designed to avoid the problems of the UKF, such as the GP-ADF. At the
same time, we maintain a lower computational complexity than the other
methods. We call our method UKF-L.

J. H. Macke, L. Busing, J. P. Cunningham, B. M. Yu, K. V. Shenoy, and
M. Sahani.
**Empirical
models of spiking in neural populations**.
In *Advances in Neural Information Processing Systems 25*, pages 1-8,
Granada, Spain, December 2011.

** Abstract:** Neurons in the
neocortex code and compute as part of a locally interconnected population.
Large-scale multi-electrode recording makes it possible to access these
population processes empirically by fitting statistical models to unaveraged
data. What statistical structure best describes the concurrent spiking of
cells within a local network? We argue that in the cortex, where firing
exhibits extensive correlations in both time and space and where a typical
sample of neurons still reflects only a very small fraction of the local
population, the most appropriate model captures shared variability by a
low-dimensional latent process evolving with smooth dynamics, rather than by
putative direct coupling. We test this claim by comparing a latent dynamical
model with realistic spiking observations to coupled generalised linear
spike-response models (GLMs) using cortical recordings. We find that the
latent dynamical approach outperforms the GLM in terms of goodness-of- fit,
and reproduces the temporal correlations in the data more accurately. We also
compare models whose observations models are either derived from a Gaussian
or point-process models, finding that the non-Gaussian model provides
slightly better goodness-of-fit and more realistic population spike
counts.

B. Petreska, B. M. Yu, J. P. Cunningham, G. Santhanam, S. I. Ryu, K. V. Shenoy,
and M. Sahani.
**Dynamical
segmentation of single trials from population neural data**.
In *Advances in Neural Information Processing Systems 25*, pages 1-8,
Granada, Spain, December 2011.

** Abstract:** Simultaneous
recordings of many neurons embedded within a recurrently-connected cortical
network may provide concurrent views into the dynamical processes of that
network, and thus its computational function. In principle, these dynamics
might be identified by purely unsupervised, statistical means. Here, we show
that a Hidden Switching Linear Dynamical Systems (HSLDS) model - in which
multiple linear dynamical laws approximate and nonlinear and potentially
non-stationary dynamical process - is able to distinguish dynamical regimes
within single-trial motor cortical activity associated with the preparation
and initiation of hand movements. The regimes are identified without
reference to behavioural or experimental epochs, but nonetheless transitions
between them correlate strongly with external events whose timing may vary
from trial to trial. The HSLDS model also performs better than recent
comparable models in predicting the firing rate of an isolated neuron based
on the firing rates of others, suggesting that it captures more of the
"Shared variance" of the data. Thus, the method is able to trace the
dynamical processes underlying the coordinated evolution of network activity
in a way that appears to reflect its computational role.

Andrew Gordon Wilson, David A Knowles, and Zoubin Ghahramani.
**Gaussian process
regression networks**.
Technical Report arXiv:1110.4411 [stat.ML], Department of Engineering,
University of Cambridge, Cambridge, UK, October 19 2011.

**
Abstract:** We introduce a new regression framework, Gaussian process
regression networks (GPRN), which combines the structural properties of
Bayesian neural networks with the non-parametric flexibility of Gaussian
processes. This model accommodates input dependent signal and noise
correlations between multiple response variables, input dependent
length-scales and amplitudes, and heavy-tailed predictive distributions. We
derive both efficient Markov chain Monte Carlo and variational Bayes
inference procedures for this model. We apply GPRN as a multiple output
regression and multivariate volatility model, demonstrating substantially
improved performance over eight popular multiple output (multi-task) Gaussian
process models and three multivariate volatility models on benchmark
datasets, including a 1000 dimensional gene expression dataset.

** Comment:** arXiv:1110.4411

J. P. Cunningham, P. Nuyujukian, V. Gilja, C. A. Chestek, S. I. Ryu, and K. V.
Shenoy.
**A
closed-loop human simulator for investigating the role of feedback-control in
brain-machine interfaces**.
*Journal of Neurophysiology*, 105:1932-1949, 2011.

**
Abstract:** Neural prosthetic systems seek to improve the lives of severely
disabled people by decoding neural activity into useful behavioral commands.
These systems and their decoding algorithms are typically developed
"offline", using neural activity previously gathered from a healthy animal,
and the decoded movement is then compared with the true movement that
accompanied the recorded neural activity. However, this offline design and
testing may neglect important features of a real prosthesis, most notably the
critical role of feedback control, which enables the user to adjust neural
activity while using the prosthesis. We hypothesize that under- standing and
optimally designing high-performance decoders require an experimental
platform where humans are in closed-loop with the various candidate decode
systems and algorithms. It remains unexplored the extent to which the subject
can, for a particular decode system, algorithm, or parameter, engage feedback
and other strategies to improve decode performance. Closed-loop testing may
suggest different choices than offline analyses. Here we ask if a healthy
human subject, using a closed-loop neural prosthesis driven by synthetic
neural activity, can inform system design. We use this online pros- thesis
simulator (OPS) to optimize "online" decode performance based on a key
parameter of a current state-of-the-art decode algorithm, the bin width of a
Kalman filter. First, we show that offline and online analyses indeed suggest
different parameter choices. Previous literature and our offline analyses
agree that neural activity should be analyzed in bins of 100- to 300-ms
width. OPS analysis, which incorporates feedback control, suggests that much
shorter bin widths (25-50 ms) yield higher decode performance. Second, we
confirm this surprising finding using a closed-loop rhesus monkey prosthetic
system. These findings illustrate the type of discovery made possible by the
OPS, and so we hypothesize that this novel testing approach will help in the
design of prosthetic systems that will translate well to human patients.

Richard E. Turner and Maneesh Sahani.
**Demodulation
as probabilistic inference**.
*Transactions on Audio, Speech and Language Processing*, 19:2398-2411,
2011.

** Abstract:** Demodulation is an ill-posed problem
whenever both carrier and envelope signals are broadband and unknown. Here,
we approach this problem using the methods of probabilistic inference. The
new approach, called Probabilistic Amplitude Demodulation (PAD), is
computationally challenging but improves on existing methods in a number of
ways. By contrast to previous approaches to demodulation, it satisfies five
key desiderata: PAD has soft constraints because it is probabilistic; PAD is
able to automatically adjust to the signal because it learns parameters; PAD
is user-steerable because the solution can be shaped by user-specific prior
information; PAD is robust to broad-band noise because this is modelled
explicitly; and PAD’s solution is self-consistent, empirically satisfying a
Carrier Identity property. Furthermore, the probabilistic view naturally
encompasses noise and uncertainty, allowing PAD to cope with missing data and
return error bars on carrier and envelope estimates. Finally, we show that
when PAD is applied to a bandpass-filtered signal, the stop-band energy of
the inferred carrier is minimal, making PAD well-suited to sub-band
demodulation.

Richard E. Turner and Maneesh Sahani.
**Probabilistic
amplitude and frequency demodulation**.
In *Advances in Neural Information Processing Systems 24*, pages
981-989. The MIT Press, 2011.

** Abstract:** A number of
recent scientific and engineering problems require signals to be decomposed
into a product of a slowly varying positive envelope and a quickly varying
carrier whose instantaneous frequency also varies slowly over time. Although
signal processing provides algorithms for so-called amplitude- and
frequency-demodulation (AFD), there are well known problems with all of the
existing methods. Motivated by the fact that AFD is ill-posed, we approach
the problem using probabilistic inference. The new approach, called
probabilistic amplitude and frequency demodulation (PAFD), models
instantaneous frequency using an auto-regressive generalization of the von
Mises distribution, and the envelopes using Gaussian auto-regressive dynamics
with a positivity constraint. A novel form of expectation propagation is used
for inference. We demonstrate that although PAFD is computationally
demanding, it outperforms previous approaches on synthetic and real signals
in clean, noisy and missing data settings.

Richard E. Turner and Maneesh Sahani.
**Two
problems with variational expectation maximisation for time-series
models**.
In D. Barber, T. Cemgil, and S. Chiappa, editors, *Bayesian Time series
models*, chapter 5, pages 109-130. Cambridge University Press,
2011.

** Abstract:** Variational methods are a key component
of the approximate inference and learning toolbox. These methods fill an
important middle ground, retaining distributional information about
uncertainty in latent variables, unlike maximum a posteriori methods (MAP),
and yet generally requiring less computational time than Monte Carlo Markov
Chain methods. In particular the variational Expectation Maximisation (vEM)
and variational Bayes algorithms, both involving variational optimisation of
a free-energy, are widely used in time-series modelling. Here, we investigate
the success of vEM in simple probabilistic time-series models. First we
consider the inference step of vEM, and show that a consequence of the
well-known compactness property of variational inference is a failure to
propagate uncertainty in time, thus limiting the usefulness of the retained
distributional information. In particular, the uncertainty may appear to be
smallest precisely when the approximation is poorest. Second, we consider
parameter learning and analytically reveal systematic biases in the
parameters found by vEM. Surprisingly, simpler variational approximations
(such a mean-field) can lead to less bias than more complicated structured
approximations.

Andrew Gordon Wilson and Zoubin Ghahramani.
**Generalised
Wishart processes**.
In *27th Conference on Uncertainty in Artificial Intelligence*, 2011.

** Abstract:** We introduce a new stochastic process called the
generalised Wishart process (GWP). It is a collection of positive
semi-definite random matrices indexed by any arbitrary input variable. We use
this process as a prior over dynamic (e.g. time varying) covariance matrices.
The GWP captures a diverse class of covariance dynamics, naturally hanles
missing data, scales nicely with dimension, has easily interpretable
parameters, and can use input variables that include covariates other than
time. We describe how to construct the GWP, introduce general procedures for
inference and prediction, and show that it outperforms its main competitor,
multivariate GARCH, even on financial data that especially suits GARCH.

** Comment:** Supplementary
Material, Best Student Paper Award

M. Zhao, A. P. Batista, J. P. Cunningham, C. A. Chestek, Z. Rivera-Alvidrez,
R. Kalmar, S. I. Ryu, K. V. Shenoy, and S. Iyengar.
**An
L1-regularized logistic model for detecting short-term neuronal
interactions.**.
*Journal of Computational Neuroscience*, 2011, doi
10.1007/s10827-011-0365-5.
In Press.

** Abstract:** Interactions among neurons are a key
com- ponent of neural signal processing. Rich neural data sets potentially
containing evidence of interactions can now be collected readily in the
laboratory, but existing analysis methods are often not sufficiently
sensitive and specific to reveal these interactions. Generalized linear
models offer a platform for analyzing multi-electrode recordings of neuronal
spike train data. Here we suggest an L1-regularized logistic regression model
(L1L method) to detect short-term (order of 3ms) neuronal interactions. We
estimate the parameters in this model using a coordinate descent algorithm,
and determine the optimal tuning parameter using a Bayesian Information
Criterion. Simulation studies show that in general the L1L method has better
sensitivities and specificities than those of the traditional
shuffle-corrected cross-correlogram (covariogram) method. The L1L method is
able to detect excitatory interactions with both high sensitivity and
specificity with reasonably large recordings, even when the magnitude of the
interactions is small; similar results hold for inhibition given sufficiently
high baseline firing rates. Our study also suggests that the false positives
can be further removed by thresholding, because their magnitudes are
typically smaller than true interactions. Simulations also show that the L1L
method is somewhat robust to partially observed networks. We apply the method
to multi-electrode recordings collected in the monkey dorsal premotor cortex
(PMd) while the animal prepares to make reaching arm movements. The results
show that some neurons interact differently depending on task conditions. The
stronger interactions detected with our L1L method were also visible using
the covariogram method.

Ryan Turner, Steven Bottone, and Zoubin Ghahramani.
**Fast online
anomaly detection using scan statistics**.
In Samuel Kaski, David J. Miller, Erkki Oja, and Antti Honkela, editors,
*Machine Learning for Signal Processing (MLSP 2010)*, pages 385-390,
Kittilä, Finland, August 2010.

** Abstract:** We present
methods to do fast online anomaly detection using scan statistics. Scan
statistics have long been used to detect statistically significant bursts of
events. We extend the scan statistics framework to handle many practical
issues that occur in application: dealing with an unknown background rate of
events, allowing for slow natural changes in background frequency, the
inverse problem of finding an unusual lack of events, and setting the test
parameters to maximize power. We demonstrate its use on real and synthetic
data sets with comparison to other methods.

Ryan Turner and Carl Edward Rasmussen.
**Model based learning
of sigma points in unscented Kalman filtering**.
In Samuel Kaski, David J. Miller, Erkki Oja, and Antti Honkela, editors,
*Machine Learning for Signal Processing (MLSP 2010)*, pages 178-183,
Kittilä, Finland, August 2010.

** Abstract:** The
unscented Kalman filter (UKF) is a widely used method in control and time
series applications. The UKF suffers from arbitrary parameters necessary for
a step known as sigma point placement, causing it to perform poorly in
nonlinear problems. We show how to treat sigma point placement in a UKF as a
learning problem in a model based view. We demonstrate that learning to place
the sigma points correctly from data can make sigma point collapse much less
likely. Learning can result in a significant increase in predictive
performance over default settings of the parameters in the UKF and other
filters designed to avoid the problems of the UKF, such as the GP-ADF. At the
same time, we maintain a lower computational complexity than the other
methods. We call our method UKF-L.

Yunus Saatçi, Ryan Turner, and Carl Edward Rasmussen.
**Gaussian process
change point models**.
In *27th International Conference on Machine Learning*, pages 927-934,
Haifa, Israel, June 2010.

** Abstract:** We combine Bayesian
online change point detection with Gaussian processes to create a
nonparametric time series model which can handle change points. The model can
be used to locate change points in an online manner; and, unlike other
Bayesian online change point detection algorithms, is applicable when
temporal correlations in a regime are expected. We show three variations on
how to apply Gaussian processes in the change point context, each with their
own advantages. We present methods to reduce the computational burden of
these models and demonstrate it on several real world data sets.

Ryan Turner, Marc Peter Deisenroth, and Carl Edward Rasmussen.
**State-space
inference and learning with Gaussian processes**.
In Yee Whye Teh and Mike Titterington, editors, *13th International
Conference on Artificial Intelligence and Statistics*, volume 9 of
*W & CP*, pages 868-875, Chia Laguna, Sardinia, Italy, May 13-15
2010. Journal of Machine Learning Research.

** Abstract:**
State-space inference and learning with Gaussian processes (GPs) is an
unsolved problem. We propose a new, general methodology for inference and
learning in nonlinear state-space models that are described probabilistically
by non-parametric GP models. We apply the expectation maximization algorithm
to iterate between inference in the latent state-space and learning the
parameters of the underlying GP dynamics model.

** Comment:** poster.

Richard E. Turner and Maneesh Sahani.
**Statistical
inference for single- and multi-band probabilistic amplitude
demodulation.**.
In

**
Abstract:** Amplitude demodulation is an ill-posed problem and so it is
natural to treat it from a Bayesian viewpoint, inferring the most likely
carrier and envelope under probabilistic constraints. One such treatment is
Probabilistic Amplitude Demodulation (PAD), which, whilst computationally
more intensive than traditional approaches, offers several advantages. Here
we provide methods for estimating the uncertainty in the PAD-derived
envelopes and carriers, and for learning free-parameters like the time-scale
of the envelope. We show how the probabilistic approach can naturally handle
noisy and missing data. Finally, we indicate how to extend the model to
signals which contain multiple modulators and carriers.

Richard E. Turner.
**Statistical
Models for Natural Sounds**.
PhD thesis, Gatsby Computational Neuroscience Unit, UCL, 2010.

**
Abstract:** It is important to understand the rich structure of natural
sounds in order to solve important tasks, like automatic speech recognition,
and to understand auditory processing in the brain. This thesis takes a step
in this direction by characterising the statistics of simple natural sounds.
We focus on the statistics because perception often appears to depend on
them, rather than on the raw waveform. For example the perception of auditory
textures, like running water, wind, fire and rain, depends on
summary-statistics, like the rate of falling rain droplets, rather than on
the exact details of the physical source. In order to analyse the statistics
of sounds accurately it is necessary to improve a number of traditional
signal processing methods, including those for amplitude demodulation,
time-frequency analysis, and sub-band demodulation. These estimation tasks
are ill-posed and therefore it is natural to treat them as Bayesian inference
problems. The new probabilistic versions of these methods have several
advantages. For example, they perform more accurately on natural signals and
are more robust to noise, they can also fill-in missing sections of data, and
provide error-bars. Furthermore, free-parameters can be learned from the
signal. Using these new algorithms we demonstrate that the energy, sparsity,
modulation depth and modulation time-scale in each sub-band of a signal are
critical statistics, together with the dependencies between the sub-band
modulators. In order to validate this claim, a model containing co-modulated
coloured noise carriers is shown to be capable of generating a range of
realistic sounding auditory textures. Finally, we explored the connection
between the statistics of natural sounds and perception. We demonstrate that
inference in the model for auditory textures qualitatively replicates the
primitive grouping rules that listeners use to understand simple acoustic
scenes. This suggests that the auditory system is optimised for the
statistics of natural sounds.

Andrew Gordon Wilson and Zoubin Ghahramani.
**Copula
processes**.
In *Advances in Neural Information Processing Systems 23*, 2010.
Spotlight.

** Abstract:** We define a copula process which
describes the dependencies between arbitrarily many random variables
independently of their marginal distributions. As an example, we develop a
stochastic volatility model, Gaussian Copula Process Volatility (GCPV), to
predict the latent standard deviations of a sequence of random variables. To
make predictions we use Bayesian inference, with the Laplace approximation,
and with Markov chain Monte Carlo as an alternative. We find our model can
outperform GARCH on simulated and financial data. And unlike GARCH, GCPV can
easily handle missing data, incorporate covariates other than time, and model
a rich class of covariance structures.

** Comment:** Supplementary
Material, slides.

Ryan Turner, Marc Peter Deisenroth, and Carl Edward Rasmussen.
**System
identification in Gaussian process dynamical systems**.
In Dilan Görür, editor, *NIPS Workshop on Nonparametric Bayes*,
Whistler, BC, Canada, December 2009.

** Comment:** poster.

Ryan Turner, Yunus Saatçi, and Carl Edward Rasmussen.
**Adaptive
sequential Bayesian change point detection**.
In Zaïd Harchaoui, editor, *NIPS Workshop on Temporal Segmentation*,
Whistler, BC, Canada, December 2009.

** Abstract:** Real-world
time series are often nonstationary with respect to the parameters of some
underlying prediction model (UPM). Furthermore, it is often desirable to
adapt the UPM to incoming regime changes as soon as possible, necessitating
sequential inference about change point locations. A Bayesian algorithm for
online change point detection (BOCPD) has been introduced recently by Adams
and MacKay (2007). In this algorithm, uncertainty about the last change point
location is updated sequentially, and is integrated out to make online
predictions robust to parameter changes. BOCPD requires a set of fixed
hyper-parameters which allow the user to fully specify the hazard function
for change points and the prior distribution over the parameters of the UPM.
In practice, finding the "right" hyper-parameters can be quite difficult. We
therefore extend BOCPD by introducing hyper-parameter learning, without
sacrificing the online nature of the algorithm. Hyper-parameter learning is
performed by optimizing the marginal likelihood of the BOCPD model, a
closed-form quantity which can be computed sequentially. We illustrate
performance on three real-world datasets.

O. Stegle, K. Denby, S. McHattie, A. Meade, D. Wild, Z. Ghahramani, and
K Borgwardt.
**Discovering
temporal patterns of differential gene expression in microarray time
series**.
In *German Conference on Bioinformatics*, pages 133-142, Halle,
Germany, September 2009.

** Abstract:** A wealth of time
series of microarray measurements have become available over recent years.
Several two-sample tests for detecting differential gene expression in these
time series have been defined, but they can only answer the question
*whether* a gene is differentially expressed across the whole time
series, not *in which intervals* it is differentially expressed. In this
article, we propose a Gaussian process based approach for studying these
dynamics of differential gene expression. In experiments on *Arabidopsis
thaliana* gene expression levels, our novel technique helps us to uncover
that the family of WRKY transcription factors appears to be involved in the
early response to infection by a fungal pathogen.

Marc Peter Deisenroth, Marco F. Huber, and Uwe D. Hanebeck.
**Analytic
moment-based Gaussian process filtering**.
In Léon Bottou and Michael Littman, editors, *26th International
Conference on Machine Learning*, pages 225-232, Montréal, QC,
Canada, June 2009. Omnipress.

** Abstract:** We propose an
analytic moment-based filter for nonlinear stochastic dynamic systems modeled
by Gaussian processes. Exact expressions for the expected value and the
covariance matrix are provided for both the prediction step and the filter
step, where an additional Gaussian assumption is exploited in the latter
case. Our filter does not require further approximations. In particular, it
avoids finite-sample approximations. We compare the filter to a variety of
Gaussian filters, that is, the EKF, the UKF, and the recent GP-UKF proposed
by Ko
et al. (2007).

** Comment:** With corrections. code.

T. Stepleton, Z. Ghahramani, G. Gordon, and T.-S. Lee.
**The block
diagonal infinite hidden Markov model**.
In D. van Dyk and M. Welling, editors, *12th International Conference on
Artificial Intelligence and Statistics*, volume 5, pages 552-559,
Clearwater Beach, FL, USA, April 2009. Microtome Publishing (paper) Journal
of Machine Learning Research.
ISSN 1938-7228.

** Abstract:** The Infinite Hidden Markov Model
(IHMM) extends hidden Markov models to have a countably infinite number of
hidden states (Beal et al., 2002; Teh et al.,
2006). We present a generalization of this framework that introduces nearly
block-diagonal structure in the transitions between the hidden states, where
blocks correspond to "sub-behaviors" exhibited by data sequences. In
identifying such structure, the model classifies, or partitions, sequence
data according to these sub-behaviors in an unsupervised way. We present an
application of this model to artificial data, a video gesture classification
task, and a musical theme labeling task, and show that components of the
model can also be applied to graph segmentation.

Pietro Berkes, Richard E. Turner, and Maneesh Sahani.
**A
structured model of video reproduces primary visual cortical
organisation**.
*PLoS Computational Biology*, 5(9), 09 2009, doi
10.1371/journal.pcbi.1000495.

** Abstract:** The visual
system must learn to infer the presence of objects and features in the world
from the images it encounters, and as such it must, either implicitly or
explicitly, model the way these elements interact to create the image. Do the
response properties of cells in the mammalian visual system reflect this
constraint? To address this question, we constructed a probabilistic model in
which the identity and attributes of simple visual elements were represented
explicitly and learnt the parameters of this model from unparsed, natural
video sequences. After learning, the behaviour and grouping of variables in
the probabilistic model corresponded closely to functional and anatomical
properties of simple and complex cells in the primary visual cortex (V1). In
particular, feature identity variables were activated in a way that resembled
the activity of complex cells, while feature attribute variables responded
much like simple cells. Furthermore, the grouping of the attributes within
the model closely parallelled the reported anatomical grouping of simple
cells in cat V1. Thus, this generative model makes explicit an interpretation
of complex and simple cells as elements in the segmentation of a visual scene
into basic independent features, along with a parametrisation of their
moment-by-moment appearances. We speculate that such a segmentation may form
the initial stage of a hierarchical system that progressively separates the
identity and appearance of more articulated visual elements, culminating in
view-invariant object recognition.

J. Van Gael, Y.W. Teh, and Z. Ghahramani.
**The infinite
factorial hidden Markov model**.
In D. Koller, D. Schuurmans, L. Bottou, and Y. Bengio, editors, *Advances in
Neural Information Processing Systems 21*, volume 21, pages
1697-1704, Cambridge, MA, USA, December 2008. The MIT Press.

** Abstract:** The infinite factorial hidden Markov model is a
non-parametric extension of the factorial hidden Markov model. Our model
defines a probability distribution over an infinite number of independent
binary hidden Markov chains which together produce an observable sequence of
random variables. Central to our model is a new type of non-parametric prior
distribution inspired by the Indian Buffet Process which we call the
*Indian Buffet Markov Process*.

Richard E. Turner and Maneesh Sahani.
**Modeling
natural sounds with modulation cascade processes**.
In J. C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, *nips20*,
volume 20. mit, 2008.

** Abstract:** Natural sounds are
structured on many time-scales. A typical segment of speech, for example,
contains features that span four orders of magnitude: Sentences ($\sim1$s);
phonemes ($\sim10$−$1$ s); glottal pulses ($\sim 10$−$2$s); and formants
($\sim 10$−$3$s). The auditory system uses information from each of these
time-scales to solve complicated tasks such as auditory scene analysis [1].
One route toward understanding how auditory processing accomplishes this
analysis is to build neuroscience-inspired algorithms which solve similar
tasks and to compare the properties of these algorithms with properties of
auditory processing. There is however a discord: Current machine-audition
algorithms largely concentrate on the shorter time-scale structures in
sounds, and the longer structures are ignored. The reason for this is
two-fold. Firstly, it is a difficult technical problem to construct an
algorithm that utilises both sorts of information. Secondly, it is
computationally demanding to simultaneously process data both at high
resolution (to extract short temporal information) and for long duration (to
extract long temporal information). The contribution of this work is to
develop a new statistical model for natural sounds that captures structure
across a wide range of time-scales, and to provide efficient learning and
inference algorithms. We demonstrate the success of this approach on a
missing data task.

Jurgen Van Gael, Yunus Saatçi, Yee-Whye Teh, and Zoubin Ghahramani.
**Beam sampling
for the infinite hidden Markov model**.
In *25th International Conference on Machine Learning*, volume 25,
pages 1088-1095, Helsinki, Finland, 2008. Association for Computing
Machinery.

** Abstract:** The infinite hidden Markov model is
a non-parametric extension of the widely used hidden Markov model. Our paper
introduces a new inference algorithm for the infinite Hidden Markov model
called beam sampling. Beam sampling combines slice sampling, which limits the
number of states considered at each time step to a finite number, with
dynamic programming, which samples whole state trajectories efficiently. Our
algorithm typically outperforms the Gibbs sampler and is more robust. We
present applications of iHMM inference using the beam sampler on changepoint
detection and text prediction problems.

Richard E. Turner and M Sahani.
**Probabilistic
amplitude demodulation**.
In *7th International Conference on Independent Component Analysis and
Signal Separation*, pages 544-551, 2007.

** Abstract:**
Auditory scene analysis is extremely challenging. One approach, perhaps that
adopted by the brain, is to shape useful representations of sounds on prior
knowledge about their statistical structure. For example, sounds with
harmonic sections are common and so time-frequency representations are
efficient. Most current representations concentrate on the shorter
components. Here, we propose representations for structures on longer
time-scales, like the phonemes and sentences of speech. We decompose a sound
into a product of processes, each with its own characteristic time-scale.
This demodulation cascade relates to classical amplitude demodulation, but
traditional algorithms fail to realise the representation fully. A n