[ 2020 | 2019 | 2018 | 2017 | 2016 | 2015 | 2014 | 2013 | 2012 | 2011 | 2010 | 2009 | 2008 | 2007 | 2006 | 2005 | 2004 | 2003 | 2002 | 2001 | past millennia ] |

Matej Balog, Ilya Tolstikhin, and Bernhard Schölkopf.
**Differentially
private database release via kernel mean embeddings**.
In *35th International Conference on Machine Learning*, Stockholm,
Sweden, July 2018.

** Abstract:** We lay theoretical
foundations for new database release mechanisms that allow third-parties to
construct consistent estimators of population statistics, while ensuring that
the privacy of each individual contributing to the database is protected. The
proposed framework rests on two main ideas. First, releasing (an estimate of)
the kernel mean embedding of the data generating random variable instead of
the database itself still allows third-parties to construct consistent
estimators of a wide class of population statistics. Second, the algorithm
can satisfy the definition of differential privacy by basing the released
kernel mean embedding on entirely synthetic data points, while controlling
accuracy through the metric available in a Reproducing Kernel Hilbert Space.
We describe two instantiations of the proposed framework, suitable under
different scenarios, and prove theoretical results guaranteeing differential
privacy of the resulting algorithms and the consistency of estimators
constructed from their outputs.

** Comment:** [arXiv]

Matej Balog, Nilesh Tripuraneni, Zoubin Ghahramani, and Adrian Weller.
**Lost
relatives of the Gumbel trick**.
In *34th International Conference on Machine Learning*, Sydney,
Australia, August 2017.

** Abstract:** The Gumbel trick is a
method to sample from a discrete probability distribution, or to estimate its
normalizing partition function. The method relies on repeatedly applying a
random perturbation to the distribution in a particular way, each time
solving for the most likely configuration. We derive an entire family of
related methods, of which the Gumbel trick is one member, and show that the
new methods have superior properties in several settings with minimal
additional computational cost. In particular, for the Gumbel trick to yield
computational benefits for discrete graphical models, Gumbel perturbations on
all configurations are typically replaced with so-called low-rank
perturbations. We show how a subfamily of our new methods adapts to this
setting, proving new upper and lower bounds on the log partition function and
deriving a family of sequential samplers for the Gibbs distribution. Finally,
we balance the discussion by showing how the simpler analytical form of the
Gumbel trick enables additional theoretical results.

** Comment:** [arXiv] [Poster]
[Slides]
[Code]

Matej Balog, Alexander L. Gaunt, Marc Brockschmidt, Sebastian Nowozin, and
Daniel Tarlow.
**DeepCoder:
Learning to write programs**.
In *5th International Conference on Learning Representations*, Toulon,
France, April 2017.

** Abstract:** We develop a first line of
attack for solving programming competition-style problems from input-output
examples using deep learning. The approach is to train a neural network to
predict properties of the program that generated the outputs from the inputs.
We use the neural network's predictions to augment search techniques from the
programming languages community, including enumerative search and an
SMT-based solver. Empirically, we show that our approach leads to an order of
magnitude speedup over the strong non-augmented baselines and a Recurrent
Neural Network approach, and that we are able to solve problems of difficulty
comparable to the simplest problems on programming competition websites.

Matej Balog, Balaji Lakshminarayanan, Zoubin Ghahramani, Daniel M. Roy, and
Yee Whye Teh.
**The
Mondrian kernel**.
In *32nd Conference on Uncertainty in Artificial Intelligence*, pages
32-41, Jersey City, New Jersey, USA, June 2016.

**
Abstract:** We introduce the Mondrian kernel, a fast random feature
approximation to the Laplace kernel. It is suitable for both batch and online
learning, and admits a fast kernel-width-selection procedure as the random
features can be re-used efficiently for all kernel widths. The features are
constructed by sampling trees via a Mondrian process [Roy and Teh, 2009], and
we highlight the connection to Mondrian forests [Lakshminarayanan et al.,
2014], where trees are also sampled via a Mondrian process, but fit
independently. This link provides a new insight into the relationship between
kernel methods and random forests.

** Comment:** [Supplementary
Material] [arXiv] [Poster]
[Slides]
[Code]

Jonathan Gordon, John Bronskill, Matthias Bauer, Sebastian Nowozin, and Richard
Turner.
**Meta-learning
probabilistic inference for prediction**.
New Orleans, April 2019.

** Abstract:** This paper introduces a
new framework for data efficient and versatile learning. Specifically: 1) We
develop ML-PIP, a general framework for Meta-Learning approximate
Probabilistic Inference for Prediction. ML-PIP extends existing probabilistic
interpretations of meta-learning to cover a broad class of methods. 2) We
introduce \Versa, an instance of the framework employing a flexible and
versatile amortization network that takes few-shot learning datasets as
inputs, with arbitrary numbers of shots, and outputs a distribution over
task-specific parameters in a single forward pass. Versa substitutes
optimization at test time with forward passes through inference networks,
amortizing the cost of inference and relieving the need for second
derivatives during training. 3) We evaluate \Versa on benchmark datasets
where the method sets new state-of-the-art results, and can handle arbitrary
number of shots, and for classification, arbitrary numbers of classes at
train and test time. The power of the approach is then demonstrated through a
challenging few-shot ShapeNet view reconstruction task.

Matthias Stephan Bauer, Mark van der Wilk, and Carl Edward Rasmussen.
**Understanding
probabilistic sparse Gaussian process approximations**.
In *Advances in Neural Information Processing Systems 29*, 2016.

** Abstract:** Good sparse approximations are essential for
practical inference in Gaussian Processes as the computational cost of exact
methods is prohibitive for large datasets. The Fully Independent Training
Conditional (FITC) and the Variational Free Energy (VFE) approximations are
two recent popular methods. Despite superficial similarities, these
approximations have surprisingly different theoretical properties and behave
differently in practice. We thoroughly investigate the two methods for
regression both analytically and through illustrative examples, and draw
conclusions to guide practical application.

** Comment:** arXiv

C. Lippert, Z. Ghahramani, and K. Borgwardt.
**Gene function
prediction from synthetic lethality networks via ranking on demand**.
*Bioinformatics*, 26:912-918, 2010.

** Abstract:**
Motivation: Synthetic lethal interactions represent pairs of genes whose
individual mutations are not lethal, while the double mutation of both genes
does incur lethality. Several studies have shown a correlation between
functional similarity of genes and their distances in networks based on
synthetic lethal interactions. However, there is a lack of algorithms for
predicting gene function from synthetic lethality interaction networks.

Results: In this article, we present a novel technique called kernelROD for
gene function prediction from synthetic lethal interaction networks based on
kernel machines. We apply our novel algorithm to Gene Ontology functional
annotation prediction in yeast. Our experiments show that our method leads to
improved gene function prediction compared with state-of-the-art competitors
and that combining genetic and congruence networks leads to a further
improvement in prediction accuracy.

O. Stegle, K. J. Denby, E. J. Cooke, D. L. Wild, Z. Ghahramani, and K. M.
Borgwardt.
**A
robust Bayesian two-sample test for detecting intervals of differential
gene expression in microarray time series**.
*Journal of Computational Biology*, 17(3):1-13, 2010, doi
10.1089/cmb.2009.0175.

** Abstract:** Understanding the
regulatory mechanisms that are responsible for an organism's response to
environmental change is an important issue in molecular biology. A first and
important step towards this goal is to detect genes whose expression levels
are affected by altered external conditions. A range of methods to test for
differential gene expression, both in static as well as in time-course
experiments, have been proposed. While these tests answer the question
*whether* a gene is differentially expressed, they do not explicitly
address the question *when* a gene is differentially expressed, although
this information may provide insights into the course and causal structure of
regulatory programs. In this article, we propose a twosample test for
identifying intervals of differential gene expression in microarray time
series. Our approach is based on Gaussian process regression, can deal with
arbitrary numbers of replicates, and is robust with respect to outliers. We
apply our algorithm to study the response of *Arabidopsis thaliana*
genes to an infection by a fungal pathogen using a microarray time series
dataset covering 30,336 gene probes at 24 observed time points. In
classification experiments, our test compares favorably with existing methods
and provides additional insights into time-dependent differential
expression.

O. Stegle, K. Denby, S. McHattie, A. Meade, D. Wild, Z. Ghahramani, and
K Borgwardt.
**Discovering
temporal patterns of differential gene expression in microarray time
series**.
In *German Conference on Bioinformatics*, pages 133-142, Halle,
Germany, September 2009.

** Abstract:** A wealth of time
series of microarray measurements have become available over recent years.
Several two-sample tests for detecting differential gene expression in these
time series have been defined, but they can only answer the question
*whether* a gene is differentially expressed across the whole time
series, not *in which intervals* it is differentially expressed. In this
article, we propose a Gaussian process based approach for studying these
dynamics of differential gene expression. In experiments on *Arabidopsis
thaliana* gene expression levels, our novel technique helps us to uncover
that the family of WRKY transcription factors appears to be involved in the
early response to infection by a fungal pathogen.

C. Lippert, O. Stegle, Z. Ghahramani, and K. Borgwardt.
**A kernel
method for unsupervised structured network inference**.
In D. van Dyk and M. Welling, editors, *12th International Conference on
Artificial Intelligence and Statistics*, volume 5, pages 368-375,
Clearwater Beach, FL, USA, April 2009. Journal of Machine Learning Research.
ISSN: 1938-7228.

** Abstract:** Network inference is the problem
of inferring edges between a set of real-world objects, for instance,
interactions between pairs of proteins in bioinformatics. Current
kernel-based approaches to this problem share a set of common features: (i)
they are supervised and hence require labeled training data; (ii) edges in
the network are treated as mutually independent and hence topological
properties are largely ignored; (iii) they lack a statistical interpretation.
We argue that these common assumptions are often undesirable for network
inference, and propose (i) an unsupervised kernel method (ii) that takes the
global structure of the network into account and (iii) is statistically
motivated. We show that our approach can explain commonly used heuristics in
statistical terms. In experiments on social networks, dfferent variants of
our method demonstrate appealing predictive performance.

Karsten M. Borgwardt and Zoubin Ghahramani.
**Bayesian two-sample
tests**.
*arXiv*, abs/0906.4032, 2009.

** Abstract:** In this
paper, we present two classes of Bayesian approaches to the two-sample
problem. Our first class of methods extends the Bayesian t-test to include
all parametric models in the exponential family and their conjugate priors.
Our second class of methods uses Dirichlet process mixtures (DPM) of such
conjugate-exponential distributions as flexible nonparametric priors over the
unknown distributions.

O. Stegle, K. Denby, David L. Wild, Zoubin Ghahramani, and Karsten Borgwardt.
**A robust
Bayesian two-sample test for detecting intervals of differential gene
expression in microarray time series**.
In *13th Annual International Conference on Research in Computational
Molecular Biology (RECOMB 2009)*, volume 5541 of *Lecture Notes in
Bioinformatics*, pages 201-216, Tucson, AZ, USA, 2009. Springer-Verlag,
doi
10.1007/978-3-642-02008-7_14.

** Abstract:** Understanding
the regulatory mechanisms that are responsible for an organism's response to
environmental changes is an important question in molecular biology. A first
and important step towards this goal is to detect genes whose expression
levels are affected by altered external conditions. A range of methods to
test for differential gene expression, both in static as well as in
time-course experiments, have been proposed. While these tests answer the
question *whether* a gene is differentially expressed, they do not
explicitly address the question *when* a gene is differentially
expressed, although this information may provide insights into the course and
causal structure of regulatory programs. In this article, we propose a
two-sample test for identifying *intervals* of differential gene
expression in microarray time series. Our approach is based on Gaussian
process regression, can deal with arbitrary numbers of replicates and is
robust with respect to outliers. We apply our algorithm to study the response
of *Arabidopsis thaliana* genes to an infection by a fungal pathogen
using a microarray time series dataset covering 30,336 gene probes at 24 time
points. In classification experiments our test compares favorably with
existing methods and provides additional insights into time-dependent
differential expression.

C. Hübler, K. Borgwardt, H.-P. Kriegel, and Z. Ghahramani.
**Metropolis
algorithms for representative subgraph sampling**.
In *Proceedings of 8th IEEE International Conference on Data Mining (ICDM
2008)*, pages 283-292, Pisa, Italy, December 2008. IEEE.
ISSN: 1550-4786.

** Abstract:** While data mining in
chemoinformatics studied graph data with dozens of nodes, systems biology and
the Internet are now generating graph data with thousands and millions of
nodes. Hence data mining faces the algorithmic challenge of coping with this
significant increase in graph size: Classic algorithms for data analysis are
often too expensive and too slow on large graphs.

While one strategy to
overcome this problem is to design novel efficient algorithms, the other is
to 'reduce' the size of the large graph by sampling. This is the scope of
this paper: We will present novel Metropolis algorithms for sampling a
'representative' small subgraph from the original large graph, with
'representative' describing the requirement that the sample shall preserve
crucial graph properties of the original graph. In our experiments, we
improve over the pioneering work of Leskovec and Faloutsos (KDD 2006), by
producing representative subgraph samples that are both smaller and of higher
quality than those produced by other methods from the literature.

Sébastien Bratières, Novi Quadrianto, Sebastian Nowozin, and Zoubin
Ghahramani.
**Scalable
Gaussian Process structured prediction for grid factor graph
applications**.
In *31st International Conference on Machine Learning*, 2014.

** Abstract:** Structured prediction is an important and well
studied problem with many applications across machine learning. GPstruct is a
recently proposed structured prediction model that offers appealing
properties such as being kernelised, non-parametric, and supporting Bayesian
inference (Bratières et al. 2013). The model places a Gaussian process prior
over energy functions which describe relationships between input variables
and structured output variables. However, the memory demand of GPstruct is
quadratic in the number of latent variables and training runtime scales
cubically. This prevents GPstruct from being applied to problems involving
grid factor graphs, which are prevalent in computer vision and spatial
statistics applications. Here we explore a scalable approach to learning
GPstruct models based on ensemble learning, with weak learners (predictors)
trained on subsets of the latent variables and bootstrap data, which can
easily be distributed. We show experiments with 4M latent variables on image
segmentation. Our method outperforms widely-used conditional random field
models trained with pseudo-likelihood. Moreover, in image segmentation
problems it improves over recent state-of-the-art marginal optimisation
methods in terms of predictive performance and uncertainty calibration.
Finally, it generalises well on all training set sizes.

Sébastien Bratières, Novi Quadrianto, and Zoubin Ghahramani.
**Bayesian
structured prediction using Gaussian processes**.
*arXiv*, abs/1307.3846, 2013.

** Abstract:** We introduce
a conceptually novel structured prediction model, GPstruct, which is
kernelized, non-parametric and Bayesian, by design. We motivate the model
with respect to existing approaches, among others, conditional random fields
(CRFs), maximum margin Markov networks (M3N), and structured support vector
machines (SVMstruct), which embody only a subset of its properties. We
present an inference procedure based on Markov Chain Monte Carlo. The
framework can be instantiated for a wide range of structured objects such as
linear chains, trees, grids, and other general graphs. As a proof of concept,
the model is benchmarked on several natural language processing tasks and a
video gesture segmentation task involving a linear chain structure. We show
prediction accuracies for GPstruct which are comparable to or exceeding those
of CRFs and SVMstruct.

Sébastien Bratières, Jurgen Van Gael, Andreas Vlachos, and Zoubin
Ghahramani.
**Scaling the
iHMM: Parallelization versus Hadoop**.
In *Proceedings of the 2010 10th IEEE International Conference on Computer
and Information Technology*, pages 1235-1240, Bradford, UK, 2010. IEEE
Computer Society, doi
10.1109/CIT.2010.223.

** Abstract:** This paper compares
parallel and distributed implementations of an iterative, Gibbs sampling,
machine learning algorithm. Distributed implementations run under Hadoop on
facility computing clouds. The probabilistic model under study is the
infinite HMM Beal, Ghahramani and Rasmussen,
2002, in which parameters are learnt using an instance blocked Gibbs
sampling, with a step consisting of a dynamic program. We apply this model to
learn part-of-speech tags from newswire text in an unsupervised fashion.
However our focus here is on runtime performance, as opposed to NLP-relevant
scores, embodied by iteration duration, ease of development, deployment and
debugging.

Cuong V. Nguyen, Yingzhen Li, and Thang D. Bui Richard E. Turner.
**Variational
Continual Learning**.
In *Sixth International Conference on Learning Representations*,
Vancouver CANADA, May 2018.

** Abstract:** This paper develops
variational continual learning (VCL), a simple but general framework for
continual learning that fuses online variational inference (VI) and recent
advances in Monte Carlo VI for neural networks. The framework can
successfully train both deep discriminative models and deep generative models
in complex continual learning settings where existing tasks evolve over time
and entirely new tasks emerge. Experimental results show that variational
continual learning outperforms state-of-the-art continual learning methods on
a variety of tasks, avoiding catastrophic forgetting in a fully automatic
way.

Thang D. Bui, Cuong V. Nguyen, and Richard E. Turner.
**Streaming
sparse Gaussian process approximations**.
In *Advances in Neural Information Processing Systems 31*,
volume 31, Long Beach, California, USA, December 2017.

**
Abstract:** Sparse approximations for Gaussian process models provide a
suite of methods that enable these models to be deployed in large data regime
and enable analytic intractabilities to be sidestepped. However, the field
lacks a principled method to handle streaming data in which the posterior
distribution over function values and the hyperparameters are updated in an
online fashion. The small number of existing approaches either use suboptimal
hand-crafted heuristics for hyperparameter learning, or suffer from
catastrophic forgetting or slow updating when new data arrive. This paper
develops a new principled framework for deploying Gaussian process
probabilistic models in the streaming setting, providing principled methods
for learning hyperparameters and optimising pseudo-input locations. The
proposed framework is experimentally validated using synthetic and real-world
datasets.

** Comment:** The first two authors contributed equally.

Thang D. Bui, Josiah Yan, and Richard E. Turner.
**A unifying framework for
Gaussian process pseudo-point approximations using power expectation
propagation**.
18(104):1-72, 2017.

** Abstract:** Gaussian processes (GPs) are
flexible distributions over functions that enable high-level assumptions
about unknown functions to be encoded in a parsimonious, flexible and general
way. Although elegant, the application of GPs is limited by computational and
analytical intractabilities that arise when data are sufficiently numerous or
when employing non-Gaussian models. Consequently, a wealth of GP
approximation schemes have been developed over the last 15 years to address
these key limitations. Many of these schemes employ a small set of pseudo
data points to summarise the actual data. In this paper we develop a new
pseudo-point approximation framework using Power Expectation Propagation
(Power EP) that unifies a large number of these pseudo-point approximations.
Unlike much of the previous venerable work in this area, the new framework is
built on standard methods for approximate inference (variational free-energy,
EP and Power EP methods) rather than employing approximations to the
probabilistic generative model itself. In this way all of the approximation
is performed at `inference time' rather than at `modelling time', resolving
awkward philosophical and empirical questions that trouble previous
approaches. Crucially, we demonstrate that the new framework includes new
pseudo-point approximation methods that outperform current approaches on
regression and classification tasks.

Thang D. Bui, Daniel Hernández-Lobato, José Miguel Hernández-Lobato,
Yingzhen Li, and Richard E. Turner.
**Deep Gaussian
processes for regression using approximate expectation propagation**.
In *33rd International Conference on Machine Learning*, New York, USA,
June 2016.

** Abstract:** Deep Gaussian processes (DGPs) are
multi-layer hierarchical generalisations of Gaussian processes (GPs) and are
formally equivalent to neural networks with multiple, infinitely wide hidden
layers. DGPs are nonparametric probabilistic models and as such are arguably
more flexible, have a greater capacity to generalise, and provide better
calibrated uncertainty estimates than alternative deep models. This paper
develops a new approximate Bayesian learning scheme that enables DGPs to be
applied to a range of medium to large scale regression problems for the first
time. The new method uses an approximate Expectation Propagation procedure
and a novel and efficient extension of the probabilistic backpropagation
algorithm for learning. We evaluate the new method for non-linear regression
on eleven real-world datasets, showing that it always outperforms GP
regression and is almost always better than state-of-the-art deterministic
and sampling-based approximate inference methods for Bayesian neural
networks. As a by-product, this work provides a comprehensive analysis of six
approximate Bayesian methods for training neural networks.

José Miguel Hernández-Lobato, Yingzhen Li, Mark Rowland, Thang D. Bui,
Daniel Hernández-Lobato, and Richard E. Turner.
**Black-box
Alpha Divergence Minimization**.
In *33rd International Conference on Machine Learning*, New York USA,
June 2016.

** Abstract:** Black-box alpha (BB-$α$) is a
new approximate inference method based on the minimization of
$α$-divergences. BB-$α$ scales to large datasets because it can be
implemented using stochastic gradient descent. BB-$α$ can be applied to
complex probabilistic models with little effort since it only requires as
input the likelihood function and its gradients. These gradients can be
easily obtained using automatic differentiation. By changing the divergence
parameter α, the method is able to interpolate between variational Bayes
(VB) ($α \rightarrow 0$) and an algorithm similar to expectation
propagation (EP) ($α = 1$). Experiments on probit regression and neural
network regression and classification problems show that BB-αwith
non-standard settings of $α$, such as $α = 0.5$, usually produces
better predictions than with $α \rightarrow 0$ (VB) or $α = 1$
(EP).

Felipe Tobar, Thang D. Bui, and Richard E. Turner.
**Learning
stationary time series using gaussian process with nonparametric
kernels**.
In *Advances in Neural Information Processing Systems 29*, Montréal
CANADA, Dec 2015.

** Abstract:** We introduce the Gaussian
Process Convolution Model (GPCM), a two-stage nonparametric generative
procedure to model stationary signals as the convolution between a
continuous-time white-noise process and a continuous-time linear filter drawn
from Gaussian process. The GPCM is a continuous-time nonparametricwindow
moving average process and, conditionally, is itself a Gaussian process with
a nonparametric kernel defined in a probabilistic fashion. The generative
model can be equivalently considered in the frequency domain, where the power
spectral density of the signal is specified using a Gaussian process. One of
the main contributions of the paper is to develop a novel variational
freeenergy approach based on inter-domain inducing variables that efficiently
learns the continuous-time linear filter and infers the driving white-noise
process. In turn, this scheme provides closed-form probabilistic estimates of
the covariance kernel and the noise-free signal both in denoising and
prediction scenarios. Additionally, the variational inference procedure
provides closed-form expressions for the approximate posterior of the
spectral density given the observed data, leading to new Bayesian
nonparametric approaches to spectrum estimation. The proposed GPCM is
validated using synthetic and real-world signals.

Thang D. Bui and Richard E. Turner.
**Tree-structured Gaussian process approximations**.
In Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger,
editors, *Advances in Neural Information Processing Systems 28*,
volume 28, pages 2213-2221. Curran Associates, Inc., 2014.

** Abstract:** Gaussian process regression can be accelerated by
constructing a small pseudo-dataset to summarize the observed data. This idea
sits at the heart of many approximation schemes, but such an approach
requires the number of pseudo-datapoints to be scaled with the range of the
input space if the accuracy of the approximation is to be maintained. This
presents problems in time-series settings or in spatial datasets where large
numbers of pseudo-datapoints are required since computation typically scales
quadratically with the pseudo-dataset size. In this paper we devise an
approximation whose complexity grows linearly with the number of
pseudo-datapoints. This is achieved by imposing a tree or chain structure on
the pseudo-datapoints and calibrating the approximation using a
Kullback-Leibler (KL) minimization. Inference and learning can then be
performed efficiently using the Gaussian belief propagation algorithm. We
demonstrate the validity of our approach on a set of challenging regression
tasks including missing data imputation for audio and spatial datasets. We
trace out the speed-accuracy trade-off for the new method and show that the
frontier dominates those obtained from a large number of existing
approximation techniques.

David R. Burt, Carl Edward Rasmussen, and Mark van der Wilk.
**Convergence
of sparse variational inference in Gaussian processes regression**.
*Journal of Machine Learning Research*, 21, 2020.

**
Abstract:** Gaussian processes are distributions over functions that are
versatile and mathematically convenient priors in Bayesian modelling.
However, their use is often impeded for data with large numbers of
observations, N, due to the cubic (in N) cost of matrix operations used in
exact inference. Many solutions have been proposed that rely on M << N
inducing variables to form an approximation at a cost of O(NM^{2}).
While the computational cost appears linear in N, the true complexity depends
on how M must scale with N to ensure a certain quality of the approximation.
In this work, we investigate upper and lower bounds on how M needs to grow
with N to ensure high quality approximations. We show that we can make the
KL-divergence between the approximate model and the exact posterior
arbitrarily small for a Gaussian-noise regression model with M << N.
Specifically, for the popular squared exponential kernel and D-dimensional
Gaussian distributed covariates, M = O((log N)^{D}) suffice and a
method with an overall computational cost of O(N(log N)^{2D}(log log
N)^{2}) can be used to perform inference.

David R Burt, Carl Edward Rasmussen, and Mark van der Wilk.
**Rates of convergence for sparse
variational Gaussian process regression**.
*arXiv*, 2019.

** Abstract:** Excellent variational
approximations to Gaussian process posteriors have been developed which avoid
the O(N^{3}) scaling with dataset size N. They reduce the
computational cost to O(NM^{2}), with M≪N being the number of
inducing variables, which summarise the process. While the computational cost
seems to be linear in N, the true complexity of the algorithm depends on how
M must increase to ensure a certain quality of approximation. We address this
by characterising the behavior of an upper bound on the KL divergence to the
posterior. We show that with high probability the KL divergence can be made
arbitrarily small by growing M more slowly than N. A particular case of
interest is that for regression with normally distributed inputs in
D-dimensions with the popular Squared Exponential kernel,
M=O(log^{D}N) is sufficient. Our results show that as datasets grow,
Gaussian process posteriors can truly be approximated cheaply, and provide a
concrete rule for how to increase M in continual learning scenarios.

Jan-Peter Calliess, Stephen Roberts, Carl Edward Rasmussen, and Jan
Maciejowski.
**Nonlinear set
membership regression with adaptive hyper-parameter estimation for online
learning and control**.
In *Proceedings of the European Control Conference*, 2018.

** Abstract:** Methods known as Lipschitz Interpolation or
Nonlinear Set Membership regression have become established tools for
nonparametric system-identification and data-based control. They utilise
presupposed Lipschitz properties to compute inferences over unobserved
function values. Unfortunately, they rely on the a priori knowledge of a
Lipschitz constant of the underlying target function which serves as a
hyperparameter. We propose a closed-form estimator of the Lipschitz constant
that is robust to bounded observational noise in the data. The merger of
Lipschitz Interpolation with the new hyperparameter estimator gives a new
nonparametric machine learning method for which we derive online learning
convergence guarantees. Furthermore, we apply our learning method to
model-reference adaptive control and provide a convergence guarantee on the
closed-loop dynamics. In a simulated flight manoeuvre control scenario, we
compare the performance of our approach to recently proposed alternative
learning-based controllers.

Daniel Limon, Jan-Peter Calliess, and Jan Maciejowski.
**Learning-based nonlinear model predictive control**.
In *IFAC 2017 World Congress*, Toulouse, France, July 2017. doi
10.1016/j.ifacol.2017.08.1050.

** Abstract:** This paper
presents stabilizing Model Predictive Controllers (MPC) in which prediction
models are inferred from experimental data of the inputs and outputs of the
plant. Using a nonparametric machine learning technique called LACKI, the
estimated (possibly nonlinear) model function together with an estimation of
Hoelder constant is provided. Based on these, a number of predictive
controllers with stability guaranteed by design are proposed. Firstly, the
case when the prediction model is estimated off- line is considered and
robust stability and recursive feasibility is ensured by using tightened
constraints in the optimisation problem. This controller has been extended to
the more interesting and complex case: the online learning of the model,
where the new data collected from feedback is added to enhance the prediction
model. A on-line learning MPC based on a double sequence of predictions is
proposed. Stability of the online learning MPC is proved. These controllers
are illustrated by simulation.

Jan-Peter Calliess.
**Lipschitz
optimisation for Lipschitz interpolation**.
In *2017 American Control Conference (ACC 2017)*, Seattle, WA, USA, May
2017.

** Abstract:** Techniques known as Nonlinear Set
Membership prediction, Kinky Inference or Lipschitz Interpolation are fast
and numerically robust approaches to nonparametric machine learning that have
been proposed to be utilised in the context of system identification and
learning-based control. They utilise presupposed Lipschitz properties in
order to compute inferences over unobserved function values. Unfortunately,
most of these approaches rely on exact knowledge about the input space metric
as well as about the Lipschitz constant. Furthermore, existing techniques to
estimate the Lipschitz constants from the data are not robust to noise or
seem to be ad-hoc and typically are decoupled from the ultimate learning and
prediction task. To overcome these limitations, we propose an approach for
optimising parameters of the presupposed metrics by minimising validation set
prediction errors. To avoid poor performance due to local minima, we propose
to utilise Lipschitz properties of the optimisation objective to ensure
global optimisation success. The resulting approach is a new flexible method
for nonparametric black-box learning. We illustrate its competitiveness on a
set of benchmark problems.

Jan-Peter Calliess, Nathan Korda, and Geoffrey J. Gordon.
**A distributed
mechanism for multi-agent convex optimisation and coordination with no-regret
learners**.
In *Workshop on Learning, Inference and Control of Multi-Agent Systems,
NIPS*, Barcelona, Spain, December 2016.

** Abstract:** We
develop an indirect mechanism for coordinated, distributed multi-agent
optimisation, and decision-making. Our approach extends previous work in
no-regret learning based mechanism design and renders it applicable to
partial information settings. We consider planning problems that can be
stated as a collection of single-agent convex programmes coupled by common
soft constraints. A key idea is to recast the joint optimisation problem as
distributed learning in a repeated game between the original agents and a
newly introduced group of adversarial agents who influence prices for
decisions and facilitate coordination. Under the weak behavioural assumption
that all agents employ selfish, sub-linear regret algorithms in the course of
the repeated game, we guarantee that our mechanism can achieve design goals
such as social optimality (efficiency) and Nash-equilibrium convergence to
within an error which approaches zero as the agents gain experience. Our
error bounds are deterministic or probabilistic, depending on the nature of
the regret bounds available for the algorithms employed by the agents. We
illustrate our method in an emissions market application.

Jan-Peter Calliess.
**Lazily adapted constant kinky
inference for nonparametric regression and model-reference adaptive
control**.
*arXiv*, arXiv:1701.00178, 2016.

** Abstract:**
Techniques known as Nonlinear Set Membership prediction, Lipschitz
Interpolation or Kinky Inference are approaches to machine learning that
utilise presupposed Lipschitz properties to compute inferences over
unobserved function values. Provided a bound on the true best Lipschitz
constant of the target function is known a priori they offer convergence
guarantees as well as bounds around the predictions. Considering a more
general setting that builds on Hölder continuity relative to
pseudo-metrics, we propose an online method for estimating the Hoelder
constant online from function value observations that possibly are corrupted
by bounded observational errors. Utilising this to compute adaptive
parameters within a kinky inference rule gives rise to a nonparametric
machine learning method, for which we establish strong universal
approximation guarantees. That is, we show that our prediction rule can learn
any continuous function in the limit of increasingly dense data to within a
worst-case error bound that depends on the level of observational
uncertainty. We apply our method in the context of nonparametric
model-reference adaptive control (MRAC). Across a range of simulated aircraft
roll-dynamics and performance metrics our approach outperforms recently
proposed alternatives that were based on Gaussian processes and RBF-neural
networks. For discrete-time systems, we provide stability guarantees for our
learning-based controllers both for the batch and the online learning
setting.

Jan-Peter Calliess.
**Bayesian
Lipschitz constant estimation and quadrature**.
In *Workshop on Probabilistic Integration, NIPS*, Montreal, Canada,
December 2015.

** Abstract:** Lipschitz quadrature methods
provide an approach to one-dimensional numerical integration on bounded
domains. On the basis of the assumption that the integrand is Lipschitz
continuous with a known Lipschitz constant, these quadrature rules can
provide a tight error bound around their integral estimates and utilise the
Lipschitz constant to guide exploration in the context of adaptive
quadrature. In this paper, we outline our ongoing work on extending this
approach to settings where the Lipschitz constant is probabilistically
uncertain. As the key component, we introduce a Bayesian approach for
updating a subjectively probabilistic belief of the Lipschitz constant.
Combined with any Lipschitz quadrature rule, we obtain an approach for
translating a sample into an integral estimate with probabilistic uncertainty
intervals. The paper concludes with an illustration of the approach followed
by a discussion of open issues and future work.

Yarin Gal, Yutian Chen, and Zoubin Ghahramani.
**Latent
Gaussian processes for distribution estimation of multivariate categorical
data**.
In *Proceedings of the 32nd International Conference on Machine Learning
(ICML-15)*, pages 645-654, 2015.

** Abstract:**
Multivariate categorical data occur in many applications of machine learning.
One of the main difficulties with these vectors of categorical variables is
sparsity. The number of possible observations grows exponentially with vector
length, but dataset diversity might be poor in comparison. Recent models have
gained significant improvement in supervised tasks with this data. These
models embed observations in a continuous space to capture similarities
between them. Building on these ideas we propose a Bayesian model for the
unsupervised task of distribution estimation of multivariate categorical
data. We model vectors of categorical variables as generated from a
non-linear transformation of a continuous latent space. Non-linearity
captures multi-modality in the distribution. The continuous representation
addresses sparsity. Our model ties together many existing models, linking the
linear categorical latent Gaussian model, the Gaussian process latent
variable model, and Gaussian process classification. We derive inference for
our model based on recent developments in sampling based variational
inference. We show empirically that the model outperforms its linear and
discrete counterparts in imputation tasks of sparse data.

Anoop Korattikara, Yutian Chen, and Max Welling.
**Austerity
in MCMC land: Cutting the Metropolis-Hastings budget**.
In *31st International Conference on Machine Learning*, pages 181-189,
Beijing, China, June 2014.

** Abstract:** Can we make Bayesian
posterior MCMC sampling more efficient when faced with very large datasets?
We argue that computing the likelihood for N datapoints in the
Metropolis-Hastings (MH) test to reach a single binary decision is
computationally inefficient. We introduce an approximate MH rule based on a
sequential hypothesis test that allows us to accept or reject samples with
high confidence using only a fraction of the data required for the exact MH
rule. While this method introduces an asymptotic bias, we show that this bias
can be controlled and is more than offset by a decrease in variance due to
our ability to draw more samples per unit of time.

** Comment:** supplementary

Roger Frigola, Yutian Chen, and Carl E. Rasmussen.
**Variational
Gaussian process state-space models**.
In Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger,
editors, *Advances in Neural Information Processing Systems 27*,
2014.

** Abstract:** State-space models have been successfully
used for more than fifty years in different areas of science and engineering.
We present a procedure for efficient variational Bayesian learning of
nonlinear state-space models based on sparse Gaussian processes. The result
of learning is a tractable posterior over nonlinear dynamical systems. In
comparison to conventional parametric models, we offer the possibility to
straightforwardly trade off model capacity and computational cost whilst
avoiding overfitting. Our main algorithm uses a hybrid inference approach
combining variational Bayes and sequential Monte Carlo. We also present
stochastic variational inference and online learning approaches for fast
learning with long time series.

E. Gilboa, Yunus Saatçi, and John P. Cunningham.
**Scaling multidimensional inference for structured Gaussian processes**.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*, pages
424-436, 2015, doi
10.1109/TPAMI.2013.192.

** Abstract:** Exact Gaussian
process (GP) regression has O(N^{3} runtime for data size N, making
it intractable for large N. Many algorithms for improving GP scaling
approximate the covariance with lower rank matrices. Other work has exploited
structure inherent in particular covariance functions, including GPs with
implied Markov structure, and inputs on a lattice (both enable O(N) or O(N
log N) runtime). However, these GP advances have not been well extended to
the multidimensional input setting, despite the preponderance of
multidimensional applications. This paper introduces and tests three novel
extensions of structured GPs to multidimensional inputs, for models with
additive and multiplicative kernels. First we present a new method for
inference in additive GPs, showing a novel connection between the classic
backfitting method and the Bayesian framework. We extend this model using two
advances: a variant of projection pursuit regression, and a Laplace
approximation for non-Gaussian observations. Lastly, for multiplicative
kernel structure, we present a novel method for GPs with inputs on a
multidimensional grid. We illustrate the power of these three advances on
several data sets, achieving performance equal to or very close to the naive
GP at orders of magnitude less cost.

** Comment:** arXiv

E. Gilboa, Yunus Saatçi, and John P. Cunningham.
**Scaling
multidimensional Gaussian processes using projected additive
approximations**.
In *30th International Conference on Machine Learning*, 2013.

** Abstract:** Exact Gaussian Process (GP) regression has
O(N^{3}) runtime for data size N, making it intractable for large N.
Many algorithms for improving GP scaling approximate the covariance with
lower rank matrices. Other work has exploited structure inherent in
particular covariance functions, including GPs with implied Markov structure,
and equispaced inputs (both enable O(N) runtime). However, these GP advances
have not been extended to the multidimensional input setting, despite the
preponderance of multidimensional applications. This paper introduces and
tests novel extensions of structured GPs to multidimensional inputs. We
present new methods for additive GPs, showing a novel connection between the
classic backﬁtting method and the Bayesian framework. To achieve optimal
accuracy-complexity tradeoff, we extend this model with a novel variant of
projection pursuit regression. Our primary result – projection pursuit
Gaussian Process Regression – shows orders of magnitude speedup while
preserving high accuracy. The natural second and third steps include
non-Gaussian observations and higher dimensional equispaced grid methods. We
introduce novel techniques to address both of these necessary directions. We
thoroughly illustrate the power of these three advances on several datasets,
achieving close performance to the naive Full GP at orders of magnitude less
cost.

Andrew Gordon Wilson, Elad Gilboa, Arye Nehorai, and John P Cunningham.
**Gpatt: Fast multidimensional
pattern extrapolation with Gaussian processes**.
*arXiv preprint arXiv:1310.5288*, 2013.

** Abstract:**
Gaussian processes are typically used for smoothing and interpolation on
small datasets. We introduce a new Bayesian nonparametric framework - GPatt
- enabling automatic pattern extrapolation with Gaussian processes on large
multidimensional datasets. GPatt unifies and extends highly expressive
kernels and fast exact inference techniques. Without human intervention - no
hand crafting of kernel features, and no sophisticated initialisation
procedures - we show that GPatt can solve large scale pattern extrapolation,
inpainting, and kernel discovery problems, including a problem with 383,400
training points. We find that GPatt significantly outperforms popular
alternative scalable Gaussian process methods in speed and accuracy.
Moreover, we discover profound differences between each of these methods,
suggesting expressive kernels, nonparametric representations, and scalable
inference which exploits model structure are useful in combination for
modelling large scale multidimensional patterns.

John P. Cunningham, Zoubin Ghahramani, and Carl Edward Rasmussen.
**Gaussian
Processes for time-marked time-series data**.
In *15th International Conference on Artificial Intelligence and
Statistics*, pages 255-263, 2012.

** Abstract:** In many
settings, data is collected as multiple time series, where each recorded time
series is an observation of some underlying dynamical process of interest.
These observations are often time-marked with known event times, and one
desires to do a range of standard analyses. When there is only one time
marker, one simply aligns the observations temporally on that marker. When
multiple time-markers are present and are at different times on different
time series observations, these analyses are more difficult. We describe a
Gaussian Process model for analyzing multiple time series with multiple time
markings, and we test it on a variety of data.

J. H. Macke, L. Busing, J. P. Cunningham, B. M. Yu, K. V. Shenoy, and
M. Sahani.
**Empirical
models of spiking in neural populations**.
In *Advances in Neural Information Processing Systems 25*, pages 1-8,
Granada, Spain, December 2011.

** Abstract:** Neurons in the
neocortex code and compute as part of a locally interconnected population.
Large-scale multi-electrode recording makes it possible to access these
population processes empirically by fitting statistical models to unaveraged
data. What statistical structure best describes the concurrent spiking of
cells within a local network? We argue that in the cortex, where firing
exhibits extensive correlations in both time and space and where a typical
sample of neurons still reflects only a very small fraction of the local
population, the most appropriate model captures shared variability by a
low-dimensional latent process evolving with smooth dynamics, rather than by
putative direct coupling. We test this claim by comparing a latent dynamical
model with realistic spiking observations to coupled generalised linear
spike-response models (GLMs) using cortical recordings. We find that the
latent dynamical approach outperforms the GLM in terms of goodness-of- fit,
and reproduces the temporal correlations in the data more accurately. We also
compare models whose observations models are either derived from a Gaussian
or point-process models, finding that the non-Gaussian model provides
slightly better goodness-of-fit and more realistic population spike
counts.

B. Petreska, B. M. Yu, J. P. Cunningham, G. Santhanam, S. I. Ryu, K. V. Shenoy,
and M. Sahani.
**Dynamical
segmentation of single trials from population neural data**.
In *Advances in Neural Information Processing Systems 25*, pages 1-8,
Granada, Spain, December 2011.

** Abstract:** Simultaneous
recordings of many neurons embedded within a recurrently-connected cortical
network may provide concurrent views into the dynamical processes of that
network, and thus its computational function. In principle, these dynamics
might be identified by purely unsupervised, statistical means. Here, we show
that a Hidden Switching Linear Dynamical Systems (HSLDS) model - in which
multiple linear dynamical laws approximate and nonlinear and potentially
non-stationary dynamical process - is able to distinguish dynamical regimes
within single-trial motor cortical activity associated with the preparation
and initiation of hand movements. The regimes are identified without
reference to behavioural or experimental epochs, but nonetheless transitions
between them correlate strongly with external events whose timing may vary
from trial to trial. The HSLDS model also performs better than recent
comparable models in predicting the firing rate of an isolated neuron based
on the firing rates of others, suggesting that it captures more of the
"Shared variance" of the data. Thus, the method is able to trace the
dynamical processes underlying the coordinated evolution of network activity
in a way that appears to reflect its computational role.

J. P. Cunningham, P. Nuyujukian, V. Gilja, C. A. Chestek, S. I. Ryu, and K. V.
Shenoy.
**A
closed-loop human simulator for investigating the role of feedback-control in
brain-machine interfaces**.
*Journal of Neurophysiology*, 105:1932-1949, 2011.

**
Abstract:** Neural prosthetic systems seek to improve the lives of severely
disabled people by decoding neural activity into useful behavioral commands.
These systems and their decoding algorithms are typically developed
"offline", using neural activity previously gathered from a healthy animal,
and the decoded movement is then compared with the true movement that
accompanied the recorded neural activity. However, this offline design and
testing may neglect important features of a real prosthesis, most notably the
critical role of feedback control, which enables the user to adjust neural
activity while using the prosthesis. We hypothesize that under- standing and
optimally designing high-performance decoders require an experimental
platform where humans are in closed-loop with the various candidate decode
systems and algorithms. It remains unexplored the extent to which the subject
can, for a particular decode system, algorithm, or parameter, engage feedback
and other strategies to improve decode performance. Closed-loop testing may
suggest different choices than offline analyses. Here we ask if a healthy
human subject, using a closed-loop neural prosthesis driven by synthetic
neural activity, can inform system design. We use this online pros- thesis
simulator (OPS) to optimize "online" decode performance based on a key
parameter of a current state-of-the-art decode algorithm, the bin width of a
Kalman filter. First, we show that offline and online analyses indeed suggest
different parameter choices. Previous literature and our offline analyses
agree that neural activity should be analyzed in bins of 100- to 300-ms
width. OPS analysis, which incorporates feedback control, suggests that much
shorter bin widths (25-50 ms) yield higher decode performance. Second, we
confirm this surprising finding using a closed-loop rhesus monkey prosthetic
system. These findings illustrate the type of discovery made possible by the
OPS, and so we hypothesize that this novel testing approach will help in the
design of prosthetic systems that will translate well to human patients.

M. Zhao, A. P. Batista, J. P. Cunningham, C. A. Chestek, Z. Rivera-Alvidrez,
R. Kalmar, S. I. Ryu, K. V. Shenoy, and S. Iyengar.
**An
L1-regularized logistic model for detecting short-term neuronal
interactions.**.
*Journal of Computational Neuroscience*, 2011, doi
10.1007/s10827-011-0365-5.
In Press.

** Abstract:** Interactions among neurons are a key
com- ponent of neural signal processing. Rich neural data sets potentially
containing evidence of interactions can now be collected readily in the
laboratory, but existing analysis methods are often not sufficiently
sensitive and specific to reveal these interactions. Generalized linear
models offer a platform for analyzing multi-electrode recordings of neuronal
spike train data. Here we suggest an L1-regularized logistic regression model
(L1L method) to detect short-term (order of 3ms) neuronal interactions. We
estimate the parameters in this model using a coordinate descent algorithm,
and determine the optimal tuning parameter using a Bayesian Information
Criterion. Simulation studies show that in general the L1L method has better
sensitivities and specificities than those of the traditional
shuffle-corrected cross-correlogram (covariogram) method. The L1L method is
able to detect excitatory interactions with both high sensitivity and
specificity with reasonably large recordings, even when the magnitude of the
interactions is small; similar results hold for inhibition given sufficiently
high baseline firing rates. Our study also suggests that the false positives
can be further removed by thresholding, because their magnitudes are
typically smaller than true interactions. Simulations also show that the L1L
method is somewhat robust to partially observed networks. We apply the method
to multi-electrode recordings collected in the monkey dorsal premotor cortex
(PMd) while the animal prepares to make reaching arm movements. The results
show that some neurons interact differently depending on task conditions. The
stronger interactions detected with our L1L method were also visible using
the covariogram method.

M. M. Churchland, J. P. Cunningham, M. T. Kaufman, S. I. Ryu, and K. V. Shenoy.
**Cortical
preparatory activity: Representation of movement or first cog in a dynamical
machine?**.
*Neuron*, 68:387-400, 2010.

** Abstract:** The motor
cortices are active during both movement and movement preparation. A common
assumption is that preparatory activity constitutes a subthreshold form of
movement activity: a neuron active during rightward movements becomes
modestly active during preparation of a rightward movement. We asked whether
this pattern of activity is, in fact, observed. We found that it was not: at
the level of a single neuron, preparatory tuning was weakly correlated with
movement-period tuning. Yet, somewhat paradoxically, preparatory tuning could
be captured by a preferred direction in an abstract "space" that described
the population-level pattern of movement activity. In fact, this relationship
accounted for preparatory responses better than did traditional tuning
models. These results are expected if preparatory activity provides the
initial state of a dynamical system whose evolution produces movement
activity. Our results thus suggest that preparatory activity may not
represent specific factors, and may instead play a more mechanistic role.

M. M. Churchland, B. M. Yu, J. P. Cunningham, L. P. Sugrue, M. R. Cohen, G. S.
Corrado, W. T. Newsome, A. M. Clark, P. Hosseini, B. B. Scott, D. C. Bradley,
M. A. Smith, A. Kohn, J. A. Movshon, K. M. Armstrong, T. Moore, S. W. Chang,
L. H. Snyder, S. G. Lisberger, N. J. Priebe, I. M. Finn, D. Ferster, S. I.
Ryu, G. Santhanam, M. Sahani, and K. V. Shenoy.
**Stimulus
onset quashes neural variability: a widespread cortical phenomenon**.
*Nature Neuroscience*, 13:369-378, 2010.

** Abstract:**
Neural responses are typically characterized by computing the mean firing
rate, but response variability can exist across trials. Many studies have
examined the effect of a stimulus on the mean response, but few have examined
the effect on response variability. We measured neural variability in 13
extracellularly recorded datasets and one intracellularly recorded dataset
from seven areas spanning the four cortical lobes in monkeys and cats. In
every case, stimulus onset caused a decline in neural variability. This
occurred even when the stimulus produced little change in mean firing rate.
The variability decline was observed in membrane potential recordings, in the
spiking of individual neurons and in correlated spiking variability measured
with implanted 96-electrode arrays. The variability decline was observed for
all stimuli tested, regardless of whether the animal was awake, behaving or
anaesthetized. This widespread variability decline suggests a rather general
property of cortex, that its state is stabilized by an input.

B. M. Yu, J. P. Cunningham, G. Santhanam, S. I. Ryu, K. V. Shenoy, and
M. Sahani.
**Gaussian-process
factor analysis for low-dimensional single-trial analysis of neural
population activity**.
In *Advances in Neural Information Processing Systems 21*, pages 1-8,
Vancouver, BC, December 2009.

** Abstract:** We consider the
problem of extracting smooth, low-dimensional neural trajectories that
summarize the activity recorded simultaneously from many neurons on
individual experimental trials. Beyond the benefit of visualizing the
high-dimensional, noisy spiking activity in a compact form, such trajectories
can offer insight into the dynamics of the neural circuitry underlying the
recorded activity. Current methods for extracting neural trajectories involve
a two-stage process: the spike trains are first smoothed over time, then a
static dimensionality- reduction technique is applied. We first describe
extensions of the two-stage methods that allow the degree of smoothing to be
chosen in a principled way and that account for spiking variability, which
may vary both across neurons and across time. We then present a novel method
for extracting neural trajectories - Gaussian-process factor analysis (GPFA)
- which unifies the smoothing and dimensionality- reduction operations in a
common probabilistic framework. We applied these methods to the activity of
61 neurons recorded simultaneously in macaque premotor and motor cortices
during reach planning and execution. By adopting a goodness-of-fit metric
that measures how well the activity of each neuron can be predicted by all
other recorded neurons, we found that the proposed extensions improved the
predictive ability of the two-stage methods. The predictive ability was
further improved by going to GPFA. From the extracted trajectories, we
directly observed a convergence in neural state during motor planning, an
effect that was shown indirectly by previous studies. We then show how such
methods can be a powerful tool for relating the spiking activity across a
neural population to the subject's behavior on a single-trial basis. Finally,
to assess how well the proposed methods characterize neural population
activity when the underlying time course is known, we performed simulations
that revealed that GPFA performed tens of percent better than the best
two-stage method.

C. Chang, J. P. Cunningham, and G. Glover.
**Influence
of heart rate on the bold signal: the cardiac response function**.
*NeuroImage*, 44:857-869, 2009.

** Abstract:** It has
previously been shown that low-frequency fluctuations in both respiratory
volume and cardiac rate can induce changes in the blood-oxygen level
dependent (BOLD) signal. Such physiological noise can obscure the detection
of neural activation using fMRI, and it is therefore important to model and
remove the effects of this noise. While a hemodynamic response function
relating respiratory variation (RV) and the BOLD signal has been described,
no such mapping for heart rate (HR) has been proposed. In the current study,
the effects of RV and HR are simultaneously deconvolved from resting state
fMRI. It is demonstrated that a convolution model including RV and HR can
explain significantly more variance in gray matter BOLD signal than a model
that includes RV alone, and an average HR response function is proposed that
well characterizes our subject population. It is observed that the voxel-wise
morphology of the deconvolved RV responses is preserved when HR is included
in the model, and that its form is adequately modeled by Birn et al.'s
previously described respiration response function. Furthermore, it is shown
that modeling out RV and HR can significantly alter functional connectivity
maps of the default-mode network.

B. M. Yu, J. P. Cunningham, G. Santhanam, S. I. Ryu, K. V. Shenoy, and
M. Sahani.
**Gaussian-process
factor analysis for low-dimensional single-trial analysis of neural
population activity**.
*Journal of Neurophysiology*, 102:614-635, 2009.

**
Abstract:** We consider the problem of extracting smooth, low-dimensional
neural trajectories that summarize the activity recorded simultaneously from
many neurons on individual experimental trials. Beyond the benefit of
visualizing the high-dimensional, noisy spiking activity in a compact form,
such trajectories can offer insight into the dynamics of the neural circuitry
underlying the recorded activity. Current methods for extracting neural
trajectories involve a two-stage process: the spike trains are first smoothed
over time, then a static dimensionality- reduction technique is applied. We
first describe extensions of the two-stage methods that allow the degree of
smoothing to be chosen in a principled way and that account for spiking
variability, which may vary both across neurons and across time. We then
present a novel method for extracting neural trajectories - Gaussian-process
factor analysis (GPFA) - which unifies the smoothing and dimensionality-
reduction operations in a common probabilistic framework. We applied these
methods to the activity of 61 neurons recorded simultaneously in macaque
premotor and motor cortices during reach planning and execution. By adopting
a goodness-of-fit metric that measures how well the activity of each neuron
can be predicted by all other recorded neurons, we found that the proposed
extensions improved the predictive ability of the two-stage methods. The
predictive ability was further improved by going to GPFA. From the extracted
trajectories, we directly observed a convergence in neural state during motor
planning, an effect that was shown indirectly by previous studies. We then
show how such methods can be a powerful tool for relating the spiking
activity across a neural population to the subject's behavior on a
single-trial basis. Finally, to assess how well the proposed methods
characterize neural population activity when the underlying time course is
known, we performed simulations that revealed that GPFA performed tens of
percent better than the best two-stage method.

J. P. Cunningham, B. M. Yu, K. V. Shenoy, and M. Sahani.
**Inferring
neural firing rates from spike trains using Gaussian processes**.
In *Advances in Neural Information Processing Systems 20*, pages 1-8,
Vancouver, BC, December 2008.

** Abstract:** Neural spike
trains present challenges to analytical efforts due to their noisy, spiking
nature. Many studies of neuroscientific and neural prosthetic importance rely
on a smoothed, denoised estimate of the spike train's underlying firing rate.
Current techniques to find time-varying firing rates require ad hoc choices
of parameters, offer no confidence intervals on their estimates, and can
obscure potentially important single trial variability. We present a new
method, based on a Gaussian Process prior, for inferring probabilistically
optimal estimates of firing rate functions underlying single or multiple
neural spike trains. We test the performance of the method on simulated data
and experimentally gathered neural spike trains, and we demonstrate
improvements over conventional estimators.

** Comment:** Spotlight Presentation

J. P. Cunningham, K. V. Shenoy, and M. Sahani.
**Fast
Gaussian process methods for point process intensity estimation**.
In *25th International Conference on Machine Learning*, pages 1-8,
Helsinki, Finland, June 2008.

** Abstract:** Point processes
are difficult to analyze because they provide only a sparse and noisy
observation of the intensity function driving the process. Gaussian Processes
offer an attractive framework within which to infer underlying intensity
functions. The result of this inference is a continuous function defined
across time that is typically more amenable to analytical efforts. However, a
naive implementation will become computationally infeasible in any problem of
reasonable size, both in memory and run time requirements. We demonstrate
problem specific methods for a class of renewal processes that eliminate the
memory burden and reduce the solve time by orders of magnitude.

J. P. Cunningham.
**Derivation
of Expectation Propagation for "fast Gaussian process methods for point
process intensity estimation"**.
Technical report, Stanford University, 2008.

** Abstract:** We
derive the Expectation Propagation algorithm updates for approximating the
posterior distribution on intensity in a conditionally inhomogeneous gamma
interval process with a Gaussian Process prior (GP IGIP), a model which
appeared in Cunningham, Shenoy, Sahani (2008) ICML.

Alex Davies.
**Effective
implementation of Gaussian process regression for machine learning**.
PhD thesis, University of Cambridge, Department of Engineering, Cambridge, UK,
2015.

** Abstract:** This thesis presents frameworks for the
effective implementation of Gaussian process regression for machine learning.
It addresses this in three parts: effective iterative methods for calculating
the predictive distribution and derivatives of a Gaussian process with fixed
hyper-parameters, defining three broad classes of kernels of controllable
complexity that allow for an order of magnitude scaling in the previous
framework and an investigation into alternative objective functions and
improved derivatives for the optimization of model hyper-parameters.

Alex Davies and Zoubin Ghahramani.
**The random forest
kernel and other kernels for big data from random partitions**.
*arXiv*, abs/1402.4293, 2014.

** Abstract:** We present
Random Partition Kernels, a new class of kernels derived by demonstrating a
natural connection between random partitions of objects and kernels between
those objects. We show how the construction can be used to create kernels
from methods that would not normally be viewed as random partitions, such as
Random Forest. To demonstrate the potential of this method, we propose two
new kernels, the Random Forest Kernel and the Fast Cluster Kernel, and show
that these kernels consistently outperform standard kernels on problems
involving real-world datasets. Finally, we show how the form of these kernels
lend themselves to a natural approximation that is appropriate for certain
big data problems, allowing O(N) inference in methods such as Gaussian
Processes, Support Vector Machines and Kernel PCA.

Simon Lacoste-Julien, Konstantina Palla, Alex Davies, Gjergji Kasneci, Thore
Graepel, and Zoubin Ghahramani.
**Sigma: simple
greedy matching for aligning large knowledge bases**.
In *KDD*, pages 572-580. Association for Computing Machinery, 2013.

** Abstract:** The Internet has enabled the creation of a
growing number of large-scale knowledge bases in a variety of domains
containing complementary information. Tools for automatically aligning these
knowledge bases would make it possible to unify many sources of structured
knowledge and answer complex queries. However, the efficient alignment of
large-scale knowledge bases still poses a considerable challenge. Here, we
present Simple Greedy Matching (SiGMa), a simple algorithm for aligning
knowledge bases with millions of entities and facts. SiGMa is an iterative
propagation algorithm which leverages both the structural information from
the relationship graph as well as flexible similarity measures between entity
properties in a greedy local search, thus making it scalable. Despite its
greedy nature, our experiments indicate that SiGMa can efficiently match some
of the world's largest knowledge bases with high precision. We provide
additional experiments on benchmark datasets which demonstrate that SiGMa can
outperform state-of-the-art approaches both in accuracy and efficiency.

A. Davies and Z. Ghahramani.
**Language-independent
Bayesian sentiment mining of twitter**.
In *In The Fifth Workshop on Social Network Mining and Analysis
(SNA-KDD 2011)*, August 2011.

** Abstract:** This paper
outlines a new language-independent model for sentiment analysis of short,
social-network statuses. We demonstrate this on data from Twitter, modelling
happy vs sad sentiment, and show that in some circumstances this outperforms
similar Naive Bayes models by more than 10%. We also propose an extension to
allow the modelling of differ- ent sentiment distributions in different
geographic regions, while incorporating information from neighbouring
regions. We outline the considerations when creating a system analysing
Twitter data and present a scalable system of data acquisi- tion and
prediction that can monitor the sentiment of tweets in real time.

Roberto Calandra, Jan Peters, Carl Edward Rasmussen, and Marc Peter Deisenroth.
**Manifold
Gaussian processes for regression**.
In *International Joint Conference on Neural Networks*, 2016.

** Abstract:** Off-the-shelf Gaussian Process (GP) covariance
functions encode smoothness assumptions on the structure of the function to
be modeled. To model complex and nondifferentiable functions, these
smoothness assumptions are often too restrictive. One way to alleviate this
limitation is to find a different representation of the data by introducing a
feature space. This feature space is often learned in an unsupervised way,
which might lead to data representations that are not useful for the overall
regression task. In this paper, we propose Manifold Gaussian Processes, a
novel supervised method that jointly learns a transformation of the data into
a feature space and a GP regression from the feature space to observed space.
The Manifold GP is a full GP and allows to learn data representations, which
are useful for the overall regression task. As a proof-of-concept, we
evaluate our approach on complex non-smooth functions where standard GPs
perform poorly, such as step functions and robotics tasks with contacts.

Marc Peter Deisenroth, Dieter Fox, and Carl Edward Rasmussen.
**Gaussian processes for data-efficient learning in robotics and control**.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*,
37:408-423, 2015, doi
10.1109/TPAMI.2013.218.

** Abstract:** Autonomous learning
has been a promising direction in control and robotics for more than a decade
since data-driven learning allows to reduce the amount of engineering
knowledge, which is otherwise required. However, autonomous reinforcement
learning (RL) approaches typically require many interactions with the system
to learn controllers, which is a practical limitation in real systems, such
as robots, where many interactions can be impractical and time consuming. To
address this problem, current learning approaches typically require
task-speciﬁc knowledge in form of expert demonstrations, realistic
simulators, pre-shaped policies, or speciﬁc knowledge about the underlying
dynamics. In this article, we follow a different approach and speed up
learning by extracting more information from data. In particular, we learn a
probabilistic, non-parametric Gaussian process transition model of the
system. By explicitly incorporating model uncertainty into long-term planning
and controller learning our approach reduces the effects of model errors, a
key problem in model-based learning. Compared to state-of-the art RL our
model-based policy search method achieves an unprecedented speed of learning.
We demonstrate its applicability to autonomous learning in real robot and
control tasks.

B. Bischoff, D. Nguyen-Tuong, D. van Hoof, A. McHutchon, Carl Edward Rasmussen,
A. Knoll, and M. P. Deisenroth.
**Policy search
for learning robot control using sparse data**.
In *IEEE International Conference on Robotics and Automation*, pages
3882-3887, Hong Kong, China, 2014. IEEE, doi
10.1109/ICRA.2014.6907422.

** Abstract:** In many complex
robot applications, such as grasping and manipulation, it is difficult to
program desired task solutions beforehand, as robots are within an uncertain
and dynamic environment. In such cases, learning tasks from experience can be
a useful alternative. To obtain a sound learning and generalization
performance, machine learning, especially, reinforcement learning, usually
requires sufficient data. However, in cases where only little data is
available for learning, due to system constraints and practical issues,
reinforcement learning can act suboptimally. In this paper, we investigate
how model-based reinforcement learning, in particular the probabilistic
inference for learning control method (PILCO), can be tailored to cope with
the case of sparse data to speed up learning. The basic idea is to include
further prior knowledge into the learning process. As PILCO is built on the
probabilistic Gaussian processes framework, additional system knowledge can
be incorporated by defining appropriate prior distributions, e.g. a linear
mean Gaussian prior. The resulting PILCO formulation remains in closed form
and analytically tractable. The proposed approach is evaluated in simulation
as well as on a physical robot, the Festo Robotino XT. For the robot
evaluation, we employ the approach for learning an object pick-up task. The
results show that by including prior knowledge, policy learning can be sped
up in presence of sparse data.

Marc Peter Deisenroth, Ryan D. Turner, Marco F. Huber, Uwe D. Hanebeck, and
Carl Edward Rasmussen.
**Robust
filtering and smoothing with Gaussian processes**.
*IEEE Transactions on Automatic Control*, 57(7):1865-1871, 2012, doi
10.1109/TAC.2011.2179426.

** Abstract:** We propose a
principled algorithm for robust Bayesian filtering and smoothing in nonlinear
stochastic dynamic systems when both the transition function and the
measurement function are described by nonparametric Gaussian process (GP)
models. GPs are gaining increasing importance in signal processing, machine
learning, robotics, and control for representing unknown system functions by
posterior probability distributions. This modern way of "system
identification" is more robust than finding point estimates of a parametric
function representation. Our principled filtering/smoothing approach for GP
dynamic systems is based on analytic moment matching in the context of the
forward-backward algorithm. Our numerical evaluations demonstrate the
robustness of the proposed approach in situations where other
state-of-the-art Gaussian filters and smoothers can fail.

Marc Peter Deisenroth, Carl Edward Rasmussen, and Dieter Fox.
**Learning to
control a low-cost manipulator using data-efficient reinforcement
learning**.
In *9th International Conference on Robotics: Science & Systems*, Los
Angeles, CA, USA, June 2011.

** Abstract:** Over the last
years, there has been substantial progress in robust manipulation in
unstructured environments. The long-term goal of our work is to get away from
precise, but very expensive robotic systems and to develop affordable,
potentially imprecise, self-adaptive manipulator systems that can
interactively perform tasks such as playing with children. In this paper, we
demonstrate how a low-cost off-the-shelf robotic system can learn closed-loop
policies for a stacking task in only a handful of trials - from scratch. Our
manipulator is inaccurate and provides no pose feedback. For learning a
controller in the work space of a Kinect-style depth camera, we use a
model-based reinforcement learning technique. Our learning method is data
efficient, reduces model bias, and deals with several noise sources in a
principled way during long-term planning. We present a way of incorporating
state-space constraints into the learning process and analyze the learning
gain by exploiting the sequential structure of the stacking task.

** Comment:** project
site

Marc Peter Deisenroth and Carl Edward Rasmussen.
**PILCO: A
model-based and data-efficient approach to policy search**.
In *28th International Conference on Machine Learning*, 2011.

** Abstract:** In this paper, we introduce PILCO, a practical,
data-efficient model-based policy search method. PILCO reduces model bias,
one of the key problems of model-based reinforcement learning, in a
principled way. By learning a probabilistic dynamics model and explicitly
incorporating model uncertainty into long-term planning, PILCO can cope with
very little data and facilitates learning from scratch in only a few trials.
Policy evaluation is performed in closed form using state-of-the-art
approximate inference. Furthermore, policy gradients are computed
analytically for policy improvement. We report unprecedented learning
efficiency on challenging and high-dimensional control tasks.

** Comment:** web
site

Ryan Turner, Marc Peter Deisenroth, and Carl Edward Rasmussen.
**State-space
inference and learning with Gaussian processes**.
In Yee Whye Teh and Mike Titterington, editors, *13th International
Conference on Artificial Intelligence and Statistics*, volume 9 of
*W & CP*, pages 868-875, Chia Laguna, Sardinia, Italy, May 13-15
2010. Journal of Machine Learning Research.

** Abstract:**
State-space inference and learning with Gaussian processes (GPs) is an
unsolved problem. We propose a new, general methodology for inference and
learning in nonlinear state-space models that are described probabilistically
by non-parametric GP models. We apply the expectation maximization algorithm
to iterate between inference in the latent state-space and learning the
parameters of the underlying GP dynamics model.

** Comment:** poster.

Marc Peter Deisenroth.
**Efficient reinforcement
learning using Gaussian processes**.
PhD thesis, Karlsruhe Institute of Technology, Karlsruhe, Germany, 2010.

** Abstract:** In many research areas, including control and
medical applications, we face decision-making problems where data are limited
and/or the underlying generative process is complicated and partially
unknown. In these scenarios, we can profit from algorithms that learn from
data and aid decision making.

Reinforcement learning (RL) is a general
computational approach to experience-based goal-directed learning for
sequential decision making under uncertainty. However, RL often lacks
efficiency in terms of the number of required trials when no task-specific
knowledge is available. This lack of efficiency makes RL often inapplicable
to (optimal) control problems. Thus, a central issue in RL is to speed up
learning by extracting more information from available experience.

The
contributions of this dissertation are threefold:

1. We propose PILCO, a
fully Bayesian approach for efficient RL in continuous-valued state and
action spaces when no expert knowledge is available. PILCO is based on
well-established ideas from statistics and machine learning. PILCO's key
ingredient is a probabilistic dynamics model learned from data, which is
implemented by a Gaussian process (GP). The GP carefully quantifies knowledge
by a probability distribution over plausible dynamics models. By averaging
over all these models during long-term planning and decision making, PILCO
takes uncertainties into account in a principled way and, therefore, reduces
model bias, a central problem in model-based RL.

2. Due to its generality
and efficiency, PILCO can be considered a conceptual and practical approach
to jointly learning models and controllers when expert knowledge is difficult
to obtain or simply not available. For this scenario, we investigate PILCO's
properties its applicability to challenging real and simulated nonlinear
control problems. For example, we consider the tasks of learning to swing up
a double pendulum attached to a cart or to balance a unicycle with five
degrees of freedom. Across all tasks we report unprecedented automation and
an unprecedented learning efficiency for solving these tasks.

3. As a
step toward pilco's extension to partially observable Markov decision
processes, we propose a principled algorithm for robust filtering and
smoothing in GP dynamic systems. Unlike commonly used Gaussian filters for
nonlinear systems, it does neither rely on function linearization nor on
finite-sample representations of densities. Our algorithm profits from exact
moment matching for predictions while keeping all computations analytically
tractable. We present experimental evidence that demonstrates the robustness
and the advantages of our method over unscented Kalman filters, the cubature
Kalman filter, and the extended Kalman filter.

Ryan Turner, Marc Peter Deisenroth, and Carl Edward Rasmussen.
**System
identification in Gaussian process dynamical systems**.
In Dilan Görür, editor, *NIPS Workshop on Nonparametric Bayes*,
Whistler, BC, Canada, December 2009.

** Comment:** poster.

Marc Peter Deisenroth and Carl Edward Rasmussen.
**Efficient
reinforcement learning for motor control**.
In *10th International PhD Workshop on Systems and Control*, Hluboká
nad Vltavou, Czech Republic, September 2009.

** Abstract:**
Artificial learners often require many more trials than humans or animals
when learning motor control tasks in the absence of expert knowledge. We
implement two key ingredients of biological learning systems, generalization
and incorporation of uncertainty into the decision-making process, to speed
up artificial learning. We present a coherent and fully Bayesian framework
that allows for efficient artificial learning in the absence of expert
knowledge. The success of our learning framework is demonstrated on
challenging nonlinear control problems in simulation and in hardware.

Marc Peter Deisenroth, Marco F. Huber, and Uwe D. Hanebeck.
**Analytic
moment-based Gaussian process filtering**.
In Léon Bottou and Michael Littman, editors, *26th International
Conference on Machine Learning*, pages 225-232, Montréal, QC,
Canada, June 2009. Omnipress.

** Abstract:** We propose an
analytic moment-based filter for nonlinear stochastic dynamic systems modeled
by Gaussian processes. Exact expressions for the expected value and the
covariance matrix are provided for both the prediction step and the filter
step, where an additional Gaussian assumption is exploited in the latter
case. Our filter does not require further approximations. In particular, it
avoids finite-sample approximations. We compare the filter to a variety of
Gaussian filters, that is, the EKF, the UKF, and the recent GP-UKF proposed
by Ko
et al. (2007).

** Comment:** With corrections. code.

Marc Peter Deisenroth and Carl Edward Rasmussen.
**Bayesian inference
for efficient learning in control**.
In *Multidisciplinary Symposium on Reinforcement Learning*,
Montréal, QC, Canada, June 2009.

** Abstract:** In
contrast to humans or animals, artificial learners often require more trials
when learning motor control tasks solely based on experience. Efficient
autonomous learners will reduce the amount of engineering required to solve
control problems. By using probabilistic forward models, we can employ two
key ingredients of biological learning systems to speed up artificial
learning. We present a consistent and coherent Bayesian framework that allows
for efficient autonomous experience-based learning. We demonstrate the
success of our learning algorithm by applying it to challenging nonlinear
control problems in simulation and in hardware.

Marc Peter Deisenroth, Carl Edward Rasmussen, and Jan Peters.
**Gaussian process
dynamic programming**.
*Neurocomputing*, 72(7-9):1508-1524, March 2009, doi
10.1016/j.neucom.2008.12.019.

** Abstract:** Reinforcement
learning (RL) and optimal control of systems with continuous states and
actions require approximation techniques in most interesting cases. In this
article, we introduce Gaussian process dynamic programming (GPDP), an
approximate value function-based RL algorithm. We consider both a classic
optimal control problem, where problem-specific prior knowledge is available,
and a classic RL problem, where only very general priors can be used. For the
classic optimal control problem, GPDP models the unknown value functions with
Gaussian processes and generalizes dynamic programming to continuous-valued
states and actions. For the RL problem, GPDP starts from a given initial
state and explores the state space using Bayesian active learning. To design
a fast learner, available data have to be used efficiently. Hence, we propose
to learn probabilistic models of the a priori unknown transition dynamics and
the value functions on the fly. In both cases, we successfully apply the
resulting continuous-valued controllers to the under-actuated pendulum swing
up and analyze the performances of the suggested algorithms. It turns out
that GPDP uses data very efficiently and can be applied to problems, where
classic dynamic programming would be cumbersome.

** Comment:** code.

Carl Edward Rasmussen and Marc Peter Deisenroth.
**Probabilistic
inference for fast learning in control**.
In S. Girgin, M. Loth, R. Munos, P. Preux, and D. Ryabko, editors, *Recent
Advances in Reinforcement Learning*, volume 5323 of *Lecture Notes in
Computer Science (LNCS)*, pages 229-242, Villeneuve d'Ascq, France,
November 2008. Springer-Verlag.

** Abstract:** We provide a
novel framework for very fast model-based reinforcement learning in
continuous state and action spaces. The framework requires probabilistic
models that explicitly characterize their levels of confidence. Within this
framework, we use flexible, non-parametric models to describe the world based
on previously collected experience. We demonstrate learning on the cart-pole
problem in a setting where we provide very limited prior knowledge about the
task. Learning progresses rapidly, and a good policy is found after only a
hand-full of iterations.

** Comment:** videos and more. slides.

Marc Peter Deisenroth, Jan Peters, and Carl Edward Rasmussen.
**Approximate
dynamic programming with Gaussian processes**.
In *2008 American Control Conference (ACC 2008)*, pages 4480-4485,
Seattle, WA, USA, June 2008.

** Abstract:** In general, it is
difficult to determine an optimal closed-loop policy in nonlinear control
problems with continuous-valued state and control domains. Hence,
approximations are often inevitable. The standard method of discretizing
states and controls suffers from the curse of dimensionality and strongly
depends on the chosen temporal sampling rate. The paper introduces Gaussian
Process Dynamic Programming (GPDP). In GPDP, value functions in the Bellman
recursion of the dynamic programming algorithm are modeled using Gaussian
processes. GPDP returns an optimal state-feedback for a finite set of states.
Based on these outcomes, we learn a possibly discontinuous closed-loop policy
on the entire state space by switching between two independently trained
Gaussian processes.

** Comment:** code.

Marc Peter Deisenroth, Carl Edward Rasmussen, and Jan Peters.
**Model-based
reinforcement learning with continuous states and actions**.
In *Proceedings of the 16th European Symposium on Artificial Neural Networks
(ESANN 2008)*, pages 19-24, Bruges, Belgium, April 2008.

**
Abstract:** Finding an optimal policy in a reinforcement learning (RL)
framework with continuous state and action spaces is challenging. Approximate
solutions are often inevitable. GPDP is an approximate dynamic programming
algorithm based on Gaussian process (GP) models for the value functions. In
this paper, we extend GPDP to the case of unknown transition dynamics. After
building a GP model for the transition dynamics, we apply GPDP to this model
and determine a continuous-valued policy in the entire state space. We apply
the resulting controller to the underpowered pendulum swing up. Moreover, we
compare our results on this RL task to a nearly optimal discrete DP solution
in a fully known environment.

James Robert Lloyd, David Duvenaud, Roger Grosse, Joshua B. Tenenbaum, and
Zoubin Ghahramani.
**Automatic construction and
natural-language description of nonparametric regression models**.
In *Association for the Advancement of Artificial Intelligence (AAAI)*,
July 2014.

** Abstract:** This paper presents the beginnings
of an automatic statistician, focusing on regression problems. Our system
explores an open-ended space of statistical models to discover a good
explanation of a data set, and then produces a detailed report with figures
and natural-language text. Our approach treats unknown regression functions
nonparametrically using Gaussian processes, which has two important
consequences. First, Gaussian processes can model functions in terms of
high-level properties (e.g. smoothness, trends, periodicity, changepoints).
Taken together with the compositional structure of our language of models
this allows us to automatically describe functions in simple terms. Second,
the use of flexible nonparametric models and a rich language for composing
them in an open-ended manner also results in state-of-the-art extrapolation
performance evaluated over 13 real time series data sets from various
domains.

Michael Schober, David Duvenaud, and Philipp Hennig.
**Probabilistic ODE solvers with
Runge-Kutta means**.
*arXiv preprint arXiv:1406.2582*, June 2014.

**
Abstract:** Runge-Kutta methods are the classic family of solvers for
ordinary differential equations (ODEs), and the basis for the
state-of-the-art. Like most numerical methods, they return point estimates.
We construct a family of probabilistic numerical methods that instead return
a Gauss-Markov process defining a probability distribution over the ODE
solution. In contrast to prior work, we construct this family such that
posterior means match the outputs of the Runge-Kutta family exactly, thus
inheriting their proven good properties. Remaining degrees of freedom not
identified by the match to Runge-Kutta are chosen such that the posterior
probability measure fits the observed structure of the ODE. Our results shed
light on the structure of Runge-Kutta solvers from a new direction, provide a
richer, probabilistic output, have low computational cost, and raise new
research questions.

David Duvenaud, Oren Rippel, Ryan P. Adams, and Zoubin Ghahramani.
**Avoiding pathologies in very
deep networks**.
In *17th International Conference on Artificial Intelligence and
Statistics*, Reykjavik, Iceland, April 2014.

**
Abstract:** Choosing appropriate architectures and regularization
strategies for deep networks is crucial to good predictive performance. To
shed light on this problem, we analyze the analogous problem of constructing
useful priors on compositions of functions. Specifically, we study the deep
Gaussian process, a type of infinitely-wide, deep neural network. We show
that in standard architectures, the representational capacity of the network
tends to capture fewer degrees of freedom as the number of layers increases,
retaining only a single degree of freedom in the limit. We propose an
alternate network architecture which does not suffer from this pathology. We
also examine deep covariance functions, obtained by composing infinitely many
feature transforms. Lastly, we characterize the class of models obtained by
performing dropout on Gaussian processes.

Tomoharu Iwata, David Duvenaud, and Zoubin Ghahramani.
**Warped mixtures for nonparametric
cluster shapes**.
In *29th Conference on Uncertainty in Artificial Intelligence*,
Bellevue, Washington, July 2013.

** Abstract:** A mixture of
Gaussians fit to a single curved or heavy-tailed cluster will report that the
data contains many clusters. To produce more appropriate clusterings, we
introduce a model which warps a latent mixture of Gaussians to produce
nonparametric cluster shapes. The possibly low-dimensional latent mixture
model allows us to summarize the properties of the high-dimensional clusters
(or density manifolds) describing the data. The number of manifolds, as well
as the shape and dimension of each manifold is automatically inferred. We
derive a simple inference scheme for this model which analytically integrates
out both the mixture parameters and the warping function. We show that our
model is effective for density estimation, performs better than infinite
Gaussian mixture models at recovering the true number of clusters, and
produces interpretable summaries of high-dimensional datasets.

David Duvenaud, James Robert Lloyd, Roger Grosse, Joshua B. Tenenbaum, and
Zoubin Ghahramani.
**Structure
discovery in nonparametric regression through compositional kernel
search**.
In *30th International Conference on Machine Learning*, Atlanta,
Georgia, USA, June 2013.

** Abstract:** Despite its
importance, choosing the structural form of the kernel in nonparametric
regression remains a black art. We define a space of kernel structures which
are built compositionally by adding and multiplying a small number of base
kernels. We present a method for searching over this space of structures
which mirrors the scientific discovery process. The learned structures can
often decompose functions into interpretable components and enable long-range
extrapolation on time-series datasets. Our structure search method
outperforms many widely used kernels and kernel combination methods on a
variety of prediction tasks.

Michael A. Osborne, David Duvenaud, Roman Garnett, Carl Edward Rasmussen,
Stephen J. Roberts, and Zoubin Ghahramani.
**Active
learning of model evidence using Bayesian quadrature**.
In *Advances in Neural Information Processing Systems 25*, pages 46-54,
Lake Tahoe, California, USA, December 2012.

** Abstract:**
Numerical integration is a key component of many problems in scientiﬁc
computing, statistical modelling, and machine learning. Bayesian Quadrature
is a model-based method for numerical integration which, relative to standard
Monte Carlo methods, offers increased sample efficiency and a more robust
estimate of the uncertainty in the estimated integral. We propose a novel
Bayesian Quadrature approach for numerical integration when the integrand is
non-negative, such as the case of computing the marginal likelihood,
predictive distribution, or normalising constant of a probabilistic model.
Our approach approximately marginalises the quadrature model’s
hyperparameters in closed form, and introduces an active learning scheme to
optimally select function evaluations, as opposed to using Monte Carlo
samples. We demonstrate our method on both a number of synthetic benchmarks
and a real scientiﬁc problem from astronomy.

Ferenc Huszár and David Duvenaud.
**Optimally-weighted herding is
Bayesian quadrature**.
In *28th Conference on Uncertainty in Artificial Intelligence*, pages
377-385, Catalina Island, California, July 2012.

**
Abstract:** Herding and kernel herding are deterministic methods of
choosing samples which summarise a probability distribution. A related task
is choosing samples for estimating integrals using Bayesian quadrature. We
show that the criterion minimised when selecting samples in kernel herding is
equivalent to the posterior variance in Bayesian quadrature. We then show
that sequential Bayesian quadrature can be viewed as a weighted version of
kernel herding which achieves performance superior to any other weighted
herding method. We demonstrate empirically a rate of convergence faster than
O(1/N). Our results also imply an upper bound on the empirical error of the
Bayesian quadrature estimate.

David Duvenaud, Hannes Nickisch, and Carl Edward Rasmussen.
**Additive
Gaussian processes**.
In *Advances in Neural Information Processing Systems 24*, pages
226-234, Granada, Spain, 2011.

** Abstract:** We introduce a
Gaussian process model of functions which are additive. An additive function
is one which decomposes into a sum of low-dimensional functions, each
depending on only a subset of the input variables. Additive GPs generalize
both Generalized Additive Models, and the standard GP models which use
squared-exponential kernels. Hyperparameter learning in this model can be
seen as Bayesian Hierarchical Kernel Learning (HKL). We introduce an
expressive but tractable parameterization of the kernel function, which
allows efficient evaluation of all input interaction terms, whose number is
exponential in the input dimension. The additional structure discoverable by
this model results in increased interpretability, as well as state-of-the-art
predictive power in regression tasks.

Gintare Karolina Dziugaite, Daniel M. Roy, and Zoubin Ghahramani.
**Training
generative neural networks via Maximum Mean Discrepancy
optimization**.
In *31st Conference on Uncertainty in Artificial Intelligence*, pages
258-267, Amsterdam, The Netherlands, July 2015.

**
Abstract:** We consider training a deep neural network to generate samples
from an unknown distribution given i.i.d. data. We frame learning as an
optimization minimizing a two-sample test statistic-informally speaking, a
good generator network produces samples that cause a two-sample test to fail
to reject the null hypothesis. As our two-sample test statistic, we use an
unbiased estimate of the maximum mean discrepancy, which is the centerpiece
of the nonparametric kernel two-sample test proposed by Gretton et al.
(2012). We compare to the adversarial nets framework introduced by Goodfellow
et al. (2014), in which learning is a two-player game between a generator
network and an adversarial discriminator network, both trained to outwit the
other. From this perspective, the MMD statistic plays the role of the
discriminator. In addition to empirical comparisons, we prove bounds on the
generalization error incurred by optimizing the empirical MMD.

Frederik Eaton and Zoubin Ghahramani.
**Model reductions
for inference: Generality of pairwise, binary, and planar factor
graphs**.
*Neural Computation*, 25(5):1213-1260, 2013.

**
Abstract:** We offer a solution to the problem of efficiently translating
algorithms between different types of discrete statistical model. We
investigate the expressive power of three classes of model-those with binary
variables, with pairwise factors, and with planar topology-as well as their
four intersections. We formalize a notion of "simple reduction" for the
problem of inferring marginal probabilities and consider whether it is
possible to "simply reduce" marginal inference from general discrete factor
graphs to factor graphs in each of these seven subclasses. We characterize
the reducibility of each class, showing in particular that the class of
binary pairwise factor graphs is able to simply reduce only positive models.
We also exhibit a continuous "spectral reduction" based on polynomial
interpolation, which overcomes this limitation. Experiments assess the
performance of standard approximate inference algorithms on the outputs of
our reductions.

Frederik Eaton and Zoubin Ghahramani.
**Choosing a variable
to clamp: Approximate inference using conditioned belief
propagation**.
In D. van Dyk and M. Welling, editors, *12th International Conference on
Artificial Intelligence and Statistics*, volume 5, pages 145-152,
Clearwater Beach, FL, USA, April 2009. Journal of Machine Learning
Research.

** Abstract:** In this paper we propose an algorithm
for approximate inference on graphical models based on belief propagation
(BP). Our algorithm is an approximate version of Cutset Conditioning, in
which a subset of variables is instantiated to make the rest of the graph
singly connected. We relax the constraint of single-connectedness, and select
variables one at a time for conditioning, running belief propagation after
each selection. We consider the problem of determining the best variable to
clamp at each level of recursion, and propose a fast heuristic which applies
back-propagation to the BP updates. We demonstrate that the heuristic
performs better than selecting variables at random, and give experimental
results which show that it performs competitively with existing approximate
inference algorithms.

** Comment:** Code (in C++
based on libDAI).

Alexandre Khae Wu Navarro, Jes Frellsen, and Richard E. Turner.
**The Multivariate Generalised
von Mises distribution: Inference and applications**.
January 2017.

** Abstract:** Circular variables arise in a
multitude of data-modelling contexts ranging from robotics to the social
sciences, but they have been largely overlooked by the machine learning
community. This paper partially redresses this imbalance by extending some
standard probabilistic modelling tools to the circular domain. First we
introduce a new multivariate distribution over circular variables, called the
multivariate Generalised von Mises (mGvM) distribution. This distribution can
be constructed by restricting and renormalising a general multivariate
Gaussian distribution to the unit hyper-torus. Previously proposed
multivariate circular distributions are shown to be special cases of this
construction. Second, we introduce a new probabilistic model for circular
regression, that is inspired by Gaussian Processes, and a method for
probabilistic principal component analysis with circular hidden variables.
These models can leverage standard modelling tools (e.g. covariance functions
and methods for automatic relevance determination). Third, we show that the
posterior distribution in these models is a mGvM distribution which enables
development of an efficient variational free-energy scheme for performing
approximate inference and approximate maximum-likelihood learning.

Jes Frellsen, Ole Winther, Zoubin Ghahramani, and Jesper Ferkinghoff-Borg.
**Bayesian generalised ensemble Markov chain Monte Carlo**.
In *19th International Conference on Artificial Intelligence and
Statistics*, Cadiz, Spain, May 2016.

** Abstract:**
Bayesian generalised ensemble (BayesGE) is a new method that addresses two
major drawbacks of standard Markov chain Monte Carlo algorithms for inference
in high-dimensional probability models: inapplicability to estimate the
partition function, and poor mixing properties. BayesGE uses a Bayesian
approach to iteratively update the belief about the density of states
(distribution of the log likelihood under the prior) for the model, with the
dual purpose of enhancing the sampling efficiency and make the estimation of
the partition function tractable. We benchmark BayesGE on Ising and Potts
systems and show that it compares favourably to existing state-of-the-art
methods.

Wouter Boomsma, Pengfei Tian, Jes Frellsen, Jesper Ferkinghoff-Borg, Thomas
Hamelryck, Kresten Lindorff-Larsen, and Michele Vendruscolo.
**Equilibrium simulations of proteins using molecular fragment replacement and
NMR chemical shifts**.
*Proceedings of the National Academy of Sciences*, 111(38):13852-13857,
2014, doi
10.1073/pnas.1404948111.

** Abstract:** Methods of protein
structure determination based on NMR chemical shifts are becoming
increasingly common. The most widely used approaches adopt the molecular
fragment replacement strategy, in which structural fragments are repeatedly
reassembled into different complete conformations in molecular simulations.
Although these approaches are effective in generating individual structures
consistent with the chemical shift data, they do not enable the sampling of
the conformational space of proteins with correct statistical weights. Here,
we present a method of molecular fragment replacement that makes it possible
to perform equilibrium simulations of proteins, and hence to determine their
free energy landscapes. This strategy is based on the encoding of the
chemical shift information in a probabilistic model in Markov chain Monte
Carlo simulations. First, we demonstrate that with this approach it is
possible to fold proteins to their native states starting from extended
structures. Second, we show that the method satisfies the detailed balance
condition and hence it can be used to carry out an equilibrium sampling from
the Boltzmann distribution corresponding to the force field used in the
simulations. Third, by comparing the results of simulations carried out with
and without chemical shift restraints we describe quantitatively the effects
that these restraints have on the free energy landscapes of proteins. Taken
together, these results demonstrate that the molecular fragment replacement
strategy can be used in combination with chemical shift information to
characterize not only the native structures of proteins but also their
conformational fluctuations.

Jes Frellsen, Thomas Hamelryck, and Jesper Ferkinghoff-Borg.
**Combining the
multicanonical ensemble with generative probabilistic models of local
biomolecular structure**.
In *Proceedings of the 59th World Statistics Congress of the
International Statistical Institute*, pages 139-144, Hong Kong,
2014.

** Abstract:** Markov chain Monte Carlo is a powerful
tool for sampling complex systems such as large biomolecular structures.
However, the standard Metropolis-Hastings algorithm suffers from a number of
deficiencies when applied to systems with rugged free-energy landscapes. Some
of these deficiencies can be addressed with the multicanonical ensemble. In
this paper we will present two strategies for applying the multicanonical
ensemble to distributions constructed from generative probabilistic models of
local biomolecular structure. In particular, we will describe how to use the
multicanonical ensemble efficiently in conjunction with the reference ratio
method.

Peter Kerpedjiev, Jes Frellsen, Stinus Lindgreen, and Anders Krogh.
**Adaptable probabilistic mapping of short reads using position specific
scoring matrices**.
*BMC bioinformatics*, 15:100, 2014, doi
10.1186/1471-2105-15-100.

** Abstract:** BACKGROUND:
Modern DNA sequencing methods produce vast amounts of data that often
requires mapping to a reference genome. Most existing programs use the number
of mismatches between the read and the genome as a measure of quality. This
approach is without a statistical foundation and can for some data types
result in many wrongly mapped reads. Here we present a probabilistic mapping
method based on position-specific scoring matrices, which can take into
account not only the quality scores of the reads but also user-specified
models of evolution and data-specific biases.RESULTS:We show how evolution,
data-specific biases, and sequencing errors are naturally dealt with
probabilistically. Our method achieves better results than Bowtie and BWA on
simulated and real ancient and PAR-CLIP reads, as well as on simulated reads
from the AT rich organism P. falciparum, when modeling the biases of these
data. For simulated Illumina reads, the method has consistently higher
sensitivity for both single-end and paired-end data. We also show that our
probabilistic approach can limit the problem of random matches from short
reads of contamination and that it improves the mapping of real reads from
one organism (D. melanogaster) to a related genome (D. simulans). CONCLUSION:
The presented work is an implementation of a novel approach to short read
mapping where quality scores, prior mismatch probabilities and mapping
qualities are handled in a statistically sound manner. The resulting
implementation provides not only a tool for biologists working with low
quality and/or biased sequencing data but also a demonstration of the
feasibility of using a probability based alignment method on real and
simulated data sets.

** Comment:** Peter Kerpedjiev and Jes Frellsen contributed
equally. Additional resources are available at bwa-pssm.binf.ku.dk

Peter Menzel, Jes Frellsen, Mireya Plass, Simon H. Rasmussen, and Anders
Krogh.
**On the accuracy of short read mapping**.
In *Deep Sequencing Data Analysis*, volume 1038, pages 39-59. Springer,
2013.

** Abstract:** The development of high-throughput
sequencing technologies has revolutionized the way we study genomes and gene
regulation. In a single experiment, millions of reads are produced. To gain
knowledge from these experiments the first thing to be done is finding the
genomic origin of the reads, i.e., mapping the reads to a reference genome.
In this new situation, conventional alignment tools are obsolete, as they
cannot handle this huge amount of data in a reasonable amount of time. Thus,
new mapping algorithms have been developed, which are fast at the expense of
a small decrease in accuracy. In this chapter we discuss the current problems
in short read mapping and show that mapping reads correctly is a nontrivial
task. Through simple experiments with both real and synthetic data, we
demonstrate that different mappers can give different results depending on
the type of data, and that a considerable fraction of uniquely mapped reads
is potentially mapped to an incorrect location. Furthermore, we provide
simple statistical results on the expected number of random matches in a
genome (E-value) and the probability of a random match as a function of read
length. Finally, we show that quality scores contain valuable information for
mapping and why mapping quality should be evaluated in a probabilistic
manner. In the end, we discuss the potential of improving the performance of
current methods by considering these quality scores in a probabilistic
mapping program.

** Comment:** Peter Menzel and Jes Frellsen contributed
equally.

Roger Frigola.
**Bayesian time series
learning with Gaussian processes**.
PhD thesis, University of Cambridge, Department of Engineering, Cambridge, UK,
2015.

** Abstract:** The analysis of time series data is
important in fields as disparate as the social sciences, biology, engineering
or econometrics. In this dissertation, we present a number of algorithms
designed to learn Bayesian nonparametric models of time series. The goal of
these kinds of models is twofold. First, they aim at making predictions which
quantify the uncertainty due to limitations in the quantity and the quality
of the data. Second, they are flexible enough to model highly complex data
whilst preventing overfitting when the data does not warrant complex
models.

We begin with a unifying literature review on time series models
based on Gaussian processes. Then, we centre our attention on the Gaussian
Process State-Space Model (GP-SSM): a Bayesian nonparametric generalisation
of discrete-time nonlinear state-space models. We present a novel formulation
of the GP-SSM that offers new insights into its properties. We then proceed
to exploit those insights by developing new learning algorithms for the
GP-SSM based on particle Markov chain Monte Carlo and variational
inference.

Finally, we present a filtered nonlinear auto-regressive
model with a simple, robust and fast learning algorithm that makes it well
suited to its application by non-experts on large datasets. Its main
advantage is that it avoids the computationally expensive (and potentially
difficult to tune) smoothing step that is a key part of learning nonlinear
state-space models.

Roger Frigola, Yutian Chen, and Carl E. Rasmussen.
**Variational
Gaussian process state-space models**.
In Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger,
editors, *Advances in Neural Information Processing Systems 27*,
2014.

** Abstract:** State-space models have been successfully
used for more than fifty years in different areas of science and engineering.
We present a procedure for efficient variational Bayesian learning of
nonlinear state-space models based on sparse Gaussian processes. The result
of learning is a tractable posterior over nonlinear dynamical systems. In
comparison to conventional parametric models, we offer the possibility to
straightforwardly trade off model capacity and computational cost whilst
avoiding overfitting. Our main algorithm uses a hybrid inference approach
combining variational Bayes and sequential Monte Carlo. We also present
stochastic variational inference and online learning approaches for fast
learning with long time series.

Roger Frigola, Fredrik Lindsten, Thomas B. Schön, and Carl Edward
Rasmussen.
**Identification of Gaussian
process state-space models with particle stochastic approximation
EM**.
In *Proceedings of the 19th World Congress of the International Federation
of Automatic Control (IFAC)*, 2014.

** Abstract:**
Gaussian process state-space models (GP-SSMs) are a very flexible family of
models of nonlinear dynamical systems. They comprise a Bayesian nonparametric
representation of the dynamics of the system and additional
(hyper-)parameters governing the properties of this nonparametric
representation. The Bayesian formalism enables systematic reasoning about the
uncertainty in the system dynamics. We present an approach to maximum
likelihood identification of the parameters in GP-SSMs, while retaining the
full nonparametric description of the dynamics. The method is based on a
stochastic approximation version of the EM algorithm that employs recent
developments in particle Markov chain Monte Carlo for efficient
identification.

Roger Frigola, Fredrik Lindsten, Thomas B. Schön, and Carl Edward
Rasmussen.
**Bayesian inference and learning in
Gaussian process state-space models with particle MCMC**.
In L. Bottou, C.J.C. Burges, Z. Ghahramani, M. Welling, and K.Q. Weinberger,
editors, *Advances in Neural Information Processing Systems 26*.
Curran Associates, Inc., 2013.

** Abstract:** State-space
models are successfully used in many areas of science, engineering and
economics to model time series and dynamical systems. We present a fully
Bayesian approach to inference and learning in nonlinear nonparametric
state-space models. We place a Gaussian process prior over the transition
dynamics, resulting in a flexible model able to capture complex dynamical
phenomena. However, to enable efficient inference, we marginalize over the
dynamics of the model and instead infer directly the joint smoothing
distribution through the use of specially tailored Particle Markov Chain
Monte Carlo samplers. Once a sample from the smoothing distribution is
computed, the state transition predictive distribution can be formulated
analytically. We make use of sparse Gaussian process models to greatly reduce
the computational complexity of the approach.

Roger Frigola and Carl Edward Rasmussen.
**Integrated pre-processing for
Bayesian nonlinear system identification with Gaussian processes**.
In *Decision and Control (CDC), 2013 IEEE 52nd Annual Conference on*,
2013.

** Abstract:** We introduce GP-FNARX: a new model for
nonlinear system identification based on a nonlinear autoregressive exogenous
model (NARX) with filtered regressors (F) where the nonlinear regression
problem is tackled using sparse Gaussian processes (GP). We integrate data
pre-processing with system identification into a fully automated procedure
that goes from raw data to an identified model. Both pre-processing
parameters and GP hyper-parameters are tuned by maximizing the marginal
likelihood of the probabilistic model. We obtain a Bayesian model of the
system's dynamics which is able to report its uncertainty in regions where
the data is scarce. The automated approach, the modeling of uncertainty and
its relatively low computational cost make of GP-FNARX a good candidate for
applications in robotics and adaptive control.

David A. Knowles, Jurgen Van Gael, and Zoubin Ghahramani.
**Message passing
algorithms for the Dirichlet diffusion tree**.
In *28th International Conference on Machine Learning*, 2011.

** Abstract:** We demonstrate efficient approximate inference
for the Dirichlet Diffusion Tree (Neal, 2003), a Bayesian nonparametric prior
over tree structures. Although DDTs provide a powerful and elegant approach
for modeling hierarchies they haven't seen much use to date. One problem is
the computational cost of MCMC inference. We provide the first deterministic
approximate inference methods for DDT models and show excellent performance
compared to the MCMC alternative. We present message passing algorithms to
approximate the Bayesian model evidence for a specific tree. This is used to
drive sequential tree building and greedy search to find optimal tree
structures, corresponding to hierarchical clusterings of the data. We
demonstrate appropriate observation models for continuous and binary data.
The empirical performance of our method is very close to the computationally
expensive MCMC alternative on a density estimation problem, and significantly
outperforms kernel density estimators.

** Comment:** web site

G. Kasneci, J. Van Gael, T. Graepel, and R. Herbrich.
**Bayesian
knowledge corroboration with logical rules and user feedback**.
In *European Conference on Machine Learning (ECML)*, Barcelona, Spain,
September 2010.

** Abstract:** Current knowledge bases suffer
from either low coverage or low accuracy. The underlying hypothesis of this
work is that user feedback can greatly improve the quality of automatically
extracted knowledge bases. The feedback could help quantify the uncertainty
associated with the stored statements and would enable mechanisms for
searching, ranking and reasoning at entity-relationship level. Most
importantly, a principled model for exploiting user feedback to learn the
truth values of statements in the knowledge base would be a major step
forward in addressing the issue of knowledge base curation. We present a
family of probabilistic graphical models that builds on user feedback and
logical inference rules derived from the popular Semantic-Web formalism of
RDFS [1]. Through internal
inference and belief propagation, these models can learn both, the truth
values of the statements in the knowledge base and the reliabilities of the
users who give feedback. We demonstrate the viability of our approach in
extensive experiments on real-world datasets, with feedback collected from
Amazon Mechanical Turk.

C. Rotsos, J. Van Gael, A.W. Moore, and Z. Ghahramani.
**Traffic
classification in information poor environments**.
In *1st International Workshop on Traffic Analysis and Classification (IWCMC
'10)*, Caen, France, July 2010.

** Abstract:** Traffic
classification using machine learning continues to be an active research
area. The majority of work in this area uses *off-the-shelf* machine
learning tools and treats them as *black-box* classifiers. This approach
turns all the modelling complexity into a feature selection problem. In this
paper, we build a problem-specific solution to the traffic classification
problem by designing a custom probabilistic graphical model. Graphical models
are a modular framework to design classifiers which incorporate
domain-specific knowledge. More specifically, our solution introduces
semi-supervised learning which means we learn from both labelled and
unlabelled traffic flows. We show that our solution performs competitively
compared to previous approaches while using less data and simpler
features.

Sébastien Bratières, Jurgen Van Gael, Andreas Vlachos, and Zoubin
Ghahramani.
**Scaling the
iHMM: Parallelization versus Hadoop**.
In *Proceedings of the 2010 10th IEEE International Conference on Computer
and Information Technology*, pages 1235-1240, Bradford, UK, 2010. IEEE
Computer Society, doi
10.1109/CIT.2010.223.

** Abstract:** This paper compares
parallel and distributed implementations of an iterative, Gibbs sampling,
machine learning algorithm. Distributed implementations run under Hadoop on
facility computing clouds. The probabilistic model under study is the
infinite HMM Beal, Ghahramani and Rasmussen,
2002, in which parameters are learnt using an instance blocked Gibbs
sampling, with a step consisting of a dynamic program. We apply this model to
learn part-of-speech tags from newswire text in an unsupervised fashion.
However our focus here is on runtime performance, as opposed to NLP-relevant
scores, embodied by iteration duration, ease of development, deployment and
debugging.

Charalampos Rotsos, Jurgen Van Gael, Andrew W. Moore, and Zoubin Ghahramani.
**Probabilistic
graphical models for semi-supervised traffic classification**.
In *The 6th International Wireless Communications and Mobile Computing
Conference*, pages 752-757, Caen, France, 2010.

**
Abstract:** Traffic classification using machine learning continues to be
an active research area. The majority of work in this area uses off-the-shelf
machine learning tools and treats them as black-box classifiers. This
approach turns all the modelling complexity into a feature selection problem.
In this paper, we build a problem-specific solution to the traffic
classification problem by designing a custom probabilistic graphical model.
Graphical models are a modular framework to design classifiers which
incorporate domain-specific knowledge. More specifically, our solution
introduces semi-supervised learning which means we learn from both labelled
and unlabelled traffic flows. We show that our solution performs
competitively compared to previous approaches while using less data and
simpler features.

J. Van Gael, A. Vlachos, and Z. Ghahramani.
**The infinite
HMM for unsupervised PoS tagging**.
In *Proceedings of the 2009 Conference on Empirical Methods in Natural
Language Processing (EMNLP)*, pages 678-687, Singapore, August 2009.
Association for Computational Linguistics.

** Abstract:** We
extend previous work on fully unsupervised part-of-speech tagging. Using a
non-parametric version of the HMM, called the infinite HMM (iHMM), we address
the problem of choosing the number of hidden states in unsupervised Markov
models for PoS tagging. We experiment with two non-parametric priors, the
Dirichlet and Pitman-Yor processes, on the Wall Street Journal dataset using
a parallelized implementation of an iHMM inference algorithm. We evaluate the
results with a variety of clustering evaluation metrics and achieve
equivalent or better performances than previously reported. Building on this
promising result we evaluate the output of the unsupervised PoS tagger as a
direct replacement for the output of a fully supervised PoS tagger for the
task of shallow parsing and compare the two evaluations.

F. Doshi-Velez, K.T. Miller, J. Van Gael, and Y.W. Teh.
**Variational
inference for the Indian buffet process**.
In *12th International Conference on Artificial Intelligence and
Statistics*, volume 12, pages 137-144, Clearwater Beach, FL, USA,
April 2009. Journal of Machine Learning Research.

**
Abstract:** The Indian Buffet Process (IBP) is a nonparametric prior for
latent feature models in which observations are influenced by a combination
of hidden features. For example, images may be composed of several objects
and sounds may consist of several notes. Latent feature models seek to infer
these unobserved features from a set of observations; the IBP provides a
principled prior in situations where the number of hidden features is
unknown. Current inference methods for the IBP have all relied on sampling.
While these methods are guaranteed to be accurate in the limit, samplers for
the IBP tend to mix slowly in practice. We develop a deterministic
variational method for inference in the IBP based on a truncated
stick-breaking approximation, provide theoretical bounds on the truncation
error, and evaluate our method in several data regimes.

Finale Doshi-Velez, Kurt T. Miller, Jurgen Van Gael, and Yee Whye Teh.
**Variational
inference for the Indian buffet process**.
Technical Report CBL-2009-001, University of Cambridge, Computational and
Biological Learning Laboratory, Department of Engineering, April 2009.

** Abstract:** The Indian Buffet Process (IBP) is a
nonparametric prior for latent feature models in which observations are
influenced by a combination of hidden features. For example, images may be
composed of several objects and sounds may consist of several notes. Latent
feature models seek to infer these unobserved features from a set of
observations; the IBP provides a principled prior in situations where the
number of hidden features is unknown. Current inference methods for the IBP
have all relied on sampling. While these methods are guaranteed to be
accurate in the limit, samplers for the IBP tend to mix slowly in practice.
We develop a deterministic variational method for inference in the IBP based
on truncating to infinite models, provide theoretical bounds on the
truncation error, and evaluate our method in several data regimes. This
technical report is a longer version of Doshi-Velez et al. (2009).

J. Van Gael, Y.W. Teh, and Z. Ghahramani.
**The infinite
factorial hidden Markov model**.
In D. Koller, D. Schuurmans, L. Bottou, and Y. Bengio, editors, *Advances in
Neural Information Processing Systems 21*, volume 21, pages
1697-1704, Cambridge, MA, USA, December 2008. The MIT Press.

** Abstract:** The infinite factorial hidden Markov model is a
non-parametric extension of the factorial hidden Markov model. Our model
defines a probability distribution over an infinite number of independent
binary hidden Markov chains which together produce an observable sequence of
random variables. Central to our model is a new type of non-parametric prior
distribution inspired by the Indian Buffet Process which we call the
*Indian Buffet Markov Process*.

Jurgen Van Gael, Yunus Saatçi, Yee-Whye Teh, and Zoubin Ghahramani.
**Beam sampling
for the infinite hidden Markov model**.
In *25th International Conference on Machine Learning*, volume 25,
pages 1088-1095, Helsinki, Finland, 2008. Association for Computing
Machinery.

** Abstract:** The infinite hidden Markov model is
a non-parametric extension of the widely used hidden Markov model. Our paper
introduces a new inference algorithm for the infinite Hidden Markov model
called beam sampling. Beam sampling combines slice sampling, which limits the
number of states considered at each time step to a finite number, with
dynamic programming, which samples whole state trajectories efficiently. Our
algorithm typically outperforms the Gibbs sampler and is more robust. We
present applications of iHMM inference using the beam sampler on changepoint
detection and text prediction problems.

A.B. Goldberg, X. Zhu, J. Van Gael, and D. Andrzejewski.
**Improving
diversity in ranking using absorbing random walks**.
In *Proceedings of NAACL HLT*, pages 97-104, Rochester, NY, USA, April
2007.

J. Van Gael and X. Zhu.
**Correlation
clustering for crosslingual link detection**.
In Manuela M. Veloso, editor, *International Joint Conference on Artificial
Intelligence*, pages 1744-1749, Hyderabad, India, January 2007.

** Abstract:** The crosslingual link detection problem calls for
identifying news articles in multiple languages that report on the same news
event. This paper presents a novel approach based on constrained clustering.
We discuss a general way for constrained clustering using a recent,
graph-based clustering framework called correlation clustering. We introduce
a correlation clustering implementation that features linear program chunking
to allow processing larger datasets. We show how to apply the correlation
clustering algorithm to the crosslingual link detection problem and present
experimental results that show correlation clustering improves upon the
hierarchical clustering approaches commonly used in link detection, and,
hierarchical clustering approaches that take constraints into account.

A. B. Goldberg, D. Andrzejewski, J. Van Gael, B. Settles, X. Zhu, and
M. Craven.
**Ranking
biomedical passages for relevance and diversity: University of Wisconsin,
Madison at TREC Genomics 2006**.
In *Proceedings of the Fifteenth Text REtrieval Conference (TREC 2006)*,
Gaithersburg, MD, USA, November 2006.

** Abstract:** We report
on the University of Wisconsin, Madison's experience in the TREC Genomics
2006 track, which asks participants to retrieve passages from scientific
articles that satisfy biologists' information needs. An emphasis is placed on
returning relevant passages that discuss different aspects of the topic.
Using an off-the-shelf information retrieval (IR) engine, we focused on query
generation and reranking query results to encourage relevance and diversity.
For query generation, we automatically identify noun phrases from the topic
descriptions, and use online resources to gather synonyms as expansion terms.
Our first submission uses the baseline IR engine results. We rerank the
passages using a naive clustering-based approach in our second run, and we
test GRASSHOPPER, a novel graph-theoretic algorithm based on absorbing random
walks, in our third run. While our aspect-level results appear to compare
favorably with other participants on average, our query generation techniques
failed to produce adequate query results for several topics, causing our
passage and document-level evaluation scores to suffer. Furthermore, we
surprisingly achieved higher aspect-level scores using the initial ranking
than our methods aimed specifically at promoting diversity. While this sounds
discouraging, we have several ideas as to why this happened and hope to
produce new methods that correct these shortcomings.

Yingzhen Li and Yarin Gal.
**Dropout Inference
in Bayesian Neural Networks with Alpha-divergences**.
In *34th International Conference on Machine Learning*, Sydney
AUSTRALIA, Aug 2017.

** Abstract:** To obtain uncertainty
estimates with real-world Bayesian deep learning models, practical inference
approximations are needed. Dropout variational inference (VI) for example has
been used for machine vision and medical applications, but VI can severely
underestimates model uncertainty. Alpha-divergences are alternative
divergences to VI’s KL objective, which are able to avoid VI’s
uncertainty underestimation. But these are hard to use in practice: existing
techniques can only use Gaussian approximating distributions, and require
existing models to be changed radically, thus are of limited use for
practitioners. We propose a re-parametrisation of the alpha-divergence
objectives, deriving a simple inference technique which, together with
dropout, can be easily implemented with existing models by simply changing
the loss of the model. We demonstrate improved uncertainty estimates and
accuracy compared to VI in dropout networks. We study our model’s epistemic
uncertainty far away from the data using adversarial images, showing that
these can be distinguished from non-adversarial images by examining our
model’s uncertainty.

Rowan McAllister, Yarin Gal, Alex Kendall, Mark van der Wilk, Amar Shah,
Roberto Cipolla, and Adrian Weller.
**Concrete problems
for autonomous vehicle safety: Advantages of Bayesian deep
learning,**.
In *International Joint Conference on Artificial Intelligence*,
Melbourne, Australia, August 2017.

** Abstract:** Autonomous
vehicle (AV) software is typically composed of a pipeline of individual
components, linking sensor inputs to motor outputs. Erroneous component
outputs propagate downstream, hence safe AV software must consider the
ultimate effect of each component's errors. Further, improving safety alone
is not sufficient. Passengers must also feel safe to trust and use AV
systems. To address such concerns, we investigate three under-explored themes
for AV research: safety, interpretability, and compliance. Safety can be
improved by quantifying the uncertainties of component outputs and
propagating them forward through the pipeline. Interpretability is concerned
with explaining what the AV observes and why it makes the decisions it does,
building reassurance with the passenger. Compliance refers to maintaining
some control for the passenger. We discuss open challenges for research
within these themes. We highlight the need for concrete evaluation metrics,
propose example problems, and highlight possible solutions.

Yarin Gal, Yutian Chen, and Zoubin Ghahramani.
**Latent
Gaussian processes for distribution estimation of multivariate categorical
data**.
In *Proceedings of the 32nd International Conference on Machine Learning
(ICML-15)*, pages 645-654, 2015.

** Abstract:**
Multivariate categorical data occur in many applications of machine learning.
One of the main difficulties with these vectors of categorical variables is
sparsity. The number of possible observations grows exponentially with vector
length, but dataset diversity might be poor in comparison. Recent models have
gained significant improvement in supervised tasks with this data. These
models embed observations in a continuous space to capture similarities
between them. Building on these ideas we propose a Bayesian model for the
unsupervised task of distribution estimation of multivariate categorical
data. We model vectors of categorical variables as generated from a
non-linear transformation of a continuous latent space. Non-linearity
captures multi-modality in the distribution. The continuous representation
addresses sparsity. Our model ties together many existing models, linking the
linear categorical latent Gaussian model, the Gaussian process latent
variable model, and Gaussian process classification. We derive inference for
our model based on recent developments in sampling based variational
inference. We show empirically that the model outperforms its linear and
discrete counterparts in imputation tasks of sparse data.

Yarin Gal and Richard Turner.
**Improving the
Gaussian process sparse spectrum approximation by representing uncertainty
in frequency inputs**.
In *Proceedings of the 32nd International Conference on Machine Learning
(ICML-15)*, pages 655-664, 2015.

** Abstract:** Standard
sparse pseudo-input approximations to the Gaussian process (GP) cannot handle
complex functions well. Sparse spectrum alternatives attempt to answer this
but are known to over-fit. We suggest the use of variational inference for
the sparse spectrum approximation to avoid both issues. We model the
covariance function with a finite Fourier series approximation and treat it
as a random variable. The random covariance function has a posterior, on
which a variational distribution is placed. The variational distribution
transforms the random covariance function to fit the data. We study the
properties of our approximate inference, compare it to alternative ones, and
extend it to the distributed and stochastic domains. Our approximation
captures complex functions better than standard approaches and avoids
over-fitting.

Yarin Gal and Zoubin Ghahramani.
**Pitfalls in the
use of parallel inference for the Dirichlet process**.
In *Proceedings of the 31th International Conference on Machine Learning
(ICML-14)*, 2014.

** Abstract:** Recent work done by
Lovell, Adams, and Mansingka (2012) and Williamson, Dubey, and Xing (2013)
has suggested an alternative parametrisation for the Dirichlet process in
order to derive non-approximate parallel MCMC inference for it – work which
has been picked-up and implemented in several different fields. In this paper
we show that the approach suggested is impractical due to an extremely
unbalanced distribution of the data. We characterise the requirements of
efficient parallel inference for the Dirichlet process and show that the
proposed inference fails most of these requirements (while approximate
approaches often satisfy most of them). We present both theoretical and
experimental evidence, analysing the load balance for the inference and
showing that it is independent of the size of the dataset and the number of
nodes available in the parallel implementation. We end with suggestions of
alternative paths of research for efficient non-approximate parallel
inference for the Dirichlet process.

Yarin Gal, Mark van der Wilk, and Carl Rasmussen.
**Distributed
variational inference in sparse Gaussian process regression and latent
variable models**.
In Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger,
editors, *Advances in Neural Information Processing Systems 27*, pages
3257-3265. Curran Associates, Inc., 2014.

** Abstract:**
Gaussian processes (GPs) are a powerful tool for probabilistic inference over
functions. They have been applied to both regression and non-linear
dimensionality reduction, and offer desirable properties such as uncertainty
estimates, robustness to over-fitting, and principled ways for tuning
hyper-parameters. However the scalability of these models to big datasets
remains an active topic of research. We introduce a novel re-parametrisation
of variational inference for sparse GP regression and latent variable models
that allows for an efficient distributed algorithm. This is done by
exploiting the decoupling of the data given the inducing points to
re-formulate the evidence lower bound in a Map-Reduce setting. We show that
the inference scales well with data and computational resources, while
preserving a balanced distribution of the load among the nodes. We further
demonstrate the utility in scaling Gaussian processes to big data. We show
that GP performance improves with increasing amounts of data in regression
(on flight data with 2 million records) and latent variable modelling (on
MNIST). The results show that GPs perform better than many common models
often used for big data.

Yarin Gal and Phil Blunsom.
**A systematic
Bayesian treatment of the IBM alignment models**.
In *Proceedings of the 2013 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies*.
Association for Computational Linguistics, 2013.

**
Abstract:** The dominant yet ageing IBM and HMM word alignment models
underpin most popular Statistical Machine Translation implementations in use
today. Though beset by the limitations of implausible independence
assumptions, intractable optimisation problems, and an excess of tunable
parameters, these models provide a scalable and reliable starting point for
inducing translation systems. In this paper we build upon this venerable base
by recasting these models in the non-parametric Bayesian framework. By
replacing the categorical distributions at their core with hierarchical
Pitman-Yor processes, and through the use of collapsed Gibbs sampling, we
provide a more flexible formulation and sidestep the original heuristic
optimisation techniques. The resulting models are highly extendible,
naturally permitting the introduction of phrasal dependencies. We present
extensive experimental results showing improvements in both AER and BLEU when
benchmarked against Giza++, including significant improvements over IBM model
4.

Adrià Garriga-Alonso, Carl Edward Rasmussen, and Laurence Aitchison.
**Deep convolutional networks as
shallow Gaussian processes**.
In *International Conference on Learning Representations (ICLR)*,
2019.

** Abstract:** We show that the output of a (residual)
convolutional neural network (CNN) with an appropriate prior over the weights
and biases is a Gaussian process (GP) in the limit of infinitely many
convolutional filters, extending similar results for dense networks. For a
CNN, the equivalent kernel can be computed exactly and, unlike "deep
kernels", has very few parameters: only the hyperparameters of the original
CNN. Further, we show that this kernel has two properties that allow it to be
computed efficiently; the cost of evaluating the kernel for a pair of images
is similar to a single forward pass through the original CNN with only one
filter per layer. The kernel equivalent to a 32-layer ResNet obtains 0.84%
classification error on MNIST, a new record for GPs with a comparable number
of parameters.

Martin Trapp, Robert Peharz, Hong Ge, Franz Pernkopf, and Carl Edward
Rasmussen.
**Deep
structured mixtures of gaussian processes**.
In *23rd International Conference on Artificial Intelligence and
Statistics*, Online, August 2020.

** Abstract:** Gaussian
Processes (GPs) are powerful non-parametric Bayesian regression models that
allow exact posterior inference, but exhibit high computational and memory
costs. In order to improve scalability of GPs, approximate posterior
inference is frequently employed, where a prominent class of approximation
techniques is based on local GP experts. However, local-expert techniques
proposed so far are either not well-principled, come with limited
approximation guarantees, or lead to intractable models. In this paper, we
introduce deep structured mixtures of GP experts, a stochastic process model
which i) allows exact posterior inference, ii) has attractive computational
and memory costs, and iii) when used as GP approximation, captures predictive
uncertainties consistently better than previous expert-based approximations.
In a variety of experiments, we show that deep structured mixtures have a low
approximation error and often perform competitive or outperform prior
work.

Martin Trapp, Robert Peharz, Hong Ge, Franz Pernkopf, and Zoubin Ghahramani.
**Bayesian
learning of sum-product networks**.
In *Advances in Neural Information Processing Systems 33*, Vancouver,
December 2019.

** Abstract:** Sum-product networks (SPNs) are
flexible density estimators and have received significant attention due to
their attractive inference properties. While parameter learning in SPNs is
well developed, structure learning leaves something to be desired: Even
though there is a plethora of SPN structure learners, most of them are
somewhat ad-hoc and based on intuition rather than a clear learning
principle. In this paper, we introduce a well-principled Bayesian framework
for SPN structure learning. First, we decompose the problem into i) laying
out a computational graph, and ii) learning the so-called scope function over
the graph. The first is rather unproblematic and akin to neural network
architecture validation. The second represents the effective structure of the
SPN and needs to respect the usual structural constraints in SPN, i.e.
completeness and decomposability. While representing and learning the scope
function is somewhat involved in general, in this paper, we propose a natural
parametrisation for an important and widely used special case of SPNs. These
structural parameters are incorporated into a Bayesian model, such that
simultaneous structure and parameter learning is cast into monolithic
Bayesian posterior inference. In various experiments, our Bayesian SPNs often
improve test likelihoods over greedy SPN learners. Further, since the
Bayesian framework protects against overfitting, we can evaluate
hyper-parameters directly on the Bayesian model score, waiving the need for a
separate validation set, which is especially beneficial in low data regimes.
Bayesian SPNs can be applied to heterogeneous domains and can easily be
extended to nonparametric formulations. Moreover, our Bayesian approach is
the first, which consistently and robustly learns SPN structures under
missing data.

Adam Ścibior, Ohad Kammar, Matthijs Vákár, Sam Staton, Hongseok Yang,
Yufei Cai, Klaus Ostermann, Sean K. Moss, Chris Heunen, and Zoubin
Ghahramani.
**Denotational
validation of higher-order Bayesian inference**.
*Proceedings of the ACM on Programming Languages*, 2, 2018.

** Abstract:** We present a modular semantic account of Bayesian
inference algorithms for probabilistic programming languages, as used in data
science and machine learning. Sophisticated inference algorithms are often
explained in terms of composition of smaller parts. However, neither their
theoretical justification nor their implementation reflects this modularity.
We show how to conceptualise and analyse such inference algorithms as
manipulating intermediate representations of probabilistic programs using
higher-order functions and inductive types, and their denotational semantics.
Semantic accounts of continuous distributions use measurable spaces. However,
our use of higher-order functions presents a substantial technical
difficulty: it is impossible to define a measurable space structure over the
collection of measurable functions between arbitrary measurable spaces that
is compatible with standard operations on those functions, such as function
application. We overcome this difficulty using quasi-Borel spaces, a recently
proposed mathematical structure that supports both function spaces and
continuous distributions. We define a class of semantic structures for
representing probabilistic programs, and semantic validity criteria for
transformations of these representations in terms of distribution
preservation. We develop a collection of building blocks for composing
representations. We use these building blocks to validate common inference
algorithms such as Sequential Monte Carlo and Markov Chain Monte Carlo. To
emphasize the connection between the semantic manipulation and its
traditional measure theoretic origins, we use Kock’s synthetic measure
theory. We demonstrate its usefulness by proving a quasi-Borel counterpart to
the Metropolis-Hastings-Green theorem.

Ben Bloem-Reddy, Emile Mathieu, Adam Foster, Tom Rainforth, Hong Ge, Maria
Lomeli, and Zoubin Ghahramani.
**Sampling and inference
for discrete random probability measures in probabilistic programs**.
In *NIPS workshop on Advances in Approximate Inference*,
California, United States, December 2017.

** Abstract:** We
consider the problem of sampling a sequence from a discrete random prob-
ability measure (RPM) with countable support, under (probabilistic)
constraints of finite memory and computation. A canonical example is sampling
from the Dirichlet Process, which can be accomplished using its well-known
stick-breaking representation and lazy initialization of its atoms. We show
that efficiently lazy initialization is possible if and only if a size-biased
representation of the discrete RPM is known. For models constructed from such
discrete RPMs, we consider the implications for generic particle-based
inference methods in probabilistic program- ming systems. To demonstrate, we
implement posterior inference for Normalized Inverse Gaussian Process mixture
models in Turing.

Nilesh Tripuraneni, Shixiang Gu, Hong Ge, and Zoubin Ghahramani.
**Particle
gibbs for infinite hidden markov models**.
In *Advances in Neural Information Processing Systems 29*, Montreal
CANADA, May 2015.

** Abstract:** Infinite Hidden Markov Models
(iHMM’s) are an attractive, nonparametric gener- alization of the classical
Hidden Markov Model which can automatically infer the number of hidden states
in the system. However, due to the infinite-dimensional nature of the
transition dynamics, performing inference in the iHMM is difficult. In this
paper, we present an infinite-state Particle Gibbs (PG) algorithm to re-
sample state trajectories for the iHMM. The proposed algorithm uses an
efficient proposal optimized for iHMMs and leverages ancestor sampling to
improve the mixing of the standard PG algorithm. Our algorithm demonstrates
significant con- vergence improvements on synthetic and real world data
sets.

Robert Peharz, Steven Lang, Antonio Vergari, Karl Stelzner, Alejandro Molina,
Martin Trapp, Guy Van den Broeck, Kristian Kersting, and Zoubin Ghahramani.
**Einsum networks: Fast and
scalable learning of tractable probabilistic circuits**.
In *37th International Conference on Machine Learning*, Online, July
2020.

** Abstract:** Probabilistic circuits (PCs) are a
promising avenue for probabilistic modeling, as they permit a wide range of
exact and efficient inference routines. Recent ``deep-learning-style''
implementations of PCs strive for a better scalability, but are still
difficult to train on real-world data, due to their sparsely connected
computational graphs. In this paper, we propose Einsum Networks (EiNets), a
novel implementation design for PCs, improving prior art in several regards.
At their core, EiNets combine a large number of arithmetic operations in a
single monolithic einsum-operation, leading to speedups and memory savings of
up to two orders of magnitude, in comparison to previous implementations. As
an algorithmic contribution, we show that the implementation of
Expectation-Maximization (EM) can be simplified for PCs, by leveraging
automatic differentiation. Furthermore, we demonstrate that EiNets scale well
to datasets which were previously out of reach, such as SVHN and CelebA, and
that they can be used as faithful generative image models.

Martin Trapp, Robert Peharz, Hong Ge, Franz Pernkopf, and Zoubin Ghahramani.
**Bayesian
learning of sum-product networks**.
In *Advances in Neural Information Processing Systems 33*, Vancouver,
December 2019.

** Abstract:** Sum-product networks (SPNs) are
flexible density estimators and have received significant attention due to
their attractive inference properties. While parameter learning in SPNs is
well developed, structure learning leaves something to be desired: Even
though there is a plethora of SPN structure learners, most of them are
somewhat ad-hoc and based on intuition rather than a clear learning
principle. In this paper, we introduce a well-principled Bayesian framework
for SPN structure learning. First, we decompose the problem into i) laying
out a computational graph, and ii) learning the so-called scope function over
the graph. The first is rather unproblematic and akin to neural network
architecture validation. The second represents the effective structure of the
SPN and needs to respect the usual structural constraints in SPN, i.e.
completeness and decomposability. While representing and learning the scope
function is somewhat involved in general, in this paper, we propose a natural
parametrisation for an important and widely used special case of SPNs. These
structural parameters are incorporated into a Bayesian model, such that
simultaneous structure and parameter learning is cast into monolithic
Bayesian posterior inference. In various experiments, our Bayesian SPNs often
improve test likelihoods over greedy SPN learners. Further, since the
Bayesian framework protects against overfitting, we can evaluate
hyper-parameters directly on the Bayesian model score, waiving the need for a
separate validation set, which is especially beneficial in low data regimes.
Bayesian SPNs can be applied to heterogeneous domains and can easily be
extended to nonparametric formulations. Moreover, our Bayesian approach is
the first, which consistently and robustly learns SPN structures under
missing data.

Tameem Adel, Isabel Valera, Zoubin Ghahramani, and Adrian Weller.
**One-network
adversarial fairness**.
In *33rd AAAI Conference on Artificial Intelligence*, Hawaii, January
2019.

** Abstract:** There is currently a great expansion of
the impact of machine learning algorithms on our lives, prompting the need
for objectives other than pure performance, including fairness. Fairness here
means that the outcome of an automated decision-making system should not
discriminate between subgroups characterized by sensitive attributes such as
gender or race. Given any existing differentiable classifier, we make only
slight adjustments to the architecture including adding a new hidden layer,
in order to enable the concurrent adversarial optimization for fairness and
accuracy. Our framework provides one way to quantify the tradeoff between
fairness and accuracy, while also leading to strong empirical
performance.

Tameem Adel, Zoubin Ghahramani, and Adrian Weller.
**Discovering
interpretable representations for both deep generative and discriminative
models**.
In *35th International Conference on Machine Learning*, Stockholm
Sweden, July 2018.

** Abstract:** Interpretability of
representations in both deep generative and discriminative models is highly
desirable. Current methods jointly optimize an objective combining accuracy
and interpretability. However, this may reduce accuracy, and is not
applicable to already trained models. We propose two interpretability
frameworks. First, we provide an interpretable lens for an existing model. We
use a generative model which takes as input the representation in an existing
(generative or discriminative) model, weakly supervised by limited side
information. Applying a flexible and invertible transformation to the input
leads to an interpretable representation with no loss in accuracy. We extend
the approach using an active learning strategy to choose the most useful side
information to obtain, allowing a human to guide what ``interpretable" means.
Our second framework relies on joint optimization for a representation which
is both maximally informative about the side information and maximally
compressive about the non-interpretable data factors. This leads to a novel
perspective on the relationship between compression and regularization. We
also propose a new interpretability evaluation metric based on our framework.
Empirically, we achieve state-of-the-art results on three datasets using the
two proposed algorithms.

Maria Lomeli, Mark Rowland, Arthur Gretton, and Zoubin Ghahramani.
**Antithetic and Monte Carlo
kernel estimators for partial rankings**.
*arXiv preprint arXiv:1807.00400*, 2018.

** Abstract:**
In the modern age, rankings data is ubiquitous and it is useful for a variety
of applications such as recommender systems, multi-object tracking and
preference learning. However, most rankings data encountered in the real
world is incomplete, which prevents the direct application of existing
modelling tools for complete rankings. Our contribution is a novel way to
extend kernel methods for complete rankings to partial rankings, via
consistent Monte Carlo estimators for Gram matrices: matrices of kernel
values between pairs of observations. We also present a novel variance
reduction scheme based on an antithetic variate construction between
permutations to obtain an improved estimator for the Mallows kernel. The
corresponding antithetic kernel estimator has lower variance and we
demonstrate empirically that it has a better performance in a variety of
Machine Learning tasks. Both kernel estimators are based on extending kernel
mean embeddings to the embedding of a set of full rankings consistent with an
observed partial ranking. They form a computationally tractable alternative
to previous approaches for partial rankings data. An overview of the existing
kernels and metrics for permutations is also provided.

Adam Ścibior, Ohad Kammar, and Zoubin Ghahramani.
**Functional programming
for modular Bayesian inference**.
*Proceedings of the ACM on Programming Languages*, 2, 2018.

** Abstract:** We present an architectural design of a library
for Bayesian modelling and inference in modern functional programming
languages. The novel aspect of our approach are modular implementations of
existing state-of-the-art inference algorithms. Our design relies on three
inherently functional features: higher-order functions, inductive data-types,
and support for either type-classes or an expressive module system. We
provide a performant Haskell implementation of this architecture,
demonstrating that high-level and modular probabilistic programming can be
added as a library in sufficiently expressive languages. We review the core
abstractions in this architecture: inference representations, inference
transformations, and inference representation transformers. We then implement
concrete instances of these abstractions, counterparts to particle filters
and Metropolis-Hastings samplers, which form the basic building blocks of our
library. By composing these building blocks we obtain state-of-the-art
inference algorithms: Resample-Move Sequential Monte Carlo, Particle Marginal
Metropolis-Hastings, and Sequential Monte Carlo Squared. We evaluate our
implementation against existing probabilistic programming systems and find it
is already competitively performant, although we conjecture that existing
functional programming optimisation techniques could reduce the overhead
associated with the abstractions we use. We show that our modular design
enables deterministic testing of inherently stochastic Monte Carlo
algorithms. Finally, we demonstrate using OCaml that an expressive module
system can also implement our design.

Adam Ścibior, Ohad Kammar, Matthijs Vákár, Sam Staton, Hongseok Yang,
Yufei Cai, Klaus Ostermann, Sean K. Moss, Chris Heunen, and Zoubin
Ghahramani.
**Denotational
validation of higher-order Bayesian inference**.
*Proceedings of the ACM on Programming Languages*, 2, 2018.

** Abstract:** We present a modular semantic account of Bayesian
inference algorithms for probabilistic programming languages, as used in data
science and machine learning. Sophisticated inference algorithms are often
explained in terms of composition of smaller parts. However, neither their
theoretical justification nor their implementation reflects this modularity.
We show how to conceptualise and analyse such inference algorithms as
manipulating intermediate representations of probabilistic programs using
higher-order functions and inductive types, and their denotational semantics.
Semantic accounts of continuous distributions use measurable spaces. However,
our use of higher-order functions presents a substantial technical
difficulty: it is impossible to define a measurable space structure over the
collection of measurable functions between arbitrary measurable spaces that
is compatible with standard operations on those functions, such as function
application. We overcome this difficulty using quasi-Borel spaces, a recently
proposed mathematical structure that supports both function spaces and
continuous distributions. We define a class of semantic structures for
representing probabilistic programs, and semantic validity criteria for
transformations of these representations in terms of distribution
preservation. We develop a collection of building blocks for composing
representations. We use these building blocks to validate common inference
algorithms such as Sequential Monte Carlo and Markov Chain Monte Carlo. To
emphasize the connection between the semantic manipulation and its
traditional measure theoretic origins, we use Kock’s synthetic measure
theory. We demonstrate its usefulness by proving a quasi-Borel counterpart to
the Metropolis-Hastings-Green theorem.

Ben Bloem-Reddy, Emile Mathieu, Adam Foster, Tom Rainforth, Hong Ge, Maria
Lomeli, and Zoubin Ghahramani.
**Sampling and inference
for discrete random probability measures in probabilistic programs**.
In *NIPS workshop on Advances in Approximate Inference*,
California, United States, December 2017.

** Abstract:** We
consider the problem of sampling a sequence from a discrete random prob-
ability measure (RPM) with countable support, under (probabilistic)
constraints of finite memory and computation. A canonical example is sampling
from the Dirichlet Process, which can be accomplished using its well-known
stick-breaking representation and lazy initialization of its atoms. We show
that efficiently lazy initialization is possible if and only if a size-biased
representation of the discrete RPM is known. For models constructed from such
discrete RPMs, we consider the implications for generic particle-based
inference methods in probabilistic program- ming systems. To demonstrate, we
implement posterior inference for Normalized Inverse Gaussian Process mixture
models in Turing.

Shixiang Gu, Timothy Lillicrap, Zoubin Ghahramani, Richard E. Turner, Bernhard
Schölkopf, and Sergey Levine.
**Interpolated policy gradient:
Merging on-policy and off-policy policy gradient estimation for deep
reinforcement learning**.
In *Advances in Neural Information Processing Systems 31*, Long Beach
USA, Dec 2017.

** Abstract:** Off-policy model-free deep
reinforcement learning methods using previously collected data can improve
sample efficiency over on-policy policy gradient techniques. On the other
hand, on-policy algorithms are often more stable and easier to use. This
paper examines, both theoretically and empirically, approaches to merging on-
and off-policy updates for deep reinforcement learning. Theoretical results
show that off-policy updates with a value function estimator can be
interpolated with on-policy policy gradient updates whilst still satisfying
performance bounds. Our analysis uses control variate methods to produce a
family of policy gradient algorithms, with several recently proposed
algorithms being special cases of this family. We then provide an empirical
comparison of these techniques with the remaining algorithmic details fixed,
and show how different mixing of off-policy gradient estimates with on-policy
samples contribute to improvements in empirical performance. The final
algorithm provides a generalization and unification of existing deep policy
gradient techniques, has theoretical guarantees on the bias introduced by
off-policy updates, and improves on the state-of-the-art model-free deep RL
methods on a number of OpenAI Gym continuous control benchmarks.

Matej Balog, Nilesh Tripuraneni, Zoubin Ghahramani, and Adrian Weller.
**Lost
relatives of the Gumbel trick**.
In *34th International Conference on Machine Learning*, Sydney,
Australia, August 2017.

** Abstract:** The Gumbel trick is a
method to sample from a discrete probability distribution, or to estimate its
normalizing partition function. The method relies on repeatedly applying a
random perturbation to the distribution in a particular way, each time
solving for the most likely configuration. We derive an entire family of
related methods, of which the Gumbel trick is one member, and show that the
new methods have superior properties in several settings with minimal
additional computational cost. In particular, for the Gumbel trick to yield
computational benefits for discrete graphical models, Gumbel perturbations on
all configurations are typically replaced with so-called low-rank
perturbations. We show how a subfamily of our new methods adapts to this
setting, proving new upper and lower bounds on the log partition function and
deriving a family of sequential samplers for the Gibbs distribution. Finally,
we balance the discussion by showing how the simpler analytical form of the
Gumbel trick enables additional theoretical results.

** Comment:** [arXiv] [Poster]
[Slides]
[Code]

Shixiang Gu, Timothy Lillicrap, Zoubin Ghahramani, Richard E. Turner, and
Sergey Levine.
**Q-prop: Sample-efficient policy
gradient with an off-policy critic**.
In *5th International Conference on Learning Representations*, Toulon
France, April 2017.

** Abstract:** Model-free deep
reinforcement learning (RL) methods have been successful in a wide variety of
simulated domains. However, a major obstacle facing deep RL in the real world
is their high sample complexity. Batch policy gradient methods offer stable
learning, but at the cost of high variance, which often requires large
batches. TD-style methods, such as off-policy actor-critic and Q-learning,
are more sample-efficient but biased, and often require costly hyperparameter
sweeps to stabilize. In this work, we aim to develop methods that combine the
stability of policy gradients with the efficiency of off-policy RL. We
present Q-Prop, a policy gradient method that uses a Taylor expansion of the
off-policy critic as a control variate. Q-Prop is both sample efficient and
stable, and effectively combines the benefits of on-policy and off-policy
methods. We analyze the connection between Q-Prop and existing model-free
algorithms, and use control variate theory to derive two variants of Q-Prop
with conservative and aggressive adaptation. We show that conservative Q-Prop
provides substantial gains in sample efficiency over trust region policy
optimization (TRPO) with generalized advantage estimation (GAE), and improves
stability over deep deterministic policy gradient (DDPG), the
state-of-the-art on-policy and off-policy methods, on OpenAI Gym's MuJoCo
continuous control environments.

Nilesh Tripuraneni, Mark Rowland, Zoubin Ghahramani, and Richard E. Turner.
**Magnetic
Hamiltonian Monte Carlo**.
In *34th International Conference on Machine Learning*, 2017.

** Abstract:** Hamiltonian Monte Carlo (HMC) exploits
Hamiltonian dynamics to construct efficient proposals for Markov chain Monte
Carlo (MCMC). In this paper, we present a generalization of HMC which
exploits non-canonical Hamiltonian dynamics. We refer to this algorithm as
magnetic HMC, since in 3 dimensions a subset of the dynamics map onto the
mechanics of a charged particle coupled to a magnetic field. We establish a
theoretical basis for the use of non-canonical Hamiltonian dynamics in MCMC,
and construct a symplectic, leapfrog-like integrator allowing for the
implementation of magnetic HMC. Finally, we exhibit several examples where
these non-canonical dynamics can lead to improved mixing of magnetic HMC
relative to ordinary HMC.

Matej Balog, Balaji Lakshminarayanan, Zoubin Ghahramani, Daniel M. Roy, and
Yee Whye Teh.
**The
Mondrian kernel**.
In *32nd Conference on Uncertainty in Artificial Intelligence*, pages
32-41, Jersey City, New Jersey, USA, June 2016.

**
Abstract:** We introduce the Mondrian kernel, a fast random feature
approximation to the Laplace kernel. It is suitable for both batch and online
learning, and admits a fast kernel-width-selection procedure as the random
features can be re-used efficiently for all kernel widths. The features are
constructed by sampling trees via a Mondrian process [Roy and Teh, 2009], and
we highlight the connection to Mondrian forests [Lakshminarayanan et al.,
2014], where trees are also sampled via a Mondrian process, but fit
independently. This link provides a new insight into the relationship between
kernel methods and random forests.

** Comment:** [Supplementary
Material] [arXiv] [Poster]
[Slides]
[Code]

Jes Frellsen, Ole Winther, Zoubin Ghahramani, and Jesper Ferkinghoff-Borg.
**Bayesian generalised ensemble Markov chain Monte Carlo**.
In *19th International Conference on Artificial Intelligence and
Statistics*, Cadiz, Spain, May 2016.

** Abstract:**
Bayesian generalised ensemble (BayesGE) is a new method that addresses two
major drawbacks of standard Markov chain Monte Carlo algorithms for inference
in high-dimensional probability models: inapplicability to estimate the
partition function, and poor mixing properties. BayesGE uses a Bayesian
approach to iteratively update the belief about the density of states
(distribution of the log likelihood under the prior) for the model, with the
dual purpose of enhancing the sampling efficiency and make the estimation of
the partition function tractable. We benchmark BayesGE on Ising and Potts
systems and show that it compares favourably to existing state-of-the-art
methods.

Alexander G D G Matthews, James Hensman, Richard E. Turner, and Zoubin
Ghahramani.
**On Sparse Variational methods
and the Kullback-Leibler divergence between stochastic processes**.
In *19th International Conference on Artificial Intelligence and
Statistics*, Cadiz, Spain, May 2016.

** Abstract:** The
variational framework for learning inducing variables (Titsias, 2009a) has
had a large impact on the Gaussian process literature. The framework may be
interpreted as minimizing a rigorously defined Kullback-Leibler divergence
between the approximating and posterior processes. To our knowledge this
connection has thus far gone unremarked in the literature. In this paper we
give a substantial generalization of the literature on this topic. We give a
new proof of the result for infinite index sets which allows inducing points
that are not data points and likelihoods that depend on all function values.
We then discuss augmented index sets and show that, contrary to previous
works, marginal consistency of augmentation is not enough to guarantee
consistency of variational inference with the original model. We then
characterize an extra condition where such a guarantee is obtainable. Finally
we show how our framework sheds light on interdomain sparse approximations
and sparse approximations for Cox processes.

Shixiang Gu, Zoubin Ghahramani, and Richard E. Turner.
**Neural adaptive sequential Monte
Carlo**.
In *Advances in Neural Information Processing Systems 29*, Montréal
CANADA, Dec 2015.

** Abstract:** Sequential Monte Carlo (SMC),
or particle filtering, is a popular class of methods for sampling from an
intractable target distribution using a sequence of simpler intermediate
distributions. Like other importance sampling-based methods, performance is
critically dependent on the proposal distribution: a bad proposal can lead to
arbitrarily inaccurate estimates of the target distribution. This paper
presents a new method for automatically adapting the proposal using an
approximation of the Kullback-Leibler divergence between the true posterior
and the proposal distribution. The method is very flexible, applicable to any
parameterised proposal distribution and it supports online and batch
variants. We use the new framework to adapt powerful proposal distributions
with rich parameterisations based upon neural networks leading to Neural
Adaptive Sequential Monte Carlo (NASMC). Experiments indicate that NASMC
significantly improves inference in a non-linear state space model
outperforming adaptive proposal methods including the Extended Kalman and
Unscented Particle Filters. Experiments also indicate that improved inference
translates into improved parameter learning when NASMC is used as a
subroutine of Particle Marginal Metropolis Hastings. Finally we show that
NASMC is able to train a neural network-based deep recurrent generative model
achieving results that compete with the state-of-the-art for polymorphic
music modelling. NASMC can be seen as bridging the gap between adaptive SMC
methods and the recent work in scalable, black-box variational inference.

James Hensman, Alexander G D G Matthews, Maurizio Filippone, and Zoubin
Ghahramani.
**MCMC
for Variationally Sparse Gaussian Processes**.
In *Advances in Neural Information Processing Systems 28*, pages 1-9,
Montreal, Canada, December 2015.

** Abstract:** Gaussian
process (GP) models form a core part of probabilistic machine learning.
Considerable research effort has been made into attacking three issues with
GP models: how to compute efficiently when the number of data is large; how
to approximate the posterior when the likelihood is not Gaussian and how to
estimate covariance function parameter posteriors. This paper simultaneously
addresses these, using a variational approximation to the posterior which is
sparse in support of the function but otherwise free-form. The result is a
Hybrid Monte-Carlo sampling scheme which allows for a non-Gaussian
approximation over the function values and covariance parameters
simultaneously, with efficient computations based on inducing-point sparse
GPs. Code to replicate each experiment in this paper will be available
shortly.

James Robert Lloyd and Zoubin Ghahramani.
**Statistical
model criticism using kernel two sample tests**.
In *Advances in Neural Information Processing Systems 29*, pages 1-9,
Montreal, Canada, December 2015.

** Abstract:** We propose an
exploratory approach to statistical model criticism using maximum mean
discrepancy (MMD) two sample tests. Typical approaches to model criticism
require a practitioner to select a statistic by which to measure
discrepancies between data and a statistical model. MMD two sample tests are
instead constructed as an analytic maximisation over a large space of
possible statistics and therefore automatically select the statistic which
most shows any discrepancy. We demonstrate on synthetic data that the
selected statistic, called the witness function, can be used to identify
where a statistical model most misrepresents the data it was trained on. We
then apply the procedure to real data where the models being assessed are
restricted Boltzmann machines, deep belief networks and Gaussian process
regression and demonstrate the ways in which these models fail to capture the
properties of the data they are trained on.

Gintare Karolina Dziugaite, Daniel M. Roy, and Zoubin Ghahramani.
**Training
generative neural networks via Maximum Mean Discrepancy
optimization**.
In *31st Conference on Uncertainty in Artificial Intelligence*, pages
258-267, Amsterdam, The Netherlands, July 2015.

**
Abstract:** We consider training a deep neural network to generate samples
from an unknown distribution given i.i.d. data. We frame learning as an
optimization minimizing a two-sample test statistic-informally speaking, a
good generator network produces samples that cause a two-sample test to fail
to reject the null hypothesis. As our two-sample test statistic, we use an
unbiased estimate of the maximum mean discrepancy, which is the centerpiece
of the nonparametric kernel two-sample test proposed by Gretton et al.
(2012). We compare to the adversarial nets framework introduced by Goodfellow
et al. (2014), in which learning is a two-player game between a generator
network and an adversarial discriminator network, both trained to outwit the
other. From this perspective, the MMD statistic plays the role of the
discriminator. In addition to empirical comparisons, we prove bounds on the
generalization error incurred by optimizing the empirical MMD.

James Hensman, Alexander G D G Matthews, and Zoubin Ghahramani.
**Scalable
Variational Gaussian Process Classification**.
In *18th International Conference on Artificial Intelligence and
Statistics*, pages 1-9, San Diego, California, USA, May 2015.

** Abstract:** Gaussian process classification is a popular
method with a number of appealing properties. We show how to scale the model
within a variational inducing point framework, out-performing the state of
the art on benchmark datasets. Importantly, the variational formulation an be
exploited to allow classification in problems with millions of data points,
as we demonstrate in experiments.

Nilesh Tripuraneni, Shixiang Gu, Hong Ge, and Zoubin Ghahramani.
**Particle
gibbs for infinite hidden markov models**.
In *Advances in Neural Information Processing Systems 29*, Montreal
CANADA, May 2015.

** Abstract:** Infinite Hidden Markov Models
(iHMM’s) are an attractive, nonparametric gener- alization of the classical
Hidden Markov Model which can automatically infer the number of hidden states
in the system. However, due to the infinite-dimensional nature of the
transition dynamics, performing inference in the iHMM is difficult. In this
paper, we present an infinite-state Particle Gibbs (PG) algorithm to re-
sample state trajectories for the iHMM. The proposed algorithm uses an
efficient proposal optimized for iHMMs and leverages ancestor sampling to
improve the mixing of the standard PG algorithm. Our algorithm demonstrates
significant con- vergence improvements on synthetic and real world data
sets.

Christian Steinruecken, Zoubin Ghahramani, and David MacKay.
**Improving PPM
with dynamic parameter updates**.
In Ali Bilgin, Michael W. Marcellin, Joan Serra-Sagristà, and James A.
Storer, editors, *Proceedings of the Data Compression Conference*.
IEEE Computer Society, April 2015.

** Abstract:** This article
makes several improvements to the classic PPM algorithm, resulting in a new
algorithm with superior compression effectiveness on human text. The key
differences of our algorithm to classic PPM are that (A) rather than the
original escape mechanism, we use a generalised blending method with explicit
hyper-parameters that control the way symbol counts are combined to form
predictions; (B) different hyper-parameters are used for classes of different
contexts; and (C) these hyper-parameters are updated dynamically using
gradient information. The resulting algorithm (PPM-DP) compresses human text
better than all currently published variants of PPM, CTW, DMC, LZ, CSE and
BWT, with runtime only slightly slower than classic PPM.

Yarin Gal, Yutian Chen, and Zoubin Ghahramani.
**Latent
Gaussian processes for distribution estimation of multivariate categorical
data**.
In *Proceedings of the 32nd International Conference on Machine Learning
(ICML-15)*, pages 645-654, 2015.

** Abstract:**
Multivariate categorical data occur in many applications of machine learning.
One of the main difficulties with these vectors of categorical variables is
sparsity. The number of possible observations grows exponentially with vector
length, but dataset diversity might be poor in comparison. Recent models have
gained significant improvement in supervised tasks with this data. These
models embed observations in a continuous space to capture similarities
between them. Building on these ideas we propose a Bayesian model for the
unsupervised task of distribution estimation of multivariate categorical
data. We model vectors of categorical variables as generated from a
non-linear transformation of a continuous latent space. Non-linearity
captures multi-modality in the distribution. The continuous representation
addresses sparsity. Our model ties together many existing models, linking the
linear categorical latent Gaussian model, the Gaussian process latent
variable model, and Gaussian process classification. We derive inference for
our model based on recent developments in sampling based variational
inference. We show empirically that the model outperforms its linear and
discrete counterparts in imputation tasks of sparse data.

Zoubin Ghahramani.
**Probabilistic
machine learning and artificial intelligence**.
*Nature*, 521:452–459, 2015, doi
doi:10.1038/nature14541.

** Abstract:** How can a machine
learn from experience? Probabilistic modelling provides a framework for
understanding what learning is, and has therefore emerged as one of the
principal theoretical and practical approaches for designing machines that
learn from data acquired through experience. The probabilistic framework,
which describes how to represent and manipulate uncertainty about models and
predictions, has a central role in scientific data analysis, machine
learning, robotics, cognitive science and artificial intelligence. This
Review provides an introduction to this framework, and discusses some of the
state-of-the-art advances in the field, namely, probabilistic programming,
Bayesian optimization, data compression and automatic model discovery.

José Miguel Hernández-Lobato, Michael A. Gelbart, Matthew W. Hoffman,
Ryan P. Adams, and Zoubin Ghahramani.
**Predictive
entropy search for Bayesian optimization with unknown constraints**.
In *32nd International Conference on Machine Learning*, pages
1699-1707, 2015.

** Abstract:** Unknown constraints arise in
many types of expensive black-box optimization problems. Several methods have
been proposed recently for performing Bayesian optimization with constraints,
based on the expected improvement (EI) heuristic. However, EI can lead to
pathologies when used with constraints. For example, in the case of decoupled
constraints—i.e., when one can independently evaluate the objective or the
constraints—EI can encounter a pathology that prevents exploration.
Additionally, computing EI requires a current best solution, which may not
exist if none of the data collected so far satisfy the constraints. By
contrast, information-based approaches do not suffer from these failure
modes. In this paper, we present a new information-based method called
Predictive Entropy Search with Constraints (PESC). We analyze the performance
of PESC and show that it compares favorably to EI-based approaches on
synthetic and benchmark problems, as well as several real-world examples. We
demonstrate that PESC is an effective algorithm that provides a promising
direction towards a unified solution for constrained Bayesian
optimization.

Tomoharu Iwata, James Robert Lloyd, and Zoubin Ghahramani.
**Unsupervised
many-to-many object matching for relational data**.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*,
2015.

** Abstract:** We propose a method for unsupervised
many-to-many object matching from multiple networks, which is the task of
finding correspondences between groups of nodes in different networks. For
example, the proposed method can discover shared word groups from
multi-lingual document-word networks without cross-language alignment
information. We assume that multiple networks share groups, and each group
has its own interaction pattern with other groups. Using infinite relational
models with this assumption, objects in different networks are clustered into
common groups depending on their interaction patterns, discovering a
matching. The effectiveness of the proposed method is experimentally
demonstrated by using synthetic and real relational data sets, which include
applications to cross-domain recommendation without shared user/item
identifiers and multi-lingual word clustering.

Adam Ścibior, Zoubin Ghahramani, and Andrew D. Gordon.
**Practical
probabilistic programming with monads**.
In *Proceedings of the 8th ACM SIGPLAN Symposium on Haskell*.
Association for Computing Machinery, 2015, doi
10.1145/2804302.2804317.

** Abstract:** The machine
learning community has recently shown a lot of interest in practical
probabilistic programming systems that target the problem of Bayesian
inference. Such systems come in different forms, but they all express
probabilistic models as computational processes using syntax resembling
programming languages. In the functional programming community monads are
known to offer a convenient and elegant abstraction for programming with
probability distributions, but their use is often limited to very simple
inference problems. We show that it is possible to use the monad abstraction
to construct probabilistic models for machine learning, while still offering
good performance of inference in challenging models. We use a GADT as an
underlying representation of a probability distribution and apply Sequential
Monte Carlo-based methods to achieve efficient inference. We define a formal
semantics via measure theory. We demonstrate a clean and elegant
implementation that achieves performance comparable with Anglican, a
state-of-the-art probabilistic programming system.

Yue Gu, José Miguel Hernández-Lobato, and Zoubin Ghahramani.
**Gaussian
process volatility model**.
In *Advances in Neural Information Processing Systems 28*, pages 1-9,
Montreal, Canada, December 2014.

** Abstract:** The prediction
of time-changing variances is an important task in the modeling of financial
data. Standard econometric models are often limited as they assume rigid
functional relationships for the evolution of the variance. Moreover,
functional parameters are usually learned by maximum likelihood, which can
lead to overfitting. To address these problems we introduce GP-Vol, a novel
non-parametric model for time-changing variances based on Gaussian Processes.
This new model can capture highly flexible functional relationships for the
variances. Furthermore, we introduce a new online algorithm for fast
inference in GP-Vol. This method is much faster than current offline
inference procedures and it avoids overfitting problems by following a fully
Bayesian approach. Experiments with financial data show that GP-Vol performs
significantly better than current standard alternatives.

José Miguel Hernández-Lobato, Matthew W. Hoffman, and Zoubin Ghahramani.
**Predictive
entropy search for efficient global optimization of black-box
functions**.
In *Advances in Neural Information Processing Systems 28*, pages 1-9,
Montreal, Canada, December 2014.

** Abstract:** We propose a
novel information-theoretic approach for Bayesian optimization called
Predictive Entropy Search (PES). At each iteration, PES selects the next
evaluation point that maximizes the expected information gained with respect
to the global maximum. PES codifies this intractable acquisition function in
terms of the expected reduction in the differential entropy of the predictive
distribution. This reformulation allows PES to obtain approximations that are
both more accurate and efficient than other alternatives such as Entropy
Search (ES). Furthermore, PES can easily perform a fully Bayesian treatment
of the model hyperparameters while ES cannot. We evaluate PES in both
synthetic and realworld applications, including optimization problems in
machine learning, finance, biotechnology, and robotics. We show that the
increased accuracy of PES leads to significant gains in optimization
performance.

Alexander G. D. G Matthews, James Hensman, and Zoubin Ghahramani.
**Comparing
lower bounds on the entropy of mixture distributions for use in variational
inference**.
In *NIPS workshop on Advances in Variational Inference*,
Montreal, Canada, December 2014.

Creighton Heaukulani, David A. Knowles, and Zoubin Ghahramani.
**Beta diffusion trees and
hierarchical feature allocations**.
Technical report, Dept. of Engineering, University of Cambridge, August 2014.

** Abstract:** We define the beta diffusion tree, a random tree
structure with a set of leaves that defines a collection of overlapping
subsets of objects, known as a feature allocation. A generative process for
the tree structure is defined in terms of particles (representing the
objects) diffusing in some continuous space, analogously to the Dirichlet
diffusion tree (Neal, 2003b), which defines a tree structure over partitions
(i.e., non-overlapping subsets) of the objects. Unlike in the Dirichlet
diffusion tree, multiple copies of a particle may exist and diffuse along
multiple branches in the beta diffusion tree, and an object may therefore
belong to multiple subsets of particles. We demonstrate how to build a
hierarchically-clustered factor analysis model with the beta diffusion tree
and how to perform inference over the random tree structures with a Markov
chain Monte Carlo algorithm. We conclude with several numerical experiments
on missing data problems with data sets of gene expression microarrays,
international development statistics, and intranational socioeconomic
measurements.

James Robert Lloyd, David Duvenaud, Roger Grosse, Joshua B. Tenenbaum, and
Zoubin Ghahramani.
**Automatic construction and
natural-language description of nonparametric regression models**.
In *Association for the Advancement of Artificial Intelligence (AAAI)*,
July 2014.

** Abstract:** This paper presents the beginnings
of an automatic statistician, focusing on regression problems. Our system
explores an open-ended space of statistical models to discover a good
explanation of a data set, and then produces a detailed report with figures
and natural-language text. Our approach treats unknown regression functions
nonparametrically using Gaussian processes, which has two important
consequences. First, Gaussian processes can model functions in terms of
high-level properties (e.g. smoothness, trends, periodicity, changepoints).
Taken together with the compositional structure of our language of models
this allows us to automatically describe functions in simple terms. Second,
the use of flexible nonparametric models and a rich language for composing
them in an open-ended manner also results in state-of-the-art extrapolation
performance evaluated over 13 real time series data sets from various
domains.

Neil Houlsby, José Miguel Hernández-Lobato, and Zoubin Ghahramani.
**Cold-start
active learning with robust ordinal matrix factorization**.
In *31st International Conference on Machine Learning*, Beijing, China,
June 2014.

** Abstract:** We present a new matrix
factorization model for rating data and a corresponding active learning
strategy to address the cold-start problem. Cold-start is one of the most
challenging tasks for recommender systems: what to recommend with new users
or items for which one has little or no data. An approach is to use active
learning to collect the most useful initial ratings. However, the performance
of active learning depends strongly upon having accurate estimates of i) the
uncertainty in model parameters and ii) the intrinsic noisiness of the data.
To achieve these estimates we propose a heteroskedastic Bayesian model for
ordinal matrix factorization. We also present a computationally efficient
framework for Bayesian active learning with this type of complex
probabilistic model. This algorithm successfully distinguishes between
informative and noisy data points. Our model yields state-of-the-art
predictive performance and, coupled with our active learning strategy,
enables us to gain useful information in the cold-start setting from the very
first active sample.

Creighton Heaukulani, David A. Knowles, and Zoubin Ghahramani.
**Beta
diffusion trees**.
In *31st International Conference on Machine Learning*, Beijing, China,
June 2014.

** Abstract:** We define the beta diffusion tree, a
random tree structure with a set of leaves that defines a collection of
overlapping subsets of objects, known as a feature allocation. The generative
process for the tree is defined in terms of particles (representing the
objects) diffusing in some continuous space, analogously to the Dirichlet and
Pitman–Yor diffusion trees (Neal, 2003b; Knowles & Ghahramani, 2011), both
of which define tree structures over clusters of the particles. With the beta
diffusion tree, however, multiple copies of a particle may exist and diffuse
to multiple locations in the continuous space, resulting in (a random number
of) possibly overlapping clusters of the objects. We demonstrate how to build
a hierarchically-clustered factor analysis model with the beta diffusion tree
and how to perform inference over the random tree structures with a Markov
chain Monte Carlo algorithm. We conclude with several numerical experiments
on missing data problems with data sets of gene expression arrays,
international development statistics, and intranational socioeconomic
measurements.

José Miguel Hernández-Lobato, Neil Houlsby, and Zoubin Ghahramani.
**Probabilistic
matrix factorization with non-random missing data**.
In *31st International Conference on Machine Learning*, Beijing, China,
June 2014.

** Abstract:** We propose a probabilistic matrix
factorization model for collaborative filtering that learns from data that is
missing not at random (MNAR). Matrix factorization models exhibit
state-of-the-art predictive performance in collaborative filtering. However,
these models usually assume that the data is missing at random (MAR), and
this is rarely the case. For example, the data is not MAR if users rate items
they like more than ones they dislike. When the MAR assumption is incorrect,
inferences are biased and predictive performance can suffer. Therefore, we
model both the generative process for the data and the missing data
mechanism. By learning these two models jointly we obtain improved
performance over state-of-the-art methods when predicting the ratings and
when modeling the data observation process. We present the first viable MF
model for MNAR data. Our results are promising and we expect that further
research on NMAR models will yield large gains in collaborative
filtering.

José Miguel Hernández-Lobato, Neil Houlsby, and Zoubin Ghahramani.
**Stochastic
inference for scalable probabilistic modeling of binary matrices**.
In *31st International Conference on Machine Learning*, Beijing, China,
June 2014.

** Abstract:** Fully observed large binary matrices
appear in a wide variety of contexts. To model them, probabilistic matrix
factorization (PMF) methods are an attractive solution. However, current
batch algorithms for PMF can be inefficient because they need to analyze the
entire data matrix before producing any parameter updates. We derive an
efficient stochastic inference algorithm for PMF models of fully observed
binary matrices. Our method exhibits faster convergence rates than more
expensive batch approaches and has better predictive performance than
scalable alternatives. The proposed method includes new data subsampling
strategies which produce large gains over standard uniform subsampling. We
also address the task of automatically selecting the size of the minibatches
of data used by our method. For this, we derive an algorithm that adjusts
this hyper-parameter online.

Konstantina Palla, David A. Knowles, and Zoubin Ghahramani.
**A reversible
infinite hmm using normalised random measures**.
In *31st International Conference on Machine Learning*, Beijing, China,
June 2014.

** Abstract:** We present a nonparametric prior
over reversible Markov chains. We use completely random measures,
specifically gamma processes, to construct a countably infinite graph with
weighted edges. By enforcing symmetry to make the edges undirected we define
a prior over random walks on graphs that results in a reversible Markov
chain. The resulting prior over infinite transition matrices is closely
related to the hierarchical Dirichlet process but enforces reversibility. A
reinforcement scheme has recently been proposed with similar properties, but
the de Finetti measure is not well characterised. We take the alternative
approach of explicitly constructing the mixing measure, which allows more
straightforward and efficient inference at the cost of no longer having a
closed form predictive distribution. We use our process to construct a
reversible infinite HMM which we apply to two real datasets, one from
epigenomics and one ion channel recording.

David Duvenaud, Oren Rippel, Ryan P. Adams, and Zoubin Ghahramani.
**Avoiding pathologies in very
deep networks**.
In *17th International Conference on Artificial Intelligence and
Statistics*, Reykjavik, Iceland, April 2014.

**
Abstract:** Choosing appropriate architectures and regularization
strategies for deep networks is crucial to good predictive performance. To
shed light on this problem, we analyze the analogous problem of constructing
useful priors on compositions of functions. Specifically, we study the deep
Gaussian process, a type of infinitely-wide, deep neural network. We show
that in standard architectures, the representational capacity of the network
tends to capture fewer degrees of freedom as the number of layers increases,
retaining only a single degree of freedom in the limit. We propose an
alternate network architecture which does not suffer from this pathology. We
also examine deep covariance functions, obtained by composing infinitely many
feature transforms. Lastly, we characterize the class of models obtained by
performing dropout on Gaussian processes.

Sébastien Bratières, Novi Quadrianto, Sebastian Nowozin, and Zoubin
Ghahramani.
**Scalable
Gaussian Process structured prediction for grid factor graph
applications**.
In *31st International Conference on Machine Learning*, 2014.

** Abstract:** Structured prediction is an important and well
studied problem with many applications across machine learning. GPstruct is a
recently proposed structured prediction model that offers appealing
properties such as being kernelised, non-parametric, and supporting Bayesian
inference (Bratières et al. 2013). The model places a Gaussian process prior
over energy functions which describe relationships between input variables
and structured output variables. However, the memory demand of GPstruct is
quadratic in the number of latent variables and training runtime scales
cubically. This prevents GPstruct from being applied to problems involving
grid factor graphs, which are prevalent in computer vision and spatial
statistics applications. Here we explore a scalable approach to learning
GPstruct models based on ensemble learning, with weak learners (predictors)
trained on subsets of the latent variables and bootstrap data, which can
easily be distributed. We show experiments with 4M latent variables on image
segmentation. Our method outperforms widely-used conditional random field
models trained with pseudo-likelihood. Moreover, in image segmentation
problems it improves over recent state-of-the-art marginal optimisation
methods in terms of predictive performance and uncertainty calibration.
Finally, it generalises well on all training set sizes.

Alex Davies and Zoubin Ghahramani.
**The random forest
kernel and other kernels for big data from random partitions**.
*arXiv*, abs/1402.4293, 2014.

** Abstract:** We present
Random Partition Kernels, a new class of kernels derived by demonstrating a
natural connection between random partitions of objects and kernels between
those objects. We show how the construction can be used to create kernels
from methods that would not normally be viewed as random partitions, such as
Random Forest. To demonstrate the potential of this method, we propose two
new kernels, the Random Forest Kernel and the Fast Cluster Kernel, and show
that these kernels consistently outperform standard kernels on problems
involving real-world datasets. Finally, we show how the form of these kernels
lend themselves to a natural approximation that is appropriate for certain
big data problems, allowing O(N) inference in methods such as Gaussian
Processes, Support Vector Machines and Kernel PCA.

Yarin Gal and Zoubin Ghahramani.
**Pitfalls in the
use of parallel inference for the Dirichlet process**.
In *Proceedings of the 31th International Conference on Machine Learning
(ICML-14)*, 2014.

** Abstract:** Recent work done by
Lovell, Adams, and Mansingka (2012) and Williamson, Dubey, and Xing (2013)
has suggested an alternative parametrisation for the Dirichlet process in
order to derive non-approximate parallel MCMC inference for it – work which
has been picked-up and implemented in several different fields. In this paper
we show that the approach suggested is impractical due to an extremely
unbalanced distribution of the data. We characterise the requirements of
efficient parallel inference for the Dirichlet process and show that the
proposed inference fails most of these requirements (while approximate
approaches often satisfy most of them). We present both theoretical and
experimental evidence, analysing the load balance for the inference and
showing that it is independent of the size of the dataset and the number of
nodes available in the parallel implementation. We end with suggestions of
alternative paths of research for efficient non-approximate parallel
inference for the Dirichlet process.

David Lopez-Paz, Suvrit Sra, Alex J. Smola, Zoubin Ghahramani, and Bernhard
Schölkopf.
**Randomized
nonlinear component analysis**.
In *ICML*, volume 29 of *JMLR Proceedings*. JMLR.org,
2014.

** Abstract:** Classical techniques such as Principal
Component Analysis (PCA) and Canonical Correlation Analysis (CCA) are
ubiquitous in statistics. However, these techniques only reveal linear
relationships in data. Although nonlinear variants of PCA and CCA have been
proposed, they are computationally prohibitive in the large scale. In a
separate strand of recent research, randomized methods have been proposed to
construct features that help reveal nonlinear patterns in data. For basic
tasks such as regression or classification, random features exhibit little or
no loss in performance, while achieving dramatic savings in computational
requirements. In this paper we leverage randomness to design scalable new
variants of nonlinear PCA and CCA; our ideas also extend to key multivariate
analysis tools such as spectral clustering or LDA. We demonstrate our
algorithms through experiments on real-world data, on which we compare
against the state-of-the-art. Code in R implementing our methods is provided
in the Appendix.

Alexander G. D. G. Matthews and Zoubin Ghahramani.
**Classification using log
Gaussian Cox processes**.
*arXiv preprint arXiv:1405.4141*, 2014.

** Abstract:**
McCullagh and Yang (2006) suggest a family of classification algorithms
based on Cox processes. We further investigate the log Gaussian variant which
has a number of appealing properties. Conditioned on the covariates, the
distribution over labels is given by a type of conditional Markov random
field. In the supervised case, computation of the predictive probability of a
single test point scales linearly with the number of training points and the
multiclass generalization is straightforward. We show new links between the
supervised method and classical nonparametric methods. We give a detailed
analysis of the pairwise graph representable Markov random field, which we
use to extend the model to semi-supervised learning problems, and propose an
inference method based on graph min-cuts. We give the first experimental
analysis on supervised and semi-supervised datasets and show good empirical
performance.

Amar Shah, Andrew Gordon Wilson, and Zoubin Ghahramani.
**Student-t
processes as alternatives to Gaussian processes**.
In *AISTATS*, JMLR Proceedings. JMLR.org, 2014.

**
Abstract:** We investigate the Student-t process as an alternative to the
Gaussian process as a nonparametric prior over functions. We derive closed
form expressions for the marginal likelihood and predictive distribution of a
Student-t process, by integrating away an inverse Wishart process prior over
the covariance kernel of a Gaussian process model. We show surprising
equivalences between different hierarchical Gaussian process models leading
to Student-t processes, and derive a new sampling scheme for the inverse
Wishart process, which helps elucidate these equivalences. Overall, we show
that a Student-t process can retain the attractive properties of a Gaussian
process - a nonparametric representation, analytic marginal and predictive
distributions, and easy model selection through covariance kernels - but has
enhanced flexibility, and predictive covariances that, unlike a Gaussian
process, explicitly depend on the values of training observations. We verify
empirically that a Student-t process is especially useful in situations where
there are changes in covariance structure, or in applications like Bayesian
optimization, where accurate predictive covariances are critical for good
performance. These advantages come at no additional computational cost over
Gaussian processes.

Tomoharu Iwata, David Duvenaud, and Zoubin Ghahramani.
**Warped mixtures for nonparametric
cluster shapes**.
In *29th Conference on Uncertainty in Artificial Intelligence*,
Bellevue, Washington, July 2013.

** Abstract:** A mixture of
Gaussians fit to a single curved or heavy-tailed cluster will report that the
data contains many clusters. To produce more appropriate clusterings, we
introduce a model which warps a latent mixture of Gaussians to produce
nonparametric cluster shapes. The possibly low-dimensional latent mixture
model allows us to summarize the properties of the high-dimensional clusters
(or density manifolds) describing the data. The number of manifolds, as well
as the shape and dimension of each manifold is automatically inferred. We
derive a simple inference scheme for this model which analytically integrates
out both the mixture parameters and the warping function. We show that our
model is effective for density estimation, performs better than infinite
Gaussian mixture models at recovering the true number of clusters, and
produces interpretable summaries of high-dimensional datasets.

Novi Quadrianto, Viktoriia Sharmanska, David A. Knowles, and Zoubin Ghahramani.
**The supervised
ibp: Neighbourhood preserving infinite latent feature models**.
In *29th Conference on Uncertainty in Artificial Intelligence*,
Bellevue, USA, July 2013.

** Abstract:** We propose a
probabilistic model to infer supervised latent variables in the Hamming space
from observed data. Our model allows simultaneous inference of the number of
binary latent variables, and their values. The latent variables preserve
neighbourhood structure of the data in a sense that objects in the same
semantic concept have similar latent values, and objects in different
concepts have dissimilar latent values. We formulate the supervised infinite
latent variable problem based on an intuitive principle of pulling objects
together if they are of the same type, and pushing them apart if they are
not. We then combine this principle with a flexible Indian Buffet Process
prior on the latent variables. We show that the inferred supervised latent
variables can be directly used to perform a nearest neighbour search for the
purpose of retrieval. We introduce a new application of dynamically extending
hash codes, and show how to effectively couple the structure of the hash
codes with continuously growing structure of the neighbourhood preserving
infinite latent feature space.

David Duvenaud, James Robert Lloyd, Roger Grosse, Joshua B. Tenenbaum, and
Zoubin Ghahramani.
**Structure
discovery in nonparametric regression through compositional kernel
search**.
In *30th International Conference on Machine Learning*, Atlanta,
Georgia, USA, June 2013.

** Abstract:** Despite its
importance, choosing the structural form of the kernel in nonparametric
regression remains a black art. We define a space of kernel structures which
are built compositionally by adding and multiplying a small number of base
kernels. We present a method for searching over this space of structures
which mirrors the scientific discovery process. The learned structures can
often decompose functions into interpretable components and enable long-range
extrapolation on time-series datasets. Our structure search method
outperforms many widely used kernels and kernel combination methods on a
variety of prediction tasks.

Creighton Heaukulani and Zoubin Ghahramani.
**Dynamic
probabilistic models for latent feature propagation in social
networks**.
In *30th International Conference on Machine Learning*, Atlanta,
Georgia, USA, June 2013.

** Abstract:** Current Bayesian
models for dynamic social network data have focused on modelling the
influence of evolving unobserved structure on observed social interactions.
However, an understanding of how observed social relationships from the past
affect future unobserved structure in the network has been neglected. In this
paper, we introduce a new probabilistic model for capturing this phenomenon,
which we call latent feature propagation, in social networks. We demonstrate
our model's capability for inferring such latent structure in varying types
of social network datasets, and experimental studies show this structure
achieves higher predictive performance on link prediction and forecasting
tasks.

David Lopez-Paz, José Miguel Hernández-Lobato, and Zoubin Ghahramani.
**Gaussian
process vine copulas for multivariate dependence**.
In *30th International Conference on Machine Learning*, Atlanta,
Georgia, USA, June 2013.

** Abstract:** Copulas allow to learn
marginal distributions separately from the multivariate dependence structure
(copula) that links them together into a density function. Vine
factorizations ease the learning of high-dimensional copulas by constructing
a hierarchy of conditional bivariate copulas. However, to simplify inference,
it is common to assume that each of these conditional bivariate copulas is
independent from its conditioning variables. In this paper, we relax this
assumption by discovering the latent functions that specify the shape of a
conditional copula given its conditioning variables We learn these functions
by following a Bayesian approach based on sparse Gaussian processes with
expectation propagation for scalable, approximate inference. Experiments on
real-world datasets show that, when modeling all conditional dependencies, we
obtain better estimates of the underlying copula of the data.

Yue Wu, José Miguel Hernández-Lobato, and Zoubin Ghahramani.
**Dynamic covariance
models for multivariate financial time series**.
In *30th International Conference on Machine Learning*, Atlanta,
Georgia, USA, June 2013.

** Abstract:** The accurate
prediction of time-changing covariances is an important problem in the
modeling of multivariate financial data. However, some of the most popular
models suffer from a) overfitting problems and multiple local optima, b)
failure to capture shifts in market conditions and c) large computational
costs. To address these problems we introduce a novel dynamic model for
time-changing covariances. Over-fitting and local optima are avoided by
following a Bayesian approach instead of computing point estimates. Changes
in market conditions are captured by assuming a diffusion process in
parameter values, and finally computationally efficient and scalable
inference is performed using particle filters. Experiments with financial
data show excellent performance of the proposed method with respect to
current standard models.

Jacob Andreas and Zoubin Ghahramani.
**A generative model
of vector space semantics**.
*ACL 2013*, page 91, 2013.

** Abstract:** We present
a novel compositional, generative model for vector space representations of
meaning. This model reformulates earlier tensor-based approaches to vector
space semantics as a top-down process, and provides efficient algorithms for
transformation from natural language to vectors and from vectors to natural
language. We describe procedures for estimating the parameters of the model
from positive examples of similar phrases, and from distributional
representations, then use these procedures to obtain similarity judgments for
a set of adjective-noun pairs. The model’s estimation of the similarity of
these pairs correlates well with human annotations, demonstrating a
substantial improvement over several existing compositional approaches in
both settings.

Sébastien Bratières, Novi Quadrianto, and Zoubin Ghahramani.
**Bayesian
structured prediction using Gaussian processes**.
*arXiv*, abs/1307.3846, 2013.

** Abstract:** We introduce
a conceptually novel structured prediction model, GPstruct, which is
kernelized, non-parametric and Bayesian, by design. We motivate the model
with respect to existing approaches, among others, conditional random fields
(CRFs), maximum margin Markov networks (M3N), and structured support vector
machines (SVMstruct), which embody only a subset of its properties. We
present an inference procedure based on Markov Chain Monte Carlo. The
framework can be instantiated for a wide range of structured objects such as
linear chains, trees, grids, and other general graphs. As a proof of concept,
the model is benchmarked on several natural language processing tasks and a
video gesture segmentation task involving a linear chain structure. We show
prediction accuracies for GPstruct which are comparable to or exceeding those
of CRFs and SVMstruct.

Konstantinos Bousmalis, Stefanos Zafeiriou, Louis-Philippe Morency, Maja
Pantic, and Zoubin Ghahramani.
**Variational
hidden conditional random fields with coupled Dirichlet process
mixtures**.
In Hendrik Blockeel, Kristian Kersting, Siegfried Nijssen, and Filip
Zelezný, editors, *ECML/PKDD*, volume 8189 of *Lecture Notes in
Computer Science*, pages 531-547. Springer, 2013.

**
Abstract:** Hidden Conditional Random Fields (HCRFs) are discriminative
latent variable models which have been shown to successfully learn the hidden
structure of a given classification problem. An infinite HCRF is an HCRF with
a countably infinite number of hidden states, which rids us not only of the
necessity to specify a priori a fixed number of hidden states available but
also of the problem of overfitting. Markov chain Monte Carlo (MCMC) sampling
algorithms are often employed for inference in such models. However,
convergence of such algorithms is rather difficult to verify, and as the
complexity of the task at hand increases, the computational cost of such
algorithms often becomes prohibitive. These limitations can be overcome by
variational techniques. In this paper, we present a generalized framework for
infinite HCRF models, and a novel variational inference approach on a model
based on coupled Dirichlet Process Mixtures, the HCRF-DPM. We show that the
variational HCRF-DPM is able to converge to a correct number of represented
hidden states, and performs as well as the best parametric HCRFs -chosen
via cross-validation- for the difficult tasks of recognizing instances of
agreement, disagreement, and pain in audiovisual sequences.

Frederik Eaton and Zoubin Ghahramani.
**Model reductions
for inference: Generality of pairwise, binary, and planar factor
graphs**.
*Neural Computation*, 25(5):1213-1260, 2013.

**
Abstract:** We offer a solution to the problem of efficiently translating
algorithms between different types of discrete statistical model. We
investigate the expressive power of three classes of model-those with binary
variables, with pairwise factors, and with planar topology-as well as their
four intersections. We formalize a notion of "simple reduction" for the
problem of inferring marginal probabilities and consider whether it is
possible to "simply reduce" marginal inference from general discrete factor
graphs to factor graphs in each of these seven subclasses. We characterize
the reducibility of each class, showing in particular that the class of
binary pairwise factor graphs is able to simply reduce only positive models.
We also exhibit a continuous "spectral reduction" based on polynomial
interpolation, which overcomes this limitation. Experiments assess the
performance of standard approximate inference algorithms on the outputs of
our reductions.

Zoubin Ghahramani.
**Bayesian
nonparametrics and the probabilistic approach to modelling**.
*Philosophical Transactions of the Royal Society A*, 2013.

** Abstract:** Modelling is fundamental to many fields of
science and engineering. A model can be thought of as a representation of
possible data one could predict from a system. The probabilistic approach to
modelling uses probability theory to express all aspects of uncertainty in
the model. The probabilistic approach is synonymous with Bayesian modelling,
which simply uses the rules of probability theory in order to make
predictions, compare alternative models, and learn model parameters and
structure from data. This simple and elegant framework is most powerful when
coupled with flexible probabilistic models. Flexibility is achieved through
the use of Bayesian nonparametrics. This article provides an overview of
probabilistic modelling and an accessible survey of some of the main tools in
Bayesian nonparametrics. The survey covers the use of Bayesian nonparametrics
for modelling unknown functions, density estimation, clustering, time series
modelling, and representing sparsity, hierarchies, and covariance structure.
More specifically it gives brief non-technical overviews of Gaussian
processes, Dirichlet processes, infinite hidden Markov models, Indian buffet
processes, Kingman's coalescent, Dirichlet diffusion tress, and Wishart
processes.

José Miguel Hernández-Lobato, Neil Houlsby, and Zoubin Ghahramani.
**Stochastic
inference for scalable probabilistic modeling of binary matrices**.
In *NIPS Workshop on Randomized Methods for Machine Learning*, 2013.

** Abstract:** Fully observed large binary matrices appear in a
wide variety of contexts. To model them, probabilistic matrix factorization
(PMF) methods are an attractive solution. However, current batch algorithms
for PMF can be inefficient since they need to analyze the entire data matrix
before producing any parameter updates. We derive an efficient stochastic
inference algorithm for PMF models of fully observed binary matrices. Our
method exhibits faster convergence rates than more expensive batch approaches
and has better predictive performance than scalable alternatives. The
proposed method includes new data subsampling strategies which produce large
gains over standard uniform subsampling. We also address the task of
automatically selecting the size of the minibatches of data and we propose an
algorithm that adjusts this hyper-parameter in an online manner.

Tomoharu Iwata, Neil Houlsby, and Zoubin Ghahramani.
**Active
learning for interactive visualization**.
In *16th International Conference on Artificial Intelligence and
Statistics*, 2013.

** Abstract:** Many automatic
visualization methods have been proposed. However, a visualization that is
automatically generated might be different to how a user wants to arrange the
objects in visualization space. By allowing users to re-locate objects in the
embedding space of the visualization, they can adjust the visualization to
their preference. We propose an active learning framework for interactive
visualization which selects objects for the user to re-locate so that they
can obtain their desired visualization by re-locating as few as possible. The
framework is based on an information theoretic criterion, which favors
objects that reduce the uncertainty of the visualization. We present a
concrete application of the proposed framework to the Laplacian eigenmap
visualization method. We demonstrate experimentally that the proposed
framework yields the desired visualization with fewer user interactions than
existing methods.

Tomoharu Iwata, Amar Shah, and Zoubin Ghahramani.
**Discovering
latent influence in online social activities via shared cascade poisson
processes**.
In *KDD*, pages 266-274. Association for Computing Machinery, 2013.

** Abstract:** Many people share their activities with others
through online communities. These shared activities have an impact on other
users' activities. For example, users are likely to become interested in
items that are adopted (e.g. liked, bought and shared) by their friends. In
this paper, we propose a probabilistic model for discovering latent influence
from sequences of item adoption events. An inhomogeneous Poisson process is
used for modeling a sequence, in which adoption by a user triggers the
subsequent adoption of the same item by other users. For modeling adoption of
multiple items, we employ multiple inhomogeneous Poisson processes, which
share parameters, such as influence for each user and relations between
users. The proposed model can be used for finding influential users,
discovering relations between users and predicting item popularity in the
future. We present an efficient Bayesian inference procedure of the proposed
model based on the stochastic EM algorithm. The effectiveness of the proposed
model is demonstrated by using real data sets in a social bookmark sharing
service.

Simon Lacoste-Julien, Konstantina Palla, Alex Davies, Gjergji Kasneci, Thore
Graepel, and Zoubin Ghahramani.
**Sigma: simple
greedy matching for aligning large knowledge bases**.
In *KDD*, pages 572-580. Association for Computing Machinery, 2013.

** Abstract:** The Internet has enabled the creation of a
growing number of large-scale knowledge bases in a variety of domains
containing complementary information. Tools for automatically aligning these
knowledge bases would make it possible to unify many sources of structured
knowledge and answer complex queries. However, the efficient alignment of
large-scale knowledge bases still poses a considerable challenge. Here, we
present Simple Greedy Matching (SiGMa), a simple algorithm for aligning
knowledge bases with millions of entities and facts. SiGMa is an iterative
propagation algorithm which leverages both the structural information from
the relationship graph as well as flexible similarity measures between entity
properties in a greedy local search, thus making it scalable. Despite its
greedy nature, our experiments indicate that SiGMa can efficiently match some
of the world's largest knowledge bases with high precision. We provide
additional experiments on benchmark datasets which demonstrate that SiGMa can
outperform state-of-the-art approaches both in accuracy and efficiency.

Colorado Reed and Zoubin Ghahramani.
**Scaling the
Indian buffet process via submodular maximization**.
In *ICML*, volume 28 of *JMLR Proceedings*, pages
1013-1021. JMLR.org, 2013.

** Abstract:** Inference for
latent feature models is inherently difficult as the inference space grows
exponentially with the size of the input data and number of latent features.
In this work, we use Kurihara & Welling (2008)'s maximization-expectation
framework to perform approximate MAP inference for linear-Gaussian latent
feature models with an Indian Buffet Process (IBP) prior. This formulation
yields a submodular function of the features that corresponds to a lower
bound on the model evidence. By adding a constant to this function, we obtain
a nonnegative submodular function that can be maximized via a greedy
algorithm that obtains at least a one-third approximation to the optimal
solution. Our inference method scales linearly with the size of the input
data, and we show the efficacy of our method on the largest datasets
currently analyzed using an IBP model.

Amar Shah and Zoubin Ghahramani.
**Determinantal
clustering processes - A nonparametric Bayesian approach to kernel based
semi-supervised clustering**.
*UAI*, 2013.

** Abstract:** Semi-supervised clustering is
the task of clustering data points into clusters where only a fraction of the
points are labelled. The true number of clusters in the data is often unknown
and most models require this parameter as an input. Dirichlet process mixture
models are appealing as they can infer the number of clusters from the data.
However, these models do not deal with high dimensional data well and can
encounter difficulties in inference. We present a novel nonparameteric
Bayesian kernel based method to cluster data points without the need to
prespecify the number of clusters or to model complicated densities from
which data points are assumed to be generated from. The key insight is to use
determinants of submatrices of a kernel matrix as a measure of how close
together a set of points are. We explore some theoretical properties of the
model and derive a natural Gibbs based algorithm with MCMC hyperparameter
learning. The model is implemented on a variety of synthetic and real world
data sets.

James Robert Lloyd, Peter Orbanz, Zoubin Ghahramani, and Daniel M. Roy.
**Random
function priors for exchangeable arrays with applications to graphs and
relational data**.
In *Advances in Neural Information Processing Systems 26*, pages 1-9,
Lake Tahoe, California, USA, December 2012.

** Abstract:** A
fundamental problem in the analysis of structured relational data like
graphs, networks, databases, and matrices is to extract a summary of the
common structure underlying relations between individual entities. Relational
data are typically encoded in the form of arrays; invariance to the ordering
of rows and columns corresponds to exchangeable arrays. Results in
probability theory due to Aldous, Hoover and Kallenberg show that
exchangeable arrays can be represented in terms of a random measurable
function which constitutes the natural model parameter in a Bayesian model.
We obtain a flexible yet simple Bayesian nonparametric model by placing a
Gaussian process prior on the parameter function. Efficient inference
utilises elliptical slice sampling combined with a random sparse
approximation to the Gaussian process. We demonstrate applications of the
model to network data and clarify its relation to models in the literature,
several of which emerge as special cases.

Michael A. Osborne, David Duvenaud, Roman Garnett, Carl Edward Rasmussen,
Stephen J. Roberts, and Zoubin Ghahramani.
**Active
learning of model evidence using Bayesian quadrature**.
In *Advances in Neural Information Processing Systems 25*, pages 46-54,
Lake Tahoe, California, USA, December 2012.

** Abstract:**
Numerical integration is a key component of many problems in scientiﬁc
computing, statistical modelling, and machine learning. Bayesian Quadrature
is a model-based method for numerical integration which, relative to standard
Monte Carlo methods, offers increased sample efficiency and a more robust
estimate of the uncertainty in the estimated integral. We propose a novel
Bayesian Quadrature approach for numerical integration when the integrand is
non-negative, such as the case of computing the marginal likelihood,
predictive distribution, or normalising constant of a probabilistic model.
Our approach approximately marginalises the quadrature model’s
hyperparameters in closed form, and introduces an active learning scheme to
optimally select function evaluations, as opposed to using Monte Carlo
samples. We demonstrate our method on both a number of synthetic benchmarks
and a real scientiﬁc problem from astronomy.

Konstantina Palla, David A. Knowles, and Zoubin Ghahramani.
**A
nonparametric variable clustering model**.
In *Advances in Neural Information Processing Systems 26*, pages 1-9,
Lake Tahoe, California, USA, December 2012.

** Abstract:**
Factor analysis models effectively summarise the covariance structure of high
dimensional data, but the solutions are typically hard to interpret. This
motivates attempting to find a disjoint partition, i.e. a simple clustering,
of observed variables into highly correlated subsets. We introduce a Bayesian
non-parametric approach to this problem, and demonstrate advantages over
heuristic methods proposed to date. Our Dirichlet process variable clustering
(DPVC) model can discover block-diagonal covariance structures in data. We
evaluate our method on both synthetic and gene expression analysis
problems.

Konstantina Palla, David A. Knowles, and Zoubin Ghahramani.
**An infinite latent attribute
model for network data**.
In *29th International Conference on Machine Learning*, Edinburgh,
Scotland, June 2012.

** Abstract:** Latent variable models for
network data extract a summary of the relational structure underlying an
observed network. The simplest possible models subdivide nodes of the network
into clusters; the probability of a link between any two nodes then depends
only on their cluster assignment. Currently available models can be
classified by whether clusters are disjoint or are allowed to overlap. These
models can explain a "flat" clustering structure. Hierarchical Bayesian
models provide a natural approach to capture more complex dependencies. We
propose a model in which objects are characterised by a latent feature
vector. Each feature is itself partitioned into disjoint groups
(subclusters), corresponding to a second layer of hierarchy. In experimental
comparisons, the model achieves significantly improved predictive performance
on social and biological link prediction tasks. The results indicate that
models with a single layer hierarchy over-simplify real networks.

Andrew Gordon Wilson, David A. Knowles, and Zoubin Ghahramani.
**Gaussian process regression
networks**.
In *29th International Conference on Machine Learning*, Edinburgh,
Scotland, June 2012.

** Abstract:** We introduce a new
regression framework, Gaussian process regression networks (GPRN), which
combines the structural properties of Bayesian neural networks with the
nonparametric flexibility of Gaussian processes. GPRN accommodates input
(predictor) dependent signal and noise correlations between multiple output
(response) variables, input dependent length-scales and amplitudes, and
heavy-tailed predictive distributions. We derive both elliptical slice
sampling and variational Bayes inference procedures for GPRN. We apply GPRN
as a multiple output regression and multivariate volatility model,
demonstrating substantially improved performance over eight popular multiple
output (multi-task) Gaussian process models and three multivariate volatility
models on real datasets, including a 1000 dimensional gene expression
dataset.

A. Bahramisharif, M. A. J. van Gerven, J-M. Schoffelen, Z. Ghahramani, and
T. Heskes.
**The dynamic
beamformer**.
In G. Langs et al, editor, *Machine Learning in Interpretation of
Neuroimaging (MLINI) 2011 LNAI 7263*, pages 148-155, 2012.

** Abstract:** Beamforming is one of the most commonly used
methods for estimating the active neural sources from the MEG or EEG sensor
readings. The basic assumption in beamforming is that the sources are
uncorrelated, which allows for estimating each source independent of the
others. In this paper, we incorporate the independence assumption of the
standard beamformer in a linear dynamical system, thereby introducing the
dynamic beamformer. Using empirical data, we show that the dynamic beamformer
outperforms the standard beamformer in predicting the condition of interest
which strongly suggests that it also outperforms the standard method in
localizing the active neural generators.

John P. Cunningham, Zoubin Ghahramani, and Carl Edward Rasmussen.
**Gaussian
Processes for time-marked time-series data**.
In *15th International Conference on Artificial Intelligence and
Statistics*, pages 255-263, 2012.

** Abstract:** In many
settings, data is collected as multiple time series, where each recorded time
series is an observation of some underlying dynamical process of interest.
These observations are often time-marked with known event times, and one
desires to do a range of standard analyses. When there is only one time
marker, one simply aligns the observations temporally on that marker. When
multiple time-markers are present and are at different times on different
time series observations, these analyses are more difficult. We describe a
Gaussian Process model for analyzing multiple time series with multiple time
markings, and we test it on a variety of data.

Neil Houlsby, Jose Miguel Hernández-Lobato, Ferenc Huszár, and Zoubin
Ghahramani.
**Collaborative
Gaussian processes for preference learning**.
In *Advances in Neural Information Processing Systems 26*, pages
2096-2104. Curran Associates, Inc., 2012.

** Abstract:** We
present a new model based on Gaussian processes (GPs) for learning pairwise
preferences expressed by multiple users. Inference is simplified by using a
*preference kernel* for GPs which allows us to combine supervised GP
learning of user preferences with unsupervised dimensionality reduction for
multi-user systems. The model not only exploits collaborative information
from the shared structure in user behavior, but may also incorporate user
features if they are available. Approximate inference is implemented using a
combination of expectation propagation and variational Bayes. Finally, we
present an efficient active learning strategy for querying preferences. The
proposed technique performs favorably on real-world data against
state-of-the-art multi-user preference learning algorithms.

Hyun-Chul Kim and Zoubin Ghahramani.
**Bayesian
classifier combination**.
In *15th International Conference on Artificial Intelligence and
Statistics*, 2012.

** Abstract:** Bayesian model averaging
linearly mixes the probabilistic predictions of multiple models, each
weighted by its posterior probability. This is the coherent Bayesian way of
combining multiple models only under certain restrictive assumptions, which
we outline. We explore a general framework for Bayesian model combination
(which differs from model averaging) in the context of classification. This
framework explicitly models the relationship between each model’s output
and the unknown true label. The framework does not require that the models be
probabilistic (they can even be human assessors), that they share prior
information or receive the same training data, or that they be independent in
their errors. Finally, the Bayesian combiner does not need to believe any of
the models is in fact correct. We test several variants of this classifier
combination procedure starting from a classic statistical model proposed by
Dawid and Skene (1979) and using MCMC to add more complex but important
features to the model. Comparisons on sev- eral data sets to simpler methods
like majority voting show that the Bayesian methods not only perform well but
result in interpretable diagnostics on the data points and the models.

P. Kirk, J. E. Griffin, R. S. Savage, Z. Ghahramani, and D. L. Wild.
**Bayesian
correlated clustering to integrate multiple datasets**.
*Bioinformatics*, 2012.

** Abstract:** Motivation: The
integration of multiple datasets remains a key challenge in systems biology
and genomic medicine. Modern high-throughput technologies generate a broad
array of different data types, providing distinct – but often complementary
– information. We present a Bayesian method for the unsupervised
integrative modelling of multiple datasets, which we refer to as MDI
(Multiple Dataset Integration). MDI can integrate information from a wide
range of different datasets and data types simultaneously (including the
ability to model time series data explicitly using Gaussian processes). Each
dataset is modelled using a Dirichlet-multinomial allocation (DMA) mixture
model, with dependencies between these models captured via parameters that
describe the agreement among the datasets.

Results: Using a set of 6
artificially constructed time series datasets, we show that MDI is able to
integrate a significant number of datasets simultaneously, and that it
successfully captures the underlying structural similarity between the
datasets. We also analyse a variety of real S. cerevisiae datasets. In the
2-dataset case, we show that MDI’s performance is comparable to the present
state of the art. We then move beyond the capabilities of current approaches
and integrate gene expression, ChIP-chip and protein-protein interaction
data, to identify a set of protein complexes for which genes are co-regulated
during the cell cycle. Comparisons to other unsupervised data integration
techniques – as well as to non-integrative approaches – demonstrate that
MDI is very competitive, while also providing information that would be
difficult or impossible to extract using other methods.

** Comment:** This paper is available from the Bioinformatics
site and a Matlab implementation of MDI is available fromthis site.

Paul D. W. Kirk, Jim E. Griffin, Richard S. Savage, Zoubin Ghahramani, and
David L. Wild.
**Bayesian
correlated clustering to integrate multiple datasets**.
*Bioinformatics*, 28(24):3290-3297, 2012.

** Abstract:**
MOTIVATION: The integration of multiple datasets remains a key challenge in
systems biology and genomic medicine. Modern high-throughput technologies
generate a broad array of different data types, providing distinct-but often
complementary-information. We present a Bayesian method for the unsupervised
integrative modelling of multiple datasets, which we refer to as MDI
(Multiple Dataset Integration). MDI can integrate information from a wide
range of different datasets and data types simultaneously (including the
ability to model time series data explicitly using Gaussian processes). Each
dataset is modelled using a Dirichlet-multinomial allocation (DMA) mixture
model, with dependencies between these models captured through parameters
that describe the agreement among the datasets. RESULTS: Using a set of six
artificially constructed time series datasets, we show that MDI is able to
integrate a significant number of datasets simultaneously, and that it
successfully captures the underlying structural similarity between the
datasets. We also analyse a variety of real Saccharomyces cerevisiae
datasets. In the two-dataset case, we show that MDI's performance is
comparable with the present state-of-the-art. We then move beyond the
capabilities of current approaches and integrate gene expression, chromatin
immunoprecipitation-chip and protein-protein interaction data, to identify a
set of protein complexes for which genes are co-regulated during the cell
cycle. Comparisons to other unsupervised data integration techniques-as well
as to non-integrative approaches-demonstrate that MDI is competitive, while
also providing information that would be difficult or impossible to extract
using other methods.

Shakir Mohamed, Katherine A. Heller, and Zoubin Ghahramani.
**Bayesian and
L _{1} approaches for sparse unsupervised learning**.
In

** Abstract:** The use of L1 regularisation for sparse learning
has generated immense research interest, with many successful applications in
diverse areas such as signal acquisition, image coding, genomics and
collaborative filtering. While existing work highlights the many advantages
of L1 methods, in this paper we find that L1 regularisation often
dramatically under-performs in terms of predictive performance when compared
to other methods for inferring sparsity. We focus on unsupervised latent
variable models, and develop L1 minimising factor models, Bayesian variants
of "L1", and Bayesian models with a stronger L0-like sparsity induced through
spike-and-slab distributions. These spike-and-slab Bayesian factor models
encourage sparsity while accounting for uncertainty in a principled manner,
and avoid unnecessary shrinkage of non-zero values. We demonstrate on a
number of data sets that in practice spike-and-slab Bayesian methods
out-perform L1 minimisation, even on a com- putational budget. We thus
highlight the need to re-assess the wide use of L1 methods in
sparsity-reliant applications, particularly when we care about generalising
to previously unseen data, and provide an alternative that, over many varying
conditions, provides improved generalisation performance.

Donglin Niu, Jennifer G. Dy, and Z. Ghahramani.
**A nonparametric
Bayesian model for multiple clustering with overlapping feature
views**.
In *15th International Conference on Artificial Intelligence and
Statistics*, 2012.

** Abstract:** Most clustering
algorithms produce a single clustering solution. This is inadequate for many
data sets that are multi-faceted and can be grouped and interpreted in many
different ways. Moreover, for high-dimensional data, different features may
be relevant or irrelevant to each clustering solution, suggesting the need
for feature selection in clustering. Features relevant to one clustering
interpretation may be different from the ones relevant for an alternative
interpretation or view of the data. In this paper, we introduce a
probabilistic nonparametric Bayesian model that can discover multiple
clustering solutions from data and the feature subsets that are relevant for
the clusters in each view. In our model, the features in different views may
be shared and therefore the sets of relevant features are allowed to overlap.
We model feature relevance to each view using an Indian Buffet Process and
the cluster membership in each view using a Chinese Restaurant Process. We
provide an inference approach to learn the latent parameters corresponding to
this multiple partitioning problem. Our model not only learns the features
and clusters in each view but also automatically learns the number of
clusters, number of views and number of features in each view.

Barnabas Poczos, Zoubin Ghahramani, and Jeff Schneider.
**Copula-based
kernel dependency measures**.
In *29th International Conference on Machine Learning*, 2012.

** Abstract:** The paper presents a new copula based method for
measuring dependence between random variables. Our approach extends the
Maximum Mean Discrepancy to the copula of the joint distribution. We prove
that this approach has several advantageous properties. Similarly to Shannon
mutual information, the proposed dependence measure is invariant to any
strictly increasing transformation of the marginal variables. This is
important in many applications, for example in feature selection. The
estimator is consistent, robust to outliers, and uses rank statistics only.
We derive upper bounds on the convergence rate and propose independence tests
too. We illustrate the theoretical contributions through a series of
experiments in feature selection and low-dimensional embedding of
distributions.

Jacob Steinhardt and Zoubin Ghahramani.
**Flexible martingale
priors for deep hierarchies**.
In *15th International Conference on Artificial Intelligence and
Statistics*, 2012.

** Abstract:** When building priors
over trees for Bayesian hierarchical models, there is a tension between
maintaining desirable theoretical properties such as infinite exchangeability
and important practical properties such as the ability to increase the depth
of the tree to accommodate new data. We resolve this tension by presenting a
family of infinitely exchangeable priors over discrete tree structures that
allows the depth of the tree to grow with the data, and then showing that our
family contains all hierarchical models with certain mild symmetry
properties. We also show that deep hierarchical models are in general
intimately tied to a process called a martingale, and use Doob’s martingale
convergence theorem to demonstrate some unexpected properties of deep
hierarchies.

Kyung-Ah Sohn, Zoubin Ghahramani, and Eric P. Xing.
**Robust estimation
of local genetic ancestry in admixed populations using a non-parametric
Bayesian approach**.
*Genetics*, 191(4), 2012.

** Abstract:** We present a new
haplotype-based approach for inferring local genetic ancestry of individuals
in an admixed population. Most existing approaches for local ancestry
estimation ignore the latent genetic relatedness between ancestral
populations and treat them as independent. In this paper, we exploit such
information by building an inheritance model that describes both the
ancestral populations and the admixed population jointly in a unified
framework. Based on an assumption that the common hypothetical founder
haplotypes give rise to both the ancestral and admixed population haplotypes,
we employ an infinite hidden Markov model to characterize each ancestral
population and further extend it to generate the admixed population. Through
an effective utilization of the population structural information under a
principled nonparametric Bayesian framework, the resulting model is
significantly less sensitive to the choice and the amount of training data
for ancestral populations than state-of-the-arts algorithms. We also improve
the robustness under deviation from common modeling assumptions by
incorporating population-specific scale parameters that allow variable
recombination rates in different populations. Our method is applicable to an
admixed population from an arbitrary number of ancestral populations and also
performs competitively in terms of spurious ancestry proportions under
general multi-way admixture assumption. We validate the proposed method by
simulation under various admixing scenarios and present empirical analysis
results on worldwide distributed dataset from Human Genome Diversity
Project.

** Comment:** doi: 10.1534/genetics.112.140228

Andrew Gordon Wilson and Zoubin Ghahramani.
**Modelling input
varying correlations between multiple responses**.
In Peter A. Flach, Tijl De Bie, and Nello Cristianini, editors,
*ECML/PKDD*, volume 7524 of *Lecture Notes in Computer
Science*, pages 858-861. Springer, 2012.

** Abstract:**
We introduced a generalised Wishart process (GWP) for modelling input
dependent covariance matrices Σ(x), allowing one to model input varying
correlations and uncertainties between multiple response variables. The GWP
can naturally scale to thousands of response variables, as opposed to
competing multivariate volatility models which are typically intractable for
greater than 5 response variables. The GWP can also naturally capture a rich
class of covariance dynamics - periodicity, Brownian motion, smoothness,
- through a covariance kernel.

Yichuan Zhang, Charles A. Sutton, Amos J. Storkey, and Zoubin Ghahramani.
**Continuous
relaxations for discrete Hamiltonian Monte Carlo**.
In Peter L. Bartlett, Fernando C. N. Pereira, Christopher J. C. Burges,
Léon Bottou, and Kilian Q. Weinberger, editors, *NIPS*, pages
3203-3211, 2012.

** Abstract:** Continuous relaxations play
an important role in discrete optimization, but have not seen much use in
approximate probabilistic inference. Here we show that a general form of the
Gaussian Integral Trick makes it possible to transform a wide class of
discrete variable undirected models into fully continuous systems. The
continuous representation allows the use of gradient-based Hamiltonian Monte
Carlo for inference, results in new ways of estimating normalization
constants (partition functions), and in general opens up a number of new
avenues for inference in difficult discrete systems. We demonstrate some of
these continuous relaxation inference algorithms on a number of illustrative
problems.

Andrew Gordon Wilson, David A Knowles, and Zoubin Ghahramani.
**Gaussian process
regression networks**.
Technical Report arXiv:1110.4411 [stat.ML], Department of Engineering,
University of Cambridge, Cambridge, UK, October 19 2011.

**
Abstract:** We introduce a new regression framework, Gaussian process
regression networks (GPRN), which combines the structural properties of
Bayesian neural networks with the non-parametric flexibility of Gaussian
processes. This model accommodates input dependent signal and noise
correlations between multiple response variables, input dependent
length-scales and amplitudes, and heavy-tailed predictive distributions. We
derive both efficient Markov chain Monte Carlo and variational Bayes
inference procedures for this model. We apply GPRN as a multiple output
regression and multivariate volatility model, demonstrating substantially
improved performance over eight popular multiple output (multi-task) Gaussian
process models and three multivariate volatility models on benchmark
datasets, including a 1000 dimensional gene expression dataset.

** Comment:** arXiv:1110.4411

A. Davies and Z. Ghahramani.
**Language-independent
Bayesian sentiment mining of twitter**.
In *In The Fifth Workshop on Social Network Mining and Analysis
(SNA-KDD 2011)*, August 2011.

** Abstract:** This paper
outlines a new language-independent model for sentiment analysis of short,
social-network statuses. We demonstrate this on data from Twitter, modelling
happy vs sad sentiment, and show that in some circumstances this outperforms
similar Naive Bayes models by more than 10%. We also propose an extension to
allow the modelling of differ- ent sentiment distributions in different
geographic regions, while incorporating information from neighbouring
regions. We outline the considerations when creating a system analysing
Twitter data and present a scalable system of data acquisi- tion and
prediction that can monitor the sentiment of tweets in real time.

Thomas L. Griffiths and Zoubin Ghahramani.
**The Indian buffet
process: An introduction and review**.
*Journal of Machine Learning Research*, 12:1185-1224, April 2011.

** Abstract:** The Indian buffet process is a stochastic process
defining a probability distribution over equivalence classes of sparse binary
matrices with a finite number of rows and an unbounded number of columns.
This distribution is suitable for use as a prior in probabilistic models that
represent objects using a potentially infinite array of features, or that
involve bipartite graphs in which the size of at least one class of nodes is
unknown. We give a detailed derivation of this distribution, and illustrate
its use as a prior in an infinite latent feature model. We then review recent
applications of the Indian buffet process in machine learning, discuss its
extensions, and summarize its connections to other stochastic processes.

Simon Lacoste-Julien, Ferenc Huszár, and Zoubin Ghahramani.
**Approximate
inference for the loss-calibrated Bayesian**.
In Geoff Gordon and David Dunson, editors, *14th International Conference on
Artificial Intelligence and Statistics*, volume 15, pages 416-424,
Fort Lauderdale, FL, USA, April 2011. Journal of Machine Learning Research.

** Abstract:** We consider the problem of approximate inference
in the context of Bayesian decision theory. Traditional approaches focus on
approximating general properties of the posterior, ignoring the decision task
- and associated losses - for which the posterior could be used. We argue
that this can be suboptimal and propose instead to *loss-calibrate* the
approximate inference methods with respect to the decision task at hand. We
present a general framework rooted in Bayesian decision theory to analyze
approximate inference from the perspective of losses, opening up several
research directions. As a first loss-calibrated approximate inference
attempt, we propose an EM-like algorithm on the Bayesian posterior risk and
show how it can improve a standard approach to Gaussian process
classification when losses are asymmetric.

Joshua Abbott, Katherine A. Heller, Zoubin Ghahramani, and Thomas L. Griffiths.
**Testing a
Bayesian measure of representativeness using a large image
database**.
In *Advances in Neural Information Processing Systems 24*, Cambridge,
MA, USA, 2011. The MIT Press.

** Abstract:** How do people
determine which elements of a set are most representative of that set? We
extend an existing Bayesian measure of representativeness, which indicates
the representativeness of a sample from a distribution, to deﬁne a measure
of the representativeness of an item to a set. We show that this measure is
formally related to a machine learning method known as Bayesian Sets.
Building on this connection, we derive an analytic expression for the
representativeness of objects described by a sparse vector of binary
features. We then apply this measure to a large database of images, using it
to determine which images are the most representative members of different
sets. Comparing the resulting predictions to human judgments of
representativeness provides a test of this measure with naturalistic stimuli,
and illustrates how databases that are more commonly used in computer vision
and machine learning can be used to evaluate psychological theories.

Finale Doshi-Velez and Zoubin Ghahramani.
**A comparison of
human and agent reinforcement learning in partially observable
domains**.
In *33rd Annual Meeting of the Cognitive Science Society*, Boston, MA,
2011.

** Abstract:** It is commonly stated that reinforcement
learning (RL) algorithms learn slower than humans. In this work, we
investigate this claim using two standard problems from the RL literature. We
compare the performance of human subjects to RL techniques. We find that
context-the meaningfulness of the observations—-plays a significant role
in the rate of human RL. Moreover, without contextual information, humans
often fare much worse than classic algorithms. Comparing the detailed
responses of humans and RL algorithms, we also find that humans appear to
employ rather different strategies from standard algorithms, even in cases
where they had indistinguishable performance to them. Our research both sheds
light on human RL and provides insights for improving RL algorithms.

Neil Houlsby, Ferenc Huszar, Zoubin Ghahramani, and Máté Lengyel.
**Bayesian active
learning for classification and preference learning**.
*arXiv*, abs/1112.5745, 2011.

** Abstract:** Information
theoretic active learning has been widely studied for probabilistic models.
For simple regression an optimal myopic policy is easily tractable. However,
for other tasks and with more complex models, such as classification with
nonparametric models, the optimal solution is harder to compute. Current
approaches make approximations to achieve tractability. We propose an
approach that expresses information gain in terms of predictive entropies,
and apply this method to the Gaussian Process Classifier (GPC). Our approach
makes minimal approximations to the full information theoretic objective. Our
experimental performance compares favourably to many popular active learning
algorithms, and has equal or lower computational complexity. We compare well
to decision theoretic approaches also, which are privy to more information
and require much more computational time. Secondly, by developing further a
reformulation of binary preference learning to a classification problem, we
extend our algorithm to Gaussian Process preference learning.

David A. Knowles and Zoubin Ghahramani.
**Nonparametric
Bayesian sparse factor models with application to gene expression
modelling.**.
*Annals of Applied Statistics*, 5(2B):1534-1552, 2011.

**
Abstract:** A nonparametric Bayesian extension of Factor Analysis (FA) is
proposed where observed data Y is modeled as a linear superposition, G, of a
potentially infinite number of hidden factors, X. The Indian Buffet Process
(IBP) is used as a prior on G to incorporate sparsity and to allow the number
of latent features to be inferred. The model's utility for modeling gene
expression data is investigated using randomly generated data sets based on a
known sparse connectivity matrix for E. Coli, and on three biological data
sets of increasing complexity.

David A. Knowles and Zoubin Ghahramani.
**Pitman-Yor
diffusion trees**.
In *27th Conference on Uncertainty in Artificial Intelligence*, 2011.

** Abstract:** We introduce the Pitman Yor Diffusion Tree (PYDT)
for hierarchical clustering, a generalization of the Dirichlet Diffusion Tree
(Neal, 2001) which removes the restriction to binary branching structure. The
generative process is described and shown to result in an exchangeable
distribution over data points. We prove some theoretical properties of the
model and then present two inference methods: a collapsed MCMC sampler which
allows us to model uncertainty over tree structures, and a computationally
efficient greedy Bayesian EM search algorithm. Both algorithms use message
passing on the tree structure. The utility of the model and algorithms is
demonstrated on synthetic and real world data, both continuous and
binary.

** Comment:** web site

David A. Knowles, Jurgen Van Gael, and Zoubin Ghahramani.
**Message passing
algorithms for the Dirichlet diffusion tree**.
In *28th International Conference on Machine Learning*, 2011.

** Abstract:** We demonstrate efficient approximate inference
for the Dirichlet Diffusion Tree (Neal, 2003), a Bayesian nonparametric prior
over tree structures. Although DDTs provide a powerful and elegant approach
for modeling hierarchies they haven't seen much use to date. One problem is
the computational cost of MCMC inference. We provide the first deterministic
approximate inference methods for DDT models and show excellent performance
compared to the MCMC alternative. We present message passing algorithms to
approximate the Bayesian model evidence for a specific tree. This is used to
drive sequential tree building and greedy search to find optimal tree
structures, corresponding to hierarchical clusterings of the data. We
demonstrate appropriate observation models for continuous and binary data.
The empirical performance of our method is very close to the computationally
expensive MCMC alternative on a density estimation problem, and significantly
outperforms kernel density estimators.

** Comment:** web site

Andrew Gordon Wilson and Zoubin Ghahramani.
**Generalised
Wishart processes**.
In *27th Conference on Uncertainty in Artificial Intelligence*, 2011.

** Abstract:** We introduce a new stochastic process called the
generalised Wishart process (GWP). It is a collection of positive
semi-definite random matrices indexed by any arbitrary input variable. We use
this process as a prior over dynamic (e.g. time varying) covariance matrices.
The GWP captures a diverse class of covariance dynamics, naturally hanles
missing data, scales nicely with dimension, has easily interpretable
parameters, and can use input variables that include covariates other than
time. We describe how to construct the GWP, introduce general procedures for
inference and prediction, and show that it outperforms its main competitor,
multivariate GARCH, even on financial data that especially suits GARCH.

** Comment:** Supplementary
Material, Best Student Paper Award

Ryan Turner, Steven Bottone, and Zoubin Ghahramani.
**Fast online
anomaly detection using scan statistics**.
In Samuel Kaski, David J. Miller, Erkki Oja, and Antti Honkela, editors,
*Machine Learning for Signal Processing (MLSP 2010)*, pages 385-390,
Kittilä, Finland, August 2010.

** Abstract:** We present
methods to do fast online anomaly detection using scan statistics. Scan
statistics have long been used to detect statistically significant bursts of
events. We extend the scan statistics framework to handle many practical
issues that occur in application: dealing with an unknown background rate of
events, allowing for slow natural changes in background frequency, the
inverse problem of finding an unusual lack of events, and setting the test
parameters to maximize power. We demonstrate its use on real and synthetic
data sets with comparison to other methods.

Y. Guan, J. G. Dy, D. Niu, and Z. Ghahramani.
**Variational
inference for nonparametric multiple clustering**.
In *KDD10 Workshop on Discovering, Summarizing, and Using Multiple
Clusterings*, Washington, DC, USA, July 2010.

**
Abstract:** Most clustering algorithms produce a single clustering
solution. Similarly, feature selection for clustering tries to find one
feature subset where one interesting clustering solution resides. However, a
single data set may be multi-faceted and can be grouped and interpreted in
many different ways, especially for high dimensional data, where feature
selection is typically needed. Moreover, different clustering solutions are
interesting for different purposes. Instead of committing to one clustering
solution, in this paper we introduce a probabilistic nonparametric Bayesian
model that can discover several possible clustering solutions and the feature
subset views that generated each cluster partitioning simultaneously. We
provide a variational inference approach to learn the features and clustering
partitions in each view. Our model allows us not only to learn the multiple
clusterings and views but also allows us to automatically learn the number of
views and the number of clusters in each view.

C. Rotsos, J. Van Gael, A.W. Moore, and Z. Ghahramani.
**Traffic
classification in information poor environments**.
In *1st International Workshop on Traffic Analysis and Classification (IWCMC
'10)*, Caen, France, July 2010.

** Abstract:** Traffic
classification using machine learning continues to be an active research
area. The majority of work in this area uses *off-the-shelf* machine
learning tools and treats them as *black-box* classifiers. This approach
turns all the modelling complexity into a feature selection problem. In this
paper, we build a problem-specific solution to the traffic classification
problem by designing a custom probabilistic graphical model. Graphical models
are a modular framework to design classifiers which incorporate
domain-specific knowledge. More specifically, our solution introduces
semi-supervised learning which means we learn from both labelled and
unlabelled traffic flows. We show that our solution performs competitively
compared to previous approaches while using less data and simpler
features.

R. P. Adams, H. Wallach, and Zoubin Ghahramani.
**Learning the
structure of deep sparse graphical models**.
In Yee Whye Teh and Mike Titterington, editors, *13th International
Conference on Artificial Intelligence and Statistics*, pages 1-8, Chia
Laguna, Sardinia, Italy, May 2010.

** Abstract:** Deep belief
networks are a powerful way to model complex probability distributions.
However, it is difficult to learn the structure of a belief network,
particularly one with hidden units. The Indian buffet process has been used
as a nonparametric Bayesian prior on the structure of a directed belief
network with a single infinitely wide hidden layer. Here, we introduce the
cascading Indian buffet process (CIBP), which provides a prior on the
structure of a layered, directed belief network that is unbounded in both
depth and width, yet allows tractable inference. We use the CIBP prior with
the nonlinear Gaussian belief network framework to allow each unit to vary
its behavior between discrete and continuous representations. We use Markov
chain Monte Carlo for inference in this model and explore the structures
learned on image data.

** Comment:** Winner of the Best Paper Award

Sinead Williamson, Peter Orbanz, and Zoubin Ghahramani.
**Dependent
Indian buffet processes**.
In *13th International Conference on Artificial Intelligence and
Statistics*, volume 9 of *W & CP*, pages 924-931, Chia
Laguna, Sardinia, Italy, May 2010.

** Abstract:** Latent
variable models represent hidden structure in observational data. To account
for the distribution of the observational data changing over time, space or
some other covariate, we need generalizations of latent variable models that
explicitly capture this dependency on the covariate. A variety of such
generalizations has been proposed for latent variable models based on the
Dirichlet process. We address dependency on covariates in binary latent
feature models, by introducing a dependent Indian Buffet Process. The model
generates a binary random matrix with an unbounded number of columns for each
value of the covariate. Evolution of the binary matrices over the covariate
set is controlled by a hierarchical Gaussian process model. The choice of
covariance functions controls the dependence structure and exchangeability
properties of the model. We derive a Markov Chain Monte Carlo sampling
algorithm for Bayesian inference, and provide experiments on both synthetic
and real-world data. The experimental results show that explicit modeling of
dependencies significantly improves accuracy of predictions.

R. P. Adams, Zoubin Ghahramani, and Michael I. Jordan.
**Tree-structured
stick breaking for hierarchical data**.
In *Advances in Neural Information Processing Systems 23*. The MIT
Press, 2010.

** Abstract:** Many data are naturally modeled by
an unobserved hierarchical structure. In this paper we propose a flexible
nonparametric prior over unknown data hierarchies. The approach uses nested
stick-breaking processes to allow for trees of unbounded width and depth,
where data can live at any node and are infinitely exchangeable. One can view
our model as providing infinite mixtures where the components have a
dependency structure corresponding to an evolutionary diffusion down a tree.
By using a stick-breaking approach, we can apply Markov chain Monte Carlo
methods based on slice sampling to perform Bayesian inference and simulate
from the posterior distribution on trees. We apply our method to hierarchical
clustering of images and topic modeling of text data.

Sébastien Bratières, Jurgen Van Gael, Andreas Vlachos, and Zoubin
Ghahramani.
**Scaling the
iHMM: Parallelization versus Hadoop**.
In *Proceedings of the 2010 10th IEEE International Conference on Computer
and Information Technology*, pages 1235-1240, Bradford, UK, 2010. IEEE
Computer Society, doi
10.1109/CIT.2010.223.

** Abstract:** This paper compares
parallel and distributed implementations of an iterative, Gibbs sampling,
machine learning algorithm. Distributed implementations run under Hadoop on
facility computing clouds. The probabilistic model under study is the
infinite HMM Beal, Ghahramani and Rasmussen,
2002, in which parameters are learnt using an instance blocked Gibbs
sampling, with a step consisting of a dynamic program. We apply this model to
learn part-of-speech tags from newswire text in an unsupervised fashion.
However our focus here is on runtime performance, as opposed to NLP-relevant
scores, embodied by iteration duration, ease of development, deployment and
debugging.

J. Leskovec, D. Chakrabarti, J. Kleinberg, C. Faloutsos, and Z. Ghahramani.
**Kronecker
graphs: An approach to modeling networks**.
*Journal of Machine Learning Research*, 11(Feb):985-1042, 2010.

** Abstract:** How can we generate realistic networks? In
addition, how can we do so with a mathematically tractable model that allows
for rigorous analysis of network properties? Real networks exhibit a long
list of surprising properties: Heavy tails for the in- and out-degree
distribution, heavy tails for the eigenvalues and eigenvectors, small
diameters, and densification and shrinking diameters over time. Current
network models and generators either fail to match several of the above
properties, are complicated to analyze mathematically, or both. Here we
propose a generative model for networks that is both mathematically tractable
and can generate networks that have all the above mentioned structural
properties. Our main idea here is to use a non-standard matrix operation, the
Kronecker product, to generate graphs which we refer to as "Kronecker
graphs".

First, we show that Kronecker graphs naturally obey common
network properties. In fact, we rigorously prove that they do so. We also
provide empirical evidence showing that Kronecker graphs can effectively
model the structure of real networks.

We then present KRONFIT, a fast and
scalable algorithm for fitting the Kronecker graph generation model to large
real networks. A naive approach to fitting would take super-exponential time.
In contrast, KRONFIT takes linear time, by exploiting the structure of
Kronecker matrix multiplication and by using statistical simulation
techniques. Experiments on a wide range of large real and synthetic networks
show that KRONFIT finds accurate parameters that very well mimic the
properties of target networks. In fact, using just four parameters we can
accurately model several aspects of global network structure. Once fitted,
the model parameters can be used to gain insights about the network
structure, and the resulting synthetic graphs can be used for null-models,
anonymization, extrapolations, and graph summarization.

C. Lippert, Z. Ghahramani, and K. Borgwardt.
**Gene function
prediction from synthetic lethality networks via ranking on demand**.
*Bioinformatics*, 26:912-918, 2010.

** Abstract:**
Motivation: Synthetic lethal interactions represent pairs of genes whose
individual mutations are not lethal, while the double mutation of both genes
does incur lethality. Several studies have shown a correlation between
functional similarity of genes and their distances in networks based on
synthetic lethal interactions. However, there is a lack of algorithms for
predicting gene function from synthetic lethality interaction networks.

Results: In this article, we present a novel technique called kernelROD for
gene function prediction from synthetic lethal interaction networks based on
kernel machines. We apply our novel algorithm to Gene Ontology functional
annotation prediction in yeast. Our experiments show that our method leads to
improved gene function prediction compared with state-of-the-art competitors
and that combining genetic and congruence networks leads to a further
improvement in prediction accuracy.

Charalampos Rotsos, Jurgen Van Gael, Andrew W. Moore, and Zoubin Ghahramani.
**Probabilistic
graphical models for semi-supervised traffic classification**.
In *The 6th International Wireless Communications and Mobile Computing
Conference*, pages 752-757, Caen, France, 2010.

**
Abstract:** Traffic classification using machine learning continues to be
an active research area. The majority of work in this area uses off-the-shelf
machine learning tools and treats them as black-box classifiers. This
approach turns all the modelling complexity into a feature selection problem.
In this paper, we build a problem-specific solution to the traffic
classification problem by designing a custom probabilistic graphical model.
Graphical models are a modular framework to design classifiers which
incorporate domain-specific knowledge. More specifically, our solution
introduces semi-supervised learning which means we learn from both labelled
and unlabelled traffic flows. We show that our solution performs
competitively compared to previous approaches while using less data and
simpler features.

O. Stegle, K. J. Denby, E. J. Cooke, D. L. Wild, Z. Ghahramani, and K. M.
Borgwardt.
**A
robust Bayesian two-sample test for detecting intervals of differential
gene expression in microarray time series**.
*Journal of Computational Biology*, 17(3):1-13, 2010, doi
10.1089/cmb.2009.0175.

** Abstract:** Understanding the
regulatory mechanisms that are responsible for an organism's response to
environmental change is an important issue in molecular biology. A first and
important step towards this goal is to detect genes whose expression levels
are affected by altered external conditions. A range of methods to test for
differential gene expression, both in static as well as in time-course
experiments, have been proposed. While these tests answer the question
*whether* a gene is differentially expressed, they do not explicitly
address the question *when* a gene is differentially expressed, although
this information may provide insights into the course and causal structure of
regulatory programs. In this article, we propose a twosample test for
identifying intervals of differential gene expression in microarray time
series. Our approach is based on Gaussian process regression, can deal with
arbitrary numbers of replicates, and is robust with respect to outliers. We
apply our algorithm to study the response of *Arabidopsis thaliana*
genes to an infection by a fungal pathogen using a microarray time series
dataset covering 30,336 gene probes at 24 observed time points. In
classification experiments, our test compares favorably with existing methods
and provides additional insights into time-dependent differential
expression.

R. S. Savage, Z. Ghahramani, J. E. Griffin, B. de la Cruz, and D. L. Wild.
**Discovering
transcriptional modules by Bayesian data integration**.
*Bioinformatics*, 26:i158-i167, 2010.

** Abstract:**
Motivation: We present a method for directly inferring transcriptional
modules (TMs) by integrating gene expression and transcription factor binding
(ChIP-chip) data. Our model extends a hierarchical Dirichlet process mixture
model to allow data fusion on a gene-by-gene basis. This encodes the
intuition that co-expression and co-regulation are not necessarily equivalent
and hence we do not expect all genes to group similarly in both datasets. In
particular, it allows us to identify the subset of genes that share the same
structure of transcriptional modules in both datasets.

Results: We find
that by working on a gene-by-gene basis, our model is able to extract
clusters with greater functional coherence than existing methods. By
combining gene expression and transcription factor binding (ChIP-chip) data
in this way, we are better able to determine the groups of genes that are
most likely to represent underlying TMs.

Availability: If interested in
the code for the work presented in this article, please contact the
authors.

R. Silva, K. A. Heller, Z. Ghahramani, and E. M. Airoldi.
**Ranking
relations using analogies in biological and information networks**.
*Annals of Applied Statistics*, 4(2):615-644, 2010.

**
Abstract:** Analogical reasoning depends fundamentally on the ability to
learn and generalize about relations between objects. We develop an approach
to relational learning which, given a set of pairs of objects S = A(1):B(1),
A(2):B(2), ..., A(N):B(N), measures how well other pairs A:B fit in with the
set S. Our work addresses the question: is the relation between objects A and
B analogous to those relations found in S? Such questions are particularly
relevant in information retrieval, where an investigator might want to search
for analogous pairs of objects that match the query set of interest. There
are many ways in which objects can be related, making the task of measuring
analogies very challenging. Our approach combines a similarity measure on
function spaces with Bayesian analysis to produce a ranking. It requires data
containing features of the objects of interest and a link matrix specifying
which relationships exist; no further attributes of such relationships are
necessary. We illustrate the potential of our method on text analysis and
information networks. An application on discovering functional interactions
between pairs of proteins is discussed in detail, where we show that our
approach can work in practice even if a small set of protein pairs is
provided.

Andreas Vlachos, Zoubin Ghahramani, and Ted Briscoe.
**Active learning
for constrained Dirichlet process mixture models**.
In *Proceedings of the 2010 Workshop on Geometrical Models of Natural
Language Semantics*, pages 57-61, Uppsala, Sweden, 2010.

**
Abstract:** Recent work applied Dirichlet Process Mixture Models to the
task of verb clustering, incorporating supervision in the form of must-links
and cannot-links constraints between instances. In this work, we introduce an
active learning approach for constraint selection employing uncertainty-based
sampling. We achieve substantial improvements over random selection on two
datasets.

Andrew Gordon Wilson and Zoubin Ghahramani.
**Copula
processes**.
In *Advances in Neural Information Processing Systems 23*, 2010.
Spotlight.

** Abstract:** We define a copula process which
describes the dependencies between arbitrarily many random variables
independently of their marginal distributions. As an example, we develop a
stochastic volatility model, Gaussian Copula Process Volatility (GCPV), to
predict the latent standard deviations of a sequence of random variables. To
make predictions we use Bayesian inference, with the Laplace approximation,
and with Markov chain Monte Carlo as an alternative. We find our model can
outperform GARCH on simulated and financial data. And unlike GARCH, GCPV can
easily handle missing data, incorporate covariates other than time, and model
a rich class of covariance structures.

** Comment:** Supplementary
Material, slides.

Finale Doshi-Velez, David Knowles, Shakir Mohamed, and Zoubin Ghahramani.
**Large scale
non-parametric inference: Data parallelisation in the Indian buffet
process**.
In *Advances in Neural Information Processing Systems 23*, pages
1294-1302, Cambridge, MA, USA, December 2009. The MIT Press.

** Abstract:** Nonparametric Bayesian models provide a framework
for flexible probabilistic modelling of complex datasets. Unfortunately, the
high-dimensional averages required for Bayesian methods can be slow,
especially with the unbounded representations used by nonparametric models.
We address the challenge of scaling Bayesian inference to the increasingly
large datasets found in real-world applications. We focus on parallelisation
of inference in the Indian Buffet Process (IBP), which allows data points to
have an unbounded number of sparse latent features. Our novel MCMC sampler
divides a large data set between multiple processors and uses message passing
to compute the global likelihoods and posteriors. This algorithm, the first
parallel inference scheme for IBP-based models, scales to datasets orders of
magnitude larger than have previously been possible.

Shakir Mohamed, Katherine A. Heller, and Zoubin Ghahramani.
**Bayesian
exponential family PCA**.
In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, *Advances in
Neural Information Processing Systems 21*, pages 1089-1096, Cambridge,
MA, USA, December 2009. The MIT Press.

** Abstract:**
Principal Components Analysis (PCA) has become established as one of the key
tools for dimensionality reduction when dealing with real valued data.
Approaches such as exponential family PCA and non-negative matrix
factorisation have successfully extended PCA to non-Gaussian data types, but
these techniques fail to take advantage of Bayesian inference and can suffer
from problems of overfitting and poor generalisation. This paper presents a
fully probabilistic approach to PCA, which is generalised to the exponential
family, based on Hybrid Monte Carlo sampling. We describe the model which is
based on a factorisation of the observed data matrix, and show performance of
the model on both synthetic and real data.

** Comment:** spotlight.

O. Stegle, K. Denby, S. McHattie, A. Meade, D. Wild, Z. Ghahramani, and
K Borgwardt.
**Discovering
temporal patterns of differential gene expression in microarray time
series**.
In *German Conference on Bioinformatics*, pages 133-142, Halle,
Germany, September 2009.

** Abstract:** A wealth of time
series of microarray measurements have become available over recent years.
Several two-sample tests for detecting differential gene expression in these
time series have been defined, but they can only answer the question
*whether* a gene is differentially expressed across the whole time
series, not *in which intervals* it is differentially expressed. In this
article, we propose a Gaussian process based approach for studying these
dynamics of differential gene expression. In experiments on *Arabidopsis
thaliana* gene expression levels, our novel technique helps us to uncover
that the family of WRKY transcription factors appears to be involved in the
early response to infection by a fungal pathogen.

R. Savage, K. A. Heller, Y. Xu, Zoubin Ghahramani, W. Truman, M. Grant,
K. Denby, and D. L. Wild.
**R/BHC: fast
Bayesian hierarchical clustering for microarray data**.
*BMC Bioinformatics 2009*, 10(242):1-9, August 2009, doi
10.1186/1471-2105-10-242.

** Abstract:** Background:
Although the use of clustering methods has rapidly become one of the standard
computational approaches in the literature of microarray gene expression data
analysis, little attention has been paid to uncertainty in the results
obtained.

Results: We present an R/Bioconductor port of a fast novel
algorithm for Bayesian agglomerative hierarchical clustering and demonstrate
its use in clustering gene expression microarray data. The method performs
bottom-up hierarchical clustering, using a Dirichlet Process (infinite
mixture) to model uncertainty in the data and Bayesian model selection to
decide at each step which clusters to merge.

Conclusion: Biologically
plausible results are presented from a well studied data set: expression
profiles of *A. thaliana* subjected to a variety of biotic and abiotic
stresses. Our method avoids several limitations of traditional methods, for
example how many clusters there should be and how to choose a principled
distance metric.

J. Van Gael, A. Vlachos, and Z. Ghahramani.
**The infinite
HMM for unsupervised PoS tagging**.
In *Proceedings of the 2009 Conference on Empirical Methods in Natural
Language Processing (EMNLP)*, pages 678-687, Singapore, August 2009.
Association for Computational Linguistics.

** Abstract:** We
extend previous work on fully unsupervised part-of-speech tagging. Using a
non-parametric version of the HMM, called the infinite HMM (iHMM), we address
the problem of choosing the number of hidden states in unsupervised Markov
models for PoS tagging. We experiment with two non-parametric priors, the
Dirichlet and Pitman-Yor processes, on the Wall Street Journal dataset using
a parallelized implementation of an iHMM inference algorithm. We evaluate the
results with a variety of clustering evaluation metrics and achieve
equivalent or better performances than previously reported. Building on this
promising result we evaluate the output of the unsupervised PoS tagger as a
direct replacement for the output of a fully supervised PoS tagger for the
task of shallow parsing and compare the two evaluations.

R. Adams and Zoubin Ghahramani.
**Archipelago:
nonparametric Bayesian semi-supervised learning**.
In Léon Bottou and Michael Littman, editors, *26th International
Conference on Machine Learning*, pages 1-8, Montréal, QC, Canada,
June 2009. Omnipress.

** Abstract:** Semi-supervised learning
(SSL), is classification where additional unlabeled data can be used to
improve accuracy. Generative approaches are appealing in this situation, as a
model of the data's probability density can assist in identifying clusters.
Nonparametric Bayesian methods, while ideal in theory due to their principled
motivations, have been difficult to apply to SSL in practice. We present a
nonparametric Bayesian method that uses Gaussian processes for the generative
model, avoiding many of the problems associated with Dirichlet process
mixture models. Our model is fully generative and we take advantage of recent
advances in Markov chain Monte Carlo algorithms to provide a practical
inference method. Our method compares favorably to competing approaches on
synthetic and real-world multi-class data.

** Comment:** This paper was awarded Honourable Mention for
Best Paper at ICML 2009.

F. Doshi-Velez and Z. Ghahramani.
**Correlated
non-parametric latent feature models**.
In *Conference on Uncertainty in Artificial Intelligence (UAI 2009)*,
pages 143-150, Montréal, QC, Canada, June 2009. AUAI Press.

** Abstract:** We are often interested in explaining data
through a set of hidden factors or features. To allow for an unknown number
of such hidden features, one can use the IBP: a non-parametric latent feature
model that does not bound the number of active features in a dataset.
However, the IBP assumes that all latent features are uncorrelated, making it
inadequate for many real-world problems. We introduce a framework for
correlated non-parametric feature models, generalising the IBP. We use this
framework to generate several specific models and demonstrate applications on
real-world datasets.

Finale Doshi-Velez and Zoubin Ghahramani.
**Accelerated Gibbs
sampling for the Indian buffet process**.
In Léon Bottou and Michael Littman, editors, *26th International
Conference on Machine Learning*, pages 273-280, Montréal, QC,
Canada, June 2009. Omnipress.

** Abstract:** We often seek to
identify co-occurring hidden features in a set of observations. The Indian
Buffet Process (IBP) provides a non-parametric prior on the features present
in each observation, but current inference techniques for the IBP often scale
poorly. The collapsed Gibbs sampler for the IBP has a running time cubic in
the number of observations, and the uncollapsed Gibbs sampler, while linear,
is often slow to mix. We present a new linear-time collapsed Gibbs sampler
for conjugate likelihood models and demonstrate its efficacy on large
real-world datasets.

R. Silva and Z. Ghahramani.
**The hidden life of
latent variables: Bayesian learning with mixed graph models**.
*Journal of Machine Learning Research*, 10:1187-1238, June 2009.

** Abstract:** Directed acyclic graphs (DAGs) have been widely
used as a representation of conditional independence in machine learning and
statistics. Moreover, hidden or latent variables are often an important
component of graphical models. However, DAG models suffer from an important
limitation: the family of DAGs is not closed under marginalization of hidden
variables. This means that in general we cannot use a DAG to represent the
independencies over a subset of variables in a larger DAG. Directed mixed
graphs (DMGs) are a representation that includes DAGs as a special case, and
overcomes this limitation. This paper introduces algorithms for performing
Bayesian inference in Gaussian and probit DMG models. An important
requirement for inference is the specification of the distribution over
parameters of the models. We introduce a new distribution for covariance
matrices of Gaussian DMGs. We discuss and illustrate how several Bayesian
machine learning tasks can benefit from the principle presented here: the
power to model dependencies that are generated from hidden variables, but
without necessarily modeling such variables explicitly.

W. Chu and Z. Ghahramani.
**Probabilistic models
for incomplete multi-dimensional arrays**.
In D. van Dyk and M. Welling, editors, *12th International Conference on
Artificial Intelligence and Statistics*, volume 5, pages 89-96,
Clearwater Beach, FL, USA, April 2009. Microtome Publishing (paper) Journal
of Machine Learning Research.
ISSN 1938-7228.

** Abstract:** In multiway data, each sample is
measured by multiple sets of correlated attributes. We develop a
probabilistic framework for modeling structural dependency from partially
observed multi-dimensional array data, known as pTucker. Latent components
associated with individual array dimensions are jointly retrieved while the
core tensor is integrated out. The resulting algorithm is capable of handling
large-scale data sets. We verify the usefulness of this approach by comparing
against classical models on applications to modeling amino acid fluorescence,
collaborative filtering and a number of benchmark multiway array data.

Frederik Eaton and Zoubin Ghahramani.
**Choosing a variable
to clamp: Approximate inference using conditioned belief
propagation**.
In D. van Dyk and M. Welling, editors, *12th International Conference on
Artificial Intelligence and Statistics*, volume 5, pages 145-152,
Clearwater Beach, FL, USA, April 2009. Journal of Machine Learning
Research.

** Abstract:** In this paper we propose an algorithm
for approximate inference on graphical models based on belief propagation
(BP). Our algorithm is an approximate version of Cutset Conditioning, in
which a subset of variables is instantiated to make the rest of the graph
singly connected. We relax the constraint of single-connectedness, and select
variables one at a time for conditioning, running belief propagation after
each selection. We consider the problem of determining the best variable to
clamp at each level of recursion, and propose a fast heuristic which applies
back-propagation to the BP updates. We demonstrate that the heuristic
performs better than selecting variables at random, and give experimental
results which show that it performs competitively with existing approximate
inference algorithms.

** Comment:** Code (in C++
based on libDAI).

C. Lippert, O. Stegle, Z. Ghahramani, and K. Borgwardt.
**A kernel
method for unsupervised structured network inference**.
In D. van Dyk and M. Welling, editors, *12th International Conference on
Artificial Intelligence and Statistics*, volume 5, pages 368-375,
Clearwater Beach, FL, USA, April 2009. Journal of Machine Learning Research.
ISSN: 1938-7228.

** Abstract:** Network inference is the problem
of inferring edges between a set of real-world objects, for instance,
interactions between pairs of proteins in bioinformatics. Current
kernel-based approaches to this problem share a set of common features: (i)
they are supervised and hence require labeled training data; (ii) edges in
the network are treated as mutually independent and hence topological
properties are largely ignored; (iii) they lack a statistical interpretation.
We argue that these common assumptions are often undesirable for network
inference, and propose (i) an unsupervised kernel method (ii) that takes the
global structure of the network into account and (iii) is statistically
motivated. We show that our approach can explain commonly used heuristics in
statistical terms. In experiments on social networks, dfferent variants of
our method demonstrate appealing predictive performance.

R. Silva and Z. Ghahramani.
**Factorial mixture
of Gaussians and the marginal independence model**.
In *12th International Conference on Artificial Intelligence and
Statistics*, volume 5, pages 520-527, Clearwater Beach, FL, USA,
April 2009. Journal of Machine Learning Research.
ISSN: 1938-7228.

** Abstract:** Marginal independence
constraints play an important role in learning with graphical models. One way
of parameterizing a model of marginal independencies is by building a latent
variable model where two independent observed variables have no common latent
source. In sparse domains, however, it might be advantageous to model the
marginal observed distribution directly, without explicitly including latent
variables in the model. There have been recent advances in Gaussian and
binary models of marginal independence, but no models with non-linear
dependencies between continuous variables has been proposed so far. In this
paper, we describe how to generalize the Gaussian model of marginal
independencies based on mixtures, and how to learn parameters. This requires
a non-standard parameterization and raises difficult non-linear optimization
issues.

** Comment:** Code at http://www.homepages.ucl.ac.uk/~ucgtrbd/code/fmog-version0.zip

T. Stepleton, Z. Ghahramani, G. Gordon, and T.-S. Lee.
**The block
diagonal infinite hidden Markov model**.
In D. van Dyk and M. Welling, editors, *12th International Conference on
Artificial Intelligence and Statistics*, volume 5, pages 552-559,
Clearwater Beach, FL, USA, April 2009. Microtome Publishing (paper) Journal
of Machine Learning Research.
ISSN 1938-7228.

** Abstract:** The Infinite Hidden Markov Model
(IHMM) extends hidden Markov models to have a countably infinite number of
hidden states (Beal et al., 2002; Teh et al.,
2006). We present a generalization of this framework that introduces nearly
block-diagonal structure in the transitions between the hidden states, where
blocks correspond to "sub-behaviors" exhibited by data sequences. In
identifying such structure, the model classifies, or partitions, sequence
data according to these sub-behaviors in an unsupervised way. We present an
application of this model to artificial data, a video gesture classification
task, and a musical theme labeling task, and show that components of the
model can also be applied to graph segmentation.

Yang Xu, Katherine A. Heller, and Zoubin Ghahramani.
**Tree-based
inference for Dirichlet process mixtures**.
In D. van Dyk and M. Welling, editors, *12th International Conference on
Artificial Intelligence and Statistics*, volume 5, pages 623-630,
Clearwater Beach, FL, USA, April 2009. Microtome Publishing (paper), Journal
of Machine Learning Research (online).
ISSN 1938-7228.

** Abstract:** The Dirichlet process mixture
(DPM) is a widely used model for clustering and for general nonparametric
Bayesian density estimation. Unfortunately, like in many statistical models,
exact inference in a DPM is intractable, and approximate methods are needed
to perform efficient inference. While most attention in the literature has
been placed on Markov chain Monte Carlo (MCMC) [1, 2, 3], variational
Bayesian (VB) [4] and collapsed variational methods [5], [6] recently
introduced a novel class of approximation for DPMs based on Bayesian
hierarchical clustering (BHC). These tree-based combinatorial approximations
efficiently sum over exponentially many ways of partitioning the data and
offer a novel lower bound on the marginal likelihood of the DPM [6]. In this
paper we make the following contributions: (1) We show empirically that the
BHC lower bounds are substantially tighter than the bounds given by VB [4]
and by collapsed variational methods [5] on synthetic and real datasets. (2)
We also show that BHC offers a more accurate predictive performance on these
datasets. (3) We further improve the tree-based lower bounds with an
algorithm that efficiently sums contributions from alternative trees. (4) We
present a fast approximate method for BHC. Our results suggest that our
combinatorial approximate inference methods and lower bounds may be useful
not only in DPMs but in other models as well.

A. Vlachos, A Korhonen, and Z. Ghahramani.
**Unsupervised and
constrained Dirichlet process mixture models for verb clustering**.
In *4th Workshop on Statistical Machine Translation, EACL '09*, Athens,
Greece, March 2009.

** Abstract:** In this work, we apply
Dirichlet Process Mixture Models (DPMMs) to a learning task in natural
language processing (NLP): lexical-semantic verb clustering. We thoroughly
evaluate a method of guiding DPMMs towards a particular clustering solution
using pairwise constraints. The quantitative and qualitative evaluation
performed highlights the benefits of both standard and constrained DPMMs
compared to previously used approaches. In addition, it sheds light on the
use of evaluation measures and their practical application.

Karsten M. Borgwardt and Zoubin Ghahramani.
**Bayesian two-sample
tests**.
*arXiv*, abs/0906.4032, 2009.

** Abstract:** In this
paper, we present two classes of Bayesian approaches to the two-sample
problem. Our first class of methods extends the Bayesian t-test to include
all parametric models in the exponential family and their conjugate priors.
Our second class of methods uses Dirichlet process mixtures (DPM) of such
conjugate-exponential distributions as flexible nonparametric priors over the
unknown distributions.

Finale Doshi-Velez and Zoubin Ghahramani.
**Accelerated
sampling for the Indian buffet process**.
In Andrea Pohoreckyj Danyluk, Léon Bottou, and Michael L. Littman, editors,
*ICML*, volume 382 of *ACM International Conference Proceeding
Series*, page 35. acm, 2009.

** Abstract:** We often
seek to identify co-occurring hidden features in a set of observations. The
Indian Buffet Process (IBP) provides a nonparametric prior on the features
present in each observation, but current inference techniques for the IBP
often scale poorly. The collapsed Gibbs sampler for the IBP has a running
time cubic in the number of observations, and the uncollapsed Gibbs sampler,
while linear, is often slow to mix. We present a new linear-time collapsed
Gibbs sampler for conjugate likelihood models and demonstrate its efficacy on
large real-world datasets.

Carl Edward Rasmussen, Bernhard J. de la Cruz, Zoubin Ghahramani, and David L.
Wild.
**Modeling and visualizing
uncertainty in gene expression clusters using Dirichlet process
mixtures**.
*IEEE/ACM Transactions on Computational Biology and Bioinformatics*,
6(4):615-628, 2009, doi
10.1109/TCBB.2007.70269.

** Abstract:** Although the use
of clustering methods has rapidly become one of the standard computational
approaches in the literature of microarray gene expression data, little
attention has been paid to uncertainty in the results obtained. Dirichlet
process mixture (DPM) models provide a nonparametric Bayesian alternative to
the bootstrap approach to modeling uncertainty in gene expression clustering.
Most previously published applications of Bayesian model-based clustering
methods have been to short time series data. In this paper, we present a case
study of the application of nonparametric Bayesian clustering methods to the
clustering of high-dimensional nontime series gene expression data using full
Gaussian covariances. We use the probability that two genes belong to the
same cluster in a DPM model as a measure of the similarity of these gene
expression profiles. Conversely, this probability can be used to define a
dissimilarity measure, which, for the purposes of visualization, can be input
to one of the standard linkage algorithms used for hierarchical clustering.
Biologically plausible results are obtained from the Rosetta compendium of
expression profiles which extend previously published cluster analyses of
this data.

O. Stegle, K. Denby, David L. Wild, Zoubin Ghahramani, and Karsten Borgwardt.
*13th Annual International Conference on Research in Computational
Molecular Biology (RECOMB 2009)*, volume 5541 of *Lecture Notes in
Bioinformatics*, pages 201-216, Tucson, AZ, USA, 2009. Springer-Verlag,
doi
10.1007/978-3-642-02008-7_14.

** Abstract:** Understanding
the regulatory mechanisms that are responsible for an organism's response to
environmental changes is an important question in molecular biology. A first
and important step towards this goal is to detect genes whose expression
levels are affected by altered external conditions. A range of methods to
test for differential gene expression, both in static as well as in
time-course experiments, have been proposed. While these tests answer the
question *whether* a gene is differentially expressed, they do not
explicitly address the question *when* a gene is differentially
expressed, although this information may provide insights into the course and
causal structure of regulatory programs. In this article, we propose a
two-sample test for identifying *intervals* of differential gene
expression in microarray time series. Our approach is based on Gaussian
process regression, can deal with arbitrary numbers of replicates and is
robust with respect to outliers. We apply our algorithm to study the response
of *Arabidopsis thaliana* genes to an infection by a fungal pathogen
using a microarray time series dataset covering 30,336 gene probes at 24 time
points. In classification experiments our test compares favorably with
existing methods and provides additional insights into time-dependent
differential expression.

Andreas Vlachos, Anna Korhonen, and Zoubin Ghahramani.
**Unsupervised and
constrained Dirichlet process mixture models for verb clustering**.
In *Proceedings of the workshop on geometrical models of natural language
semantics*, pages 74-82. Association for Computational Linguistics,
2009.

** Abstract:** In this work, we apply Dirichlet Process
Mixture Models (DPMMs) to a learning task in natural language processing
(NLP): lexical-semantic verb clustering. We thoroughly evaluate a method of
guiding DP- MMs towards a particular clustering solution using pairwise
constraints. The quantitative and qualitative evaluation per- formed
highlights the benefits of both standard and constrained DPMMs com- pared to
previously used approaches. In addition, it sheds light on the use of
evaluation measures and their practical application.

C. Hübler, K. Borgwardt, H.-P. Kriegel, and Z. Ghahramani.
**Metropolis
algorithms for representative subgraph sampling**.
In *Proceedings of 8th IEEE International Conference on Data Mining (ICDM
2008)*, pages 283-292, Pisa, Italy, December 2008. IEEE.
ISSN: 1550-4786.

** Abstract:** While data mining in
chemoinformatics studied graph data with dozens of nodes, systems biology and
the Internet are now generating graph data with thousands and millions of
nodes. Hence data mining faces the algorithmic challenge of coping with this
significant increase in graph size: Classic algorithms for data analysis are
often too expensive and too slow on large graphs.

While one strategy to
overcome this problem is to design novel efficient algorithms, the other is
to 'reduce' the size of the large graph by sampling. This is the scope of
this paper: We will present novel Metropolis algorithms for sampling a
'representative' small subgraph from the original large graph, with
'representative' describing the requirement that the sample shall preserve
crucial graph properties of the original graph. In our experiments, we
improve over the pioneering work of Leskovec and Faloutsos (KDD 2006), by
producing representative subgraph samples that are both smaller and of higher
quality than those produced by other methods from the literature.

H. Kim and Zoubin Ghahramani.
**Outlier robust
Gaussian process classification**.
In L. Niels da Vitoria, editor, *Structural, Syntactic and Statistical
Pattern Recognition*, volume 5342 of *Lecture Notes in Computer
Science (LNCS)*, pages 896-905, Berlin, Germany, December 2008. Springer
Berlin / Heidelberg.

** Abstract:** Gaussian process
classifiers (GPCs) are a fully statistical model for kernel classification.
We present a form of GPC which is robust to labeling errors in the data set.
This model allows label noise not only near the class boundaries, but also
far from the class boundaries which can result from mistakes in labelling or
gross errors in measuring the input features. We derive an outlier robust
algorithm for training this model which alternates iterations based on the EP
approximation and hyperparameter updates until convergence. We show the
usefulness of the proposed algorithm with model selection method through
simulation results.

R. Silva, W. Chu, and Zoubin Ghahramani.
**Hidden common
cause relations in relational learning**.
In J. C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, *Advances in
Neural Information Processing Systems 20*, pages 1345-1352, Cambridge,
MA, USA, December 2008. The MIT Press.

** Abstract:** When
predicting class labels for objects within a relational database, it is often
helpful to consider a model for relationships: this allows for information
between class labels to be shared and to improve prediction performance.
However, there are different ways by which objects can be related within a
relational database. One traditional way corresponds to a Markov network
structure: each existing relation is represented by an undirected edge. This
encodes that, conditioned on input features, each object label is independent
of other object labels given its neighbors in the graph. However, there is no
reason why Markov networks should be the only representation of choice for
symmetric dependence structures. Here we discuss the case when relationships
are postulated to exist due to *hidden common causes*. We discuss how
the resulting graphical model differs from Markov networks, and how it
describes different types of real-world relational processes. A Bayesian
nonparametric classification model is built upon this graphical
representation and evaluated with several empirical studies.

** Comment:** Code at http://www.homepages.ucl.ac.uk/~ucgtrbd/code/xgp

J.M. Sung, Z. Ghahramani, and S.Y. Bang.
**Second-order
latent space variational Bayes for approximate Bayesian
inference**.
*IEEE Signal Processing Letters*, 15:918-921, December 2008.

** Abstract:** In this letter, we consider a variational
approximate Bayesian inference framework, latent-space variational Bayes
(LSVB), in the general context of conjugate-exponential family models with
latent variables. In the LSVB approach, we integrate out model parameters in
an exact way and then perform the variational inference over only the latent
variables. It can be shown that LSVB can achieve better estimates of the
model evidence as well as the distribution over the latent variables than the
popular variational Bayesian expectation-maximization (VBEM). However, the
distribution over the latent variables in LSVB has to be approximated in
practice. As an approximate implementation of LSVB, we propose a second-order
LSVB (SoLSVB) method. In particular, VBEM can be derived as a special case of
a first-order approximation in LSVB. SoLSVB can capture higher order
statistics neglected in VBEM and can therefore achieve a better
approximation. Examples of Gaussian mixture models are used to illustrate the
comparison between our method and VBEM, demonstrating the improvement.

J. Van Gael, Y.W. Teh, and Z. Ghahramani.
**The infinite
factorial hidden Markov model**.
In D. Koller, D. Schuurmans, L. Bottou, and Y. Bengio, editors, *Advances in
Neural Information Processing Systems 21*, volume 21, pages
1697-1704, Cambridge, MA, USA, December 2008. The MIT Press.

** Abstract:** The infinite factorial hidden Markov model is a
non-parametric extension of the factorial hidden Markov model. Our model
defines a probability distribution over an infinite number of independent
binary hidden Markov chains which together produce an observable sequence of
random variables. Central to our model is a new type of non-parametric prior
distribution inspired by the Indian Buffet Process which we call the
*Indian Buffet Markov Process*.

J. Zhang, Z. Ghahramani, and Y. Yang.
**Flexible latent
variable models for multi-task learning**.
*Machine Learning*, 73(3):221-242, December 2008.

**
Abstract:** Given multiple prediction problems such as regression and
classification, we are interested in a joint inference framework which can
effectively borrow information among tasks to improve the prediction
accuracy, especially when the number of training examples per problem is
small. In this paper we propose a probabilistic framework which can support a
set of latent variable models for different multi-task learning scenarios. We
show that the framework is a generalization of standard learning methods for
single prediction problems and it can effectively model the shared structure
among different prediction tasks. Furthermore, we present efficient
algorithms for the empirical Bayes method as well as point estimation. Our
experiments on both simulated datasets and real world classification datasets
show the effectiveness of the proposed models in two evaluation settings:
standard multi-task learning setting and transfer learning setting.

J.M. Sung, Z. Ghahramani, and S.Y. Bang.
**Latent space
variational Bayes**.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*,
30(12):2236-2242, November 2008.

** Abstract:** Variational
Bayesian Expectation-Maximization (VBEM), an approximate inference method for
probabilistic models based on factorizing over latent variables and model
parameters, has been a standard technique for practical Bayesian inference.
In this paper, we introduce a more general approximate inference framework
for conjugate-exponential family models, which we call Latent-Space
Variational Bayes (LSVB). In this approach, we integrate out the model
parameters in an exact way, leaving only the latent variables. It can be
shown that the LSVB approach gives better estimates of the model evidence as
well as the distribution over the latent variables than the VBEM approach,
but, in practice, the distribution over the latent variables has to be
approximated. As a practical implementation, we present a First-order LSVB
(FoLSVB) algorithm to approximate the distribution over the latent variables.
From this approximate distribution, one can also estimate the model evidence
and the posterior over the model parameters. The FoLSVB algorithm is directly
comparable to the VBEM algorithm and has the same computational complexity.
We discuss how LSVB generalizes the recently proposed collapsed variational
methods to general conjugate-exponential families. Examples based on mixtures
of Gaussians and mixtures of Bernoullis with synthetic and real-world data
sets are used to illustrate some advantages of our method over VBEM.

Katherine A. Heller, Sinead Williamson, and Zoubin Ghahramani.
**Statistical
models for partial membership**.
In Andrew McCallum and Sam Roweis, editors, *25th International Conference
on Machine Learning*, pages 392-399, Helsinki, Finland, July 2008.
Omnipress.

** Abstract:** We present a principled Bayesian
framework for modeling partial memberships of data points to clusters. Unlike
a standard mixture model which assumes that each data point belongs to one
and only one mixture component, or cluster, a partial membership model allows
data points to have fractional membership in multiple clusters. Algorithms
which assign data points partial memberships to clusters can be useful for
tasks such as clustering genes based on microarray data (Gasch & Eisen,
2002). Our Bayesian Partial Membership Model (BPM) uses exponential family
distributions to model each cluster, and a product of these distibtutions,
with weighted parameters, to model each datapoint. Here the weights
correspond to the degree to which the datapoint belongs to each cluster. All
parameters in the BPM are continuous, so we can use Hybrid Monte Carlo to
perform inference and learning. We discuss relationships between the BPM and
Latent Dirichlet Allocation, Mixed Membership models, Exponential Family PCA,
and fuzzy clustering. Lastly, we show some experimental results and discuss
nonparametric extensions to our model.

A. Vlachos, Z. Ghahramani, and A Korhonen.
**Dirichlet
process mixture models for verb clustering**.
In Guillaume Bouchard, Hal Daumé III, Marc Dymetman, and Yee Whye Teh,
editors, *ICML Workshop on Prior Knowledge for Text and Language
Processing*, pages 43-48, Helsinki, Finland, July 2008.

**
Abstract:** In this work we apply Dirichlet Process Mixture Models to a
learning task in natural language processing (NLP): lexical-semantic verb
clustering. We assess the performance on a dataset based on Levin's (1993)
verb classes using the recently introduced V-measure metric. In, we present a
method to add human supervision to the model in order to to influence the
solution with respect to some prior knowledge. The quantitative evaluation
performed highlights the benefits of the chosen method compared to previously
used clustering approaches.

Hyun-Chul Kim and Zoubin Ghahramani.
**Outlier robust
Gaussian process classification**.
In Niels da Vitoria Lobo, Takis Kasparis, Fabio Roli, James Tin-Yau Kwok,
Michael Georgiopoulos, Georgios C. Anagnostopoulos, and Marco Loog, editors,
*SSPR/SPR*, volume 5342 of *Lecture Notes in Computer Science*,
pages 896-905. Springer, 2008.

** Abstract:** Gaussian
process classifiers (GPCs) are a fully statistical model for kernel
classification. We present a form of GPC which is robust to labeling errors
in the data set. This model allows label noise not only near the class
boundaries, but also far from the class boundaries which can result from
mistakes in labelling or gross errors in measuring the input features. We
derive an outlier robust algorithm for training this model which alternates
iterations based on the EP approximation and hyperparameter updates until
convergence. We show the usefulness of the proposed algorithm with model
selection method through simulation results.

Andreas Vlachos, Zoubin Ghahramani, and Anna Korhonen.
**Dirichlet
process mixture models for verb clustering**.
In *Proceedings of the ICML workshop on Prior Knowledge for Text and
Language*, 2008.

** Abstract:** In this work we apply
Dirichlet Process Mixture Models to a learning task in natural language
processing (NLP): lexical-semantic verb clustering. We assess the performance
on a dataset based on Levin’s (1993) verb classes using the recently
introduced V- measure metric. In, we present a method to add human
supervision to the model in order to to influence the solution with respect
to some prior knowledge. The quantitative evaluation performed highlights the
benefits of the chosen method compared to previously used clustering
approaches.

Jurgen Van Gael, Yunus Saatçi, Yee-Whye Teh, and Zoubin Ghahramani.
**Beam sampling
for the infinite hidden Markov model**.
In *25th International Conference on Machine Learning*, volume 25,
pages 1088-1095, Helsinki, Finland, 2008. Association for Computing
Machinery.

** Abstract:** The infinite hidden Markov model is
a non-parametric extension of the widely used hidden Markov model. Our paper
introduces a new inference algorithm for the infinite Hidden Markov model
called beam sampling. Beam sampling combines slice sampling, which limits the
number of states considered at each time step to a finite number, with
dynamic programming, which samples whole state trajectories efficiently. Our
algorithm typically outperforms the Gibbs sampler and is more robust. We
present applications of iHMM inference using the beam sampler on changepoint
detection and text prediction problems.

Sinead Williamson and Zoubin Ghahramani.
**Probabilistic models
for data combination in recommender systems**.
In *Learning from Multiple Sources Workshop, NIPS Conference*, Whistler
Canada, 2008.

W. Chu, V. Sindhwani, Z. Ghahramani, and S. Keerthi.
**Relational
learning with Gaussian processes**.
In B. Schölkopf, J. Platt, and T. Hofmann, editors, *Advances in Neural
Information Processing Systems 19*, volume 19 of *Bradford
Books*, pages 289-296, Cambridge, MA, USA, September 2007. The MIT
Press.
Online contents gives pages 314-321, and 289-296 on pdf of contents.

** Abstract:** Correlation between instances is often modelled
via a kernel function using input attributes of the instances. Relational
knowledge can further reveal additional pairwise correlations between
variables of interest. In this paper, we develop a class of models which
incorporates both reciprocal relational information and input attributes
using Gaussian process techniques. This approach provides a novel
non-parametric Bayesian framework with a data-dependent prior for supervised
learning tasks. We also apply this framework to semi-supervised learning.
Experimental results on several real world data sets verify the usefulness of
this algorithm.

David Knowles and Zoubin Ghahramani.
**Infinite sparse
factor analysis and infinite independent components analysis**.
In *7th International Conference on Independent Component Analysis and
Signal Separation*, pages 381-388, London, UK, September 2007. Springer,
doi
10.1007/978-3-540-74494-8_48.

** Abstract:** A
nonparametric Bayesian extension of Independent Components Analysis (ICA) is
proposed where observed data Y is modelled as a linear superposition, G, of a
potentially infinite number of hidden sources, X. Whether a given source is
active for a specific data point is specified by an infinite binary matrix,
Z. The resulting sparse representation allows increased data reduction
compared to standard ICA. We define a prior on Z using the Indian Buffet
Process (IBP). We describe four variants of the model, with Gaussian or
Laplacian priors on X and the one or two-parameter IBPs. We demonstrate
Bayesian inference under these models using a Markov chain Monte Carlo (MCMC)
algorithm on synthetic and gene expression data and compare to standard ICA
algorithms.

E. Meeds, Z. Ghahramani, R. Neal, and S.T. Roweis.
**Modelling
dyadic data with binary latent factors**.
In B. Schölkopf, J. Platt, and T. Hofmann, editors, *Advances in Neural
Information Processing Systems 19*, Bradford Books, pages 977-984,
Cambridge, MA, USA, September 2007. The MIT Press.
Online contents gives pages 1002-1009, and 977-984 on pdf contents.

** Abstract:** We introduce binary matrix factorization, a novel
model for unsupervised matrix decomposition. The decomposition is learned by
fitting a non-parametric Bayesian probabilistic model with binary latent
variables to a matrix of dyadic data. Unlike bi-clustering models, which
assign each row or column to a single cluster based on a categorical hidden
feature, our binary feature model reflects the prior belief that items and
attributes can be associated with more than one latent cluster at a time. We
provide simple learning and inference rules for this new model and show how
to extend it to an infinite model in which the number of features is not a
priori fixed but is allowed to grow with the size of the data.

F. Pérez-Cruz, Zoubin Ghahramani, and M. Pontil.
**Conditional
graphical models**.
In G. H. Bakir, T. Hofmann, B. Schölkopf, A. J. Smola, B. Taskar, and
S. V. N. Vishwanathan, editors, *Predicting Structured Data*, pages
265-282. The MIT Press, Cambridge, MA, USA, September 2007.
Chapter 12.

** Abstract:** In this chapter we propose a
modification of CRF-like algorithms that allows for solving large-scale
structured classification problems. Our approach consists in upper bounding
the CRF functional in order to decompose its training into independent
optimisation problems per clique. Furthermore we show that each sub-problem
corresponds to solving a multiclass learning task in each clique, which
enlarges the applicability of these tools for large-scale structural learning
problems. Before presenting the Conditional Graphical Model (CGM), as we
refer to this procedure, we review the family of CRF algorithms. We
concentrate on the best known procedures and standard generalisations of
CRFs. The ob jective of this introduction is analysing from the same
viewpoint the proposed solutions in the literature to tackle this problem,
which allows comparing their different features. We complete the chapter with
a case study, in which we show the possibility to work with large-scale
problems using CGM and that the obtained performance is comparable to the
result with CRF-like algorithms.

Z. Ghahramani, T.L. Griffiths, and P. Sollich.
**Bayesian
nonparametric latent feature models (with discussion)**.
In J.M. Bernardo, M.J. Bayarri, J.O. Berger, A.P. Dawid, D. Heckerman, A.F.M.
Smith, and M. West, editors, *Bayesian Statistics 8*, pages 201-226,
Oxford, UK, July 2007. Oxford University Press.

** Abstract:**
We describe a flexible nonparametric approach to latent variable modelling in
which the number of latent variables is unbounded. This approach is based on
a probability distribution over equivalence classes of binary matrices with a
finite number of rows, corresponding to the data points, and an unbounded
number of columns, corresponding to the latent variables. Each data point can
be associated with a subset of the possible latent variables, which we refer
to as the latent features of that data point. The binary variables in the
matrix indicate which latent feature is possessed by which data point, and
there is a potentially infinite array of features. We derive the distribution
over unbounded binary matrices by taking the limit of a distribution over
N×K binary matrices as K→∞. We define a simple generative
processes for this distribution which we call the Indian buffet process (IBP;
Griffiths and Ghahramani, 2005, 2006) by analogy
to the Chinese restaurant process (Aldous, 1985; Pitman, 2002). The IBP has a
single hyperparameter which controls both the number of feature per ob ject
and the total number of features. We describe a two-parameter generalization
of the IBP which has additional flexibility, independently controlling the
number of features per object and the total number of features in the matrix.
The use of this distribution as a prior in an infinite latent feature model
is illustrated, and Markov chain Monte Carlo algorithms for inference are
described.

** Comment:** Includes discussion by David Dunson, and
rejoinder.

Katherine A. Heller and Zoubin Ghahramani.
**A nonparametric
Bayesian approach to modeling overlapping clusters**.
In Marina Meila and Xiaotong Shen, editors, *AISTATS*, volume 2 of
*JMLR Proceedings*, pages 187-194. JMLR.org, 2007.

**
Abstract:** Although clustering data into mutually exclusive partitions has
been an extremely successful approach to unsupervised learning, there are
many situations in which a richer model is needed to fully represent the
data. This is the case in problems where data points actually simultaneously
belong to multiple, overlapping clusters. For example a particular gene may
have several functions, therefore belonging to several distinct clusters of
genes, and a biologist may want to discover these through unsupervised
modeling of gene expression data. We present a new nonparametric Bayesian
method, the Infinite Overlapping Mixture Model (IOMM), for modeling
overlapping clusters. The IOMM uses exponential family distributions to model
each cluster and forms an overlapping mixture by taking products of such
distributions, much like products of experts (Hinton, 2002). The IOMM allows
an unbounded number of clusters, and assignments of points to (multiple)
clusters is modeled using an Indian Buffet Process (IBP), (Griffiths and
Ghahramani, 2006). The IOMM has the desirable properties of being able to
focus in on overlapping regions while maintaining the ability to model a
potentially infinite number of clusters which may overlap. We derive MCMC
inference algorithms for the IOMM and show that these can be used to cluster
movies into multiple genres.

Edward Snelson and Zoubin Ghahramani.
**Local and global
sparse Gaussian process approximations**.
In M. Meila and X. Shen, editors, *11th International Conference on
Artificial Intelligence and Statistics*. Omnipress, 2007.

**
Abstract:** Gaussian process (GP) models are flexible probabilistic
nonparametric models for regression, classification and other tasks.
Unfortunately they suffer from computational intractability for large data
sets. Over the past decade there have been many different approximations
developed to reduce this cost. Most of these can be termed global
approximations, in that they try to summarize all the training data via a
small set of support points. A different approach is that of local
regression, where many local experts account for their own part of space. In
this paper we start by investigating the regimes in which these different
approaches work well or fail. We then proceed to develop a new sparse GP
approximation which is a combination of both the global and local approaches.
Theoretically we show that it is derived as a natural extension of the
framework developed by Quiñonero-Candela and
Rasmussen for sparse GP approximations. We demonstrate the benefits of
the combined approximation on some 1D examples for illustration, and on some
large real-world data sets.

Ricardo Silva, Katherine A. Heller, and Zoubin Ghahramani.
**Analogical
reasoning with relational Bayesian sets**.
In Marina Meila and Xiaotong Shen, editors, *AISTATS*, volume 2 of
*JMLR Proceedings*, pages 500-507. JMLR.org, 2007.

**
Abstract:** Analogical reasoning depends fundamentally on the ability to
learn and generalize about relations between objects. There are many ways in
which objects can be related, making automated analogical reasoning very
chal- lenging. Here we develop an approach which, given a set of pairs of
related objects S = A1:B1,A2:B2,...,AN:BN, measures how well other pairs
A:B fit in with the set S. This addresses the question: is the relation
between objects A and B analogous to those relations found in S? We recast
this classi- cal problem as a problem of Bayesian analy- sis of relational
data. This problem is non- trivial because direct similarity between ob-
jects is not a good way of measuring analo- gies. For instance, the analogy
between an electron around the nucleus of an atom and a planet around the Sun
is hardly justified by isolated, non-relational, comparisons of an electron
to a planet, and a nucleus to the Sun. We develop a generative model for
predicting the existence of relationships and extend the framework of
Ghahramani and Heller (2005) to provide a Bayesian measure for how analogous
a relation is to other relations. This sheds new light on an old problem,
which we motivate and illustrate through practical applications in
exploratory data analysis.

Yee Whye Teh, Dilan Görür, and Zoubin Ghahramani.
**Stick-breaking
construction for the Indian buffet process**.
In Marina Meila and Xiaotong Shen, editors, *AISTATS*, volume 2 of
*JMLR Proceedings*, pages 556-563. JMLR.org, 2007.

**
Abstract:** The Indian buffet process (IBP) is a Bayesian nonparametric
distribution whereby objects are modelled using an unbounded number of latent
features. In this paper we derive a stick-breaking representation for the
IBP. Based on this new representation, we develop slice samplers for the IBP
that are efficient, easy to implement and are more generally applicable than
the currently available Gibbs sampler. This representation, along with the
work of Thibaux and Jordan, also illuminates interesting theoretical
connections between the IBP, Chinese restaurant processes, Beta processes and
Dirichlet processes.

T. L. Griffiths and Z. Ghahramani.
**Infinite latent
feature models and the Indian Buffet Process**.
In Y. Weiss, B. Schölkopf, and J. Platt, editors, *Advances in Neural
Information Processing Systems 18*, pages 475-482, Cambridge, MA, USA,
December 2006. The MIT Press.

** Abstract:** We define a
probability distribution over equivalence classes of binary matrices with a
finite number of rows and an unbounded number of columns. This distribution
is suitable for use as a prior in probabilistic models that represent objects
using a potentially infinite array of features. We identify a simple
generative process that results in the same distribution over equivalence
classes, which we call the Indian buffet process. We illustrate the use of
this distribution as a prior in an infinite latent feature model, deriving a
Markov chain Monte Carlo algorithm for inference in this model and applying
the algorithm to an image dataset.

Zoubin Ghahramani and Katherine A. Heller.
**Bayesian
sets**.
In Y. Weiss, B. Schölkopf, and J. Platt, editors, *Advances in Neural
Information Processing Systems 18*, pages 435-442, Cambridge, MA, USA,
December 2006. The MIT Press.

** Abstract:** Inspired by
"Google™ Sets", we consider the problem of retrieving items from a
concept or cluster, given a query consisting of a few items from that
cluster. We formulate this as a Bayesian inference problem and describe a
very simple algorithm for solving it. Our algorithm uses a model-based
concept of a cluster and ranks items using a score which evaluates the
marginal probability that each item belongs to a cluster containing the query
items. For exponential family models with conjugate priors this marginal
probability is a simple function of sufficient statistics. We focus on sparse
binary data and show that our score can be evaluated exactly using a single
sparse matrix multiplication, making it possible to apply our algorithm to
very large datasets. We evaluate our algorithm on three datasets: retrieving
movies from EachMovie, finding completions of author sets from the NIPS
dataset, and finding completions of sets of words appearing in the Grolier
encyclopedia. We compare to Google™ Sets and show that Bayesian Sets
gives very reasonable set completions.

Arik Azran and Zoubin Ghahramani.
**A new approach to
data driven clustering**.
In William Cohen and Andrew Moore, editors, *23rd International Conference
on Machine Learning*, pages 57-64, Pittsburgh, PA, USA, June 2006.
Omnipress.

** Abstract:** We consider the problem of
clustering in its most basic form where only a local metric on the data space
is given. No parametric statistical model is assumed, and the number of
clusters is learned from the data. We introduce, analyze and demonstrate a
novel approach to clustering where data points are viewed as nodes of a
graph, and pairwise similarities are used to derive a transition probability
matrix P for a Markov random walk between them. The algorithm automatically
reveals structure at increasing scales by varying the number of steps taken
by this random walk. Points are represented as rows of Pt, which are the
t-step distributions of the walk starting at that point; these distributions
are then clustered using a KL-minimizing iterative algorithm. Both the number
of clusters, and the number of steps that best reveal it, are found by
optimizing spectral properties of P.

Arik Azran and Zoubin Ghahramani.
**Spectral methods
for automatic multiscale data clustering**.
In *IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*,
pages 190-197, New York, NY, USA, June 2006. IEEE Computer Society, doi
10.1109/CVPR.2006.289.

** Abstract:** Spectral clustering
is a simple yet powerful method for finding structure in data using spectral
properties of an associated pairwise similarity matrix. This paper provides
new insights into how the method works and uses these to derive new
algorithms which given the data alone automatically learn different plausible
data partitionings. The main theoretical contribution is a generalization of
a key result in the field, the multicut lemma (Meila 2001). We use this
generalization to derive two algorithms. The first uses the eigenvalues of a
given affinity matrix to infer the number of clusters in data, and the second
combines learning the affinity matrix with inferring the number of clusters.
A hierarchical implementation of the algorithms is also derived. The
algorithms are theoretically motivated and demonstrated on nontrivial data
sets.

Wei Chu, Zoubin Ghahramani, Roland Krause, and David L. Wild.
**Identifying
protein complexes in high-throughput protein interaction screens using an
infinite latent feature model**.
In Russ B. Altman, Tiffany Murray, Teri E. Klein, A. Keith Dunker, and Lawrence
Hunter, editors, *Pacific Symposium on Biocomputing*, pages 231-242.
World Scientific, 2006.

** Abstract:** We propose a Bayesian
approach to identify protein complexes and their constituents from
high-throughput protein-protein interaction screens. An infinite latent
feature model that allows for multi-complex membership by individual proteins
is coupled with a graph diffusion kernel that evaluates the likelihood of two
proteins belonging to the same complex. Gibbs sampling is then used to infer
a catalog of protein complexes from the interaction screen data. An advantage
of this model is that it places no prior constraints on the number of
complexes and automatically infers the number of significant complexes from
the data. Validation results using affinity purification/mass spectrometry
experimental data from yeast RNA-processing complexes indicate that our
method is capable of partitioning the data in a biologically meaningful way.
A supplementary web site containing larger versions of the figures is
available at http://public.kgi.edu/wild/PSBO6/index.html.

Wei Chu, Zoubin Ghahramani, Alexei A. Podtelezhnikov, and David L. Wild.
**Bayesian
segmental models with multiple sequence alignment profiles for protein
secondary structure and contact map prediction**.
*IEEE/ACM Trans. Comput. Biology Bioinform.*, 3(2):98-113, 2006.

** Abstract:** In this paper, we develop a segmental semi-Markov
model (SSMM) for protein secondary structure prediction which incorporates
multiple sequence alignment profiles with the purpose of improving the
predictive performance. The segmental model is a generalization of the hidden
Markov model where a hidden state generates segments of various length and
secondary structure type. A novel parameterized model is proposed for the
likelihood function that explicitly represents multiple sequence alignment
profiles to capture the segmental conformation. Numerical results on
benchmark data sets show that incorporating the profiles results in
substantial improvements and the generalization performance is promising. By
incorporating the information from long range interactions in beta-sheets,
this model is also capable of carrying out inference on contact maps. This is
an important advantage of probabilistic generative models over the
traditional discriminative approach to protein secondary structure
prediction. The Web server of our algorithm and supplementary materials are
available at http://public.kgi.edu/-wild/bsm.html.

Katherine A. Heller and Zoubin Ghahramani.
**A simple Bayesian
framework for content-based image retrieval**.
In *CVPR*, pages 2110-2117. IEEE Computer Society, 2006.

** Abstract:** We present a Bayesian framework for content-based
image retrieval which models the distribution of color and texture features
within sets of related images. Given a userspecified text query (e.g.
"penguins") the system first extracts a set of images, from a labelled
corpus, corresponding to that query. The distribution over features of these
images is used to compute a Bayesian score for each image in a large
unlabelled corpus. Unlabelled images are then ranked using this score and the
top images are returned. Although the Bayesian score is based on computing
marginal likelihoods, which integrate over model parameters, in the case of
sparse binary data the score reduces to a single matrix-vector multiplication
and is therefore extremely efficient to compute. We show that our method
works surprisingly well despite its simplicity and the fact that no relevance
feedback is used. We compare different choices of features, and evaluate our
results using human subjects.

Hyun-Chul Kim and Zoubin Ghahramani.
**Bayesian Gaussian
process classification with the EM-EP algorithm**.
*IEEE Trans. Pattern Anal. Mach. Intell.*, 28(12):1948-1959, 2006.

** Abstract:** Gaussian process classifiers (GPCs) are Bayesian
probabilistic kernel classifiers. In GPCs, the probability of belonging to a
certain class at an input location is monotonically related to the value of
some latent function at that location. Starting from a Gaussian process prior
over this latent function, data are used to infer both the posterior over the
latent function and the values of hyperparameters to determine various
aspects of the function. Recently, the expectation propagation (EP) approach
has been proposed to infer the posterior over the latent function. Based on
this work, we present an approximate EM algorithm, the EM-EP algorithm, to
learn both the latent function and the hyperparameters. This algorithm is
found to converge in practice and provides an efficient Bayesian framework
for learning hyperparameters of the kernel. A multiclass extension of the
EM-EP algorithm for GPCs is also derived. In the experimental results, the
EM-EP algorithms are as good or better than other methods for GPCs or Support
Vector Machines (SVMs) with cross-validation.

Hyun-Chul Kim, Daijin Kim, Zoubin Ghahramani, and Sung Yang Bang.
**Appearance-based
gender classification with Gaussian processes**.
*Pattern Recognition Letters*, 27(6):618-626, 2006.

**
Abstract:** This paper concerns the gender classification task of
discriminating between images of faces of men and women from face images. In
appearance-based approaches, the initial images are preprocessed (e.g.
normalized) and input into classifiers. Recently, support vector machines
(SVMs) which are popular kernel classifiers have been applied to gender
classification and have shown excellent performance. SVMs have difficulty in
determining the hyperparameters in kernels (using cross-validation). We
propose to use Gaussian process classifiers (GPCs) which are Bayesian kernel
classifiers. The main advantage of GPCs over SVMs is that they determine the
hyperparameters of the kernel based on Bayesian model selection criterion.
The experimental results show that our methods outperformed SVMs with
cross-validation in most of data sets. Moreover, the kernel hyperparameters
found by GPCs using Bayesian methods can be used to improve SVM
performance.

Hyun-Chul Kim, Daijin Kim, Zoubin Ghahramani, and Sung Yang Bang.
**Gender
classification with Bayesian kernel methods**.
In William W. Cohen and Andrew Moore, editors, *IJCNN*, volume 148 of
*ACM International Conference Proceeding Series*, pages 3371-3376.
Association for Computing Machinery, 2006.

** Abstract:** We
consider the gender classification task of discriminating between images of
faces of men and women from face images. In appearance-based approaches, the
initial images are preprocessed (e.g. normalized) and input into classifiers.
Recently, SVMs which are popular kernel classifiers have been applied to
gender classification and have shown excellent performance. We propose to use
one of Bayesian kernel methods which is Gaussian process classifiers (GPCs)
for gender classification. The main advantage of Bayesian kernel methods such
as GPCs over SVMs is that they determine the hyperparameters of the kernel
based on Bayesian model selection criterion. Our results show that GPCs
outperformed SVMs with cross validation.

Iain Murray, Zoubin Ghahramani, and David J. C. MacKay.
**MCMC for
doubly-intractable distributions**.
In *UAI*. AUAI Press, 2006.

** Abstract:** Markov Chain
Monte Carlo (MCMC) algorithms are routinely used to draw samples from
distributions with intractable normalization constants. However, standard
MCMC algorithms do not apply to doubly-intractable distributions in which
there are additional parameter-dependent normalization terms; for example,
the posterior over parameters of an undirected graphical model. An ingenious
auxiliary-variable scheme (Møller et al., 2004) offers a solution: exact
sampling (Propp and Wilson, 1996) is used to sample from a
Metropolis-Hastings proposal for which the acceptance probability is
tractable. Unfortunately the acceptance probability of these expensive
updates can be low. This paper provides a generalization of Møller et al.
(2004) and a new MCMC algorithm, which obtains better acceptance
probabilities for the same amount of exact sampling, and removes the need to
estimate model parameters before sampling begins.

Ricardo Silva and Zoubin Ghahramani.
**Bayesian inference
for Gaussian mixed graph models**.
In *UAI*. AUAI Press, 2006.

** Abstract:** We introduce
priors and algorithms to perform Bayesian inference in Gaussian models
defined by acyclic directed mixed graphs. Such a class of graphs, composed of
directed and bi-directed edges, is a representation of conditional
independencies that is closed under marginalization and arises naturally from
causal models which allow for unmeasured confounding. Monte Carlo methods and
a variational approximation for such models are presented. Our algorithms for
Bayesian inference allow the evaluation of posterior distributions for
several quantities of interest, including causal effects that are not
identifiable from data alone but could otherwise be inferred where
informative prior knowledge about confounding is available.

Edward Snelson and Zoubin Ghahramani.
**Sparse Gaussian
processes using pseudo-inputs**.
In Y. Weiss, B. Schölkopf, and J. Platt, editors, *Advances in Neural
Information Processing Systems 18*, pages 1257-1264. The MIT Press,
Cambridge, MA, 2006.

** Abstract:** We present a new Gaussian
process (GP) regression model whose covariance is parameterized by the the
locations of M pseudo-input points, which we learn by a gradient based
optimization. We take M<<N, where N is the number of real data points,
and hence obtain a sparse regression method which has O(NM^{2})
training cost and O(M^{2}) prediction cost per test case. We also
find hyperparameters of the covariance function in the same joint
optimization. The method can be viewed as a Bayesian regression model with
particular input dependent noise. The method turns out to be closely related
to several other sparse GP approaches, and we discuss the relation in detail.
We finally demonstrate its performance on some large data sets, and make a
direct comparison to other sparse GP methods. We show that our method can
match full GP performance with small M, i.e. very sparse solutions, and it
significantly outperforms other approaches in this regime.

Edward Snelson and Zoubin Ghahramani.
**Variable
noise and dimensionality reduction for sparse Gaussian processes**.
In R. Dechter and T. S. Richardson, editors, *22nd Conference on Uncertainty
in Artificial Intelligence*. AUAI Press, 2006.

**
Abstract:** The sparse pseudo-input Gaussian process (SPGP) is a new
approximation method for speeding up GP regression in the case of a large
number of data points N. The approximation is controlled by the gradient
optimization of a small set of M pseudo-inputs, thereby reducing complexity
from O(N^{3}) to O(NM^{2}). One limitation of the SPGP is
that this optimization space becomes impractically big for high dimensional
data sets. This paper addresses this limitation by performing automatic
dimensionality reduction. A projection of the input space to a low
dimensional space is learned in a supervised manner, alongside the
pseudo-inputs, which now live in this reduced space. The paper also
investigates the suitability of the SPGP for modeling data with
input-dependent noise. A further extension of the model is made to make it
even more powerful in this regard - we learn an uncertainty parameter for
each pseudo-input. The combination of sparsity, reduced dimension, and
input-dependent noise makes it possible to apply GPs to much larger and more
complex data sets than was previously practical. We demonstrate the benefits
of these methods on several synthetic and real world problems.

Frank Wood, Thomas L. Griffiths, and Zoubin Ghahramani.
**A non-parametric
Bayesian method for inferring hidden causes**.
In *UAI*. AUAI Press, 2006.

** Abstract:** We present a
non-parametric Bayesian approach to structure learning with hidden causes.
Previous Bayesian treatments of this problem define a prior over the number
of hidden causes and use algorithms such as reversible jump Markov chain
Monte Carlo to move between solutions. In contrast, we assume that the number
of hidden causes is unbounded, but only a finite number influence observable
variables. This makes it possible to use a Gibbs sampler to approximate the
distribution over causal structures. We evaluate the performance of both
approaches in discovering hidden causes in simulated data, and use our
non-parametric approach to discover hidden causes in a real medical
dataset.

Edward Snelson and Zoubin Ghahramani.
**Compact
approximations to Bayesian predictive distributions**.
In *22nd International Conference on Machine Learning*, Bonn, Germany,
August 2005. Omnipress.

** Abstract:** We provide a general
framework for learning precise, compact, and fast representations of the
Bayesian predictive distribution for a model. This framework is based on
minimizing the KL divergence between the true predictive density and a
suitable compact approximation. We consider various methods for doing this,
both sampling based approximations, and deterministic approximations such as
expectation propagation. These methods are tested on a mixture of Gaussians
model for density estimation and on binary linear classification, with both
synthetic data sets for visualization and several real data sets. Our results
show significant reductions in prediction time and memory footprint.

Matthew J. Beal, Francesco Falciani, Zoubin Ghahramani, Claudia Rangel, and
David L. Wild.
**A Bayesian
approach to reconstructing genetic regulatory networks with hidden
factors**.
*Bioinformatics*, 21(3):349-356, 2005.

** Abstract:**
Motivation: We have used state-space models (SSMs) to reverse engineer
transcriptional networks from highly replicated gene expression profiling
time series data obtained from a well-established model of T cell activation.
SSMs are a class of dynamic Bayesian networks in which the observed
measurements depend on some hidden state variables that evolve according to
Markovian dynamics. These hidden variables can capture effects that cannot be
directly measured in a gene expression profiling experiment, for example:
genes that have not been included in the microarray, levels of regulatory
proteins, the effects of mRNA and protein degradation, etc. Results: We have
approached the problem of inferring the model structure of these state-space
models using both classical and Bayesian methods. In our previous work, a
bootstrap procedure was used to derive classical confidence intervals for
parameters representing `gene-gene' interactions over time. In this article,
variational approximations are used to perform the analogous model selection
task in the Bayesian context. Certain interactions are present in both the
classical and the Bayesian analyses of these regulatory networks. The
resulting models place JunB and JunD at the centre of the mechanisms that
control apoptosis and proliferation. These mechanisms are key for clonal
expansion and for controlling the long term behavior (e.g. programmed cell
death) of these cells.

Wei Chu and Zoubin Ghahramani.
**Gaussian processes
for ordinal regression**.
*Journal of Machine Learning Research*, 6:1019-1041, 2005.

** Abstract:** We present a probabilistic kernel approach to
ordinal regression based on Gaussian processes. A threshold model that
generalizes the probit function is used as the likelihood function for
ordinal variables. Two inference techniques, based on the Laplace
approximation and the expectation propagation algorithm respectively, are
derived for hyperparameter learning and model selection. We compare these two
Gaussian process approaches with a previous ordinal regression method based
on support vector machines on some benchmark and real-world data sets,
including applications of ordinal regression to collaborative filtering and
gene expression analysis. Experimental results on these data sets verify the
usefulness of our approach.

Wei Chu and Zoubin Ghahramani.
**Preference learning
with Gaussian processes**.
In Luc De Raedt and Stefan Wrobel, editors, *ICML*, volume 119 of
*ACM International Conference Proceeding Series*, pages 137-144. acm,
2005.

** Abstract:** In this paper, we propose a probabilistic
kernel approach to preference learning based on Gaussian processes. A new
likelihood function is proposed to capture the preference relations in the
Bayesian framework. The generalized formulation is also applicable to tackle
many multiclass problems. The overall approach has the advantages of Bayesian
methods for model selection and probabilistic prediction. Experimental
results compared against the constraint classification approach on several
benchmark datasets verify the usefulness of this algorithm.

Wei Chu, Zoubin Ghahramani, Francesco Falciani, and David L. Wild.
**Biomarker
discovery in microarray gene expression data with Gaussian
processes**.
*Bioinformatics*, 21(16):3385-3393, 2005.

** Abstract:**
MOTIVATION: In clinical practice, pathological phenotypes are often labelled
with ordinal scales rather than binary, e.g. the Gleason grading system for
tumour cell differentiation. However, in the literature of microarray
analysis, these ordinal labels have been rarely treated in a principled way.
This paper describes a gene selection algorithm based on Gaussian processes
to discover consistent gene expression patterns associated with ordinal
clinical phenotypes. The technique of automatic relevance determination is
applied to represent the significance level of the genes in a Bayesian
inference framework. RESULTS: The usefulness of the proposed algorithm for
ordinal labels is demonstrated by the gene expression signature associated
with the Gleason score for prostate cancer data. Our results demonstrate how
multi-gene markers that may be initially developed with a diagnostic or
prognostic application in mind are also useful as an investigative tool to
reveal associations between specific molecular and cellular events and
features of tumour physiology. Our algorithm can also be applied to
microarray data with binary labels with results comparable to other methods
in the literature.

Katherine A. Heller and Zoubin Ghahramani.
**Bayesian
hierarchical clustering**.
In Luc De Raedt and Stefan Wrobel, editors, *ICML*, volume 119 of
*ACM International Conference Proceeding Series*, pages 297-304.
Association for Computing Machinery, 2005.

** Abstract:** We
present a novel algorithm for agglomerative hierarchical clustering based on
evaluating marginal likelihoods of a probabilistic model. This algorithm has
several advantages over traditional distance-based agglomerative clustering
algorithms. (1) It defines a probabilistic model of the data which can be
used to compute the predictive distribution of a test point and the
probability of it belonging to any of the existing clusters in the tree. (2)
It uses a model-based criterion to decide on merging clusters rather than an
ad-hoc distance metric. (3) Bayesian hypothesis testing is used to decide
which merges are advantageous and to output the recommended depth of the
tree. (4) The algorithm can be interpreted as a novel fast bottom-up
approximate inference method for a Dirichlet process (i.e. countably
infinite) mixture model (DPM). It provides a new lower bound on the marginal
likelihood of a DPM by summing over exponentially many clusterings of the
data in polynomial time. We describe procedures for learning the model
hyperpa-rameters, computing the predictive distribution, and extensions to
the algorithm. Experimental results on synthetic and real-world data sets
demonstrate useful properties of the algorithm.

Iain Murray, David J. C. MacKay, Zoubin Ghahramani, and John Skilling.
**Nested sampling
for Potts models**.
In *NIPS*, 2005.

** Abstract:** Nested sampling is a new
Monte Carlo method by Skilling intended for general Bayesian computation.
Nested sampling provides a robust alternative to annealing-based methods for
computing normalizing constants. It can also generate estimates of other
quantities such as posterior expectations. The key technical requirement is
an ability to draw samples uniformly from the prior subject to a constraint
on the likelihood. We provide a demonstration with the Potts model, an
undirected graphical model.

JaeMo Sung, Sung Yang Bang, Seungjin Choi, and Zoubin Ghahramani.
**U-likelihood and
u-updating algorithms: Statistical inference in latent variable
models**.
In João Gama, Rui Camacho, Pavel Brazdil, Alípio Jorge, and Luís
Torgo, editors, *ECML*, volume 3720 of *Lecture Notes in Computer
Science*, pages 377-388. Springer, 2005.

** Abstract:**
In this paper we consider latent variable models and introduce a new
U-likelihood concept for estimating the distribution over hidden variables.
One can derive an estimate of parameters from this distribution. Our approach
differs from the Bayesian and Maximum Likelihood (ML) approaches. It gives an
alternative to Bayesian inference when we don't want to define a prior over
parameters and gives an alternative to the ML method when we want a better
estimate of the distribution over hidden variables. As a practical
implementation, we present a U-updating algorithm based on the mean field
theory to approximate the distribution over hidden variables from the
U-likelihood. This algorithm captures some of the correlations among hidden
variables by estimating reaction terms. Those reaction terms are found to
penalize the likelihood. We show that the U-updating algorithm becomes the EM
algorithm as a special case in the large sample limit. The useful behavior of
our method is confirmed for the case of mixture of Gaussians by comparing to
the EM algorithm.

Jian Zhang, Zoubin Ghahramani, and Yiming Yang.
**Learning
multiple related tasks using latent independent component analysis**.
In *NIPS*, 2005.

** Abstract:** We propose a
probabilistic model based on Independent Component Analysis for learning
multiple related tasks. In our model the task parameters are assumed to be
generated from independent sources which account for the relatedness of the
tasks. We use Laplace distributions to model hidden sources which makes it
possible to identify the hidden, independent components instead of just
modeling correlations. Furthermore, our model enjoys a sparsity property
which makes it both parsimonious and robust. We also propose efficient
algorithms for both empirical Bayes method and point estimation. Our
experimental results on two multi-label text classification data sets show
that the proposed approach is promising.

Edward Snelson, Carl Edward Rasmussen, and Zoubin Ghahramani.
**Warped Gaussian
processes**.
In S. Thrun, L. Saul, and B. Schölkopf, editors, *Advances in Neural
Information Processing Systems 16*, pages 337-344, Cambridge, MA, USA,
December 2004. The MIT Press.

** Abstract:** We generalise
the Gaussian process (GP) framework for regression by learning a nonlinear
transformation of the GP outputs. This allows for non-Gaussian processes and
non-Gaussian noise. The learning algorithm chooses a nonlinear transformation
such that transformed data is well-modelled by a GP. This can be seen as
including a preprocessing transformation as an integral part of the
probabilistic modelling problem, rather than as an ad-hoc step. We
demonstrate on several real regression problems that learning the
transformation can lead to significantly better performance than using a
regular GP, or a GP with a fixed transformation.

Philip E. Bourne, C. K. J. Allerston, Werner G. Krebs, Wilfred W. Li, Ilya N.
Shindyalov, Adam Godzik, Iddo Friedberg, Tong Liu, David L. Wild, Seungwoo
Hwang, Zoubin Ghahramani, Li Chen, and John D. Westbrook.
**The status of
structural genomics defined through the analysis of current targets and
structures**.
In Russ B. Altman, A. Keith Dunker, Lawrence Hunter, Tiffany A. Jung, and
Teri E. Klein, editors, *Pacific Symposium on Biocomputing*, pages
375-386. World Scientific, 2004.

** Abstract:** Structural
genomics-large-scale macromolecular 3-dimenional structure determination-is
unique in that major participants report scientific progress on a weekly
basis. The target database (TargetDB) maintained by the Protein Data Bank
(http://targetdb.pdb.org) reports this progress through the status of each
protein sequence (target) under consideration by the major structural
genomics centers worldwide. Hence, TargetDB provides a unique opportunity to
analyze the potential impact that this major initiative provides to
scientists interested in the sequence-structure-function-disease paradigm.
Here we report such an analysis with a focus on: (i) temporal
characteristics-how is the project doing and what can we expect in the
future? (ii) target characteristics-what are the predicted functions of the
proteins targeted by structural genomics and how biased is the target set
when compared to the PDB and to predictions across complete genomes? (iii)
structures solved-what are the characteristics of structures solved thus far
and what do they contribute? The analysis required a more extensive database
of structure predictions using different methods integrated with data from
other sources. This database, associated tools and related data sources are
available from http://spam.sdsc.edu.

Wei Chu, Zoubin Ghahramani, and David L. Wild.
**A graphical
model for protein secondary structure prediction**.
In Carla E. Brodley, editor, *ICML*, volume 69 of *ACM
International Conference Proceeding Series*. acm, 2004.

**
Abstract:** In this paper, we present a graphical model for protein
secondary structure prediction. This model extends segmental semi-Markov
models (SSMM) to exploit multiple sequence alignment profiles which contain
information from evolutionarily related sequences. A novel parameterized
model is proposed as the likelihood function for the SSMM to capture the
segmental conformation. By incorporating the information from long range
interactions in β-sheets, this model is capable of carrying out inference on
contact maps. The numerical results on benchmark data sets show that
incorporating the profiles results in substantial improvements and the
generalization performance is promising.

Wei Chu, Zoubin Ghahramani, and David L. Wild.
**Protein
secondary structure prediction using sigmoid belief networks to parameterize
segmental semi-Markov models**.
In *ESANN*, pages 81-86, 2004.

** Abstract:** In this
paper, we merge the parametric structure of neural networks into a segmental
semi-Markov model to set up a Bayesian framework for protein structure
prediction. The parametric model, which can also be regarded as an extension
of a sigmoid belief network, captures the underlying dependency in residue
sequences. The results of numerical experiments indicate the usefulness of
this approach.

A. Dubey, S. Hwang, C. Rangel, Carl Edward Rasmussen, Zoubin Ghahramani, and
David L. Wild.
**Clustering
protein sequence and structure space with infinite Gaussian mixture
models**.
In *Pacific Symposium on Biocomputing 2004*, pages 399-410, Singapore,
2004. World Scientific Publishing.

** Abstract:** We describe
a novel approach to the problem of automatically clustering protein sequences
and discovering protein families, subfamilies etc., based on the thoery of
infinite Gaussian mixture models. This method allows the data itself to
dictate how many mixture components are required to model it, and provides a
measure of the probability that two proteins belong to the same cluster. We
illustrate our methods with application to three data sets: globin sequences,
globin sequences with known tree-dimensional structures and G-pretein coupled
receptor sequences. The consistency of the clusters indicate that that our
methods is producing biologically meaningful results, which provide a very
good indication of the underlying families and subfamilies. With the
inclusion of secondary structure and residue solvent accessibility
information, we obtain a classification of sequences of known structure which
reflects and extends their SCOP classifications.

Ananya Dubey, Seungwoo Hwang, Claudia Rangel, Carl Edward Rasmussen, Zoubin
Ghahramani, and David L. Wild.
**Clustering
protein sequence and structure space with infinite Gaussian mixture
models**.
In Russ B. Altman, A. Keith Dunker, Lawrence Hunter, Tiffany A. Jung, and
Teri E. Klein, editors, *Pacific Symposium on Biocomputing*, pages
399-410. World Scientific, 2004.

** Abstract:** We describe a
novel approach to the problem of automatically clustering protein sequences
and discovering protein families, subfamilies etc., based on the theory of
infinite Gaussian mixtures models. This method allows the data itself to
dictate how many mixture components are required to model it, and provides a
measure of the probability that two proteins belong to the same cluster. We
illustrate our methods with application to three data sets: globin sequences,
globin sequences with known three-dimensional structures and G-protein
coupled receptor sequences. The consistency of the clusters indicate that our
method is producing biologically meaningful results, which provide a very
good indication of the underlying families and subfamilies. With the
inclusion of secondary structure and residue solvent accessibility
information, we obtain a classification of sequences of known structure which
both reflects and extends their SCOP classifications. A supplementray web
site containing larger versions of the figures is available at
http://public.kgi.edu/approximately wid/PSB04/index.html

Iain Murray and Zoubin Ghahramani.
**Bayesian learning
in undirected graphical models: Approximate MCMC algorithms**.
In David Maxwell Chickering and Joseph Y. Halpern, editors, *UAI*, pages
392-399. AUAI Press, 2004.

** Abstract:** Bayesian learning
in undirected graphical models|computing posterior distributions over
parameters and predictive quantities is exceptionally difficult. We
conjecture that for general undirected models, there are no tractable MCMC
(Markov Chain Monte Carlo) schemes giving the correct equilibrium
distribution over parameters. While this intractability, due to the partition
function, is familiar to those performing parameter optimisation, Bayesian
learning of posterior distributions over undirected model parameters has been
unexplored and poses novel challenges. we propose several approximate MCMC
schemes and test on fully observed binary models (Boltzmann machines) for a
small coronary heart disease data set and larger artificial systems. While
approximations must perform well on the model, their interaction with the
sampling scheme is also important. Samplers based on variational mean- field
approximations generally performed poorly, more advanced methods using loopy
propagation, brief sampling and stochastic dynamics lead to acceptable
parameter posteriors. Finally, we demonstrate these techniques on a Markov
random field with hidden variables.

Yuan (Alan) Qi, Thomas P. Minka, Rosalind W. Picard, and Zoubin Ghahramani.
**Predictive
automatic relevance determination by expectation propagation**.
In Carla E. Brodley, editor, *ICML*, volume 69 of *ACM
International Conference Proceeding Series*. Association for Computing
Machinery, 2004.

** Abstract:** In many real-world
classification problems the input contains a large number of potentially
irrelevant features. This paper proposes a new Bayesian framework for
determining the relevance of input features. This approach extends one of the
most successful Bayesian methods for feature selection and sparse learning,
known as Automatic Relevance Determination (ARD). ARD finds the relevance of
features by optimizing the model marginal likelihood, also known as the
evidence. We show that this can lead to overfitting. To address this problem,
we propose Predictive ARD based on estimating the predictive performance of
the classifier. While the actual leave-one-out predictive performance is
generally very costly to compute, the expectation propagation (EP) algorithm
proposed by Minka provides an estimate of this predictive performance as a
side-effect of its iterations. We exploit this in our algorithm to do feature
selection, and to select data points in a sparse Bayesian kernel classifier.
Moreover, we provide two other improvements to previous algorithms, by
replacing Laplace's approximation with the generally more accurate EP, and by
incorporating the fast optimization algorithm proposed by Faul and Tipping.
Our experiments show that our method based on the EP estimate of predictive
performance is more accurate on test data than relevance determination by
optimizing the evidence.

Claudia Rangel, John Angus, Zoubin Ghahramani, Maria Lioumi, Elizabeth
Sotheran, Alessia Gaiba, David L. Wild, and Francesco Falciani.
**Modeling t-cell
activation using gene expression profiling and state-space models**.
*Bioinformatics*, 20(9):1361-1372, 2004.

** Abstract:**
Motivation: We have used state-space models to reverse engineer
transcriptional networks from highly replicated gene expression profiling
time series data obtained from a well-established model of T-cell activation.
State space models are a class of dynamic Bayesian networks that assume that
the observed measurements depend on some hidden state variables that evolve
according to Markovian dynamics. These hidden variables can capture effects
that cannot be measured in a gene expression profiling experiment, e.g. genes
that have not been included in the microarray, levels of regulatory proteins,
the effects of messenger RNA and protein degradation, etc. Results: Bootstrap
confidence intervals are developed for parameters representing `gene-gene'
interactions over time. Our models represent the dynamics of T-cell
activation and provide a methodology for the development of rational and
experimentally testable hypotheses. Availability: Supplementary data and
Matlab computer source code will be made available on the web at the URL
given below. Supplementary information: .

Sebastian Thrun, Yufeng Liu, Daphne Koller, Andrew Y. Ng, Zoubin Ghahramani,
and Hugh F. Durrant-Whyte.
**Simultaneous
localization and mapping with sparse extended information filters**.
*I. J. Robotic Res.*, 23(7-8):693-716, 2004.

**
Abstract:** This paper describes a scalable algorithm for the simultaneous
mapping and localization (SLAM) problem. SLAM is the problem of acquiring a
map of a static environment with a mobile robot. The vast majority of SLAM
algorithms are based on the extended Kalman filter (EKF). This paper
advocates an algorithm that relies on the dual of the EKF, the extended
information filter (EIF). We show that when represented in the information
form, map posteriors are dominated by a small number of links that tie
together nearby features in the map. This insight is developed into a sparse
variant of the EIF, called the sparse extended information filters (SEIF).
SEIFs represent maps by graphical networks of features that are locally
interconnected, where links represent relative information between pairs of
nearby features, as well as information about the robot's pose relative to
the map. We show that all essential update equations in SEIFs can be executed
in constant time, irrespective of the size of the map. We also provide
empirical results obtained for a benchmark data set collected in an outdoor
environment, and using a multi-robot mapping simulation.

Jian Zhang, Zoubin Ghahramani, and Yiming Yang.
**A probabilistic
model for online document clustering with application to novelty
detection**.
In Sebastian Thrun, Lawrence K. Saul, and Bernhard Schölkopf, editors,
*NIPS*. MIT Press, 2004.

** Abstract:** In this paper
we propose a probabilistic model for online document clustering. We use
non-parametric Dirichlet process prior to model the growing number of
clusters, and use a prior of general English language model as the base
distribution to handle the generation of novel clusters. Furthermore, cluster
uncertainty is modeled with a Bayesian Dirichletmultinomial distribution. We
use empirical Bayes method to estimate hyperparameters based on a historical
dataset. Our probabilistic model is applied to the novelty detection task in
Topic Detection and Tracking (TDT) and compared with existing approaches in
the literature.

Xiaojin Zhu, Jaz S. Kandola, Zoubin Ghahramani, and John D. Lafferty.
**Nonparametric
transforms of graph kernels for semi-supervised learning**.
In Sebastian Thrun, Lawrence K. Saul, and Bernhard Schölkopf, editors,
*NIPS*. MIT Press, 2004.

** Abstract:** We present an
algorithm based on convex optimization for constructing kernels for
semi-supervised learning. The kernel matrices are derived from the spectral
decomposition of graph Laplacians, and combine labeled and unlabeled data in
a systematic fashion. Unlike previous work using diffusion kernels and
Gaussian random field kernels, a nonparametric kernel approach is presented
that incorporates order constraints during optimization. This results in
flexible kernels and avoids the need to choose among different parametric
forms. Our approach relies on a quadratically constrained quadratic program
(QCQP), and is computationally feasible for large datasets. We evaluate the
kernels on real datasets using support vector machines, with encouraging
results.

Carl Edward Rasmussen and Zoubin Ghahramani.
**Bayesian Monte
Carlo**.
In S. Becker, S. Thrun, and K. Obermayer, editors, *Advances in Neural
Information Processing Systems 15*, pages 489-496, Cambridge, MA, USA,
December 2003. The MIT Press.

** Abstract:** We investigate
Bayesian alternatives to classical Monte Carlo methods for evaluating
integrals. Bayesian Monte Carlo (BMC) allows the incorporation of prior
knowledge, such as smoothness of the integrand, into the estimation. In a
simple problem we show that this outperforms any classical importance
sampling method. We also attempt more challenging multidimensional integrals
involved in computing marginal likelihoods of statistical models (a.k.a.
partition functions and model evidences). We find that Bayesian Monte Carlo
outperformed Annealed Importance Sampling, although for very high dimensional
problems or problems with massive multimodality BMC may be less adequate. One
advantage of the Bayesian approach to Monte Carlo is that samples can be
drawn from any distribution. This allows for the possibility of active design
of sample points so as to maximise information gain.

Zoubin Ghahramani.
**Unsupervised
Learning**.
In Olivier Bousquet, Ulrike von Luxburg, and Gunnar Rätsch, editors,
*Advanced Lectures on Machine Learning*, volume 3176 of *Lecture
Notes in Computer Science*, pages 72-112. Springer, 2003.

** Abstract:** We give a tutorial and overview of the field of
unsupervised learning from the perspective of statistical modelling.
Unsupervised learning can be motivated from information theoretic and
Bayesian principles. We briefly review basic models in unsupervised learning,
including factor analysis, PCA, mixtures of Gaussians, ICA, hidden Markov
models, state-space models, and many variants and extensions. We derive the
EM algorithm and give an overview of fundamental concepts in graphical
models, and inference algorithms on graphs. This is followed by a quick tour
of approximate Bayesian inference, including Markov chain Monte Carlo (MCMC),
Laplace approximation, BIC, variational approximations, and expectation
propagation (EP). The aim of this chapter is to provide a high-level view of
the field. Along the way, many state-of-the-art ideas and future directions
are also reviewed.

Ruslan Salakhutdinov, Sam T. Roweis, and Zoubin Ghahramani.
**On the
convergence of bound optimization algorithms**.
In Christopher Meek and Uffe Kjærulff, editors, *UAI*, pages
509-516. Morgan Kaufmann, 2003.

** Abstract:** Many
practitioners who use EM and related algorithms complain that they are
sometimes slow. When does this happen, and what can be done about it? In this
paper, we study the general class of bound optimization algorithms -
including EM, Iterative Scaling, Non-negative Matrix Factorization, CCCP -
and their relationship to direct optimization algorithms such as
gradientbased methods for parameter learning. We derive a general
relationship between the updates performed by bound optimization methods and
those of gradient and second-order methods and identify analytic conditions
under which bound optimization algorithms exhibit quasi-Newton behavior, and
under which they possess poor, first-order convergence. Based on this
analysis, we consider several specific algorithms, interpret and analyze
their convergence properties and provide some recipes for preprocessing input
to these algorithms to yield faster convergence behavior. We report empirical
results supporting our analysis and showing that simple data preprocessing
can result in dramatically improved performance of bound optimizers in
practice.

Ruslan Salakhutdinov, Sam T. Roweis, and Zoubin Ghahramani.
**Optimization
with EM and expectation-conjugate-gradient**.
In Tom Fawcett and Nina Mishra, editors, *ICML*, pages 672-679. AAAI
Press, 2003.

** Abstract:** We show a close relationship
between the Expectation- Maximization (EM) algorithm and direct optimization
algorithms such as gradientbased methods for parameter learning. We identify
analytic conditions under which EM exhibits Newton-like behavior, and
conditions under which it possesses poor, first-order convergence. Based on
this analysis, we propose two novel algorithms for maximum likelihood
estimation of latent variable models, and report empirical results showing
that, as predicted by theory, the proposed new algorithms can substantially
outperform standard EM in terms of speed of convergence in certain cases.

Xiaojin Zhu, Zoubin Ghahramani, and John D. Lafferty.
**Semi-supervised
learning using Gaussian fields and harmonic functions**.
In Tom Fawcett and Nina Mishra, editors, *ICML*, pages 912-919. AAAI
Press, 2003.

** Abstract:** An approach to semi-supervised
learning is proposed that is based on a Gaussian random field model. Labeled
and unlabeled data are represented as vertices in a weighted graph, with edge
weights encoding the similarity between instances. The learning problem is
then formulated in terms of a Gaussian random field on this graph, where the
mean of the field is characterized in terms of harmonic functions, and is
efficiently obtained using matrix methods or belief propagation. The
resulting learning algorithms have intimate connections with random walks,
electric networks, and spectral graph theory. We discuss methods to
incorporate class priors and the predictions of classifiers obtained by
supervised learning. We also propose a method of parameter learning by
entropy minimization, and show the algorithm's ability to perform feature
selection. Promising experimental results are presented for synthetic data,
digit classification, and text classification tasks.

Matthew J. Beal, Zoubin Ghahramani, and Carl Edward Rasmussen.
**The infinite
hidden Markov model**.
In *Advances in Neural Information Processing Systems 14*, pages
577-584, Cambridge, MA, USA, December 2002. The MIT Press.

**
Abstract:** We show that it is possible to extend hidden Markov models to
have a countably infinite number of hidden states. By using the theory of
Dirichlet processes we can implicitly integrate out the infinitely many
transition parameters, leaving only three hyperparameters which can be
learned from data. These three hyperparameters define a hierarchical
Dirichlet process capable of capturing a rich set of transition dynamics. The
three hyperparameters control the time scale of the dynamics, the sparsity of
the underlying state-transition matrix, and the expected number of distinct
hidden states in a finite sequence. In this framework it is also natural to
allow the alphabet of emitted symbols to be infinite - consider, for
example, symbols being possible words appearing in English text.

Carl Edward Rasmussen and Zoubin Ghahramani.
**Infinite mixtures of
Gaussian process experts**.
In *Advances in Neural Information Processing Systems 14*, pages
881-888, Cambridge, MA, USA, December 2002. The MIT Press.

**
Abstract:** We present an extension to the Mixture of Experts (ME) model,
where the individual experts are Gaussian Process (GP) regression models.
Using an input-dependent adaptation of the Dirichlet Process, we implement a
gating network for an infinite number of Experts. Inference in this model may
be done efficiently using a Markov Chain relying on Gibbs sampling. The model
allows the effective covariance function to vary with the inputs, and may
handle large datasets - thus potentially overcoming two of the biggest
hurdles with GP models. Simulations show the viability of this approach.

Rong Jin and Zoubin Ghahramani.
**Learning with
multiple labels**.
In Suzanna Becker, Sebastian Thrun, and Klaus Obermayer, editors,
*NIPS*, pages 897-904. MIT Press, 2002.

** Abstract:**
In this paper, we study a special kind of learning problem in which each
training instance is given a set of (or distribution over) candidate class
labels and only one of the candidate labels is the correct one. Such a
problem can occur, e.g., in an information retrieval setting where a set of
words is associated with an image, or if classes labels are organized
hierarchically. We propose a novel discriminative approach for handling the
ambiguity of class labels in the training examples. The experiments with the
proposed approach over five different UCI datasets show that our approach is
able to find the correct label among the set of candidate labels and actually
achieve performance close to the case when each training instance is given a
single correct label. In contrast, naïve methods degrade rapidly as more
ambiguity is introduced into the labels.

A. Raval, Zoubin Ghahramani, and David L. Wild.
**A Bayesian
network model for protein fold and remote homologue recognition**.
*Bioinformatics*, 18(6):788-801, 2002.

** Abstract:**
Motivation: The Bayesian network approach is a framework which combines
graphical representation and probability theory, which includes, as a special
case, hidden Markov models. Hidden Markov models trained on amino acid
sequence or secondary structure data alone have been shown to have potential
for addressing the problem of protein fold and superfamily classification.
Results: This paper describes a novel implementation of a Bayesian network
which simultaneously learns amino acid sequence, secondary structure and
residue accessibility for proteins of known three-dimensional structure. An
awareness of the errors inherent in predicted secondary structure may be
incorporated into the model by means of a confusion matrix. Training and
validation data have been derived for a number of protein superfamilies from
the Structural Classification of Proteins (SCOP) database. Cross validation
results using posterior probability classification demonstrate that the
Bayesian network performs better in classifying proteins of known structural
superfamily than a hidden Markov model trained on amino acid sequences
alone.

Naonori Ueda and Zoubin Ghahramani.
**Bayesian model
search for mixture models based on optimizing variational bounds**.
*Neural Networks*, 15(10):1223-1241, 2002.

**
Abstract:** When learning a mixture model, we suffer from the local optima
and model structure determination problems. In this paper, we present a
method for simultaneously solving these problems based on the variational
Bayesian (VB) framework. First, in the VB framework, we derive an objective
function that can simultaneously optimize both model parameter distributions
and model structure. Next, focusing on mixture models, we present a
deterministic algorithm to approximately optimize the objective function by
using the idea of the split and merge operations which we previously proposed
within the maximum likelihood framework. Then, we apply the method to mixture
of expers (MoE) models to experimentally show that the proposed method can
find the optimal number of experts of a MoE while avoiding local maxima. q
2002 Elsevier Science Ltd. All rights reserved.

Carl Edward Rasmussen and Zoubin Ghahramani.
**Occam's
razor**.
In *Advances in Neural Information Processing Systems 13*, pages
294-300, Cambridge, MA, USA, December 2001. The MIT Press.

**
Abstract:** The Bayesian paradigm apparently only sometimes gives rise to
Occam's Razor; at other times very large models perform well. We give simple
examples of both kinds of behaviour. The two views are reconciled when
measuring complexity of functions, rather than of the machinery used to
implement them. We analyze the complexity of functions for some linear in the
parameter models that are equivalent to Gaussian Processes, and always find
Occam's Razor at work.

Zoubin Ghahramani.
**An introduction to
hidden Markov models and Bayesian networks**.
*IJPRAI*, 15(1):9-42, 2001.

** Abstract:** We provide a
tutorial on learning and inference in hidden Markov models in the context of
the recent literature on Bayesian networks. This perspective make sit
possible to consider novel generalizations to hidden Markov models with
multiple hidden state variables, multiscale representations, and mixed
discrete and continuous variables. Although exact inference in these
generalizations is usually intractable, one can use approximate inference in
these generalizations is usually intractable, one can use approximate
inference algorithms such as Markov chain sampling and variational methods.
We describe how such methods are applied to these generalized hidden Markov
models. We conclude this review with a discussion of Bayesian methods for
model selection in generalized HMMs.

Nicholas J. Adams, Amos J. Storkey, Christopher K. I. Williams, and Zoubin
Ghahramani.
**MFDTs: Mean
field dynamic trees**.
In *International Conference on Pattern Recognition*, volume 3,
pages 151-154, 2000.

** Abstract:** Tree structured belief
networks are attractive for image segmentation tasks. However, networks with
fixed architectures are not very suitable as they lead to blocky artefacts,
and led to the introduction of dynamic trees (DTs). The Dynamic trees
architecture provide a prior distribution over tree structures, and simulated
annealing (SA) was used to search for structures with high posterior
probability. In this paper we introduce a mean field approach to inference in
DTs. We find that the mean field method captures the posterior better than
just using the maximum a posteriori solution found by SA

Zoubin Ghahramani and Matthew J. Beal.
**Propagation
algorithms for variational Bayesian learning**.
In Michael J. Kearns, Sara A. Solla, and David A. Cohn, editors, *NIPS*,
pages 507-513. The MIT Press, 2000.

** Abstract:**
Variational approximations are becoming a widespread tool for Bayesian
learning of graphical models. We provide some theoretical results for the
variational updates in a very general family of conjugate-exponential
graphical models. We show how the belief propagation and the junction tree
algorithms can be used in the inference step of variational Bayesian
learning. Applying these results to the Bayesian analysis of linear-Gaussian
state-space models we obtain a learning procedure that exploits the Kalman
smoothing propagation, while integrating over all model parameters. We
demonstrate how this can be used to infer the hidden state dimensionality of
the state-space model in a variety of synthetic problems and one real
high-dimensional data set.

Zoubin Ghahramani and Geoffrey E. Hinton.
**Variational
learning for switching state-space models**.
*Neural Computation*, 12(4):831-864, 2000.

**
Abstract:** We introduce a new statistical model for time series which
iteratively segments data into regimes with approximately linear dynamics and
learns the parameters of each of these linear regimes. This model combines
and generalizes two of the most widely used stochastic time series
models-hidden Markov models and linear dynamical systems-and is closely
related to models that are widely used in the control and econometrics
literatures. It can also be derived by extending the mixture of experts
neural network (Jacobs et al, 1991) to its fully dynamical version, in which
both expert and gating networks are recurrent. Inferring the posterior
probabilities of the hidden states of this model is computationally
intractable, and therefore the exact Expectation Maximization (EM) algorithm
cannot be applied. However, we present a variational approximation that
maximizes a lower bound on the log likelihood and makes use of both the
forward-backward recursions for hidden Markov models and the Kalman filter
recursions for linear dynamical systems. We tested the algorithm both on
artificial data sets and on a natural data set of respiration force from a
patient with sleep apnea. The results suggest that variational approximations
are a viable method for inference and learning in switching state-space
models.

Naonori Ueda, Ryohei Nakano, Zoubin Ghahramani, and Geoffrey E. Hinton.
**SMEM algorithm
for mixture models**.
*Neural Computation*, 12(9):2109-2128, 2000.

**
Abstract:** We present a split-and-merge expectation-maximization (SMEM)
algorithm to overcome the local maxima problem in parameter estimation of
finite mixture models. In the case of mixture models, local maxima often
involve having too many components of a mixture model in one part of the
space and too few in another, widely separated part of the space. To escape
from such configurations, we repeatedly perform simultaneous split-and-merge
operations using a new criterion for efficiently selecting the
split-and-merge candidates. We apply the proposed algorithm to the training
of gaussian mixtures and mixtures of factor analyzers using synthetic and
real data and show the effectiveness of using the split-and-merge operations
to improve the likelihood of both the training data and of held-out test
data. We also show the practical usefulness of the proposed algorithm by
applying it to image compression and pattern recognition problems.

Naonori Ueda, Ryohei Nakano, Zoubin Ghahramani, and Geoffrey E. Hinton.
**Split and merge
EM algorithm for improving Gaussian mixture density estimates**.
*VLSI Signal Processing*, 26(1-2):133-140, 2000.

**
Abstract:** We present a split and merge EM algorithm to overcome the local
maximum problem in Gaussian mixture density estimation. Nonglobal maxims
often involve having too many Gaussians in one part of the space and too few
in another, widely separated part of the space. To escape from such
configurations we repeatedly perform split and merge operations using a new
criterion for efficiently selecting the split and merge candidates.
Experimental results on synthetic and real data show the effectiveness of
using the split and merge operations to improve the likelihood of both the
training data and of held-out test data

Zoubin Ghahramani and Matthew J. Beal.
**Variational
inference for Bayesian mixtures of factor analysers**.
In Michael J. Kearns, Sara A. Solla, and David A. Cohn, editors, *NIPS*,
pages 449-455. The MIT Press, 1999.

** Abstract:** We present
an algorithm that infers the model structure of a mixture of factor analysers
using an efficient and deterministic variational approximation to full
Bayesian integration over model parameters. This procedure can automatically
determine the optimal number of components and the local dimensionality of
each component (i.e. the number of factors in each factor analyser).
Alternatively it can be used to infer posterior distributions over number of
components and dimensionalities. Since all parameters are integrated out the
method is not prone to overfitting. Using a stochastic procedure for adding
components it is possible to perform the variational optimisation
incrementally and to avoid local maxima. Results show that the method works
very well in practice and correctly infers the number and dimensionality of
nontrivial synthetic examples. By importance sampling from the variational
approximation we show how to obtain unbiased estimates of the true evidence,
the exact predictive density, and the KL divergence between the variational
posterior and the true posterior, not only in this model but for variational
approximations in general.

Zoubin Ghahramani, Alexander T Korenberg, and Geoffrey E Hinton.
**Scaling in a
hierarchical unsupervised network**.
In *Artificial Neural Networks, 1999. ICANN 99. Ninth International
Conference on (Conf. Publ. No. 470)*, volume 1, pages 13-18. IET,
1999.

** Abstract:** A persistent worry with computational
models of unsupervised learning is that learning will become more difficult
as the problem is scaled. We examine this issue in the context of a novel
hierarchical, generative model that can be viewed as a non-linear
generalization of factor analysis and can be implemented in a neural network.
The model performs perceptual inference in a probabilistically consistent
manner by using top-down, bottom-up and lateral connections. These
connections can be learned using simple rules that require only locally
available information. We first demonstrate that the model can extract a
sparse, distributed, hierarchical representation of global disparity from
simplified random-dot stereograms. We then investigate some of the scaling
properties of the algorithm on this problem and find that : (1) increasing
the image size leads to faster and more reliable learning; (2) Increasing the
depth of the network from one to two hidden layers leads to better
representations at the first hidden layer, and (3) Once one part of the
network has discovered how to represent disparity, it 'supervises' other
parts of the network, greatly speeding up their learning.

Geoffrey E. Hinton, Zoubin Ghahramani, and Yee Whye Teh.
**Learning to
parse images**.
In Michael J. Kearns, Sara A. Solla, and David A. Cohn, editors, *NIPS*,
pages 463-469. The MIT Press, 1999.

** Abstract:** We
describe a class of probabilistic models that we call credibility networks.
Using parse trees as internal representations of images, credibility networks
are able to perform segmentation and recognition simultaneously, removing the
need for ad hoc segmentation heuristics. Promising results in the problem of
segmenting handwritten digits were obtained.

Michael I. Jordan, Zoubin Ghahramani, Tommi Jaakkola, and Lawrence K. Saul.
**An introduction
to variational methods for graphical models**.
*Machine Learning*, 37(2):183-233, 1999.

** Abstract:**
This paper presents a tutorial introduction to the use of variational methods
for inference and learning in graphical models (Bayesian networks and Markov
random fields). We present a number of examples of graphical models,
including the QMR-DT database, the sigmoid belief network, the Boltzmann
machine, and several variants of hidden Markov models, in which it is
infeasible to run exact inference algorithms. We then introduce variational
methods, which exploit laws of large numbers to transform the original
graphical model into a simplified graphical model in which inference is
efficient. Inference in the simpified model provides bounds on probabilities
of interest in the original model. We describe a general framework for
generating variational transformations based on convex duality. Finally we
return to the examples and demonstrate how variational algorithms can be
formulated in each case.

Sam T. Roweis and Zoubin Ghahramani.
**A unifying review
of linear Gaussian models**.
*Neural Computation*, 11(2):305-345, 1999.

**
Abstract:** Factor analysis, principal component analysis, mixtures of
gaussian clusters, vector quantization, Kalman filter models, and hidden
Markov models can all be unified as variations of unsupervised learning under
a single basic generative model. This is achieved by collecting together
disparate observations and derivations made by many previous authors and
introducing a new way of linking discrete and continuous state models using a
simple nonlinearity. Through the use of other nonlinearities, we show how
independent component analysis is also a variation of the same basic
generative model. We show that factor analysis and mixtures of gaussians can
be implemented in autoencoder neural networks and learned using squared error
plus the same regularization term. We introduce a new model for static data,
known as sensible principal component analysis, as well as a novel concept of
spatially adaptive observation noise. We also review some of the literature
involving global and local mixtures of the basic models and provide
pseudocode for inference and learning for all the basic models.

Zoubin Ghahramani and Sam T. Roweis.
**Learning nonlinear
dynamical systems using an EM algorithm**.
In Michael J. Kearns, Sara A. Solla, and David A. Cohn, editors, *NIPS*,
pages 431-437. The MIT Press, 1998.

** Abstract:** The
Expectation Maximization (EM) algorithm is an iterative procedure for maximum
likelihood parameter estimation from data sets with missing or hidden
variables. It has been applied to system identification in linear stochastic
state-space models, where the state variables are hidden from the observer
and both the state and the parameters of the model have to be estimated
simultaneously [9]. We present a generalization of the EM algorithm for
parameter estimation in nonlinear dynamical systems. The ``expectation'' step
makes use of Extended Kalman Smoothing to estimate the state, while the
``maximization'' step re-estimates the parameters using these uncertain state
estimates. In general, the nonlinear maximization step is difficult because
it requires integrating out the uncertainty in the states. However, if
Gaussian radial basis function (RBF) approximators are used to model the
nonlinearities, the integrals become tractable and the maximization step can
be solved via systems of linear equations.

Naonori Ueda, Ryohei Nakano, Zoubin Ghahramani, and Geoffrey E. Hinton.
**SMEM algorithm
for mixture models**.
In Michael J. Kearns, Sara A. Solla, and David A. Cohn, editors, *NIPS*,
pages 599-605. The MIT Press, 1998.

** Abstract:** We present
a split-and-merge expectation-maximization (SMEM) algorithm to overcome the
local maxima problem in parameter estimation of finite mixture models. In the
case of mixture models, local maxima often involve having too many components
of a mixture model in one part of the space and too few in another, widely
separated part of the space. To escape from such configurations, we
repeatedly perform simultaneous split-and-merge operations using a new
criterion for efficiently selecting the split-and-merge candidates. We apply
the proposed algorithm to the training of gaussian mixtures and mixtures of
factor analyzers using synthetic and real data and show the effectiveness of
using the split- and-merge operations to improve the likelihood of both the
training data and of held-out test data. We also show the practical
usefulness of the proposed algorithm by applying it to image compression and
pattern recognition problems.

Zoubin Ghahramani and Geoffrey E. Hinton.
**Hierarchical
non-linear factor analysis and topographic maps**.
In Michael I. Jordan, Michael J. Kearns, and Sara A. Solla, editors,
*NIPS*. The MIT Press, 1997.

** Abstract:** We first
describe a hierarchical, generative model that can be viewed as a non-linear
generalisation of factor analysis and can be implemented in a neural network.
The model performs perceptual inference in a probabilistically consistent
manner by using top-down, bottom-up and lateral connections. These
connections can be learned using simple rules that require only locally
available information. We then show how to incorporate lateral connections
into the generative model. The model extracts a sparse, distributed,
hierarchical representation of depth from simplified random-dot stereograms
and the localised disparity detectors in the first hidden layer form a
topographic map. When presented with image patches from natural scenes, the
model develops topographically organised local feature detectors.

Zoubin Ghahramani.
**Learning dynamic
Bayesian networks**.
In C. Lee Giles and Marco Gori, editors, *Adaptive Processing of Sequences
and Data Structures*, volume 1387 of *Lecture Notes in Computer
Science*, pages 168-197. Springer, 1997.

** Abstract:**
Bayesian networks are a concise graphical formalism for describing
probabilistic models. We have provided a brief tutorial of methods for
learning and inference in dynamic Bayesian networks. In many of the
interesting models, beyond the simple linear dynamical system or hidden
Markov model, the calculations required for inference are intractable. Two
different approaches for handling this intractability are Monte Carlo methods
such as Gibbs sampling, and variational methods. An especially promising
variational approach is based on exploiting tractable substructures in the
Bayesian network.

Zoubin Ghahramani and Michael I. Jordan.
**Factorial hidden
Markov models**.
*Machine Learning*, 29(2-3):245-273, 1997.

**
Abstract:** Hidden Markov models (HMMs) have proven to be one of the most
widely used tools for learning probabilistic models of time series data. In
an HMM, information about the past is conveyed through a single discrete
variable-the hidden state. We discuss a generalization of HMMs in which
this state is factored into multiple state variables and is therefore
represented in a distributed manner. We describe an exact algorithm for
inferring the posterior probabilities of the hidden state variables given the
observations, and relate it to the forward-backward algorithm for HMMs and
to algorithms for more general graphical models. Due to the combinatorial
nature of the hidden state representation, this exact algorithm is
intractable. As in other intractable systems, approximate inference can be
carried out using Gibbs sampling or variational methods. Within the
variational framework, we present a structured approximation in which the the
state variables are decoupled, yielding a tractable algorithm for learning
the parameters of the model. Empirical comparisons suggest that these
approximations are efficient and provide accurate alternatives to the exact
methods. Finally, we use the structured approximation to model Bach's
chorales and show that factorial HMMs can capture statistical structure in
this data set which an unconstrained HMM cannot.

Geoffrey E Hinton and Zoubin Ghahramani.
**Generative models
for discovering sparse distributed representations**.
*Philosophical Transactions of the Royal Society of London. Series B:
Biological Sciences*, 352(1358):1177-1190, 1997.

**
Abstract:** We describe a hierarchical, generative model that can be viewed
as a nonlinear generalization of factor analysis and can be implemented in a
neural network. The model uses bottom–up, top–down and lateral
connections to perform Bayesian perceptual inference correctly. Once
perceptual inference has been performed the connection strengths can be
updated using a very simple learning rule that only requires locally
available information. We demonstrate that the network learns to extract
sparse, distributed, hierarchical representations.

David A. Cohn, Zoubin Ghahramani, and Michael I. Jordan.
**Active learning
with statistical models**.
*J. Artif. Intell. Res. (JAIR)*, 4:129-145, 1996.

**
Abstract:** For many types of machine learning algorithms, one can compute
the statistically ``optimal'' way to select training data. In this paper, we
review how optimal data selection techniques have been used with feedforward
neural networks. We then show how the same principles may be used to select
data for two alternative, statistically-based learning architectures:
mixtures of Gaussians and locally weighted regression. While the techniques
for neural networks are computationally expensive and approximate, the
techniques for mixtures of Gaussians and locally weighted regression are both
efficient and accurate. Empirically, we observe that the optimality criterion
sharply decreases the number of training examples the learner needs in order
to achieve good performance.

Michael I. Jordan, Zoubin Ghahramani, and Lawrence K. Saul.
**Hidden Markov
decision trees**.
In Michael Mozer, Michael I. Jordan, and Thomas Petsche, editors,
*NIPS*, pages 501-507. MIT Press, 1996.

** Abstract:**
We study a time series model that can be viewed as a decision tree with
Markov temporal structure. The model is intractable for exact calculations,
thus we utilize variational approximations. We consider three different
distributions for the approximation: one in which the Markov calculations are
performed exactly and the layers of the decision tree are decoupled, one in
which the decision tree calculations are performed exactly and the time steps
of the Markov chain are decoupled, and one in which a Viterbi-like assumption
is made to pick out a single most likely state sequence. We present
simulation results for artificial data and the Bach chorales.

Carl Edward Rasmussen, Radford M. Neal, Geoffrey E. Hinton, Drew van Camp, Mike
Revow, Zoubin Ghahramani, Rafal Kustra, and Robert Tibshirani.
**The DELVE
manual**, 1996.

** Abstract:** DELVE - Data for
Evaluating Learning in Valid Experiments - is a collection of datasets from
many sources, an environment within which this data can be used to assess the
performance of methods for learning relationships from data, and a repository
for the results of such experiments.

** Comment:** The delve website.

Zoubin Ghahramani and Michael I. Jordan.
**Factorial hidden
Markov models**.
In David S. Touretzky, Michael Mozer, and Michael E. Hasselmo, editors,
*NIPS*, pages 472-478. MIT Press, 1995.

** Abstract:**
Hidden Markov models (HMMs) have proven to be one of the most widely used
tools for learning probabilistic models of time series data. In an HMM,
information about the past is conveyed through a single discrete
variable-the hidden state. We discuss a generalization of HMMs in which
this state is factored into multiple state variables and is therefore
represented in a distributed manner. We describe an exact algorithm for
inferring the posterior probabilities of the hidden state variables given the
observations, and relate it to the forward-backward algorithm for HMMs and
to algorithms for more general graphical models. Due to the combinatorial
nature of the hidden state representation, this exact algorithm is
intractable. As in other intractable systems, approximate inference can be
carried out using Gibbs sampling or variational methods. Within the
variational framework, we present a structured approximation in which the the
state variables are decoupled, yielding a tractable algorithm for learning
the parameters of the model. Empirical comparisons suggest that these
approximations are efficient and provide accurate alternatives to the exact
methods. Finally, we use the structured approximation to model Bach's
chorales and show that factorial HMMs can capture statistical structure in
this data set which an unconstrained HMM cannot.

David A. Cohn, Zoubin Ghahramani, and Michael I. Jordan.
**Active learning
with statistical models**.
In Gerald Tesauro, David S. Touretzky, and Todd K. Leen, editors, *Advances
in Neural Information Processing Systems 7*, pages 705-712. MIT Press,
1994.

** Abstract:** For many types of machine learning
algorithms, one can compute the statistically ``optimal'' way to select
training data. In this paper, we review how optimal data selection techniques
have been used with feedforward neural networks. We then show how the same
principles may be used to select data for two alternative,
statistically-based learning architectures: mixtures of Gaussians and locally
weighted regression. While the techniques for neural networks are
computationally expensive and approximate, the techniques for mixtures of
Gaussians and locally weighted regression are both efficient and accurate.
Empirically, we observe that the optimality criterion sharply decreases the
number of training examples the learner needs in order to achieve good
performance.

Zoubin Ghahramani.
**Factorial learning and
the EM algorithm**.
In Gerald Tesauro, David S. Touretzky, and Todd K. Leen, editors, *Advances
in Neural Information Processing Systems 7*, pages 617-624. MIT Press,
1994.

** Abstract:** Many real world learning problems are
best characterized by an interaction of multiple independent causes or
factors. Discovering such causal structure from the data is the focus of this
paper. Based on Zemel and Hinton's cooperative vector quantizer (CVQ)
architecture, an unsupervised learning algorithm is derived from the
Expectation-Maximization (EM) framework. Due to the combinatorial nature of
the data generation process, the exact E-step is computationally intractable.
Two alternative methods for computing the E-step are proposed: Gibbs sampling
and mean-field approximation, and some promising empirical results are
presented.

Zoubin Ghahramani, Daniel M. Wolpert, and Michael I. Jordan.
**Computational
structure of coordinate transformations: A generalization study**.
In Gerald Tesauro, David S. Touretzky, and Todd K. Leen, editors, *Advances
in Neural Information Processing Systems 7*, pages 1125-1132. MIT Press,
1994.

** Abstract:** One of the fundamental properties that
both neural networks and the central nervous system share is the ability to
learn and generalize from examples. While this property has been studied
extensively in the neural network literature it has not been thoroughly
explored in human perceptual and motor learning. We have chosen a coordinate
transformation system-the visuomotor map which transforms visual
coordinates into motor coordinates-to study the generalization effects of
learning new input-output pairs. Using a paradigm of computer controlled
altered visual feedback, we have studied the generalization of the visuomotor
map subsequent to both local and context-dependent remappings. A local
remapping of one or two input-output pairs induced a significant global, yet
decaying, change in the visuomotor map, suggesting a representation for the
map composed of units with large functional receptive fields. Our study of
context-dependent remappings indicated that a single point in visual space
can be mapped to two different finger locations depending on a context
variable-the starting point of the movement. Furthermore, as the context is
varied there is a gradual shift between the two remappings, consistent with
two visuomotor modules being learned and gated smoothly with the context.

Daniel M. Wolpert, Zoubin Ghahramani, and Michael I. Jordan.
**Forward dynamic
models in human motor control: Psychophysical evidence**.
In Gerald Tesauro, David S. Touretzky, and Todd K. Leen, editors, *Advances
in Neural Information Processing Systems 7*, pages 43-50. MIT Press,
1994.

** Abstract:** Based on computational principles, with
as yet no direct experimental validation, it has been proposed that the
central nervous system (CNS) uses an internal model to simulate the dynamic
behavior of the motor system in planning, control and learning. We present
experimental results and simulations based on a novel approach that
investigates the temporal propagation of errors in the sensorimotor
integration process. Our results provide direct support for the existence of
an internal model.

Zoubin Ghahramani and Michael I. Jordan.
**Supervised learning
from incomplete data via an EM approach**.
In Jack D. Cowan, Gerald Tesauro, and Joshua Alspector, editors, *NIPS*,
pages 120-127. Morgan Kaufmann, 1993.

** Abstract:**
Real-world learning tasks may involve high-dimensional data sets with
arbitrary patterns of missing data. In this paper we present a framework
based on maximum likelihood density estimation for learning from such data
sets. We use mixture models for the density estimates and make two distinct
appeals to the ExpectationMaximization (EM) principle (Dempster et al., 1977)
in deriving a learning algorithm-EM is used both for the estimation of
mixture components and for coping with missing data. The resulting algorithm
is applicable to a wide range of supervised as well as unsupervised learning
problems. Results from a classification benchmark-the iris data set-are
presented.

Benjamin Eysenbach, Shixiang Gu, Julian Ibarz, and Sergey Levine.
**Leave no trace: Learning to reset
for safe and autonomous reinforcement learning**.
In *6th International Conference on Learning Representations*, Vancouver
CANADA, Apr 2018.

** Abstract:** Deep reinforcement learning
algorithms can learn complex behavioral skills, but real-world application of
these methods requires a large amount of experience to be collected by the
agent. In practical settings, such as robotics, this involves repeatedly
attempting a task, resetting the environment between each attempt. However,
not all tasks are easily or automatically reversible. In practice, this
learning process requires extensive human intervention. In this work, we
propose an autonomous method for safe and efficient reinforcement learning
that simultaneously learns a forward and reset policy, with the reset policy
resetting the environment for a subsequent attempt. By learning a value
function for the reset policy, we can automatically determine when the
forward policy is about to enter a non-reversible state, providing for
uncertainty-aware safety aborts. Our experiments illustrate that proper use
of the reset policy can greatly reduce the number of manual resets required
to learn a task, can reduce the number of unsafe actions that lead to
non-reversible states, and can automatically induce a curriculum.

** Comment:** [Video]

Vitchyr Pong, Shixiang Gu, Murtaza Dalal, and Sergey Levine.
**Temporal
difference models: Model-free deep rl for model-based control**.
In *6th International Conference on Learning Representations*, Vancouver
CANADA, Apr 2018.

** Abstract:** Model-free reinforcement
learning (RL) has been proven to be a powerful, general tool for learning
complex behaviors. However, its sample efficiency is often impractically
large for solving challenging real-world problems, even for off-policy
algorithms such as Q-learning. A limiting factor in classic model-free RL is
that the learning signal consists only of scalar rewards, ignoring much of
the rich information contained in state transition tuples. Model-based RL
uses this information, by training a predictive model, but often does not
achieve the same asymptotic performance as model-free RL due to model bias.
We introduce temporal difference models (TDMs), a family of goal-conditioned
value functions that can be trained with model-free learning and used for
model-based control. TDMs combine the benefits of model-free and model-based
RL: they leverage the rich information in state transitions to learn very
efficiently, while still attaining asymptotic performance that exceeds that
of direct model-based RL methods. Our experimental results show that, on a
range of continuous control tasks, TDMs provide a substantial improvement in
efficiency compared to state-of-the-art model-based and model-free
methods.

Shixiang Gu, Timothy Lillicrap, Zoubin Ghahramani, Richard E. Turner, Bernhard
Schölkopf, and Sergey Levine.
**Interpolated policy gradient:
Merging on-policy and off-policy policy gradient estimation for deep
reinforcement learning**.
In *Advances in Neural Information Processing Systems 31*, Long Beach
USA, Dec 2017.

** Abstract:** Off-policy model-free deep
reinforcement learning methods using previously collected data can improve
sample efficiency over on-policy policy gradient techniques. On the other
hand, on-policy algorithms are often more stable and easier to use. This
paper examines, both theoretically and empirically, approaches to merging on-
and off-policy updates for deep reinforcement learning. Theoretical results
show that off-policy updates with a value function estimator can be
interpolated with on-policy policy gradient updates whilst still satisfying
performance bounds. Our analysis uses control variate methods to produce a
family of policy gradient algorithms, with several recently proposed
algorithms being special cases of this family. We then provide an empirical
comparison of these techniques with the remaining algorithmic details fixed,
and show how different mixing of off-policy gradient estimates with on-policy
samples contribute to improvements in empirical performance. The final
algorithm provides a generalization and unification of existing deep policy
gradient techniques, has theoretical guarantees on the bias introduced by
off-policy updates, and improves on the state-of-the-art model-free deep RL
methods on a number of OpenAI Gym continuous control benchmarks.

Natasha Jaques, Shixiang Gu, Dzmitry Bahdanau, Jose Miguel Hernndez Lobato,
Richard E. Turner, and Douglas Eck.
**Sequence tutor: Conservative
fine-tuning of sequence generation models with kl-control**.
In *34th International Conference on Machine Learning*, Sydney
AUSTRALIA, Aug 2017.

** Abstract:** This paper proposes a
general method for improving the structure and quality of sequences generated
by a recurrent neural network (RNN), while maintaining information originally
learned from data, as well as sample diversity. An RNN is first pre-trained
on data using maximum likelihood estimation (MLE), and the probability
distribution over the next token in the sequence learned by this model is
treated as a prior policy. Another RNN is then trained using reinforcement
learning (RL) to generate higher-quality outputs that account for
domain-specific incentives while retaining proximity to the prior policy of
the MLE RNN. To formalize this objective, we derive novel off-policy RL
methods for RNNs from KL-control. The effectiveness of the approach is
demonstrated on two applications; 1) generating novel musical melodies, and
2) computational molecular generation. For both problems, we show that the
proposed method improves the desired properties and structure of the
generated sequences, while maintaining information learned from data.

** Comment:** [MIT
Technology Review] [Video]

Shixiang Gu, Ethan Holly, Timothy Lillicrap, and Sergey Levine.
**Deep reinforcement learning for
robotic manipulation with asynchronous off-policy updates**.
In *IEEE International Conference on Robotics and Automation*,
SINGAPORE, May 2017.

** Abstract:** Reinforcement learning
holds the promise of enabling autonomous robots to learn large repertoires of
behavioral skills with minimal human intervention. However, robotic
applications of reinforcement learning often compromise the autonomy of the
learning process in favor of achieving training times that are practical for
real physical systems. This typically involves introducing hand-engineered
policy representations and human-supplied demonstrations. Deep reinforcement
learning alleviates this limitation by training general-purpose neural
network policies, but applications of direct deep reinforcement learning
algorithms have so far been restricted to simulated settings and relatively
simple tasks, due to their apparent high sample complexity. In this paper, we
demonstrate that a recent deep reinforcement learning algorithm based on
off-policy training of deep Q-functions can scale to complex 3D manipulation
tasks and can learn deep neural network policies efficiently enough to train
on real physical robots. We demonstrate that the training times can be
further reduced by parallelizing the algorithm across multiple robots which
pool their policy updates asynchronously. Our experimental evaluation shows
that our method can learn a variety of 3D manipulation skills in simulation
and a complex door opening skill on real robots without any prior
demonstrations or manually designed representations.

** Comment:** [Google
Blogpost] [MIT
Technology Review] [Video]

Shixiang Gu, Timothy Lillicrap, Zoubin Ghahramani, Richard E. Turner, and
Sergey Levine.
**Q-prop: Sample-efficient policy
gradient with an off-policy critic**.
In *5th International Conference on Learning Representations*, Toulon
France, April 2017.

** Abstract:** Model-free deep
reinforcement learning (RL) methods have been successful in a wide variety of
simulated domains. However, a major obstacle facing deep RL in the real world
is their high sample complexity. Batch policy gradient methods offer stable
learning, but at the cost of high variance, which often requires large
batches. TD-style methods, such as off-policy actor-critic and Q-learning,
are more sample-efficient but biased, and often require costly hyperparameter
sweeps to stabilize. In this work, we aim to develop methods that combine the
stability of policy gradients with the efficiency of off-policy RL. We
present Q-Prop, a policy gradient method that uses a Taylor expansion of the
off-policy critic as a control variate. Q-Prop is both sample efficient and
stable, and effectively combines the benefits of on-policy and off-policy
methods. We analyze the connection between Q-Prop and existing model-free
algorithms, and use control variate theory to derive two variants of Q-Prop
with conservative and aggressive adaptation. We show that conservative Q-Prop
provides substantial gains in sample efficiency over trust region policy
optimization (TRPO) with generalized advantage estimation (GAE), and improves
stability over deep deterministic policy gradient (DDPG), the
state-of-the-art on-policy and off-policy methods, on OpenAI Gym's MuJoCo
continuous control environments.

Eric Jang, Shixiang Gu, and Ben Poole.
**Categorical reparametrization
with gumble-softmax**.
In *5th International Conference on Learning Representations*, Toulon
FRANCE, April 2017.

** Abstract:** Categorical variables are a
natural choice for representing discrete structure in the world. However,
stochastic neural networks rarely use categorical latent variables due to the
inability to backpropagate through samples. In this work, we present an
efficient gradient estimator that replaces the non-differentiable sample from
a categorical distribution with a differentiable sample from a novel
Gumbel-Softmax distribution. This distribution has the essential property
that it can be smoothly annealed into a categorical distribution. We show
that our Gumbel-Softmax estimator outperforms state-of-the-art gradient
estimators on structured output prediction and unsupervised generative
modeling tasks with categorical latent variables, and enables large speedups
on semi-supervised classification.

Shixiang Gu, Timothy Lillicrap, Ilya Sutskever, and Sergey Levine.
**Continuous deep q-learning with
model-based acceleration**.
In *33rd International Conference on Machine Learning*, New York USA,
June 2016.

** Abstract:** Model-free reinforcement learning
has been successfully applied to a range of challenging problems, and has
recently been extended to handle large neural network policies and value
functions. However, the sample complexity of model-free algorithms,
particularly when using high-dimensional function approximators, tends to
limit their applicability to physical systems. In this paper, we explore
algorithms and representations to reduce the sample complexity of deep
reinforcement learning for continuous control tasks. We propose two
complementary techniques for improving the efficiency of such algorithms.
First, we derive a continuous variant of the Q-learning algorithm, which we
call normalized adantage functions (NAF), as an alternative to the more
commonly used policy gradient and actor-critic methods. NAF representation
allows us to apply Q-learning with experience replay to continuous tasks, and
substantially improves performance on a set of simulated robotic control
tasks. To further improve the efficiency of our approach, we explore the use
of learned models for accelerating model-free reinforcement learning. We show
that iteratively refitted local linear models are especially effective for
this, and demonstrate substantially faster learning on domains where such
models are applicable.

Shixiang Gu, Sergey Levine, Ilya Sutskever, and Andriy Mnih.
**Muprop: Unbiased backpropagation
for stochastic neural networks**.
In *4th International Conference on Learning Representations*, San Juan
PUERTO RICO, May 2016.

** Abstract:** Deep neural networks are
powerful parametric models that can be trained efficiently using the
backpropagation algorithm. Stochastic neural networks combine the power of
large parametric functions with that of graphical models, which makes it
possible to learn very complex distributions. However, as backpropagation is
not directly applicable to stochastic networks that include discrete sampling
operations within their computational graph, training such networks remains
difficult. We present MuProp, an unbiased gradient estimator for stochastic
networks, designed to make this task easier. MuProp improves on the
likelihood-ratio estimator by reducing its variance using a control variate
based on the first-order Taylor expansion of a mean-field network. Crucially,
unlike prior attempts at using backpropagation for training stochastic
networks, the resulting estimator is unbiased and well behaved. Our
experiments on structured output prediction and discrete latent variable
modeling demonstrate that MuProp yields consistently good performance across
a range of difficult tasks.

Shixiang Gu, Zoubin Ghahramani, and Richard E. Turner.
**Neural adaptive sequential Monte
Carlo**.
In *Advances in Neural Information Processing Systems 29*, Montréal
CANADA, Dec 2015.

** Abstract:** Sequential Monte Carlo (SMC),
or particle filtering, is a popular class of methods for sampling from an
intractable target distribution using a sequence of simpler intermediate
distributions. Like other importance sampling-based methods, performance is
critically dependent on the proposal distribution: a bad proposal can lead to
arbitrarily inaccurate estimates of the target distribution. This paper
presents a new method for automatically adapting the proposal using an
approximation of the Kullback-Leibler divergence between the true posterior
and the proposal distribution. The method is very flexible, applicable to any
parameterised proposal distribution and it supports online and batch
variants. We use the new framework to adapt powerful proposal distributions
with rich parameterisations based upon neural networks leading to Neural
Adaptive Sequential Monte Carlo (NASMC). Experiments indicate that NASMC
significantly improves inference in a non-linear state space model
outperforming adaptive proposal methods including the Extended Kalman and
Unscented Particle Filters. Experiments also indicate that improved inference
translates into improved parameter learning when NASMC is used as a
subroutine of Particle Marginal Metropolis Hastings. Finally we show that
NASMC is able to train a neural network-based deep recurrent generative model
achieving results that compete with the state-of-the-art for polymorphic
music modelling. NASMC can be seen as bridging the gap between adaptive SMC
methods and the recent work in scalable, black-box variational inference.

Nilesh Tripuraneni, Shixiang Gu, Hong Ge, and Zoubin Ghahramani.
**Particle
gibbs for infinite hidden markov models**.
In *Advances in Neural Information Processing Systems 29*, Montreal
CANADA, May 2015.

** Abstract:** Infinite Hidden Markov Models
(iHMM’s) are an attractive, nonparametric gener- alization of the classical
Hidden Markov Model which can automatically infer the number of hidden states
in the system. However, due to the infinite-dimensional nature of the
transition dynamics, performing inference in the iHMM is difficult. In this
paper, we present an infinite-state Particle Gibbs (PG) algorithm to re-
sample state trajectories for the iHMM. The proposed algorithm uses an
efficient proposal optimized for iHMMs and leverages ancestor sampling to
improve the mixing of the standard PG algorithm. Our algorithm demonstrates
significant con- vergence improvements on synthetic and real world data
sets.

Creighton Heaukulani, David A. Knowles, and Zoubin Ghahramani.
**Beta diffusion trees and
hierarchical feature allocations**.
Technical report, Dept. of Engineering, University of Cambridge, August 2014.

** Abstract:** We define the beta diffusion tree, a random tree
structure with a set of leaves that defines a collection of overlapping
subsets of objects, known as a feature allocation. A generative process for
the tree structure is defined in terms of particles (representing the
objects) diffusing in some continuous space, analogously to the Dirichlet
diffusion tree (Neal, 2003b), which defines a tree structure over partitions
(i.e., non-overlapping subsets) of the objects. Unlike in the Dirichlet
diffusion tree, multiple copies of a particle may exist and diffuse along
multiple branches in the beta diffusion tree, and an object may therefore
belong to multiple subsets of particles. We demonstrate how to build a
hierarchically-clustered factor analysis model with the beta diffusion tree
and how to perform inference over the random tree structures with a Markov
chain Monte Carlo algorithm. We conclude with several numerical experiments
on missing data problems with data sets of gene expression microarrays,
international development statistics, and intranational socioeconomic
measurements.

Creighton Heaukulani, David A. Knowles, and Zoubin Ghahramani.
**Beta
diffusion trees**.
In *31st International Conference on Machine Learning*, Beijing, China,
June 2014.

** Abstract:** We define the beta diffusion tree, a
random tree structure with a set of leaves that defines a collection of
overlapping subsets of objects, known as a feature allocation. The generative
process for the tree is defined in terms of particles (representing the
objects) diffusing in some continuous space, analogously to the Dirichlet and
Pitman–Yor diffusion trees (Neal, 2003b; Knowles & Ghahramani, 2011), both
of which define tree structures over clusters of the particles. With the beta
diffusion tree, however, multiple copies of a particle may exist and diffuse
to multiple locations in the continuous space, resulting in (a random number
of) possibly overlapping clusters of the objects. We demonstrate how to build
a hierarchically-clustered factor analysis model with the beta diffusion tree
and how to perform inference over the random tree structures with a Markov
chain Monte Carlo algorithm. We conclude with several numerical experiments
on missing data problems with data sets of gene expression arrays,
international development statistics, and intranational socioeconomic
measurements.

Creighton Heaukulani and Daniel M. Roy.
**The combinatorial structure of beta
negative binomial processes**.
Technical report, Dept. of Engineering, University of Cambridge, March 2014.

** Abstract:** We characterize the combinatorial structure of
conditionally-i.i.d. sequences of negative binomial processes with a common
beta process base measure. In Bayesian nonparametric applications, such
processes have served as models for unknown multisets of a measurable space.
Previous work has characterized random subsets arising from
conditionally-i.i.d. sequences of Bernoulli processes with a common beta
process base measure. In this case, the combinatorial structure is described
by the Indian buffet process. Our results give a count analogue of the Indian
buffet process, which we call a negative binomial Indian buffet process. As
an intermediate step toward this goal, we provide constructions for the beta
negative binomial process that avoid a representation of the underlying beta
process base measure.

Creighton Heaukulani and Zoubin Ghahramani.
**Dynamic
probabilistic models for latent feature propagation in social
networks**.
In *30th International Conference on Machine Learning*, Atlanta,
Georgia, USA, June 2013.

** Abstract:** Current Bayesian
models for dynamic social network data have focused on modelling the
influence of evolving unobserved structure on observed social interactions.
However, an understanding of how observed social relationships from the past
affect future unobserved structure in the network has been neglected. In this
paper, we introduce a new probabilistic model for capturing this phenomenon,
which we call latent feature propagation, in social networks. We demonstrate
our model's capability for inferring such latent structure in varying types
of social network datasets, and experimental studies show this structure
achieves higher predictive performance on link prediction and forecasting
tasks.

Yingzhen Li and Stephan Mandt.
**Disentangled Sequential
Autoencoder**.
In *35th International Conference on Machine Learning*, Stockholm
Sweden, July 2018.

** Abstract:** We present a VAE
architecture for encoding and generating high dimensional sequential data,
such as video or audio. Our deep generative model learns a latent
representation of the data which is split into a static and dynamic part,
allowing us to approximately disentangle latent time-dependent features
(dynamics) from features which are preserved over time (content). This
architecture gives us partial control over generating content and dynamics by
conditioning on either one of these sets of features. In our experiments on
artificially generated cartoon video clips and voice recordings, we show that
we can convert the content of a given sequence into another one by such
content swapping. For audio, this allows us to convert a male speaker into a
female speaker and vice versa, while for video we can separately manipulate
shapes and dynamics. Furthermore, we give empirical evidence for the
hypothesis that stochastic RNNs as latent state models are more efficient at
compressing and generating long sequences than deterministic ones, which may
be relevant for applications in video compression.

Yingzhen Li and Richard E. Turner.
**Gradient Estimators
for Implicit Models**.
In *Sixth International Conference on Learning Representations*,
Vancouver CANADA, May 2018.

** Abstract:** Implicit models,
which allow for the generation of samples but not for point-wise evaluation
of probabilities, are omnipresent in real-world problems tackled by machine
learning and a hot topic of current research. Some examples include data
simulators that are widely used in engineering and scientific research,
generative adversarial networks (GANs) for image synthesis, and
hot-off-the-press approximate inference techniques relying on implicit
distributions. The majority of existing approaches to learning implicit
models rely on approximating the intractable distribution or optimisation
objective for gradient- based optimisation, which is liable to produce
inaccurate updates and thus poor models. This paper alleviates the need for
such approximations by proposing the Stein gradient estimator, which directly
estimates the score function of the implicitly defined distribution. The
efficacy of the proposed estimator is empirically demonstrated by examples
that include meta-learning for approximate inference and entropy regularised
GANs that provide improved sample diversities.

Cuong V. Nguyen, Yingzhen Li, and Thang D. Bui Richard E. Turner.
**Variational
Continual Learning**.
In *Sixth International Conference on Learning Representations*,
Vancouver CANADA, May 2018.

** Abstract:** This paper develops
variational continual learning (VCL), a simple but general framework for
continual learning that fuses online variational inference (VI) and recent
advances in Monte Carlo VI for neural networks. The framework can
successfully train both deep discriminative models and deep generative models
in complex continual learning settings where existing tasks evolve over time
and entirely new tasks emerge. Experimental results show that variational
continual learning outperforms state-of-the-art continual learning methods on
a variety of tasks, avoiding catastrophic forgetting in a fully automatic
way.

Yingzhen Li and Yarin Gal.
**Dropout Inference
in Bayesian Neural Networks with Alpha-divergences**.
In *34th International Conference on Machine Learning*, Sydney
AUSTRALIA, Aug 2017.

** Abstract:** To obtain uncertainty
estimates with real-world Bayesian deep learning models, practical inference
approximations are needed. Dropout variational inference (VI) for example has
been used for machine vision and medical applications, but VI can severely
underestimates model uncertainty. Alpha-divergences are alternative
divergences to VI’s KL objective, which are able to avoid VI’s
uncertainty underestimation. But these are hard to use in practice: existing
techniques can only use Gaussian approximating distributions, and require
existing models to be changed radically, thus are of limited use for
practitioners. We propose a re-parametrisation of the alpha-divergence
objectives, deriving a simple inference technique which, together with
dropout, can be easily implemented with existing models by simply changing
the loss of the model. We demonstrate improved uncertainty estimates and
accuracy compared to VI in dropout networks. We study our model’s epistemic
uncertainty far away from the data using adversarial images, showing that
these can be distinguished from non-adversarial images by examining our
model’s uncertainty.

Yingzhen Li and Richard E. Turner.
**Rényi
Divergence Variational Inference**.
In *Advances in Neural Information Processing Systems 29*, Barcelona
SPAIN, Dec 2016.

** Abstract:** This paper introduces the
variational Rényi bound (VR) that extends traditional variational
inference to Rényi's alpha-divergences. This new family of variational
methods unifies a number of existing approaches, and enables a smooth
interpolation from the evidence lower-bound to the log (marginal) likelihood
that is controlled by the value of alpha that parametrises the divergence.
The reparameterization trick, Monte Carlo approximation and stochastic
optimisation methods are deployed to obtain a tractable and unified framework
for optimisation. We further consider negative alpha values and propose a
novel variational inference method as a new special case in the proposed
framework. Experiments on Bayesian neural networks and variational
auto-encoders demonstrate the wide applicability of the VR bound.

Thang D. Bui, Daniel Hernández-Lobato, José Miguel Hernández-Lobato,
Yingzhen Li, and Richard E. Turner.
**Deep Gaussian
processes for regression using approximate expectation propagation**.
In *33rd International Conference on Machine Learning*, New York, USA,
June 2016.

** Abstract:** Deep Gaussian processes (DGPs) are
multi-layer hierarchical generalisations of Gaussian processes (GPs) and are
formally equivalent to neural networks with multiple, infinitely wide hidden
layers. DGPs are nonparametric probabilistic models and as such are arguably
more flexible, have a greater capacity to generalise, and provide better
calibrated uncertainty estimates than alternative deep models. This paper
develops a new approximate Bayesian learning scheme that enables DGPs to be
applied to a range of medium to large scale regression problems for the first
time. The new method uses an approximate Expectation Propagation procedure
and a novel and efficient extension of the probabilistic backpropagation
algorithm for learning. We evaluate the new method for non-linear regression
on eleven real-world datasets, showing that it always outperforms GP
regression and is almost always better than state-of-the-art deterministic
and sampling-based approximate inference methods for Bayesian neural
networks. As a by-product, this work provides a comprehensive analysis of six
approximate Bayesian methods for training neural networks.

José Miguel Hernández-Lobato, Yingzhen Li, Mark Rowland, Thang D. Bui,
Daniel Hernández-Lobato, and Richard E. Turner.
**Black-box
Alpha Divergence Minimization**.
In *33rd International Conference on Machine Learning*, New York USA,
June 2016.

** Abstract:** Black-box alpha (BB-$α$) is a
new approximate inference method based on the minimization of
$α$-divergences. BB-$α$ scales to large datasets because it can be
implemented using stochastic gradient descent. BB-$α$ can be applied to
complex probabilistic models with little effort since it only requires as
input the likelihood function and its gradients. These gradients can be
easily obtained using automatic differentiation. By changing the divergence
parameter α, the method is able to interpolate between variational Bayes
(VB) ($α \rightarrow 0$) and an algorithm similar to expectation
propagation (EP) ($α = 1$). Experiments on probit regression and neural
network regression and classification problems show that BB-αwith
non-standard settings of $α$, such as $α = 0.5$, usually produces
better predictions than with $α \rightarrow 0$ (VB) or $α = 1$
(EP).

Yingzhen Li, José Miguel Hernández-Lobato, and Richard E. Turner.
**Stochastic Expectation
Propagation**.
In *Advances in Neural Information Processing Systems 28*, Montréal
CANADA, Dec 2015.

** Abstract:** Expectation propagation (EP)
is a deterministic approximation algorithm that is often used to perform
approximate Bayesian parameter learning. EP approximates the full intractable
posterior distribution through a set of local-approximations that are
iteratively refined for each datapoint. EP can offer analytic and
computational advantages over other approximations, such as Variational
Inference (VI), and is the method of choice for a number of models. The local
nature of EP appears to make it an ideal candidate for performing Bayesian
learning on large models in large-scale datasets settings. However, EP has a
crucial limitation in this context: the number approximating factors needs to
increase with the number of data-points, N, which often entails a
prohibitively large memory overhead. This paper presents an extension to EP,
called stochastic expectation propagation (SEP), that maintains a global
posterior approximation (like VI) but updates it in a local way (like EP ).
Experiments on a number of canonical learning problems using synthetic and
real-world datasets indicate that SEP performs almost as well as full EP, but
reduces the memory consumption by a factor of N. SEP is therefore ideally
suited to performing approximate Bayesian learning in the large model, large
dataset setting.

Shakir Mohamed, Katherine A. Heller, and Zoubin Ghahramani.
**Bayesian and
L _{1} approaches for sparse unsupervised learning**.
In

** Abstract:** The use of L1 regularisation for sparse learning
has generated immense research interest, with many successful applications in
diverse areas such as signal acquisition, image coding, genomics and
collaborative filtering. While existing work highlights the many advantages
of L1 methods, in this paper we find that L1 regularisation often
dramatically under-performs in terms of predictive performance when compared
to other methods for inferring sparsity. We focus on unsupervised latent
variable models, and develop L1 minimising factor models, Bayesian variants
of "L1", and Bayesian models with a stronger L0-like sparsity induced through
spike-and-slab distributions. These spike-and-slab Bayesian factor models
encourage sparsity while accounting for uncertainty in a principled manner,
and avoid unnecessary shrinkage of non-zero values. We demonstrate on a
number of data sets that in practice spike-and-slab Bayesian methods
out-perform L1 minimisation, even on a com- putational budget. We thus
highlight the need to re-assess the wide use of L1 methods in
sparsity-reliant applications, particularly when we care about generalising
to previously unseen data, and provide an alternative that, over many varying
conditions, provides improved generalisation performance.

Joshua Abbott, Katherine A. Heller, Zoubin Ghahramani, and Thomas L. Griffiths.
**Testing a
Bayesian measure of representativeness using a large image
database**.
In *Advances in Neural Information Processing Systems 24*, Cambridge,
MA, USA, 2011. The MIT Press.

** Abstract:** How do people
determine which elements of a set are most representative of that set? We
extend an existing Bayesian measure of representativeness, which indicates
the representativeness of a sample from a distribution, to deﬁne a measure
of the representativeness of an item to a set. We show that this measure is
formally related to a machine learning method known as Bayesian Sets.
Building on this connection, we derive an analytic expression for the
representativeness of objects described by a sparse vector of binary
features. We then apply this measure to a large database of images, using it
to determine which images are the most representative members of different
sets. Comparing the resulting predictions to human judgments of
representativeness provides a test of this measure with naturalistic stimuli,
and illustrates how databases that are more commonly used in computer vision
and machine learning can be used to evaluate psychological theories.

Sinead Williamson, Katherine A. Heller, C. Wang, and D. M. Blei.
**The IBP
compound Dirichlet process and its application to focused topic
modeling**.
In *27th International Conference on Machine Learning*, pages
1151-1158, Haifa, Israel, June 2010.

** Abstract:** The
hierarchical Dirichlet process (HDP) is a Bayesian nonparametric mixed
membership model - each data point is modeled with a collection of
components of different proportions. Though powerful, the HDP makes an
assumption that the probability of a component being exhibited by a data
point is positively correlated with its proportion within that data point.
This might be an undesirable assumption. For example, in topic modeling, a
topic (component) might be rare throughout the corpus but dominant within
those documents (data points) where it occurs. We develop the IBP compound
Dirichlet process (ICD), a Bayesian nonparametric prior that decouples
across-data prevalence and within-data proportion in a mixed membership
model. The ICD combines properties from the HDP and the Indian buffet process
(IBP), a Bayesian nonparametric prior on binary matrices. The ICD assigns a
subset of the shared mixture components to each data point. This subset, the
data point's "focus", is determined independently from the amount that each
of its components contribute. We develop an ICD mixture model for text, the
focused topic model (FTM), and show superior performance over the HDP-based
topic model.

R. Silva, K. A. Heller, Z. Ghahramani, and E. M. Airoldi.
**Ranking
relations using analogies in biological and information networks**.
*Annals of Applied Statistics*, 4(2):615-644, 2010.

**
Abstract:** Analogical reasoning depends fundamentally on the ability to
learn and generalize about relations between objects. We develop an approach
to relational learning which, given a set of pairs of objects S = A(1):B(1),
A(2):B(2), ..., A(N):B(N), measures how well other pairs A:B fit in with the
set S. Our work addresses the question: is the relation between objects A and
B analogous to those relations found in S? Such questions are particularly
relevant in information retrieval, where an investigator might want to search
for analogous pairs of objects that match the query set of interest. There
are many ways in which objects can be related, making the task of measuring
analogies very challenging. Our approach combines a similarity measure on
function spaces with Bayesian analysis to produce a ranking. It requires data
containing features of the objects of interest and a link matrix specifying
which relationships exist; no further attributes of such relationships are
necessary. We illustrate the potential of our method on text analysis and
information networks. An application on discovering functional interactions
between pairs of proteins is discussed in detail, where we show that our
approach can work in practice even if a small set of protein pairs is
provided.

Shakir Mohamed, Katherine A. Heller, and Zoubin Ghahramani.
**Bayesian
exponential family PCA**.
In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, *Advances in
Neural Information Processing Systems 21*, pages 1089-1096, Cambridge,
MA, USA, December 2009. The MIT Press.

** Abstract:**
Principal Components Analysis (PCA) has become established as one of the key
tools for dimensionality reduction when dealing with real valued data.
Approaches such as exponential family PCA and non-negative matrix
factorisation have successfully extended PCA to non-Gaussian data types, but
these techniques fail to take advantage of Bayesian inference and can suffer
from problems of overfitting and poor generalisation. This paper presents a
fully probabilistic approach to PCA, which is generalised to the exponential
family, based on Hybrid Monte Carlo sampling. We describe the model which is
based on a factorisation of the observed data matrix, and show performance of
the model on both synthetic and real data.

** Comment:** spotlight.

R. Savage, K. A. Heller, Y. Xu, Zoubin Ghahramani, W. Truman, M. Grant,
K. Denby, and D. L. Wild.
**R/BHC: fast
Bayesian hierarchical clustering for microarray data**.
*BMC Bioinformatics 2009*, 10(242):1-9, August 2009, doi
10.1186/1471-2105-10-242.

** Abstract:** Background:
Although the use of clustering methods has rapidly become one of the standard
computational approaches in the literature of microarray gene expression data
analysis, little attention has been paid to uncertainty in the results
obtained.

Results: We present an R/Bioconductor port of a fast novel
algorithm for Bayesian agglomerative hierarchical clustering and demonstrate
its use in clustering gene expression microarray data. The method performs
bottom-up hierarchical clustering, using a Dirichlet Process (infinite
mixture) to model uncertainty in the data and Bayesian model selection to
decide at each step which clusters to merge.

Conclusion: Biologically
plausible results are presented from a well studied data set: expression
profiles of *A. thaliana* subjected to a variety of biotic and abiotic
stresses. Our method avoids several limitations of traditional methods, for
example how many clusters there should be and how to choose a principled
distance metric.

Yang Xu, Katherine A. Heller, and Zoubin Ghahramani.
**Tree-based
inference for Dirichlet process mixtures**.
In D. van Dyk and M. Welling, editors, *12th International Conference on
Artificial Intelligence and Statistics*, volume 5, pages 623-630,
Clearwater Beach, FL, USA, April 2009. Microtome Publishing (paper), Journal
of Machine Learning Research (online).
ISSN 1938-7228.

** Abstract:** The Dirichlet process mixture
(DPM) is a widely used model for clustering and for general nonparametric
Bayesian density estimation. Unfortunately, like in many statistical models,
exact inference in a DPM is intractable, and approximate methods are needed
to perform efficient inference. While most attention in the literature has
been placed on Markov chain Monte Carlo (MCMC) [1, 2, 3], variational
Bayesian (VB) [4] and collapsed variational methods [5], [6] recently
introduced a novel class of approximation for DPMs based on Bayesian
hierarchical clustering (BHC). These tree-based combinatorial approximations
efficiently sum over exponentially many ways of partitioning the data and
offer a novel lower bound on the marginal likelihood of the DPM [6]. In this
paper we make the following contributions: (1) We show empirically that the
BHC lower bounds are substantially tighter than the bounds given by VB [4]
and by collapsed variational methods [5] on synthetic and real datasets. (2)
We also show that BHC offers a more accurate predictive performance on these
datasets. (3) We further improve the tree-based lower bounds with an
algorithm that efficiently sums contributions from alternative trees. (4) We
present a fast approximate method for BHC. Our results suggest that our
combinatorial approximate inference methods and lower bounds may be useful
not only in DPMs but in other models as well.

Katherine A. Heller, Sinead Williamson, and Zoubin Ghahramani.
**Statistical
models for partial membership**.
In Andrew McCallum and Sam Roweis, editors, *25th International Conference
on Machine Learning*, pages 392-399, Helsinki, Finland, July 2008.
Omnipress.

** Abstract:** We present a principled Bayesian
framework for modeling partial memberships of data points to clusters. Unlike
a standard mixture model which assumes that each data point belongs to one
and only one mixture component, or cluster, a partial membership model allows
data points to have fractional membership in multiple clusters. Algorithms
which assign data points partial memberships to clusters can be useful for
tasks such as clustering genes based on microarray data (Gasch & Eisen,
2002). Our Bayesian Partial Membership Model (BPM) uses exponential family
distributions to model each cluster, and a product of these distibtutions,
with weighted parameters, to model each datapoint. Here the weights
correspond to the degree to which the datapoint belongs to each cluster. All
parameters in the BPM are continuous, so we can use Hybrid Monte Carlo to
perform inference and learning. We discuss relationships between the BPM and
Latent Dirichlet Allocation, Mixed Membership models, Exponential Family PCA,
and fuzzy clustering. Lastly, we show some experimental results and discuss
nonparametric extensions to our model.

Katherine A. Heller and Zoubin Ghahramani.
**A nonparametric
Bayesian approach to modeling overlapping clusters**.
In Marina Meila and Xiaotong Shen, editors, *AISTATS*, volume 2 of
*JMLR Proceedings*, pages 187-194. JMLR.org, 2007.

**
Abstract:** Although clustering data into mutually exclusive partitions has
been an extremely successful approach to unsupervised learning, there are
many situations in which a richer model is needed to fully represent the
data. This is the case in problems where data points actually simultaneously
belong to multiple, overlapping clusters. For example a particular gene may
have several functions, therefore belonging to several distinct clusters of
genes, and a biologist may want to discover these through unsupervised
modeling of gene expression data. We present a new nonparametric Bayesian
method, the Infinite Overlapping Mixture Model (IOMM), for modeling
overlapping clusters. The IOMM uses exponential family distributions to model
each cluster and forms an overlapping mixture by taking products of such
distributions, much like products of experts (Hinton, 2002). The IOMM allows
an unbounded number of clusters, and assignments of points to (multiple)
clusters is modeled using an Indian Buffet Process (IBP), (Griffiths and
Ghahramani, 2006). The IOMM has the desirable properties of being able to
focus in on overlapping regions while maintaining the ability to model a
potentially infinite number of clusters which may overlap. We derive MCMC
inference algorithms for the IOMM and show that these can be used to cluster
movies into multiple genres.

Ricardo Silva, Katherine A. Heller, and Zoubin Ghahramani.
**Analogical
reasoning with relational Bayesian sets**.
In Marina Meila and Xiaotong Shen, editors, *AISTATS*, volume 2 of
*JMLR Proceedings*, pages 500-507. JMLR.org, 2007.

**
Abstract:** Analogical reasoning depends fundamentally on the ability to
learn and generalize about relations between objects. There are many ways in
which objects can be related, making automated analogical reasoning very
chal- lenging. Here we develop an approach which, given a set of pairs of
related objects S = A1:B1,A2:B2,...,AN:BN, measures how well other pairs
A:B fit in with the set S. This addresses the question: is the relation
between objects A and B analogous to those relations found in S? We recast
this classi- cal problem as a problem of Bayesian analy- sis of relational
data. This problem is non- trivial because direct similarity between ob-
jects is not a good way of measuring analo- gies. For instance, the analogy
between an electron around the nucleus of an atom and a planet around the Sun
is hardly justified by isolated, non-relational, comparisons of an electron
to a planet, and a nucleus to the Sun. We develop a generative model for
predicting the existence of relationships and extend the framework of
Ghahramani and Heller (2005) to provide a Bayesian measure for how analogous
a relation is to other relations. This sheds new light on an old problem,
which we motivate and illustrate through practical applications in
exploratory data analysis.

Zoubin Ghahramani and Katherine A. Heller.
**Bayesian
sets**.
In Y. Weiss, B. Schölkopf, and J. Platt, editors, *Advances in Neural
Information Processing Systems 18*, pages 435-442, Cambridge, MA, USA,
December 2006. The MIT Press.

** Abstract:** Inspired by
"Google™ Sets", we consider the problem of retrieving items from a
concept or cluster, given a query consisting of a few items from that
cluster. We formulate this as a Bayesian inference problem and describe a
very simple algorithm for solving it. Our algorithm uses a model-based
concept of a cluster and ranks items using a score which evaluates the
marginal probability that each item belongs to a cluster containing the query
items. For exponential family models with conjugate priors this marginal
probability is a simple function of sufficient statistics. We focus on sparse
binary data and show that our score can be evaluated exactly using a single
sparse matrix multiplication, making it possible to apply our algorithm to
very large datasets. We evaluate our algorithm on three datasets: retrieving
movies from EachMovie, finding completions of author sets from the NIPS
dataset, and finding completions of sets of words appearing in the Grolier
encyclopedia. We compare to Google™ Sets and show that Bayesian Sets
gives very reasonable set completions.

Katherine A. Heller and Zoubin Ghahramani.
**A simple Bayesian
framework for content-based image retrieval**.
In *CVPR*, pages 2110-2117. IEEE Computer Society, 2006.

** Abstract:** We present a Bayesian framework for content-based
image retrieval which models the distribution of color and texture features
within sets of related images. Given a userspecified text query (e.g.
"penguins") the system first extracts a set of images, from a labelled
corpus, corresponding to that query. The distribution over features of these
images is used to compute a Bayesian score for each image in a large
unlabelled corpus. Unlabelled images are then ranked using this score and the
top images are returned. Although the Bayesian score is based on computing
marginal likelihoods, which integrate over model parameters, in the case of
sparse binary data the score reduces to a single matrix-vector multiplication
and is therefore extremely efficient to compute. We show that our method
works surprisingly well despite its simplicity and the fact that no relevance
feedback is used. We compare different choices of features, and evaluate our
results using human subjects.

Katherine A. Heller and Zoubin Ghahramani.
**Bayesian
hierarchical clustering**.
In Luc De Raedt and Stefan Wrobel, editors, *ICML*, volume 119 of
*ACM International Conference Proceeding Series*, pages 297-304.
Association for Computing Machinery, 2005.

** Abstract:** We
present a novel algorithm for agglomerative hierarchical clustering based on
evaluating marginal likelihoods of a probabilistic model. This algorithm has
several advantages over traditional distance-based agglomerative clustering
algorithms. (1) It defines a probabilistic model of the data which can be
used to compute the predictive distribution of a test point and the
probability of it belonging to any of the existing clusters in the tree. (2)
It uses a model-based criterion to decide on merging clusters rather than an
ad-hoc distance metric. (3) Bayesian hypothesis testing is used to decide
which merges are advantageous and to output the recommended depth of the
tree. (4) The algorithm can be interpreted as a novel fast bottom-up
approximate inference method for a Dirichlet process (i.e. countably
infinite) mixture model (DPM). It provides a new lower bound on the marginal
likelihood of a DPM by summing over exponentially many clusterings of the
data in polynomial time. We describe procedures for learning the model
hyperpa-rameters, computing the predictive distribution, and extensions to
the algorithm. Experimental results on synthetic and real-world data sets
demonstrate useful properties of the algorithm.

Eric Nalisnick, José Miguel Hernández-Lobato, and Padhraic Smyth.
**Dropout as a
structured shrinkage prior**.
In *36th International Conference on Machine Learning*, Long Beach, June
2019.

** Abstract:** Dropout regularization of deep neural
networks has been a mysterious yet effective tool to prevent overfitting.
Explanations for its success range from the prevention of co-adapted weights
to it being a form of cheap Bayesian inference. We propose a novel framework
for understanding multiplicative noise in neural networks, considering
continuous distributions as well as Bernoulli noise (i.e. dropout). We show
that multiplicative noise induces structured shrinkage priors on a network's
weights. We derive the equivalence through reparametrization properties of
scale mixtures and without invoking any approximations. Given the
equivalence, we then show that dropout's Monte Carlo training objective
approximates marginal MAP estimation. We leverage these insights to propose a
novel shrinkage framework for resnets, terming the prior 'automatic depth
determination' as it is the natural analog of automatic relevance
determination for network depth. Lastly, we investigate two inference
strategies that improve upon the aforementioned MAP approximation in
regression benchmarks.

Robert Pinsler, Jonathan Gordon, Eric Nalisnick, and Jose Miguel
Hernández-Lobato.
**Bayesian
batch active learning as sparse subset approximation**.
In *Advances in Neural Information Processing Systems 33*, 2019.

** Abstract:** Leveraging the wealth of unlabeled data produced
in recent years provides great potential for improving supervised models.
When the cost of acquiring labels is high, probabilistic active learning
methods can be used to greedily select the most informative data points to be
labeled. However, for many large-scale problems standard greedy procedures
become computationally infeasible and suffer from negligible model change. In
this paper, we introduce a novel Bayesian batch active learning approach that
mitigates these issues. Our approach is motivated by approximating the
complete data posterior of the model parameters. While naive batch
construction methods result in correlated queries, our algorithm produces
diverse batches that enable efficient active learning at scale. We derive
interpretable closed-form solutions akin to existing active learning
procedures for linear models, and generalize to arbitrary models using random
projections. We demonstrate the benefits of our approach on several
large-scale regression and classification tasks.

Natasha Jaques, Shixiang Gu, Dzmitry Bahdanau, Jose Miguel Hernndez Lobato,
Richard E. Turner, and Douglas Eck.
**Sequence tutor: Conservative
fine-tuning of sequence generation models with kl-control**.
In *34th International Conference on Machine Learning*, Sydney
AUSTRALIA, Aug 2017.

** Abstract:** This paper proposes a
general method for improving the structure and quality of sequences generated
by a recurrent neural network (RNN), while maintaining information originally
learned from data, as well as sample diversity. An RNN is first pre-trained
on data using maximum likelihood estimation (MLE), and the probability
distribution over the next token in the sequence learned by this model is
treated as a prior policy. Another RNN is then trained using reinforcement
learning (RL) to generate higher-quality outputs that account for
domain-specific incentives while retaining proximity to the prior policy of
the MLE RNN. To formalize this objective, we derive novel off-policy RL
methods for RNNs from KL-control. The effectiveness of the approach is
demonstrated on two applications; 1) generating novel musical melodies, and
2) computational molecular generation. For both problems, we show that the
proposed method improves the desired properties and structure of the
generated sequences, while maintaining information learned from data.

** Comment:** [MIT
Technology Review] [Video]

Thang D. Bui, Daniel Hernández-Lobato, José Miguel Hernández-Lobato,
Yingzhen Li, and Richard E. Turner.
**Deep Gaussian
processes for regression using approximate expectation propagation**.
In *33rd International Conference on Machine Learning*, New York, USA,
June 2016.

** Abstract:** Deep Gaussian processes (DGPs) are
multi-layer hierarchical generalisations of Gaussian processes (GPs) and are
formally equivalent to neural networks with multiple, infinitely wide hidden
layers. DGPs are nonparametric probabilistic models and as such are arguably
more flexible, have a greater capacity to generalise, and provide better
calibrated uncertainty estimates than alternative deep models. This paper
develops a new approximate Bayesian learning scheme that enables DGPs to be
applied to a range of medium to large scale regression problems for the first
time. The new method uses an approximate Expectation Propagation procedure
and a novel and efficient extension of the probabilistic backpropagation
algorithm for learning. We evaluate the new method for non-linear regression
on eleven real-world datasets, showing that it always outperforms GP
regression and is almost always better than state-of-the-art deterministic
and sampling-based approximate inference methods for Bayesian neural
networks. As a by-product, this work provides a comprehensive analysis of six
approximate Bayesian methods for training neural networks.

José Miguel Hernández-Lobato, Yingzhen Li, Mark Rowland, Thang D. Bui,
Daniel Hernández-Lobato, and Richard E. Turner.
**Black-box
Alpha Divergence Minimization**.
In *33rd International Conference on Machine Learning*, New York USA,
June 2016.

** Abstract:** Black-box alpha (BB-$α$) is a
new approximate inference method based on the minimization of
$α$-divergences. BB-$α$ scales to large datasets because it can be
implemented using stochastic gradient descent. BB-$α$ can be applied to
complex probabilistic models with little effort since it only requires as
input the likelihood function and its gradients. These gradients can be
easily obtained using automatic differentiation. By changing the divergence
parameter α, the method is able to interpolate between variational Bayes
(VB) ($α \rightarrow 0$) and an algorithm similar to expectation
propagation (EP) ($α = 1$). Experiments on probit regression and neural
network regression and classification problems show that BB-αwith
non-standard settings of $α$, such as $α = 0.5$, usually produces
better predictions than with $α \rightarrow 0$ (VB) or $α = 1$
(EP).

Yingzhen Li, José Miguel Hernández-Lobato, and Richard E. Turner.
**Stochastic Expectation
Propagation**.
In *Advances in Neural Information Processing Systems 28*, Montréal
CANADA, Dec 2015.

** Abstract:** Expectation propagation (EP)
is a deterministic approximation algorithm that is often used to perform
approximate Bayesian parameter learning. EP approximates the full intractable
posterior distribution through a set of local-approximations that are
iteratively refined for each datapoint. EP can offer analytic and
computational advantages over other approximations, such as Variational
Inference (VI), and is the method of choice for a number of models. The local
nature of EP appears to make it an ideal candidate for performing Bayesian
learning on large models in large-scale datasets settings. However, EP has a
crucial limitation in this context: the number approximating factors needs to
increase with the number of data-points, N, which often entails a
prohibitively large memory overhead. This paper presents an extension to EP,
called stochastic expectation propagation (SEP), that maintains a global
posterior approximation (like VI) but updates it in a local way (like EP ).
Experiments on a number of canonical learning problems using synthetic and
real-world datasets indicate that SEP performs almost as well as full EP, but
reduces the memory consumption by a factor of N. SEP is therefore ideally
suited to performing approximate Bayesian learning in the large model, large
dataset setting.

José Miguel Hernández-Lobato, Michael A. Gelbart, Matthew W. Hoffman,
Ryan P. Adams, and Zoubin Ghahramani.
**Predictive
entropy search for Bayesian optimization with unknown constraints**.
In *32nd International Conference on Machine Learning*, pages
1699-1707, 2015.

** Abstract:** Unknown constraints arise in
many types of expensive black-box optimization problems. Several methods have
been proposed recently for performing Bayesian optimization with constraints,
based on the expected improvement (EI) heuristic. However, EI can lead to
pathologies when used with constraints. For example, in the case of decoupled
constraints—i.e., when one can independently evaluate the objective or the
constraints—EI can encounter a pathology that prevents exploration.
Additionally, computing EI requires a current best solution, which may not
exist if none of the data collected so far satisfy the constraints. By
contrast, information-based approaches do not suffer from these failure
modes. In this paper, we present a new information-based method called
Predictive Entropy Search with Constraints (PESC). We analyze the performance
of PESC and show that it compares favorably to EI-based approaches on
synthetic and benchmark problems, as well as several real-world examples. We
demonstrate that PESC is an effective algorithm that provides a promising
direction towards a unified solution for constrained Bayesian
optimization.

Yue Gu, José Miguel Hernández-Lobato, and Zoubin Ghahramani.
**Gaussian
process volatility model**.
In *Advances in Neural Information Processing Systems 28*, pages 1-9,
Montreal, Canada, December 2014.

** Abstract:** The prediction
of time-changing variances is an important task in the modeling of financial
data. Standard econometric models are often limited as they assume rigid
functional relationships for the evolution of the variance. Moreover,
functional parameters are usually learned by maximum likelihood, which can
lead to overfitting. To address these problems we introduce GP-Vol, a novel
non-parametric model for time-changing variances based on Gaussian Processes.
This new model can capture highly flexible functional relationships for the
variances. Furthermore, we introduce a new online algorithm for fast
inference in GP-Vol. This method is much faster than current offline
inference procedures and it avoids overfitting problems by following a fully
Bayesian approach. Experiments with financial data show that GP-Vol performs
significantly better than current standard alternatives.

José Miguel Hernández-Lobato, Matthew W. Hoffman, and Zoubin Ghahramani.
**Predictive
entropy search for efficient global optimization of black-box
functions**.
In *Advances in Neural Information Processing Systems 28*, pages 1-9,
Montreal, Canada, December 2014.

** Abstract:** We propose a
novel information-theoretic approach for Bayesian optimization called
Predictive Entropy Search (PES). At each iteration, PES selects the next
evaluation point that maximizes the expected information gained with respect
to the global maximum. PES codifies this intractable acquisition function in
terms of the expected reduction in the differential entropy of the predictive
distribution. This reformulation allows PES to obtain approximations that are
both more accurate and efficient than other alternatives such as Entropy
Search (ES). Furthermore, PES can easily perform a fully Bayesian treatment
of the model hyperparameters while ES cannot. We evaluate PES in both
synthetic and realworld applications, including optimization problems in
machine learning, finance, biotechnology, and robotics. We show that the
increased accuracy of PES leads to significant gains in optimization
performance.

Neil Houlsby, José Miguel Hernández-Lobato, and Zoubin Ghahramani.
**Cold-start
active learning with robust ordinal matrix factorization**.
In *31st International Conference on Machine Learning*, Beijing, China,
June 2014.

** Abstract:** We present a new matrix
factorization model for rating data and a corresponding active learning
strategy to address the cold-start problem. Cold-start is one of the most
challenging tasks for recommender systems: what to recommend with new users
or items for which one has little or no data. An approach is to use active
learning to collect the most useful initial ratings. However, the performance
of active learning depends strongly upon having accurate estimates of i) the
uncertainty in model parameters and ii) the intrinsic noisiness of the data.
To achieve these estimates we propose a heteroskedastic Bayesian model for
ordinal matrix factorization. We also present a computationally efficient
framework for Bayesian active learning with this type of complex
probabilistic model. This algorithm successfully distinguishes between
informative and noisy data points. Our model yields state-of-the-art
predictive performance and, coupled with our active learning strategy,
enables us to gain useful information in the cold-start setting from the very
first active sample.

José Miguel Hernández-Lobato, Neil Houlsby, and Zoubin Ghahramani.
**Probabilistic
matrix factorization with non-random missing data**.
In *31st International Conference on Machine Learning*, Beijing, China,
June 2014.

** Abstract:** We propose a probabilistic matrix
factorization model for collaborative filtering that learns from data that is
missing not at random (MNAR). Matrix factorization models exhibit
state-of-the-art predictive performance in collaborative filtering. However,
these models usually assume that the data is missing at random (MAR), and
this is rarely the case. For example, the data is not MAR if users rate items
they like more than ones they dislike. When the MAR assumption is incorrect,
inferences are biased and predictive performance can suffer. Therefore, we
model both the generative process for the data and the missing data
mechanism. By learning these two models jointly we obtain improved
performance over state-of-the-art methods when predicting the ratings and
when modeling the data observation process. We present the first viable MF
model for MNAR data. Our results are promising and we expect that further
research on NMAR models will yield large gains in collaborative
filtering.

José Miguel Hernández-Lobato, Neil Houlsby, and Zoubin Ghahramani.
**Stochastic
inference for scalable probabilistic modeling of binary matrices**.
In *31st International Conference on Machine Learning*, Beijing, China,
June 2014.

** Abstract:** Fully observed large binary matrices
appear in a wide variety of contexts. To model them, probabilistic matrix
factorization (PMF) methods are an attractive solution. However, current
batch algorithms for PMF can be inefficient because they need to analyze the
entire data matrix before producing any parameter updates. We derive an
efficient stochastic inference algorithm for PMF models of fully observed
binary matrices. Our method exhibits faster convergence rates than more
expensive batch approaches and has better predictive performance than
scalable alternatives. The proposed method includes new data subsampling
strategies which produce large gains over standard uniform subsampling. We
also address the task of automatically selecting the size of the minibatches
of data used by our method. For this, we derive an algorithm that adjusts
this hyper-parameter online.

Daniel Hernández-Lobato and José Miguel Hernández-Lobato.
**Learning
feature selection dependencies in multi-task learning**.
In *Advances in Neural Information Processing Systems 27*, pages 1-9,
Lake Tahoe, California, USA, December 2013.

** Abstract:** A
probabilistic model based on the horseshoe prior is proposed for learning de-
pendencies in the process of identifying relevant features for prediction.
Exact inference is intractable in this model. However, expectation
propagation offers an approximate alternative. Because the process of
estimating feature selection dependencies may suffer from over-fitting in the
model proposed, additional data from a multi-task learning scenario are
considered for induction. The same model can be used in this setting with few
modifications. Furthermore, the assumptions made are less restrictive than in
other multi-task methods: The different tasks must share feature selection
dependencies, but can have different relevant features and model
coefficients. Experiments with real and synthetic data show that this model
performs better than other multi-task alternatives from the literature. The
experiments also show that the model is able to induce suitable feature
selection dependencies for the problems considered, only from the training
data.

José Miguel Hernández-Lobato, James Robert Lloyds, and Daniel
Hernández-Lobato.
**Gaussian process
conditional copulas with applications to financial time series**.
In *Advances in Neural Information Processing Systems 27*, pages 1-9,
Lake Tahoe, California, USA, December 2013.

** Abstract:** The
estimation of dependencies between multiple variables is a central problem in
the analysis of financial time series. A common approach is to express these
dependencies in terms of a copula function. Typically the copula function is
assumed to be constant but this may be inaccurate when there are covariates
that could have a large influence on the dependence structure of the data. To
account for this, a Bayesian framework for the estimation of conditional
copulas is proposed. In this framework the parameters of a copula are
non-linearly related to some arbitrary conditioning variables. We evaluate
the ability of our method to predict time-varying dependencies on several
equities and currencies and observe consistent performance gains compared to
static copula models and other time-varying copula methods.

Daniel Hernández-Lobato, José Miguel Hernández-Lobato, and Pierre Dupont.
**Generalized
spike-and-slab priors for Bayesian group feature selection using
expectation propagation**.
*Journal of Machine Learning Research*, 14:1891-1945, July 2013.

** Abstract:** We describe a Bayesian method for group feature
selection in linear regression problems. The method is based on a generalized
version of the standard spike-and-slab prior distribution which is often used
for individual feature selection. Exact Bayesian inference under the prior
considered is infeasible for typical regression problems. However,
approximate inference can be carried out efficiently using Expectation
Propagation (EP). A detailed analysis of the generalized spike-and-slab
prior shows that it is well suited for regression problems that are sparse at
the group level. Furthermore, this prior can be used to introduce prior
knowledge about specific groups of features that are a priori believed to be
more relevant. An experimental evaluation compares the performance of the
proposed method with those of group LASSO, Bayesian group LASSO,
automatic relevance determination and additional variants used for group
feature selection. The results of these experiments show that a model based
on the generalized spike-and-slab prior and the EP algorithm has
state-of-the-art prediction performance in the problems analyzed.
Furthermore, this model is also very useful to carry out sequential
experimental design (also known as active learning), where the data instances
that are most informative are iteratively included in the training set,
reducing the number of instances needed to obtain a particular level of
prediction accuracy.

David Lopez-Paz, José Miguel Hernández-Lobato, and Zoubin Ghahramani.
**Gaussian
process vine copulas for multivariate dependence**.
In *30th International Conference on Machine Learning*, Atlanta,
Georgia, USA, June 2013.

** Abstract:** Copulas allow to learn
marginal distributions separately from the multivariate dependence structure
(copula) that links them together into a density function. Vine
factorizations ease the learning of high-dimensional copulas by constructing
a hierarchy of conditional bivariate copulas. However, to simplify inference,
it is common to assume that each of these conditional bivariate copulas is
independent from its conditioning variables. In this paper, we relax this
assumption by discovering the latent functions that specify the shape of a
conditional copula given its conditioning variables We learn these functions
by following a Bayesian approach based on sparse Gaussian processes with
expectation propagation for scalable, approximate inference. Experiments on
real-world datasets show that, when modeling all conditional dependencies, we
obtain better estimates of the underlying copula of the data.

Yue Wu, José Miguel Hernández-Lobato, and Zoubin Ghahramani.
**Dynamic covariance
models for multivariate financial time series**.
In *30th International Conference on Machine Learning*, Atlanta,
Georgia, USA, June 2013.

** Abstract:** The accurate
prediction of time-changing covariances is an important problem in the
modeling of multivariate financial data. However, some of the most popular
models suffer from a) overfitting problems and multiple local optima, b)
failure to capture shifts in market conditions and c) large computational
costs. To address these problems we introduce a novel dynamic model for
time-changing covariances. Over-fitting and local optima are avoided by
following a Bayesian approach instead of computing point estimates. Changes
in market conditions are captured by assuming a diffusion process in
parameter values, and finally computationally efficient and scalable
inference is performed using particle filters. Experiments with financial
data show excellent performance of the proposed method with respect to
current standard models.

José Miguel Hernández-Lobato, Neil Houlsby, and Zoubin Ghahramani.
**Stochastic
inference for scalable probabilistic modeling of binary matrices**.
In *NIPS Workshop on Randomized Methods for Machine Learning*, 2013.

** Abstract:** Fully observed large binary matrices appear in a
wide variety of contexts. To model them, probabilistic matrix factorization
(PMF) methods are an attractive solution. However, current batch algorithms
for PMF can be inefficient since they need to analyze the entire data matrix
before producing any parameter updates. We derive an efficient stochastic
inference algorithm for PMF models of fully observed binary matrices. Our
method exhibits faster convergence rates than more expensive batch approaches
and has better predictive performance than scalable alternatives. The
proposed method includes new data subsampling strategies which produce large
gains over standard uniform subsampling. We also address the task of
automatically selecting the size of the minibatches of data and we propose an
algorithm that adjusts this hyper-parameter in an online manner.

Michael Kaschesky, Pawel Sobkowicz, José Miguel Hernández-Lobato, Guillaume
Bouchard, Cedric Archambeau, Nicolas Scharioth, Robert Manchin, Adrian
Gschwend, and Reinhard Riedl.
**Bringing
representativeness into social media monitoring and analysis**.
In *46th Hawaii International Conference on System Sciences*, Manoa,
Hawaii, 2013.

** Abstract:** The opinions, expectations and
behavior of citizens are increasingly reflected online - therefore, mining
the internet for such data can enhance decision-making in public policy,
communications, marketing, finance and other fields. However, to come closer
to the representativeness of classic opinion surveys there is a lack of
knowledge about the sociodemographic characteristics of those voicing
opinions on the internet. This paper proposes to calibrate online opinions
aggregated from multiple and heterogeneous data sources with traditional
surveys enhanced with rich socio-demographic information to enable insights
into which opinions are expressed on the internet by specific segments of
society. The goal of this research is to provide professionals in citizen-
and consumer- centered domains with more concise near real-time intelligence
on online opinions. To become effective, the methodologies presented in this
paper must be integrated into a coherent decision support system.

David Lopez-Paz, José Miguel Hernández-Lobato, and Bernhard Scholköpf.
**Semi-supervised
domain adaptation with non-parametric copulas**.
In *Advances in Neural Information Processing Systems 26*, pages 1-9,
Lake Tahoe, California, USA, December 2012.

** Abstract:** A
new framework based on the theory of copulas is proposed to address
semisupervised domain adaptation problems. The presented method factorizes
any multivariate density into a product of marginal distributions and
bivariate copula functions. Therefore, changes in each of these factors can
be detected and corrected to adapt a density model accross different learning
domains. Importantly, we introduce a novel vine copula model, which allows
for this factorization in a non-parametric manner. Experimental results on
regression problems with real-world data illustrate the efficacy of the
proposed approach when compared to state-of-the-art techniques.

Neil Houlsby, Jose Miguel Hernández-Lobato, Ferenc Huszár, and Zoubin
Ghahramani.
**Collaborative
Gaussian processes for preference learning**.
In *Advances in Neural Information Processing Systems 26*, pages
2096-2104. Curran Associates, Inc., 2012.

** Abstract:** We
present a new model based on Gaussian processes (GPs) for learning pairwise
preferences expressed by multiple users. Inference is simplified by using a
*preference kernel* for GPs which allows us to combine supervised GP
learning of user preferences with unsupervised dimensionality reduction for
multi-user systems. The model not only exploits collaborative information
from the shared structure in user behavior, but may also incorporate user
features if they are available. Approximate inference is implemented using a
combination of expectation propagation and variational Bayes. Finally, we
present an efficient active learning strategy for querying preferences. The
proposed technique performs favorably on real-world data against
state-of-the-art multi-user preference learning algorithms.

Daniel Hernández-Lobato, José Miguel Hernández-Lobato, and Pierre Dupont.
**Robust
multi-class Gaussian process classification**.
In *Advances in Neural Information Processing Systems 25*, 2011.

** Abstract:** Multi-class Gaussian Processs Classifiers (MGPCs)
are often affected by overfitting problems when labeling errors occur far
from the decision boundaries. To prevent this, we investigate a robust MGPC
(RMGPC) which considers labeling errors independently of their distance to
the decision boundaries. Expectation propagation is used for approximate
inference. Experiments with several datasets in which noise is injected in
the labels illustrate the benefits of RMGPC. This method performs better than
other Gaussian process alternatives based on considering latent Gaussian
noise or heavy-tailed processes. When no noise is injected in the labels,
RMGPC still performs equal or better than the other methods. Finally, we show
how RMGPC can be used for successfully indentifying data instances which are
difficult to classify correctly in practice.

Moein Khajehnejad, Ahmad Asgharian Rezaei, Mahmoudreza Babaei, Jessica
Hoffmann, Mahdi Jalili, and Adrian Weller.
**Adversarial graph embeddings for
fair influence maximization over social networks**.
In *International Joint Conference on Artificial Intelligence*, 2020.

** Abstract:** Influence maximization is a widely studied topic
in network science, where the aim is to reach the maximum possible number of
nodes, while only targeting a small initial set of individuals. It has
critical applications in many fields, including viral marketing, information
propagation, news dissemination, and vaccinations. However, the objective
does not usually take into account whether the final set of influenced nodes
is fair with respect to sensitive attributes, such as race or gender. Here we
address fair influence maximization, aiming to reach minorities more
equitably. We introduce Adversarial Graph Embeddings: we co-train an
auto-encoder for graph embedding and a discriminator to discern sensitive
attributes. This leads to embeddings which are similarly distributed across
sensitive attributes. We then find a good initial set by clustering the
embeddings. We believe we are the first to use embeddings for the task of
fair influence maximization. While there are typically trade-offs between
fairness and influence maximization objectives, our experiments on synthetic
and real-world datasets show that our approach dramatically reduces disparity
while remaining competitive with state-of-the-art influence maximization
methods.

José Miguel Hernández-Lobato, Michael A. Gelbart, Matthew W. Hoffman,
Ryan P. Adams, and Zoubin Ghahramani.
**Predictive
entropy search for Bayesian optimization with unknown constraints**.
In *32nd International Conference on Machine Learning*, pages
1699-1707, 2015.

** Abstract:** Unknown constraints arise in
many types of expensive black-box optimization problems. Several methods have
been proposed recently for performing Bayesian optimization with constraints,
based on the expected improvement (EI) heuristic. However, EI can lead to
pathologies when used with constraints. For example, in the case of decoupled
constraints—i.e., when one can independently evaluate the objective or the
constraints—EI can encounter a pathology that prevents exploration.
Additionally, computing EI requires a current best solution, which may not
exist if none of the data collected so far satisfy the constraints. By
contrast, information-based approaches do not suffer from these failure
modes. In this paper, we present a new information-based method called
Predictive Entropy Search with Constraints (PESC). We analyze the performance
of PESC and show that it compares favorably to EI-based approaches on
synthetic and benchmark problems, as well as several real-world examples. We
demonstrate that PESC is an effective algorithm that provides a promising
direction towards a unified solution for constrained Bayesian
optimization.

José Miguel Hernández-Lobato, Matthew W. Hoffman, and Zoubin Ghahramani.
**Predictive
entropy search for efficient global optimization of black-box
functions**.
In *Advances in Neural Information Processing Systems 28*, pages 1-9,
Montreal, Canada, December 2014.

** Abstract:** We propose a
novel information-theoretic approach for Bayesian optimization called
Predictive Entropy Search (PES). At each iteration, PES selects the next
evaluation point that maximizes the expected information gained with respect
to the global maximum. PES codifies this intractable acquisition function in
terms of the expected reduction in the differential entropy of the predictive
distribution. This reformulation allows PES to obtain approximations that are
both more accurate and efficient than other alternatives such as Entropy
Search (ES). Furthermore, PES can easily perform a fully Bayesian treatment
of the model hyperparameters while ES cannot. We evaluate PES in both
synthetic and realworld applications, including optimization problems in
machine learning, finance, biotechnology, and robotics. We show that the
increased accuracy of PES leads to significant gains in optimization
performance.

Matthew W Hoffman, Bobak Shahriari, and Nando de Freitas.
**On
correlation and budget constraints in model-based bandit optimization with
application to automatic machine learning**.
In *17th International Conference on Artificial Intelligence and
Statistics*, pages 365-374, Reykjavik, Iceland, April 2014.

** Abstract:** We address the problem of finding the maximizer
of a nonlinear function that can only be evaluated, subject to noise, at a
finite number of query locations. Further, we will assume that there is a
constraint on the total number of permitted function evaluations. We
introduce a Bayesian approach for this problem and show that it empirically
outperforms both the existing frequentist counterpart and other Bayesian
optimization methods. The Bayesian approach places emphasis on detailed
modelling, including the modelling of correlations among the arms. As a
result, it can perform well in situations where the number of arms is much
larger than the number of allowed function evaluation, whereas the
frequentist counterpart is inapplicable. This feature enables us to develop
and deploy practical applications, such as automatic machine learning
toolboxes. The paper presents comprehensive comparisons of the proposed
approach with many Bayesian and bandit optimization techniques, the first
comparison of many of these methods in the literature.

Neil Houlsby, José Miguel Hernández-Lobato, and Zoubin Ghahramani.
**Cold-start
active learning with robust ordinal matrix factorization**.
In *31st International Conference on Machine Learning*, Beijing, China,
June 2014.

** Abstract:** We present a new matrix
factorization model for rating data and a corresponding active learning
strategy to address the cold-start problem. Cold-start is one of the most
challenging tasks for recommender systems: what to recommend with new users
or items for which one has little or no data. An approach is to use active
learning to collect the most useful initial ratings. However, the performance
of active learning depends strongly upon having accurate estimates of i) the
uncertainty in model parameters and ii) the intrinsic noisiness of the data.
To achieve these estimates we propose a heteroskedastic Bayesian model for
ordinal matrix factorization. We also present a computationally efficient
framework for Bayesian active learning with this type of complex
probabilistic model. This algorithm successfully distinguishes between
informative and noisy data points. Our model yields state-of-the-art
predictive performance and, coupled with our active learning strategy,
enables us to gain useful information in the cold-start setting from the very
first active sample.

José Miguel Hernández-Lobato, Neil Houlsby, and Zoubin Ghahramani.
**Probabilistic
matrix factorization with non-random missing data**.
In *31st International Conference on Machine Learning*, Beijing, China,
June 2014.

** Abstract:** We propose a probabilistic matrix
factorization model for collaborative filtering that learns from data that is
missing not at random (MNAR). Matrix factorization models exhibit
state-of-the-art predictive performance in collaborative filtering. However,
these models usually assume that the data is missing at random (MAR), and
this is rarely the case. For example, the data is not MAR if users rate items
they like more than ones they dislike. When the MAR assumption is incorrect,
inferences are biased and predictive performance can suffer. Therefore, we
model both the generative process for the data and the missing data
mechanism. By learning these two models jointly we obtain improved
performance over state-of-the-art methods when predicting the ratings and
when modeling the data observation process. We present the first viable MF
model for MNAR data. Our results are promising and we expect that further
research on NMAR models will yield large gains in collaborative
filtering.

José Miguel Hernández-Lobato, Neil Houlsby, and Zoubin Ghahramani.
**Stochastic
inference for scalable probabilistic modeling of binary matrices**.
In *31st International Conference on Machine Learning*, Beijing, China,
June 2014.

** Abstract:** Fully observed large binary matrices
appear in a wide variety of contexts. To model them, probabilistic matrix
factorization (PMF) methods are an attractive solution. However, current
batch algorithms for PMF can be inefficient because they need to analyze the
entire data matrix before producing any parameter updates. We derive an
efficient stochastic inference algorithm for PMF models of fully observed
binary matrices. Our method exhibits faster convergence rates than more
expensive batch approaches and has better predictive performance than
scalable alternatives. The proposed method includes new data subsampling
strategies which produce large gains over standard uniform subsampling. We
also address the task of automatically selecting the size of the minibatches
of data used by our method. For this, we derive an algorithm that adjusts
this hyper-parameter online.

Neil Houlsby and Massimiliano Ciaramita.
**A
scalable Gibbs sampler for probabilistic entity linking**.
In *36th European Conference on Information Retrieval*, pages 335-346.
Springer, 2014.

** Abstract:** Entity linking involves
labeling phrases in text with their referent entities, such as Wikipedia or
Freebase entries. This task is challenging due to the large number of
possible entities, in the millions, and heavy-tailed mention ambiguity. We
formulate the problem in terms of probabilistic inference within a topic
model, where each topic is associated with a Wikipedia article. To deal with
the large number of topics we propose a novel efficient Gibbs sampling scheme
which can also incorporate side information, such as the Wikipedia graph.
This conceptually simple probabilistic approach achieves state-of-the-art
performance in entity-linking on the Aida-CoNLL dataset.

Neil Houlsby and Guy Houlsby.
**Statistical
fitting of undrained strength data**.
*Geotechnique*, 63(13):1253-1263, 2013, doi
10.1680/geot.13.P.007.

** Abstract:** We describe an
approach, based on Bayesian statistical methods, that allows the fitting of a
design profile to a set of measurements of undrained strengths. In particular
we allow for the automatic determination of not only the positions of
boundaries between geological units, but also the selection of the number of
units to model the data in an appropriate way.

Neil Houlsby, Ferenc Huszár, Mohammad M Ghassemi, Gergő Orbán,
Daniel M Wolpert, and Máté Lengyel.
**Cognitive
tomography reveals complex task-independent mental representations**.
*Current Biology*, 23(21):2169-2175, 2013, doi
10.1016/j.cub.2013.09.012.

** Abstract:** Humans develop
rich mental representations that guide their behavior in a variety of
every-day tasks. However, it is unknown whether these representations, often
formalized as priors in Bayesian inference, are specific for each task or
subserve multiple tasks. Current approaches cannot distinguish between these
two possibilities because they cannot extract comparable representations
across different tasks. Here, we develop a novel method, termed cognitive
tomography, that can extract complex, multi-dimensional priors across tasks.
We apply this method to human judgments in two qualitatively different tasks,
familiarity and odd-one-out, involving an ecologically relevant set of
stimuli, human faces. We show that priors over faces are structurally complex
and vary dramatically across subjects, but are invariant across the tasks
within each subject. The priors we extract from each task allow us to predict
with high precision the behavior of subjects for novel stimuli both in the
same task as well as in the other task. Our results provide the first
evidence for a single high-dimensional structured representation of a
naturalistic stimulus set that guides behavior in multiple tasks. Moreover,
the representations estimated by cognitive tomography can provide
independent, behavior-based regressors for elucidating the neural correlates
of complex naturalistic priors.

José Miguel Hernández-Lobato, Neil Houlsby, and Zoubin Ghahramani.
**Stochastic
inference for scalable probabilistic modeling of binary matrices**.
In *NIPS Workshop on Randomized Methods for Machine Learning*, 2013.

** Abstract:** Fully observed large binary matrices appear in a
wide variety of contexts. To model them, probabilistic matrix factorization
(PMF) methods are an attractive solution. However, current batch algorithms
for PMF can be inefficient since they need to analyze the entire data matrix
before producing any parameter updates. We derive an efficient stochastic
inference algorithm for PMF models of fully observed binary matrices. Our
method exhibits faster convergence rates than more expensive batch approaches
and has better predictive performance than scalable alternatives. The
proposed method includes new data subsampling strategies which produce large
gains over standard uniform subsampling. We also address the task of
automatically selecting the size of the minibatches of data and we propose an
algorithm that adjusts this hyper-parameter in an online manner.

Tomoharu Iwata, Neil Houlsby, and Zoubin Ghahramani.
**Active
learning for interactive visualization**.
In *16th International Conference on Artificial Intelligence and
Statistics*, 2013.

** Abstract:** Many automatic
visualization methods have been proposed. However, a visualization that is
automatically generated might be different to how a user wants to arrange the
objects in visualization space. By allowing users to re-locate objects in the
embedding space of the visualization, they can adjust the visualization to
their preference. We propose an active learning framework for interactive
visualization which selects objects for the user to re-locate so that they
can obtain their desired visualization by re-locating as few as possible. The
framework is based on an information theoretic criterion, which favors
objects that reduce the uncertainty of the visualization. We present a
concrete application of the proposed framework to the Laplacian eigenmap
visualization method. We demonstrate experimentally that the proposed
framework yields the desired visualization with fewer user interactions than
existing methods.

Konstantin Kravtsov, Stanislav Straupe, Igor Radchenko, Neil Houlsby, Ferenc
Huszár, and Sergey Kulik.
**Experimental
adaptive Bayesian tomography**.
*Physical Review A*, 87(6):062122, 2013.

** Abstract:**
We report an experimental realization of an adaptive quantum state tomography
protocol. Our method takes advantage of a Bayesian approach to statistical
inference and is naturally tailored for adaptive strategies. For pure states
we observe close to $N^-1$ scaling of infidelity with overall number of
registered events, while best non-adaptive protocols allow for $N^-1/2$
scaling only. Experiments are performed for polarization qubits, but the
approach is readily adapted to any dimension.

Ferenc Huszár and Neil Houlsby.
**Adaptive Bayesian
quantum tomography**.
*Physical Review A*, 85(5):052120, 2012.

** Abstract:**
In this paper we revisit the problem of optimal design of quantum tomographic
experiments. In contrast to previous approaches where an optimal set of
measurements is decided in advance of the experiment, we allow for
measurements to be adaptively and efficiently re-optimised depending on data
collected so far. We develop an adaptive statistical framework based on
Bayesian inference and Shannon's information, and demonstrate a ten-fold
reduction in the total number of measurements required as compared to
non-adaptive methods, including mutually unbiased bases.

Neil Houlsby, Jose Miguel Hernández-Lobato, Ferenc Huszár, and Zoubin
Ghahramani.
**Collaborative
Gaussian processes for preference learning**.
In *Advances in Neural Information Processing Systems 26*, pages
2096-2104. Curran Associates, Inc., 2012.

** Abstract:** We
present a new model based on Gaussian processes (GPs) for learning pairwise
preferences expressed by multiple users. Inference is simplified by using a
*preference kernel* for GPs which allows us to combine supervised GP
learning of user preferences with unsupervised dimensionality reduction for
multi-user systems. The model not only exploits collaborative information
from the shared structure in user behavior, but may also incorporate user
features if they are available. Approximate inference is implemented using a
combination of expectation propagation and variational Bayes. Finally, we
present an efficient active learning strategy for querying preferences. The
proposed technique performs favorably on real-world data against
state-of-the-art multi-user preference learning algorithms.

Neil Houlsby, Ferenc Huszar, Zoubin Ghahramani, and Máté Lengyel.
**Bayesian active
learning for classification and preference learning**.
*arXiv*, abs/1112.5745, 2011.

** Abstract:** Information
theoretic active learning has been widely studied for probabilistic models.
For simple regression an optimal myopic policy is easily tractable. However,
for other tasks and with more complex models, such as classification with
nonparametric models, the optimal solution is harder to compute. Current
approaches make approximations to achieve tractability. We propose an
approach that expresses information gain in terms of predictive entropies,
and apply this method to the Gaussian Process Classifier (GPC). Our approach
makes minimal approximations to the full information theoretic objective. Our
experimental performance compares favourably to many popular active learning
algorithms, and has equal or lower computational complexity. We compare well
to decision theoretic approaches also, which are privy to more information
and require much more computational time. Secondly, by developing further a
reformulation of binary preference learning to a classification problem, we
extend our algorithm to Gaussian Process preference learning.

Neil Houlsby, Ferenc Huszár, Mohammad M Ghassemi, Gergő Orbán,
Daniel M Wolpert, and Máté Lengyel.
**Cognitive
tomography reveals complex task-independent mental representations**.
*Current Biology*, 23(21):2169-2175, 2013, doi
10.1016/j.cub.2013.09.012.

** Abstract:** Humans develop
rich mental representations that guide their behavior in a variety of
every-day tasks. However, it is unknown whether these representations, often
formalized as priors in Bayesian inference, are specific for each task or
subserve multiple tasks. Current approaches cannot distinguish between these
two possibilities because they cannot extract comparable representations
across different tasks. Here, we develop a novel method, termed cognitive
tomography, that can extract complex, multi-dimensional priors across tasks.
We apply this method to human judgments in two qualitatively different tasks,
familiarity and odd-one-out, involving an ecologically relevant set of
stimuli, human faces. We show that priors over faces are structurally complex
and vary dramatically across subjects, but are invariant across the tasks
within each subject. The priors we extract from each task allow us to predict
with high precision the behavior of subjects for novel stimuli both in the
same task as well as in the other task. Our results provide the first
evidence for a single high-dimensional structured representation of a
naturalistic stimulus set that guides behavior in multiple tasks. Moreover,
the representations estimated by cognitive tomography can provide
independent, behavior-based regressors for elucidating the neural correlates
of complex naturalistic priors.

Konstantin Kravtsov, Stanislav Straupe, Igor Radchenko, Neil Houlsby, Ferenc
Huszár, and Sergey Kulik.
**Experimental
adaptive Bayesian tomography**.
*Physical Review A*, 87(6):062122, 2013.

** Abstract:**
We report an experimental realization of an adaptive quantum state tomography
protocol. Our method takes advantage of a Bayesian approach to statistical
inference and is naturally tailored for adaptive strategies. For pure states
we observe close to $N^-1$ scaling of infidelity with overall number of
registered events, while best non-adaptive protocols allow for $N^-1/2$
scaling only. Experiments are performed for polarization qubits, but the
approach is readily adapted to any dimension.

Ferenc Huszár and David Duvenaud.
**Optimally-weighted herding is
Bayesian quadrature**.
In *28th Conference on Uncertainty in Artificial Intelligence*, pages
377-385, Catalina Island, California, July 2012.

**
Abstract:** Herding and kernel herding are deterministic methods of
choosing samples which summarise a probability distribution. A related task
is choosing samples for estimating integrals using Bayesian quadrature. We
show that the criterion minimised when selecting samples in kernel herding is
equivalent to the posterior variance in Bayesian quadrature. We then show
that sequential Bayesian quadrature can be viewed as a weighted version of
kernel herding which achieves performance superior to any other weighted
herding method. We demonstrate empirically a rate of convergence faster than
O(1/N). Our results also imply an upper bound on the empirical error of the
Bayesian quadrature estimate.

Ferenc Huszár and Neil Houlsby.
**Adaptive Bayesian
quantum tomography**.
*Physical Review A*, 85(5):052120, 2012.

** Abstract:**
In this paper we revisit the problem of optimal design of quantum tomographic
experiments. In contrast to previous approaches where an optimal set of
measurements is decided in advance of the experiment, we allow for
measurements to be adaptively and efficiently re-optimised depending on data
collected so far. We develop an adaptive statistical framework based on
Bayesian inference and Shannon's information, and demonstrate a ten-fold
reduction in the total number of measurements required as compared to
non-adaptive methods, including mutually unbiased bases.

**Collaborative
Gaussian processes for preference learning**.
In *Advances in Neural Information Processing Systems 26*, pages
2096-2104. Curran Associates, Inc., 2012.

** Abstract:** We
present a new model based on Gaussian processes (GPs) for learning pairwise
preferences expressed by multiple users. Inference is simplified by using a
*preference kernel* for GPs which allows us to combine supervised GP
learning of user preferences with unsupervised dimensionality reduction for
multi-user systems. The model not only exploits collaborative information
from the shared structure in user behavior, but may also incorporate user
features if they are available. Approximate inference is implemented using a
combination of expectation propagation and variational Bayes. Finally, we
present an efficient active learning strategy for querying preferences. The
proposed technique performs favorably on real-world data against
state-of-the-art multi-user preference learning algorithms.

Simon Lacoste-Julien, Ferenc Huszár, and Zoubin Ghahramani.
**Approximate
inference for the loss-calibrated Bayesian**.
In Geoff Gordon and David Dunson, editors, *14th International Conference on
Artificial Intelligence and Statistics*, volume 15, pages 416-424,
Fort Lauderdale, FL, USA, April 2011. Journal of Machine Learning Research.

** Abstract:** We consider the problem of approximate inference
in the context of Bayesian decision theory. Traditional approaches focus on
approximating general properties of the posterior, ignoring the decision task
- and associated losses - for which the posterior could be used. We argue
that this can be suboptimal and propose instead to *loss-calibrate* the
approximate inference methods with respect to the decision task at hand. We
present a general framework rooted in Bayesian decision theory to analyze
approximate inference from the perspective of losses, opening up several
research directions. As a first loss-calibrated approximate inference
attempt, we propose an EM-like algorithm on the Bayesian posterior risk and
show how it can improve a standard approach to Gaussian process
classification when losses are asymmetric.

Neil Houlsby, Ferenc Huszar, Zoubin Ghahramani, and Máté Lengyel.
**Bayesian active
learning for classification and preference learning**.
*arXiv*, abs/1112.5745, 2011.

** Abstract:** Information
theoretic active learning has been widely studied for probabilistic models.
For simple regression an optimal myopic policy is easily tractable. However,
for other tasks and with more complex models, such as classification with
nonparametric models, the optimal solution is harder to compute. Current
approaches make approximations to achieve tractability. We propose an
approach that expresses information gain in terms of predictive entropies,
and apply this method to the Gaussian Process Classifier (GPC). Our approach
makes minimal approximations to the full information theoretic objective. Our
experimental performance compares favourably to many popular active learning
algorithms, and has equal or lower computational complexity. We compare well
to decision theoretic approaches also, which are privy to more information
and require much more computational time. Secondly, by developing further a
reformulation of binary preference learning to a classification problem, we
extend our algorithm to Gaussian Process preference learning.

Ferenc Huszár and Simon Lacoste-Julien.
**A kernel approach to
tractable Bayesian nonparametrics**.
Technical report, University of Cambridge, 2011.

** Abstract:**
Inference in popular nonparametric Bayesian models typically relies on
sampling or other approximations. This paper presents a general methodology
for constructing novel tractable nonparametric Bayesian methods by applying
the kernel trick to inference in a parametric Bayesian model. For example,
Gaussian process regression can be derived this way from Bayesian linear
regression. Despite the success of the Gaussian process framework, the kernel
trick is rarely explicitly considered in the Bayesian literature. In this
paper, we aim to fill this gap and demonstrate the potential of applying the
kernel trick to tractable Bayesian parametric models in a wider context than
just regression. As an example, we present an intuitive Bayesian kernel
machine for density estimation that is obtained by applying the kernel trick
to a Gaussian generative model in feature space.

** Comment:** arXiv:1103.1761

Ferenc Huszár, Uta Noppeney, and Máté Lengyel.
**Mind reading by
machine learning: A doubly Bayesian method for inferring mental
representations**.
In S. Ohlsson and R. Catrambone, editors, *The Proceedings of the 32nd
Annual Meeting of the Cognitive Science Society*, Austin, TX, USA, August
2010. The Cognitive Science Society.

** Abstract:** A central
challenge in cognitive science is to measure and quantify the mental
representations humans develop - in other words, to 'read' subject's minds.
In order to eliminate potential biases in reporting mental contents due to
verbal elaboration, subjects' responses in experiments are often limited to
binary decisions or discrete choices that do not require conscious reflection
upon their mental contents. However, it is unclear what such impoverished
data can tell us about the potential richness and dynamics of subjects'
mental representations. To address this problem, we used ideal observer
models that formalise choice behaviour as (quasi-)Bayes-optimal, given
subjects' representations in long-term memory, acquired through prior
learning, and the stimuli currently available to them. Bayesian inversion of
such ideal observer models allowed us to infer subjects' mental
representation from their choice behaviour in a variety of psychophysical
tasks. The inferred mental representations also allowed us to predict future
choices of subjects with reasonable accuracy, even in tasks that were
different from those in which the representations were estimated. These
results demonstrate a significant potential in standard binary decision tasks
to recover detailed information about subjects' mental representations

** Comment:** Supplementary material available here.

Alessandro Davide Ialongo, Mark van der Wilk, James Hensman, and Carl Edward
Rasmussen.
**Overcoming mean-field
approximations in recurrent Gaussian process models**.
In *36th International Conference on Machine Learning*, Long Beach, June
2019.

** Abstract:** We identify a new variational inference
scheme for dynamical systems whose transition function is modelled by a
Gaussian process. Inference in this setting has either employed
computationally intensive MCMC methods, or relied on factorisations of the
variational posterior. As we demonstrate in our experiments, the
factorisation between latent system states and transition function can lead
to a miscalibrated posterior and to learning unnecessarily large noise terms.
We eliminate this factorisation by explicitly modelling the dependence
between state trajectories and the Gaussian process posterior. Samples of the
latent states can then be tractably generated by conditioning on this
representation. The method we obtain (VCDT: variationally coupled dynamics
and trajectories) gives better predictive performance and more calibrated
estimates of the transition function, yet maintains the same time and space
complexities as mean-field methods. Code is available at:
https://github.com/ialong/GPt.

Alessandro Davide Ialongo, Mark van der Wilk, James Hensman, and Carl Edward
Rasmussen.
**Non-factorised variational
inference in dynamical systems**.
In *First Symposium on Advances in Approximate Bayesian Inference*,
Montreal, December 2018.

** Abstract:** We focus on
variational inference in dynamical systems where the discrete time transition
function (or evolution rule) is modelled by a Gaussian process. The dominant
approach so far has been to use a factorised posterior distribution,
decoupling the transition function from the system states. This is not exact
in general and can lead to an overconfident posterior over the transition
function as well as an overestimation of the intrinsic stochasticity of the
system (process noise). We propose a new method that addresses these issues
and incurs no additional computational costs.

Carlo Ciliberto, Mark Herbster, Alessandro Davide Ialongo, Massimiliano Pontil,
Andrea Rocchetto, Simone Severini, and Leonard Wossnig.
**Quantum
machine learning: a classical perspective**.
In *Proc. R. Soc. A*, volume 474, page 20170551. The Royal Society,
2018.

** Abstract:** Recently, increased computational power
and data availability, as well as algorithmic advances, have led machine
learning techniques to impressive results in regression, classification,
data-generation and reinforcement learning tasks. Despite these successes,
the proximity to the physical limits of chip fabrication alongside the
increasing size of datasets are motivating a growing number of researchers to
explore the possibility of harnessing the power of quantum computation to
speed-up classical machine learning algorithms. Here we review the literature
in quantum machine learning and discuss perspectives for a mixed readership
of classical machine learning and quantum computation experts. Particular
emphasis will be placed on clarifying the limitations of quantum algorithms,
how they compare with their best classical counterparts and why quantum
resources are expected to provide advantages for learning problems. Learning
in the presence of noise and certain computationally hard problems in machine
learning are identified as promising directions for the field. Practical
questions, like how to upload classical data into quantum form, will also be
addressed.

Alessandro Davide Ialongo, Mark van der Wilk, and Carl Edward Rasmussen.
**Closed-form inference and
prediction in Gaussian process state-space models**.
In *NIPS Time Series Workshop 2017*, Long Beach, December 2017.

** Abstract:** We examine an analytic variational inference
scheme for the Gaussian Process State Space Model (GPSSM) - a probabilistic
model for system identification and time-series modelling. Our approach
performs variational inference over both the system states and the transition
function. We exploit Markov structure in the true posterior, as well as an
inducing point approximation to achieve linear time complexity in the length
of the time series. Contrary to previous approaches, no Monte Carlo sampling
is required: inference is cast as a deterministic optimisation problem. In a
number of experiments, we demonstrate the ability to model non-linear
dynamics in the presence of both process and observation noise as well as to
impute missing information (e.g. velocities from raw positions through time),
to de-noise, and to estimate the underlying dimensionality of the system.
Finally, we also introduce a closed-form method for multi-step prediction,
and a novel criterion for assessing the quality of our approximate
posterior.

Niki Kilbertus, Manuel Gomez Rodriguez, Bernhard Schölkopf, Krikamol Muandet,
and Isabel Valera.
**Fair decisions
despite imperfect predictions**.
In Silvia Chiappa and Roberto Calandra, editors, *23rd International
Conference on Artificial Intelligence and Statistics*, volume 108 of
*Proceedings of Machine Learning Research*, pages 277-287. PMLR,
26-28 Aug 2020.

** Abstract:** Consequential decisions are
increasingly informed by sophisticated data-driven predictive models.
However, consistently learning accurate predictive models requires access to
ground truth labels. Unfortunately, in practice, labels may only exist
conditional on certain decisions—if a loan is denied, there is not even an
option for the individual to pay back the loan. In this paper, we show that,
in this selective labels setting, learning to predict is suboptimal in terms
of both fairness and utility. To avoid this undesirable behavior, we propose
to directly learn stochastic decision policies that maximize utility under
fairness constraints. In the context of fair machine learning, our results
suggest the need for a paradigm shift from "learning to predict" to "learning
to decide". Experiments on synthetic and real-world data illustrate the
favorable properties of learning to decide, in terms of both utility and
fairness.

Timothy Gebhard, Niki Kilbertus, Ian Harry, and Bernhard Schölkopf.
**Convolutional
neural networks: A magic bullet for gravitational-wave detection?**.
*Physical Review D*, 100(6):063015, September 2019, doi
https://doi.org/10.1103/PhysRevD.100.063015.

**
Abstract:** In the last few years, machine learning techniques, in
particular convolutional neural networks, have been investigated as a method
to replace or complement traditional matched filtering techniques that are
used to detect the gravitational-wave signature of merging black holes.
However, to date, these methods have not yet been successfully applied to the
analysis of long stretches of data recorded by the Advanced LIGO and Virgo
gravitational-wave observatories. In this work, we critically examine the use
of convolutional neural networks as a tool to search for merging black holes.
We identify the strengths and limitations of this approach, highlight some
common pitfalls in translating between machine learning and
gravitational-wave astronomy, and discuss the interdisciplinary challenges.
In particular, we explain in detail why convolutional neural networks alone
cannot be used to claim a statistically significant gravitational-wave
detection. However, we demonstrate how they can still be used to rapidly flag
the times of potential signals in the data for a more detailed follow-up. Our
convolutional neural network architecture as well as the proposed performance
metrics are better suited for this task than a standard binary
classifications scheme. A detailed evaluation of our approach on Advanced
LIGO data demonstrates the potential of such systems as trigger generators.
Finally, we sound a note of caution by constructing adversarial examples,
which showcase interesting "failure modes" of our model, where inputs with no
visible resemblance to real gravitational-wave signals are identified as such
by the network with high confidence.

Niki Kilbertus, Phil Ball, Matt Kusner, Adrian Weller, and Ricardo Silva.
**The sensitivity of counterfactual
fairness to unmeasured confounding**.
In *35th Conference on Uncertainty in Artificial Intelligence*, Tel
Aviv, July 2019.

** Abstract:** Causal approaches to fairness
have seen substantial recent interest, both from the machine learning
community and from wider parties interested in ethical prediction algorithms.
In no small part, this has been due to the fact that causal models allow one
to simultaneously leverage data and expert knowledge to remove discriminatory
effects from predictions. However, one of the primary assumptions in causal
modeling is that you know the causal graph. This introduces a new opportunity
for bias, caused by misspecifying the causal model. One common way for
misspecification to occur is via unmeasured confounding: the true causal
effect between variables is partially described by unobserved quantities. In
this work we design tools to assess the sensitivity of fairness measures to
this confounding for the popular class of non-linear additive noise models
(ANMs). Specifically, we give a procedure for computing the maximum
difference between two counterfactually fair predictors, where one has become
biased due to confounding. For the case of bivariate confounding our
technique can be swiftly computed via a sequence of closed-form updates. For
multivariate confounding we give an algorithm that can be efficiently solved
via automatic differentiation. We demonstrate our new sensitivity analysis
tools in real-world fairness scenarios to assess the bias arising from
confounding.

Niki Kilbertus, Adria Gascon, Matt Kusner, Michael Veale, Krishna P. Gummadi,
and Adrian Weller.
**Blind
justice: Fairness with encrypted sensitive attributes**.
In *35th International Conference on Machine Learning*, Stockholm
Sweden, July 2018.

** Abstract:** Recent work has explored how
to train machine learning models which do not discriminate against any
subgroup of the population as determined by sensitive attributes such as
gender or race. To avoid disparate treatment, sensitive attributes should not
be considered. On the other hand, in order to avoid disparate impact,
sensitive attributes must be examined — e.g., in order to learn a fair
model, or to check if a given model is fair. We introduce methods from secure
multi-party computation which allow us to avoid both. By encrypting sensitive
attributes, we show how an outcome based fair model may be learned, checked,
or have its outputs verified and held to account, without users revealing
their sensitive attributes.

Giambattista Parascandolo, Niki Kilbertus, Mateo Rojas-Carulla, and Bernhard
Schölkopf.
**Learning
independent causal mechanisms**.
In *35th International Conference on Machine Learning*, Stockholm
Sweden, July 2018.

** Abstract:** Statistical learning relies
upon data sampled from a distribution, and we usually do not care what
actually generated it in the first place. From the point of view of causal
modeling, the structure of each distribution is induced by physical
mechanisms that give rise to dependences between observables. Mechanisms,
however, can be meaningful autonomous modules of generative models that make
sense beyond a particular entailed data distribution, lending themselves to
transfer between problems. We develop an algorithm to recover a set of
independent (inverse) mechanisms from a set of transformed data points. The
approach is unsupervised and based on a set of experts that compete for data
generated by the mechanisms, driving specialization. We analyze the proposed
method in a series of experiments on image data. Each expert learns to map a
subset of the transformed data back to a reference distribution. The learned
mechanisms generalize to novel domains. We discuss implications for transfer
learning and links to recent trends in generative modeling.

Niki Kilbertus, Mateo Rojas-Carulla, Giambattista Parascandolo, Moritz Hardt,
Dominik Janzing, and Bernhard Schölkopf.
**Avoiding discrimination
through causal reasoning**.
In *Advances in Neural Information Processing Systems 30*, Long Beach,
California, December 2017.

** Abstract:** Recent work on
fairness in machine learning has focused on various statistical
discrimination criteria and how they trade off. Most of these criteria are
observational: They depend only on the joint distribution of predictor,
protected attribute, features, and outcome. While convenient to work with,
observational criteria have severe inherent limitations that prevent them
from resolving matters of fairness conclusively. Going beyond observational
criteria, we frame the problem of discrimination based on protected
attributes in the language of causal reasoning. This viewpoint shifts
attention from ``What is the right fairness criterion?'' to ``What do we want
to assume about our model of the causal data generating process?'' Through
the lens of causality, we make several contributions. First, we crisply
articulate why and when observational criteria fail, thus formalizing what
was before a matter of opinion. Second, our approach exposes previously
ignored subtleties and why they are fundamental to the problem. Finally, we
put forward natural causal non-discrimination criteria and develop algorithms
that satisfy them.

Creighton Heaukulani, David A. Knowles, and Zoubin Ghahramani.
**Beta diffusion trees and
hierarchical feature allocations**.
Technical report, Dept. of Engineering, University of Cambridge, August 2014.

** Abstract:** We define the beta diffusion tree, a random tree
structure with a set of leaves that defines a collection of overlapping
subsets of objects, known as a feature allocation. A generative process for
the tree structure is defined in terms of particles (representing the
objects) diffusing in some continuous space, analogously to the Dirichlet
diffusion tree (Neal, 2003b), which defines a tree structure over partitions
(i.e., non-overlapping subsets) of the objects. Unlike in the Dirichlet
diffusion tree, multiple copies of a particle may exist and diffuse along
multiple branches in the beta diffusion tree, and an object may therefore
belong to multiple subsets of particles. We demonstrate how to build a
hierarchically-clustered factor analysis model with the beta diffusion tree
and how to perform inference over the random tree structures with a Markov
chain Monte Carlo algorithm. We conclude with several numerical experiments
on missing data problems with data sets of gene expression microarrays,
international development statistics, and intranational socioeconomic
measurements.

Creighton Heaukulani, David A. Knowles, and Zoubin Ghahramani.
**Beta
diffusion trees**.
In *31st International Conference on Machine Learning*, Beijing, China,
June 2014.

** Abstract:** We define the beta diffusion tree, a
random tree structure with a set of leaves that defines a collection of
overlapping subsets of objects, known as a feature allocation. The generative
process for the tree is defined in terms of particles (representing the
objects) diffusing in some continuous space, analogously to the Dirichlet and
Pitman–Yor diffusion trees (Neal, 2003b; Knowles & Ghahramani, 2011), both
of which define tree structures over clusters of the particles. With the beta
diffusion tree, however, multiple copies of a particle may exist and diffuse
to multiple locations in the continuous space, resulting in (a random number
of) possibly overlapping clusters of the objects. We demonstrate how to build
a hierarchically-clustered factor analysis model with the beta diffusion tree
and how to perform inference over the random tree structures with a Markov
chain Monte Carlo algorithm. We conclude with several numerical experiments
on missing data problems with data sets of gene expression arrays,
international development statistics, and intranational socioeconomic
measurements.

Konstantina Palla, David A. Knowles, and Zoubin Ghahramani.
**A reversible
infinite hmm using normalised random measures**.
In *31st International Conference on Machine Learning*, Beijing, China,
June 2014.

** Abstract:** We present a nonparametric prior
over reversible Markov chains. We use completely random measures,
specifically gamma processes, to construct a countably infinite graph with
weighted edges. By enforcing symmetry to make the edges undirected we define
a prior over random walks on graphs that results in a reversible Markov
chain. The resulting prior over infinite transition matrices is closely
related to the hierarchical Dirichlet process but enforces reversibility. A
reinforcement scheme has recently been proposed with similar properties, but
the de Finetti measure is not well characterised. We take the alternative
approach of explicitly constructing the mixing measure, which allows more
straightforward and efficient inference at the cost of no longer having a
closed form predictive distribution. We use our process to construct a
reversible infinite HMM which we apply to two real datasets, one from
epigenomics and one ion channel recording.

Novi Quadrianto, Viktoriia Sharmanska, David A. Knowles, and Zoubin Ghahramani.
**The supervised
ibp: Neighbourhood preserving infinite latent feature models**.
In *29th Conference on Uncertainty in Artificial Intelligence*,
Bellevue, USA, July 2013.

** Abstract:** We propose a
probabilistic model to infer supervised latent variables in the Hamming space
from observed data. Our model allows simultaneous inference of the number of
binary latent variables, and their values. The latent variables preserve
neighbourhood structure of the data in a sense that objects in the same
semantic concept have similar latent values, and objects in different
concepts have dissimilar latent values. We formulate the supervised infinite
latent variable problem based on an intuitive principle of pulling objects
together if they are of the same type, and pushing them apart if they are
not. We then combine this principle with a flexible Indian Buffet Process
prior on the latent variables. We show that the inferred supervised latent
variables can be directly used to perform a nearest neighbour search for the
purpose of retrieval. We introduce a new application of dynamically extending
hash codes, and show how to effectively couple the structure of the hash
codes with continuously growing structure of the neighbourhood preserving
infinite latent feature space.

Konstantina Palla, David A. Knowles, and Zoubin Ghahramani.
**A
nonparametric variable clustering model**.
In *Advances in Neural Information Processing Systems 26*, pages 1-9,
Lake Tahoe, California, USA, December 2012.

** Abstract:**
Factor analysis models effectively summarise the covariance structure of high
dimensional data, but the solutions are typically hard to interpret. This
motivates attempting to find a disjoint partition, i.e. a simple clustering,
of observed variables into highly correlated subsets. We introduce a Bayesian
non-parametric approach to this problem, and demonstrate advantages over
heuristic methods proposed to date. Our Dirichlet process variable clustering
(DPVC) model can discover block-diagonal covariance structures in data. We
evaluate our method on both synthetic and gene expression analysis
problems.

Konstantina Palla, David A. Knowles, and Zoubin Ghahramani.
**An infinite latent attribute
model for network data**.
In *29th International Conference on Machine Learning*, Edinburgh,
Scotland, June 2012.

** Abstract:** Latent variable models for
network data extract a summary of the relational structure underlying an
observed network. The simplest possible models subdivide nodes of the network
into clusters; the probability of a link between any two nodes then depends
only on their cluster assignment. Currently available models can be
classified by whether clusters are disjoint or are allowed to overlap. These
models can explain a "flat" clustering structure. Hierarchical Bayesian
models provide a natural approach to capture more complex dependencies. We
propose a model in which objects are characterised by a latent feature
vector. Each feature is itself partitioned into disjoint groups
(subclusters), corresponding to a second layer of hierarchy. In experimental
comparisons, the model achieves significantly improved predictive performance
on social and biological link prediction tasks. The results indicate that
models with a single layer hierarchy over-simplify real networks.

Andrew Gordon Wilson, David A. Knowles, and Zoubin Ghahramani.
**Gaussian process regression
networks**.
In *29th International Conference on Machine Learning*, Edinburgh,
Scotland, June 2012.

** Abstract:** We introduce a new
regression framework, Gaussian process regression networks (GPRN), which
combines the structural properties of Bayesian neural networks with the
nonparametric flexibility of Gaussian processes. GPRN accommodates input
(predictor) dependent signal and noise correlations between multiple output
(response) variables, input dependent length-scales and amplitudes, and
heavy-tailed predictive distributions. We derive both elliptical slice
sampling and variational Bayes inference procedures for GPRN. We apply GPRN
as a multiple output regression and multivariate volatility model,
demonstrating substantially improved performance over eight popular multiple
output (multi-task) Gaussian process models and three multivariate volatility
models on real datasets, including a 1000 dimensional gene expression
dataset.

Andrew Gordon Wilson, David A Knowles, and Zoubin Ghahramani.
**Gaussian process
regression networks**.
Technical Report arXiv:1110.4411 [stat.ML], Department of Engineering,
University of Cambridge, Cambridge, UK, October 19 2011.

**
Abstract:** We introduce a new regression framework, Gaussian process
regression networks (GPRN), which combines the structural properties of
Bayesian neural networks with the non-parametric flexibility of Gaussian
processes. This model accommodates input dependent signal and noise
correlations between multiple response variables, input dependent
length-scales and amplitudes, and heavy-tailed predictive distributions. We
derive both efficient Markov chain Monte Carlo and variational Bayes
inference procedures for this model. We apply GPRN as a multiple output
regression and multivariate volatility model, demonstrating substantially
improved performance over eight popular multiple output (multi-task) Gaussian
process models and three multivariate volatility models on benchmark
datasets, including a 1000 dimensional gene expression dataset.

** Comment:** arXiv:1110.4411

David A. Knowles and Zoubin Ghahramani.
**Nonparametric
Bayesian sparse factor models with application to gene expression
modelling.**.
*Annals of Applied Statistics*, 5(2B):1534-1552, 2011.

**
Abstract:** A nonparametric Bayesian extension of Factor Analysis (FA) is
proposed where observed data Y is modeled as a linear superposition, G, of a
potentially infinite number of hidden factors, X. The Indian Buffet Process
(IBP) is used as a prior on G to incorporate sparsity and to allow the number
of latent features to be inferred. The model's utility for modeling gene
expression data is investigated using randomly generated data sets based on a
known sparse connectivity matrix for E. Coli, and on three biological data
sets of increasing complexity.

David A. Knowles and Zoubin Ghahramani.
**Pitman-Yor
diffusion trees**.
In *27th Conference on Uncertainty in Artificial Intelligence*, 2011.

** Abstract:** We introduce the Pitman Yor Diffusion Tree (PYDT)
for hierarchical clustering, a generalization of the Dirichlet Diffusion Tree
(Neal, 2001) which removes the restriction to binary branching structure. The
generative process is described and shown to result in an exchangeable
distribution over data points. We prove some theoretical properties of the
model and then present two inference methods: a collapsed MCMC sampler which
allows us to model uncertainty over tree structures, and a computationally
efficient greedy Bayesian EM search algorithm. Both algorithms use message
passing on the tree structure. The utility of the model and algorithms is
demonstrated on synthetic and real world data, both continuous and
binary.

** Comment:** web site

David A. Knowles, Jurgen Van Gael, and Zoubin Ghahramani.
**Message passing
algorithms for the Dirichlet diffusion tree**.
In *28th International Conference on Machine Learning*, 2011.

** Abstract:** We demonstrate efficient approximate inference
for the Dirichlet Diffusion Tree (Neal, 2003), a Bayesian nonparametric prior
over tree structures. Although DDTs provide a powerful and elegant approach
for modeling hierarchies they haven't seen much use to date. One problem is
the computational cost of MCMC inference. We provide the first deterministic
approximate inference methods for DDT models and show excellent performance
compared to the MCMC alternative. We present message passing algorithms to
approximate the Bayesian model evidence for a specific tree. This is used to
drive sequential tree building and greedy search to find optimal tree
structures, corresponding to hierarchical clusterings of the data. We
demonstrate appropriate observation models for continuous and binary data.
The empirical performance of our method is very close to the computationally
expensive MCMC alternative on a density estimation problem, and significantly
outperforms kernel density estimators.

** Comment:** web site

David A. Knowles and Thomas P. Minka.
**Non-conjugate
variational message passing for multinomial and binary regression**.
In *Advances in Neural Information Processing Systems 25*, 2011.

** Abstract:** Variational Message Passing (VMP) is an
algorithmic implementation of the Variational Bayes (VB) method which applies
only in the special case of conjugate exponential family models. We propose
an extension to VMP, which we refer to as Non-conjugate Variational Message
Passing (NCVMP) which aims to alleviate this restriction while maintaining
modularity, allowing choice in how expectations are calculated, and
integrating into an existing message-passing framework: Infer.NET. We
demonstrate NCVMP on logistic binary and multinomial regression. In the
multinomial case we introduce a novel variational bound for the softmax
factor which is tighter than other commonly used bounds whilst maintaining
computational tractability.

** Comment:** web site supplementary

David A. Knowles, Leopold Parts, Daniel Glass, and John M. Winn.
**Inferring a
measure of physiological age from multiple ageing related phenotypes**.
In *NIPS Workshop: From Statistical Genetics to Predictive Models in
Personalized Medicine*, 2011.

** Abstract:** What is
ageing? One definition is simultaneous degradation of multiple organ systems.
Can an individual be said to be "old" or "young" for their (chronological)
age in a scientifically meaningful way? We investigate these questions using
ageing related phenotypes measured on the 12,000 female twins in the Twins UK
study. We propose a simple linear model of ageing, which allows a latent
adjustment to be made to an individual's chronological age to give her
"physiological age", shared across the observed phenotypes. We note problems
with the analysis resulting from the linearity assumption and show how to
alleviate these issues using a non-linear extension. We find more gene
expression probes are significantly associated with our measurement of
physiological age than to chronological age.

** Comment:** web site

Mehregan Movassagh, Mun-Kit Choy, David A. Knowles, Lina Cordeddu, Syed Haider,
Thomas Down, Lee Siggens, Ana Vujic, Ilenia Simeoni, Chris Penkett, Martin
Goddard, Pietro Lio, Martin Bennett, and Roger Foo.
**Distinct
epigenomic features in human cardiomyopathy**.
*Circulation, American Heart Association*, 2011.

**
Abstract:** Background. The epigenome refers to marks on the genome
including DNA methylation and histone modifications that regulate the
expression of underlying genes. A consistent profile of gene expression
changes in end- stage cardiomyopathy led us to hypothesise that distinct
global patterns of the epigenome may also exist. Methods and Results. We
constructed genome-wide maps of DNA methylation and Histone-3 Lysine-36
tri-methylation (H3K36me3)-enrichment for cardiomyopathic and normal human
hearts. 506Mb of sequence per library was generated by high-throughput
sequencing, covering 24 million out of the 28 million CG di-nucleotides in
the human genome. DNA methylation was significantly different in promoter
CpG-islands (CGI), intra-genic CGI, gene bodies and H3K36me3-enriched regions
of the genome. Moreover DNA methylation differences were present in promoters
of upregulated genes but not down-regulated genes. The profile of
H3K36me3-enrichment itself was also significantly different in protein-coding
regions of the genome. Conclusions. Distinct epigenomic patterns exist in
important DNA elements of the human cardiac genome in end-stage
cardiomyopathy. If epigenomic patterns track with disease progression, assays
for the epigenome may be more useful than quantification of mRNA for
assessing prognosis in heart failure. These results open up an important new
horizon of research and further studies will be needed to determine how
epigenomics contribute to altered gene expression in cardiomyopathy.

Cornelia Schone, Anne Venner, David A. Knowles, Mahesh M Karnani, and Denis
Burdakov.
**Dichotomous
cellular properties of mouse orexin/hypocretin neurons**.
*The Journal of Physiology*, 2011.

** Abstract:**
Hypothalamic hypocretin/orexin (hcrt/orx) neurons recently emerged as
critical regulators of sleep-wake cycles, reward-seeking, and body energy
balance. However, at the level of cellular and network properties, it remains
unclear whether hcrt/orx neurons are one homogenous population, or whether
there are several distinct types of hcrt/orx cells. Here, we collated diverse
structural and functional information about individual hcrt/orx neurons in
mouse brain slices, by combining patch-clamp analysis of spike firing,
membrane currents, and synaptic inputs with confocal imaging of cell shape
and subsequent 3-dimensional Sholl analysis of dendritic architecture.
Statistical cluster analysis of intrinsic firing properties revealed that
hcrt/orx neurons fall into two distinct types. These two cell types also
differ in the complexity of their dendritic arbour, the strength of AMPA and
GABAA receptor-mediated synaptic drive that they receive, and the density of
low-threshold, 4-aminopyridine-sensitive, transient K+ current. Our results
provide quantitative evidence that, at the cellular level, the mouse hcrt/orx
system is composed of two classes of neurons with different firing
properties, morphologies, and synaptic input organization.

Daniel Glass, Leopold Parts, David A. Knowles, Abraham Aviv, , and Tim D.
Spector.
**No correlation between childhood maltreatment and telomere length.**.
*Biological Psychiatry*, 68(6):21-22, 2010.

**
Abstract:** Telomeres are lengths of repetitive DNA that cap the ends of
chromosomes. They protect the ends of the chromosome and shorten with each
cell division. Short leukocyte telomere length has been related to a number
of age-related diseases. In addition, shorter telomere length has been
associated with environmental factors such as smoking and lack of exercise.
In a recent issue of Biological Psychiatry, Tyrka et al. (4) published a
report suggesting a link between maltreatment in childhood and telomere
shortening in 31 subjects. Individuals who had suffered maltreatment had
telomere length .70 +/- .24 compared with 1.02 +/- .52 in individuals who had
not been abused.

David A. Knowles, Leopold Parts, Daniel Glass, and John M. Winn.
**Modeling skin
and ageing phenotypes using latent variable models in infer.net**.
In *NIPS Workshop: Predictive Models in Personalized Medicine Workshop*,
2010.

** Abstract:** We demonstrate and compare three
unsupervised Bayesian latent variable models implemented in Infer.NET for
biomedical data modeling of 42 skin and ageing phenotypes measured on the
12,000 female twins in the Twins UK study. We address various data modeling
problems include high missingness, heterogeneous data, and repeat
observations. We compare the proposed models in terms of their performance at
predicting disease labels and symptoms from available explanatory variables,
concluding that factor analysis type models have the strongest statistical
performance in this setting. We show that such models can be combined with
regression components for improved interpretability.

** Comment:** web
site

Finale Doshi-Velez, David Knowles, Shakir Mohamed, and Zoubin Ghahramani.
**Large scale
non-parametric inference: Data parallelisation in the Indian buffet
process**.
In *Advances in Neural Information Processing Systems 23*, pages
1294-1302, Cambridge, MA, USA, December 2009. The MIT Press.

** Abstract:** Nonparametric Bayesian models provide a framework
for flexible probabilistic modelling of complex datasets. Unfortunately, the
high-dimensional averages required for Bayesian methods can be slow,
especially with the unbounded representations used by nonparametric models.
We address the challenge of scaling Bayesian inference to the increasingly
large datasets found in real-world applications. We focus on parallelisation
of inference in the Indian Buffet Process (IBP), which allows data points to
have an unbounded number of sparse latent features. Our novel MCMC sampler
divides a large data set between multiple processors and uses message passing
to compute the global likelihoods and posteriors. This algorithm, the first
parallel inference scheme for IBP-based models, scales to datasets orders of
magnitude larger than have previously been possible.

David A. Knowles and Susan Holmes.
**Statistical tools
for ultra-deep pyrosequencing of fast evolving viruses**.
In *NIPS Workshop: Computational Biology*, 2009.

**
Abstract:** We aim to detect minor variant Hepatitis B viruses (HBV) in 38
pyrosequencing samples from infected individuals. Errors involved in the
amplification and ultra deep pyrosequencing (UDPS) of these samples are
characterised using HBV plasmid controls. Homopolymeric regions and quality
scores are found to be significant covariates in determining insertion and
deletion (indel) error rates, but not mismatch rates which depend on the
nucleotide transition matrix. This knowledge is used to derive two methods
for classifying genuine mutations: a hypothesis testing framework and a
mixture model. Using an approximate "ground truth" from a limiting dilution
Sanger sequencing run, these methods are shown to outperform the naive
percentage threshold approach. The possibility of early stage PCR errors
becoming significant is investigated by simulation, which underlines the
importance of the initial copy number.

** Comment:** web site

David Knowles and Zoubin Ghahramani.
**Infinite sparse
factor analysis and infinite independent components analysis**.
In *7th International Conference on Independent Component Analysis and
Signal Separation*, pages 381-388, London, UK, September 2007. Springer,
doi
10.1007/978-3-540-74494-8_48.

** Abstract:** A
nonparametric Bayesian extension of Independent Components Analysis (ICA) is
proposed where observed data Y is modelled as a linear superposition, G, of a
potentially infinite number of hidden sources, X. Whether a given source is
active for a specific data point is specified by an infinite binary matrix,
Z. The resulting sparse representation allows increased data reduction
compared to standard ICA. We define a prior on Z using the Indian Buffet
Process (IBP). We describe four variants of the model, with Gaussian or
Laplacian priors on X and the one or two-parameter IBPs. We demonstrate
Bayesian inference under these models using a Markov chain Monte Carlo (MCMC)
algorithm on synthetic and gene expression data and compare to standard ICA
algorithms.

Manon Kok and Arno Solin.
**Scalable magnetic field slam in
3d using gaussian process maps**.
In *Proceedings of the 21th International Conference on Information Fusion
(accepted for publication)*, Cambridge, UK, July 2018.

**
Abstract:** We present a method for scalable and fully 3D magnetic field
simultaneous localisation and mapping (SLAM) using local anomalies in the
magnetic field as a source of position information. These anomalies are due
to the presence of ferromagnetic material in the structure of buildings and
in objects such as furniture. We represent the magnetic field map using a
Gaussian process model and take well-known physical properties of the
magnetic field into account. We build local magnetic field maps using
three-dimensional hexagonal block tiling. To make our approach
computationally tractable we use reduced-rank Gaussian process regression in
combination with a Rao-Blackwellised particle filter. We show that it is
possible to obtain accurate position and orientation estimates using
measurements from a smartphone, and that our approach provides a scalable
magnetic SLAM algorithm in terms of both computational complexity and map
storage.

Martin A. Skoglund, Zoran Sjanic, and Manon Kok.
**On
orientation estimation using iterative methods in Euclidean space**.
In *Proceedings of the 20th International Conference on Information
Fusion*, Xi'an, China, July 2017. doi
10.23919/ICIF.2017.8009830.

** Abstract:** This paper
presents three iterative methods for orientation estimation. The first two
are based on iterated Extended Kalman filter (IEKF) formulations with
different state representations. The first is using the well-known unit
quaternion as state (q-IEKF) while the other is using orientation deviation
which we call IMEKF. The third method is based on nonlinear least squares
(NLS) estimation of the angular velocity which is used to parametrise the
orientation. The results are obtained using Monte Carlo simulations and the
comparison is done with the non-iterative EKF and multiplicative EKF (MEKF)
as baseline. The result clearly shows that the IMEKF and the NLS-based method
are superior to q-IEKF and all three outperform the non-iterative
methods.

Manon Kok, Jeroen D. Hol, and Thomas B. Schön.
**Using
inertial sensors for position and orientation estimation**.
*Foundations and Trends in Signal Processing*, 11(1-2):1-153, 2017.

** Abstract:** In recent years, MEMS inertial sensors (3D
accelerometers and 3D gyroscopes) have become widely available due to their
small size and low cost. Inertial sensor measurements are obtained at high
sampling rates and can be integrated to obtain position and orientation
information. These estimates are accurate on a short time scale, but suffer
from integration drift over longer time scales. To overcome this issue,
inertial sensors are typically combined with additional sensors and models.
In this tutorial we focus on the signal processing aspects of position and
orientation estimation using inertial sensors. We discuss different modeling
choices and a selected number of important algorithms. The algorithms include
optimization-based smoothing and filtering as well as computationally cheaper
extended Kalman filter and complementary filter implementations. The quality
of their estimates is illustrated using both experimental and simulated
data.

** Comment:** arXiv

Simon Lacoste-Julien, Konstantina Palla, Alex Davies, Gjergji Kasneci, Thore
Graepel, and Zoubin Ghahramani.
**Sigma: simple
greedy matching for aligning large knowledge bases**.
In *KDD*, pages 572-580. Association for Computing Machinery, 2013.

** Abstract:** The Internet has enabled the creation of a
growing number of large-scale knowledge bases in a variety of domains
containing complementary information. Tools for automatically aligning these
knowledge bases would make it possible to unify many sources of structured
knowledge and answer complex queries. However, the efficient alignment of
large-scale knowledge bases still poses a considerable challenge. Here, we
present Simple Greedy Matching (SiGMa), a simple algorithm for aligning
knowledge bases with millions of entities and facts. SiGMa is an iterative
propagation algorithm which leverages both the structural information from
the relationship graph as well as flexible similarity measures between entity
properties in a greedy local search, thus making it scalable. Despite its
greedy nature, our experiments indicate that SiGMa can efficiently match some
of the world's largest knowledge bases with high precision. We provide
additional experiments on benchmark datasets which demonstrate that SiGMa can
outperform state-of-the-art approaches both in accuracy and efficiency.

Simon Lacoste-Julien, Ferenc Huszár, and Zoubin Ghahramani.
**Approximate
inference for the loss-calibrated Bayesian**.
In Geoff Gordon and David Dunson, editors, *14th International Conference on
Artificial Intelligence and Statistics*, volume 15, pages 416-424,
Fort Lauderdale, FL, USA, April 2011. Journal of Machine Learning Research.

** Abstract:** We consider the problem of approximate inference
in the context of Bayesian decision theory. Traditional approaches focus on
approximating general properties of the posterior, ignoring the decision task
- and associated losses - for which the posterior could be used. We argue
that this can be suboptimal and propose instead to *loss-calibrate* the
approximate inference methods with respect to the decision task at hand. We
present a general framework rooted in Bayesian decision theory to analyze
approximate inference from the perspective of losses, opening up several
research directions. As a first loss-calibrated approximate inference
attempt, we propose an EM-like algorithm on the Bayesian posterior risk and
show how it can improve a standard approach to Gaussian process
classification when losses are asymmetric.

Ferenc Huszár and Simon Lacoste-Julien.
**A kernel approach to
tractable Bayesian nonparametrics**.
Technical report, University of Cambridge, 2011.

** Abstract:**
Inference in popular nonparametric Bayesian models typically relies on
sampling or other approximations. This paper presents a general methodology
for constructing novel tractable nonparametric Bayesian methods by applying
the kernel trick to inference in a parametric Bayesian model. For example,
Gaussian process regression can be derived this way from Bayesian linear
regression. Despite the success of the Gaussian process framework, the kernel
trick is rarely explicitly considered in the Bayesian literature. In this
paper, we aim to fill this gap and demonstrate the potential of applying the
kernel trick to tractable Bayesian parametric models in a wider context than
just regression. As an example, we present an intuitive Bayesian kernel
machine for density estimation that is obtained by applying the kernel trick
to a Gaussian generative model in feature space.

** Comment:** arXiv:1103.1761

James Robert Lloyd and Zoubin Ghahramani.
**Statistical
model criticism using kernel two sample tests**.
In *Advances in Neural Information Processing Systems 29*, pages 1-9,
Montreal, Canada, December 2015.

** Abstract:** We propose an
exploratory approach to statistical model criticism using maximum mean
discrepancy (MMD) two sample tests. Typical approaches to model criticism
require a practitioner to select a statistic by which to measure
discrepancies between data and a statistical model. MMD two sample tests are
instead constructed as an analytic maximisation over a large space of
possible statistics and therefore automatically select the statistic which
most shows any discrepancy. We demonstrate on synthetic data that the
selected statistic, called the witness function, can be used to identify
where a statistical model most misrepresents the data it was trained on. We
then apply the procedure to real data where the models being assessed are
restricted Boltzmann machines, deep belief networks and Gaussian process
regression and demonstrate the ways in which these models fail to capture the
properties of the data they are trained on.

Tomoharu Iwata, James Robert Lloyd, and Zoubin Ghahramani.
**Unsupervised
many-to-many object matching for relational data**.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*,
2015.

** Abstract:** We propose a method for unsupervised
many-to-many object matching from multiple networks, which is the task of
finding correspondences between groups of nodes in different networks. For
example, the proposed method can discover shared word groups from
multi-lingual document-word networks without cross-language alignment
information. We assume that multiple networks share groups, and each group
has its own interaction pattern with other groups. Using infinite relational
models with this assumption, objects in different networks are clustered into
common groups depending on their interaction patterns, discovering a
matching. The effectiveness of the proposed method is experimentally
demonstrated by using synthetic and real relational data sets, which include
applications to cross-domain recommendation without shared user/item
identifiers and multi-lingual word clustering.

James Rovert Lloyd.
**Representation,
learning, description and criticism of probabilistic models with applications
to networks, functions and relational data**.
PhD thesis, University of Cambridge, Department of Engineering, Cambridge, UK,
2015.

** Abstract:** This thesis makes contributions to a
variety of aspects of probabilistic inference. When performing probabilistic
inference, one must first represent one’s beliefs with a probability
distribution. Specifying the details of a probability distribution can be a
difficult task in many situations, but when expressing beliefs about complex
data structures it may not even be apparent what form such a distribution
should take. This thesis starts by demonstrating how representation theorems
due to Aldous, Hoover and Kallenberg can be used to specify appropriate
models for data in the form of networks. These theorems are then extended in
order to reveal appropriate probability distributions for arbitrary
relational data or databases. A simpler data structure to specify probability
distributions for is that of functions; many probability distributions for
functions have been used for centuries. We demonstrate that many of these
distributions can be expressed in a common language of Gaussian process
kernels constructed from a few base elements and operators. The structure of
this language allows for the effective automatic construction of
probabilistic models for functions. Furthermore, the formal mathematical
language of kernels can be mapped neatly onto natural language allowing for
automatic descriptions of the automatically constructed models. By further
automating the construction of statistical models, the need to be able to
effectively check or criticise these models becomes greater. This thesis
demonstrates how kernel two sample tests can be used to demonstrate where a
probabilistic model most disagrees with data allowing for targeted
improvements to the model. In proposing a new method of model criticism this
thesis also briefly discusses the philosophy of model criticism within the
context of probabilistic inference.

James Robert Lloyd, David Duvenaud, Roger Grosse, Joshua B. Tenenbaum, and
Zoubin Ghahramani.
**Automatic construction and
natural-language description of nonparametric regression models**.
In *Association for the Advancement of Artificial Intelligence (AAAI)*,
July 2014.

** Abstract:** This paper presents the beginnings
of an automatic statistician, focusing on regression problems. Our system
explores an open-ended space of statistical models to discover a good
explanation of a data set, and then produces a detailed report with figures
and natural-language text. Our approach treats unknown regression functions
nonparametrically using Gaussian processes, which has two important
consequences. First, Gaussian processes can model functions in terms of
high-level properties (e.g. smoothness, trends, periodicity, changepoints).
Taken together with the compositional structure of our language of models
this allows us to automatically describe functions in simple terms. Second,
the use of flexible nonparametric models and a rich language for composing
them in an open-ended manner also results in state-of-the-art extrapolation
performance evaluated over 13 real time series data sets from various
domains.

José Miguel Hernández-Lobato, James Robert Lloyds, and Daniel
Hernández-Lobato.
**Gaussian process
conditional copulas with applications to financial time series**.
In *Advances in Neural Information Processing Systems 27*, pages 1-9,
Lake Tahoe, California, USA, December 2013.

** Abstract:** The
estimation of dependencies between multiple variables is a central problem in
the analysis of financial time series. A common approach is to express these
dependencies in terms of a copula function. Typically the copula function is
assumed to be constant but this may be inaccurate when there are covariates
that could have a large influence on the dependence structure of the data. To
account for this, a Bayesian framework for the estimation of conditional
copulas is proposed. In this framework the parameters of a copula are
non-linearly related to some arbitrary conditioning variables. We evaluate
the ability of our method to predict time-varying dependencies on several
equities and currencies and observe consistent performance gains compared to
static copula models and other time-varying copula methods.

David Duvenaud, James Robert Lloyd, Roger Grosse, Joshua B. Tenenbaum, and
Zoubin Ghahramani.
**Structure
discovery in nonparametric regression through compositional kernel
search**.
In *30th International Conference on Machine Learning*, Atlanta,
Georgia, USA, June 2013.

** Abstract:** Despite its
importance, choosing the structural form of the kernel in nonparametric
regression remains a black art. We define a space of kernel structures which
are built compositionally by adding and multiplying a small number of base
kernels. We present a method for searching over this space of structures
which mirrors the scientific discovery process. The learned structures can
often decompose functions into interpretable components and enable long-range
extrapolation on time-series datasets. Our structure search method
outperforms many widely used kernels and kernel combination methods on a
variety of prediction tasks.

James Robert Lloyd.
**Gefcom2012
hierarchical load forecasting: Gradient boosting machines and gaussian
processes**.
*International Journal of Forecasting*, 2013.

**
Abstract:** This report discusses methods for forecasting hourly loads of a
US utility as part of the load forecasting track of the Global Energy
Forecasting Competition 2012 hosted on Kaggle. The methods described
(gradient boosting machines and Gaussian processes) are generic machine
learning / regression algorithms and few domain specific adjustments were
made. Despite this, the algorithms were able to produce highly competitive
predictions and hopefully they can inspire more reﬁned techniques to
compete with state-of-the-art load forecasting methodologies.

James Robert Lloyd, Peter Orbanz, Zoubin Ghahramani, and Daniel M. Roy.
**Random
function priors for exchangeable arrays with applications to graphs and
relational data**.
In *Advances in Neural Information Processing Systems 26*, pages 1-9,
Lake Tahoe, California, USA, December 2012.

** Abstract:** A
fundamental problem in the analysis of structured relational data like
graphs, networks, databases, and matrices is to extract a summary of the
common structure underlying relations between individual entities. Relational
data are typically encoded in the form of arrays; invariance to the ordering
of rows and columns corresponds to exchangeable arrays. Results in
probability theory due to Aldous, Hoover and Kallenberg show that
exchangeable arrays can be represented in terms of a random measurable
function which constitutes the natural model parameter in a Bayesian model.
We obtain a flexible yet simple Bayesian nonparametric model by placing a
Gaussian process prior on the parameter function. Efficient inference
utilises elliptical slice sampling combined with a random sparse
approximation to the Gaussian process. We demonstrate applications of the
model to network data and clarify its relation to models in the literature,
several of which emerge as special cases.

Maria Lomeli, Mark Rowland, Arthur Gretton, and Zoubin Ghahramani.
**Antithetic and Monte Carlo
kernel estimators for partial rankings**.
*arXiv preprint arXiv:1807.00400*, 2018.

** Abstract:**
In the modern age, rankings data is ubiquitous and it is useful for a variety
of applications such as recommender systems, multi-object tracking and
preference learning. However, most rankings data encountered in the real
world is incomplete, which prevents the direct application of existing
modelling tools for complete rankings. Our contribution is a novel way to
extend kernel methods for complete rankings to partial rankings, via
consistent Monte Carlo estimators for Gram matrices: matrices of kernel
values between pairs of observations. We also present a novel variance
reduction scheme based on an antithetic variate construction between
permutations to obtain an improved estimator for the Mallows kernel. The
corresponding antithetic kernel estimator has lower variance and we
demonstrate empirically that it has a better performance in a variety of
Machine Learning tasks. Both kernel estimators are based on extending kernel
mean embeddings to the embedding of a set of full rankings consistent with an
observed partial ranking. They form a computationally tractable alternative
to previous approaches for partial rankings data. An overview of the existing
kernels and metrics for permutations is also provided.

Ben Bloem-Reddy, Emile Mathieu, Adam Foster, Tom Rainforth, Hong Ge, Maria
Lomeli, and Zoubin Ghahramani.
**Sampling and inference
for discrete random probability measures in probabilistic programs**.
In *NIPS workshop on Advances in Approximate Inference*,
California, United States, December 2017.

** Abstract:** We
consider the problem of sampling a sequence from a discrete random prob-
ability measure (RPM) with countable support, under (probabilistic)
constraints of finite memory and computation. A canonical example is sampling
from the Dirichlet Process, which can be accomplished using its well-known
stick-breaking representation and lazy initialization of its atoms. We show
that efficiently lazy initialization is possible if and only if a size-biased
representation of the discrete RPM is known. For models constructed from such
discrete RPMs, we consider the implications for generic particle-based
inference methods in probabilistic program- ming systems. To demonstrate, we
implement posterior inference for Normalized Inverse Gaussian Process mixture
models in Turing.

Maria Lomeli, Stefano Favaro, and Yee Whye Teh.
**A
marginal sampler for sigma-Stable Poisson-Kingman mixture models**.
*Journal of Computational and Graphical Statistics*, 26:44-53, 2017.

** Abstract:** We investigate the class of sigma-stable
Poisson-Kingman random probability measures (RPMs) in the context of Bayesian
nonparametric mixture modeling. This is a large class of discrete RPMs, which
encompasses most of the popular discrete RPMs used in Bayesian
nonparametrics, such as the Dirichlet process, Pitman-Yor process, the
normalized inverse Gaussian process, and the normalized generalized Gamma
process. We show how certain sampling properties and marginal
characterizations of sigma-stable Poisson-Kingman RPMs can be usefully
exploited for devising a Markov chain Monte Carlo (MCMC) algorithm for
performing posterior inference with a Bayesian nonparametric mixture model.
Specifically, we introduce a novel and efficient MCMC sampling scheme in an
augmented space that has a small number of auxiliary variables per iteration.
We apply our sampling scheme to a density estimation and clustering tasks
with unidimensional and multidimensional datasets, and compare it against
competing MCMC sampling schemes. Supplementary materials for this article are
available online.

Maria Lomeli.
**General Bayesian inference
schemes in infinite mixture models**.
PhD thesis, University College London,Gatsby Unit, London, UK, 2017.

** Abstract:** Bayesian statistical models allow us to formalise
our knowledge about the world and reason about our uncertainty, but there is
a need for better procedures to accurately encode its complexity. One way to
do so is through compositional models, which are formed by combining blocks
consisting of simpler models. One can increase the complexity of the
compositional model by either stacking more blocks or by using a
not-so-simple model as a building block. This thesis is an example of the
latter. One first aim is to expand the choice of Bayesian nonparametric (BNP)
blocks for constructing tractable compositional models. So far, most of the
models that have a Bayesian nonparametric component use a Dirichlet Process
or a Pitman-Yor process because of the availability of tractable and compact
representations. This thesis shows how to overcome certain intractabilities
in order to obtain analogous compact representations for the class of
Poisson-Kingman priors which includes the Dirichlet and Pitman-Yor processes.
A major impediment to the widespread use of Bayesian nonparametric building
blocks is that inference is often costly, intractable or difficult to carry
out. This is an active research area since dealing with the model's infinite
dimensional component forbids the direct use of standard simulation-based
methods. The main contribution of this thesis is a variety of inference
schemes that tackle this problem: Markov chain Monte Carlo and Sequential
Monte Carlo methods, which are exact inference schemes since they target the
true posterior. The contributions of this thesis, in a larger context,
provide general purpose exact inference schemes in the flavour or
probabilistic programming: the user is able to choose from a variety of
models, focusing only on the modelling part. Indeed, if the wide enough class
of Poisson-Kingman priors is used as one of our blocks, this objective is
achieved.

Maria Lomeli, Stefano Favaro, and Yee Whye Teh.
**A
hybrid sampler for Poisson-Kingman mixture models**.
In *Advances in Neural Information Processing Systems 28*, pages 1-9,
Montreal, Canada, December 2015.

** Abstract:** This paper
concerns the introduction of a new Markov Chain Monte Carlo scheme for
posterior sampling in Bayesian nonparametric mixture models with priors that
belong to the general Poisson-Kingman class. We present a novel and compact
way of representing the infinite dimensional component of the model such that
while explicitly representing this infinite component it has less memory and
storage requirements than previous MCMC schemes. We describe comparative
simulation results demonstrating the efficacy of the proposed MCMC algorithm
against existing marginal and conditional MCMC samplers.

Stefano Favaro, Maria Lomeli, and Yee Whye Teh.
**On a
class of sigma-Stable Poisson-Kingman models and an effective marginalised
sampler**.
*Statistics and Computing*, 25:67-78, 2015.

**
Abstract:** We investigate the use of a large class of discrete random
probability measures, which is referred to as the class Q, , in the context
of Bayesian nonparametric mixture modeling. The class Q encompasses both the
the two-parameter Poisson?Dirichlet process and the normalized generalized
Gamma process, thus allowing us to comparatively study the inferential
advantages of these two well-known nonparametric priors. Apart from ahighly
flexible parameterization, the distinguishing feature of the class Q is the
availability of a tractable posterior distribution. This feature, in turn,
leads to derive an efficient marginal MCMC algorithm for posterior sampling
within the framework of mixture models. We demonstrate the efficacy of our
modeling framework on both one-dimensional and multi-dimensional
datasets.

Stefano Favaro, Maria Lomeli, Bernardo Nipoti, and Yee Whye Teh.
**Stick-breaking
representations of sigma-Stable Poisson-Kingman models**.
*Electronic Journal of Statistics*, 8:1063-1085, 2014.

**
Abstract:** In this paper we investigate the stick-breaking representation
for the class of sigma-Stable Poisson-Kingman models, also known as
Gibbs-type random probability measures. This class includes as special cases
most of the discrete priors commonly used in Bayesian nonparametrics, such as
the two parameter Poisson-Dirichlet process and the normalized generalized
Gamma process. Under the assumption sigma=u/v, for any coprime integers
1<=u

Dino Sejdinovic, Heiko Strathmann, Maria Lomeli, Christophe Andrieu, and Arthur
Gretton.
**Kernel adaptive
Metropolis-Hastings**.
In *31st International Conference on Machine Learning*, pages 1-9,
Beijing, China, June 2012.

** Abstract:** A Kernel Adaptive
Metropolis-Hastings algo- rithm is introduced, for the purpose of sampling
from a target distribution with strongly nonlin- ear support. The algorithm
embeds the trajec- tory of the Markov chain into a reproducing ker- nel
Hilbert space (RKHS), such that the fea- ture space covariance of the samples
informs the choice of proposal. The procedure is com- putationally efficient
and straightforward to im- plement, since the RKHS moves can be inte- grated
out analytically: our proposal distribu- tion in the original space is a
normal distribution whose mean and covariance depend on where the current
sample lies in the support of the tar- get distribution, and adapts to its
local covari- ance structure. Furthermore, the procedure re- quires neither
gradients nor any other higher or- der information about the target, making
it par- ticularly attractive for contexts such as Pseudo- Marginal MCMC.
Kernel Adaptive Metropolis- Hastings outperforms competing fixed and adap-
tive samplers on multivariate, highly nonlinear target distributions, arising
in both real-world and synthetic examples.

David Lopez-Paz, Suvrit Sra, Alex J. Smola, Zoubin Ghahramani, and Bernhard
Schölkopf.
**Randomized
nonlinear component analysis**.
In *ICML*, volume 29 of *JMLR Proceedings*. JMLR.org,
2014.

** Abstract:** Classical techniques such as Principal
Component Analysis (PCA) and Canonical Correlation Analysis (CCA) are
ubiquitous in statistics. However, these techniques only reveal linear
relationships in data. Although nonlinear variants of PCA and CCA have been
proposed, they are computationally prohibitive in the large scale. In a
separate strand of recent research, randomized methods have been proposed to
construct features that help reveal nonlinear patterns in data. For basic
tasks such as regression or classification, random features exhibit little or
no loss in performance, while achieving dramatic savings in computational
requirements. In this paper we leverage randomness to design scalable new
variants of nonlinear PCA and CCA; our ideas also extend to key multivariate
analysis tools such as spectral clustering or LDA. We demonstrate our
algorithms through experiments on real-world data, on which we compare
against the state-of-the-art. Code in R implementing our methods is provided
in the Appendix.

David Lopez-Paz, Philipp Hennig, and Bernhard Scholköpf.
**The randomized dependence
coefficient**.
In *Advances in Neural Information Processing Systems 27*, pages 1-9,
Lake Tahoe, California, USA, December 2013.

** Abstract:** We
introduce the Randomized Dependence Coefficient (RDC), a measure of
non-linear dependence between random variables of arbitrary dimension based
on the Hirschfeld-Gebelein-Rényi Maximum Correlation Coefficient. RDC
is defined in terms of correlation of random non-linear copula projections;
it is invariant with respect to marginal distribution transformations, has
low computational cost and is easy to implement: just five lines of R code,
included at the end of the paper.

David Lopez-Paz, José Miguel Hernández-Lobato, and Zoubin Ghahramani.
**Gaussian
process vine copulas for multivariate dependence**.
In *30th International Conference on Machine Learning*, Atlanta,
Georgia, USA, June 2013.

** Abstract:** Copulas allow to learn
marginal distributions separately from the multivariate dependence structure
(copula) that links them together into a density function. Vine
factorizations ease the learning of high-dimensional copulas by constructing
a hierarchy of conditional bivariate copulas. However, to simplify inference,
it is common to assume that each of these conditional bivariate copulas is
independent from its conditioning variables. In this paper, we relax this
assumption by discovering the latent functions that specify the shape of a
conditional copula given its conditioning variables We learn these functions
by following a Bayesian approach based on sparse Gaussian processes with
expectation propagation for scalable, approximate inference. Experiments on
real-world datasets show that, when modeling all conditional dependencies, we
obtain better estimates of the underlying copula of the data.

David Lopez-Paz, José Miguel Hernández-Lobato, and Bernhard Scholköpf.
**Semi-supervised
domain adaptation with non-parametric copulas**.
In *Advances in Neural Information Processing Systems 26*, pages 1-9,
Lake Tahoe, California, USA, December 2012.

** Abstract:** A
new framework based on the theory of copulas is proposed to address
semisupervised domain adaptation problems. The presented method factorizes
any multivariate density into a product of marginal distributions and
bivariate copula functions. Therefore, changes in each of these factors can
be detected and corrected to adapt a density model accross different learning
domains. Importantly, we introduce a novel vine copula model, which allows
for this factorization in a non-parametric manner. Experimental results on
regression problems with real-world data illustrate the efficacy of the
proposed approach when compared to state-of-the-art techniques.

Alexander G D G Matthews, James Hensman, Richard E. Turner, and Zoubin
Ghahramani.
**On Sparse Variational methods
and the Kullback-Leibler divergence between stochastic processes**.
In *19th International Conference on Artificial Intelligence and
Statistics*, Cadiz, Spain, May 2016.

** Abstract:** The
variational framework for learning inducing variables (Titsias, 2009a) has
had a large impact on the Gaussian process literature. The framework may be
interpreted as minimizing a rigorously defined Kullback-Leibler divergence
between the approximating and posterior processes. To our knowledge this
connection has thus far gone unremarked in the literature. In this paper we
give a substantial generalization of the literature on this topic. We give a
new proof of the result for infinite index sets which allows inducing points
that are not data points and likelihoods that depend on all function values.
We then discuss augmented index sets and show that, contrary to previous
works, marginal consistency of augmentation is not enough to guarantee
consistency of variational inference with the original model. We then
characterize an extra condition where such a guarantee is obtainable. Finally
we show how our framework sheds light on interdomain sparse approximations
and sparse approximations for Cox processes.

James Hensman, Alexander G D G Matthews, Maurizio Filippone, and Zoubin
Ghahramani.
**MCMC
for Variationally Sparse Gaussian Processes**.
In *Advances in Neural Information Processing Systems 28*, pages 1-9,
Montreal, Canada, December 2015.

** Abstract:** Gaussian
process (GP) models form a core part of probabilistic machine learning.
Considerable research effort has been made into attacking three issues with
GP models: how to compute efficiently when the number of data is large; how
to approximate the posterior when the likelihood is not Gaussian and how to
estimate covariance function parameter posteriors. This paper simultaneously
addresses these, using a variational approximation to the posterior which is
sparse in support of the function but otherwise free-form. The result is a
Hybrid Monte-Carlo sampling scheme which allows for a non-Gaussian
approximation over the function values and covariance parameters
simultaneously, with efficient computations based on inducing-point sparse
GPs. Code to replicate each experiment in this paper will be available
shortly.

James Hensman, Alexander G D G Matthews, and Zoubin Ghahramani.
**Scalable
Variational Gaussian Process Classification**.
In *18th International Conference on Artificial Intelligence and
Statistics*, pages 1-9, San Diego, California, USA, May 2015.

** Abstract:** Gaussian process classification is a popular
method with a number of appealing properties. We show how to scale the model
within a variational inducing point framework, out-performing the state of
the art on benchmark datasets. Importantly, the variational formulation an be
exploited to allow classification in problems with millions of data points,
as we demonstrate in experiments.

Alexander G. D. G Matthews, James Hensman, and Zoubin Ghahramani.
**Comparing
lower bounds on the entropy of mixture distributions for use in variational
inference**.
In *NIPS workshop on Advances in Variational Inference*,
Montreal, Canada, December 2014.

Alexander G. D. G. Matthews and Zoubin Ghahramani.
**Classification using log
Gaussian Cox processes**.
*arXiv preprint arXiv:1405.4141*, 2014.

** Abstract:**
McCullagh and Yang (2006) suggest a family of classification algorithms
based on Cox processes. We further investigate the log Gaussian variant which
has a number of appealing properties. Conditioned on the covariates, the
distribution over labels is given by a type of conditional Markov random
field. In the supervised case, computation of the predictive probability of a
single test point scales linearly with the number of training points and the
multiclass generalization is straightforward. We show new links between the
supervised method and classical nonparametric methods. We give a detailed
analysis of the pairwise graph representable Markov random field, which we
use to extend the model to semi-supervised learning problems, and propose an
inference method based on graph min-cuts. We give the first experimental
analysis on supervised and semi-supervised datasets and show good empirical
performance.

Rowan McAllister and Carl Edward Rasmussen.
**Data-efficient
reinforcement learning in continuous state-action
Gaussian-POMDPs**.
In *Advances in Neural Information Processing Systems 31*, Long Beach,
California, December 2017.

** Abstract:** We present a
data-efficient reinforcement learning method for continuous state-action
systems under significant observation noise. Data-efficient solutions under
small noise exist, such as PILCO which learns the cartpole swing-up task in
30s. PILCO evaluates policies by planning state-trajectories using a dynamics
model. However, PILCO applies policies to the observed state, therefore
planning in observation space. We extend PILCO with filtering to instead plan
in belief space, consistent with partially observable Markov decisions
process (POMDP) planning. This enables data-efficient learning under
significant observation noise, outperforming more naive methods such as
post-hoc application of a filter to policies optimised by the original
(unfiltered) PILCO algorithm. We test our method on the cartpole swing-up
task, which involves nonlinear dynamics and requires nonlinear control.

Rowan McAllister, Yarin Gal, Alex Kendall, Mark van der Wilk, Amar Shah,
Roberto Cipolla, and Adrian Weller.
**Concrete problems
for autonomous vehicle safety: Advantages of Bayesian deep
learning,**.
In *International Joint Conference on Artificial Intelligence*,
Melbourne, Australia, August 2017.

** Abstract:** Autonomous
vehicle (AV) software is typically composed of a pipeline of individual
components, linking sensor inputs to motor outputs. Erroneous component
outputs propagate downstream, hence safe AV software must consider the
ultimate effect of each component's errors. Further, improving safety alone
is not sufficient. Passengers must also feel safe to trust and use AV
systems. To address such concerns, we investigate three under-explored themes
for AV research: safety, interpretability, and compliance. Safety can be
improved by quantifying the uncertainties of component outputs and
propagating them forward through the pipeline. Interpretability is concerned
with explaining what the AV observes and why it makes the decisions it does,
building reassurance with the passenger. Compliance refers to maintaining
some control for the passenger. We discuss open challenges for research
within these themes. We highlight the need for concrete evaluation metrics,
propose example problems, and highlight possible solutions.

Rowan McAllister.
**Bayesian learning for
data-efficient control**.
PhD thesis, University of Cambridge, Department of Engineering, Cambridge, UK,
2016.

** Abstract:** Applications to learn control of
unfamiliar dynamical systems with increasing autonomy are ubiquitous. From
robotics, to finance, to industrial processing, autonomous learning helps
obviate a heavy reliance on experts for system identification and controller
design. Often real world systems are nonlinear, stochastic, and expensive to
operate (e.g. slow, energy intensive, prone to wear and tear). Ideally
therefore, nonlinear systems can be identified with minimal system
interaction. This thesis considers data efficient autonomous learning of
control of nonlinear, stochastic systems. Data efficient learning critically
requires probabilistic modelling of dynamics. Traditional control approaches
use deterministic models, which easily overfit data, especially small
datasets. We use probabilistic Bayesian modelling to learn systems from
scratch, similar to the PILCO algorithm, which achieved unprecedented data
efficiency in learning control of several benchmarks. We extend PILCO in
three principle ways. First, we learn control under significant observation
noise by simulating a filtered control process using a tractably analytic
framework of Gaussian distributions. In addition, we develop the `latent
variable belief Markov decision process' when filters must predict under
real-time constraints. Second, we improve PILCO's data efficiency by
directing exploration with predictive loss uncertainty and Bayesian
optimisation, including a novel approximation to the Gittins index. Third, we
take a step towards data efficient learning of high-dimensional control using
Bayesian neural networks (BNN). Experimentally we show although filtering
mitigates adverse effects of observation noise, much greater performance is
achieved when optimising controllers with evaluations faithful to reality: by
simulating closed-loop filtered control if executing closed-loop filtered
control. Thus, controllers are optimised w.r.t. how they are used,
outperforming filters applied to systems optimised by unfiltered simulations.
We show directed exploration improves data efficiency. Lastly, we show BNN
dynamics models are almost as data efficient as Gaussian process models.
Results show data efficient learning of high-dimensional control is possible
as BNNs scale to high-dimensional state inputs.

B. Bischoff, D. Nguyen-Tuong, D. van Hoof, A. McHutchon, Carl Edward Rasmussen,
A. Knoll, and M. P. Deisenroth.
**Policy search
for learning robot control using sparse data**.
In *IEEE International Conference on Robotics and Automation*, pages
3882-3887, Hong Kong, China, 2014. IEEE, doi
10.1109/ICRA.2014.6907422.

** Abstract:** In many complex
robot applications, such as grasping and manipulation, it is difficult to
program desired task solutions beforehand, as robots are within an uncertain
and dynamic environment. In such cases, learning tasks from experience can be
a useful alternative. To obtain a sound learning and generalization
performance, machine learning, especially, reinforcement learning, usually
requires sufficient data. However, in cases where only little data is
available for learning, due to system constraints and practical issues,
reinforcement learning can act suboptimally. In this paper, we investigate
how model-based reinforcement learning, in particular the probabilistic
inference for learning control method (PILCO), can be tailored to cope with
the case of sparse data to speed up learning. The basic idea is to include
further prior knowledge into the learning process. As PILCO is built on the
probabilistic Gaussian processes framework, additional system knowledge can
be incorporated by defining appropriate prior distributions, e.g. a linear
mean Gaussian prior. The resulting PILCO formulation remains in closed form
and analytically tractable. The proposed approach is evaluated in simulation
as well as on a physical robot, the Festo Robotino XT. For the robot
evaluation, we employ the approach for learning an object pick-up task. The
results show that by including prior knowledge, policy learning can be sped
up in presence of sparse data.

Andrew McHutchon.
**Nonlinear modelling and
control using Gaussian processes**.
PhD thesis, University of Cambridge, Department of Engineering, Cambridge, UK,
2014.

** Abstract:** In many scientific disciplines it is
often required to make predictions about how a system will behave or to
deduce the correct control values to elicit a particular desired response.
Efficiently solving both of these tasks relies on the construction of a model
capturing the system's operation. In the most interesting situations, the
model needs to capture strongly nonlinear effects and deal with the presence
of uncertainty and noise. Building models for such systems purely based on a
theoretical understanding of underlying physical principles can be infeasibly
complex and require a large number of simplifying assumptions. An alternative
is to use a data-driven approach, which builds a model directly from
observations. A powerful and principled approach to doing this is to use a
Gaussian Process (GP).

In this thesis we start by discussing how GPs can
be applied to data sets which have noise affecting their inputs. We present
the "Noisy Input GP", which uses a simple local-linearisation to refer the
input noise into heteroscedastic output noise, and compare it to other
methods both theoretically and empirically. We show that this technique leads
to a effective model for nonlinear functions with input and output noise. We
then consider the broad topic of GP state space models for application to
dynamical systems. We discuss a very wide variety of approaches for using GPs
in state space models, including introducing a new method based on
moment-matching, which consistently gave the best performance. We analyse the
methods in some detail including providing a systematic comparison between
approximate-analytic and particle methods. To our knowledge such a comparison
has not been provided before in this area. Finally, we investigate an
automatic control learning framework, which uses Gaussian Processes to model
a system for which we wish to design a controller. Controller design for
complex systems is a difficult task and thus a framework which allows an
automatic design directly from data promises to be extremely useful. We
demonstrate that the previously published framework cannot cope with the
presence of observation noise but that the introduction of a state space
model dramatically improves its performance. This contribution, along with
some other suggested improvements opens the door for this framework to be
used in real-world applications.

Andrew McHutchon and Carl Edward Rasmussen.
**Gaussian process
training with input noise**.
In J. Shawe-Taylor, R.S. Zemel, P.L. Bartlett, F. Pereira, and K.Q. Weinberger,
editors, *Advances in Neural Information Processing Systems 24*, pages
1341-1349, Granada, Spain, 2011. Curran Associates, Inc.

**
Abstract:** In standard Gaussian Process regression input locations are
assumed to be noise free. We present a simple yet effective GP model for
training on input points corrupted by i.i.d. Gaussian noise. To make
computations tractable we use a local linear expansion about each input
point. This allows the input noise to be recast as output noise proportional
to the squared gradient of the GP posterior mean. The input noise
hyperparameters are trained alongside other hyperparameters by the usual
method of maximisation of the marginal likelihood, and allow estimation of
the noise levels on each input dimension. Training uses an iterative scheme,
which alternates between optimising the hyperparameters and calculating the
posterior gradient. Analytic predictive moments can then be found for
Gaussian distributed test points. We compare our model to others over a range
of different regression problems and show that it improves over current
methods.

Shakir Mohamed, Katherine A. Heller, and Zoubin Ghahramani.
**Bayesian and
L _{1} approaches for sparse unsupervised learning**.
In

** Abstract:** The use of L1 regularisation for sparse learning
has generated immense research interest, with many successful applications in
diverse areas such as signal acquisition, image coding, genomics and
collaborative filtering. While existing work highlights the many advantages
of L1 methods, in this paper we find that L1 regularisation often
dramatically under-performs in terms of predictive performance when compared
to other methods for inferring sparsity. We focus on unsupervised latent
variable models, and develop L1 minimising factor models, Bayesian variants
of "L1", and Bayesian models with a stronger L0-like sparsity induced through
spike-and-slab distributions. These spike-and-slab Bayesian factor models
encourage sparsity while accounting for uncertainty in a principled manner,
and avoid unnecessary shrinkage of non-zero values. We demonstrate on a
number of data sets that in practice spike-and-slab Bayesian methods
out-perform L1 minimisation, even on a com- putational budget. We thus
highlight the need to re-assess the wide use of L1 methods in
sparsity-reliant applications, particularly when we care about generalising
to previously unseen data, and provide an alternative that, over many varying
conditions, provides improved generalisation performance.

Shakir Mohamed.
**Generalised Bayesian
matrix factorisation models**.
PhD thesis, University of Cambridge, Department of Engineering, Cambridge, UK,
2011.

** Abstract:** Factor analysis and related models for
probabilistic matrix factorisation are of central importance to the
unsupervised analysis of data, with a colourful history more than a century
long. Probabilistic models for matrix factorisation allow us to explore the
underlying structure in data, and have relevance in a vast number of
application areas including collaborative filtering, source separation,
missing data imputation, gene expression analysis, information retrieval,
computational finance and computer vision, amongst others.

This thesis
develops generalisations of matrix factorisation models that advance our
understanding and enhance the applicability of this important class of
models. The generalisation of models for matrix factorisation focuses on
three concerns: widening the applicability of latent variable models to the
diverse types of data that are currently available; considering alternative
structural forms in the underlying representations that are inferred; and
including higher order data structures into the matrix factorisation
framework. These three issues reflect the reality of modern data analysis and
we develop new models that allow for a principled exploration and use of data
in these settings. We place emphasis on Bayesian approaches to learning and
the advantages that come with the Bayesian methodology. Our port of departure
is a generalisation of latent variable models to members of the exponential
family of distributions. This generalisation allows for the analysis of data
that may be real-valued, binary, counts, non-negative or a heterogeneous set
of these data types. The model unifies various existing models and constructs
for unsupervised settings, the complementary framework to the generalised
linear models in regression.

Moving to structural considerations, we
develop Bayesian methods for learning sparse latent representations. We
define ideas of weakly and strongly sparse vectors and investigate the
classes of prior distributions that give rise to these forms of sparsity,
namely the scale-mixture of Gaussians and the spike-and-slab distribution.
Based on these sparsity favouring priors, we develop and compare methods for
sparse matrix factorisation and present the first comparison of these sparse
learning approaches. As a second structural consideration, we develop models
with the ability to generate correlated binary vectors. Moment-matching is
used to allow binary data with specified correlation to be generated, based
on dichotomisation of the Gaussian distribution. We then develop a novel and
simple method for binary PCA based on Gaussian dichotomisation. The third
generalisation considers the extension of matrix factorisation models to
multi-dimensional arrays of data that are increasingly prevalent. We develop
the first Bayesian model for non-negative tensor factorisation and explore
the relationship between this model and the previously described models for
matrix factorisation.

Finale Doshi-Velez, David Knowles, Shakir Mohamed, and Zoubin Ghahramani.
**Large scale
non-parametric inference: Data parallelisation in the Indian buffet
process**.
In *Advances in Neural Information Processing Systems 23*, pages
1294-1302, Cambridge, MA, USA, December 2009. The MIT Press.

** Abstract:** Nonparametric Bayesian models provide a framework
for flexible probabilistic modelling of complex datasets. Unfortunately, the
high-dimensional averages required for Bayesian methods can be slow,
especially with the unbounded representations used by nonparametric models.
We address the challenge of scaling Bayesian inference to the increasingly
large datasets found in real-world applications. We focus on parallelisation
of inference in the Indian Buffet Process (IBP), which allows data points to
have an unbounded number of sparse latent features. Our novel MCMC sampler
divides a large data set between multiple processors and uses message passing
to compute the global likelihoods and posteriors. This algorithm, the first
parallel inference scheme for IBP-based models, scales to datasets orders of
magnitude larger than have previously been possible.

Shakir Mohamed, Katherine A. Heller, and Zoubin Ghahramani.
**Bayesian
exponential family PCA**.
In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, *Advances in
Neural Information Processing Systems 21*, pages 1089-1096, Cambridge,
MA, USA, December 2009. The MIT Press.

** Abstract:**
Principal Components Analysis (PCA) has become established as one of the key
tools for dimensionality reduction when dealing with real valued data.
Approaches such as exponential family PCA and non-negative matrix
factorisation have successfully extended PCA to non-Gaussian data types, but
these techniques fail to take advantage of Bayesian inference and can suffer
from problems of overfitting and poor generalisation. This paper presents a
fully probabilistic approach to PCA, which is generalised to the exponential
family, based on Hybrid Monte Carlo sampling. We describe the model which is
based on a factorisation of the observed data matrix, and show performance of
the model on both synthetic and real data.

** Comment:** spotlight.

Mikkel N. Schmidt and Shakir Mohamed.
**Probabilistic
non-negative tensor factorization using Markov chain Monte
Carlo**.
In *European Signal Processing Conference (EUSIPCO)*, pages 1918-1922,
Glasgow, Scotland, August 2009.

** Abstract:** We present a
probabilistic model for learning non-negative tensor factorizations (NTF), in
which the tensor factors are latent variables associated with each data
dimension. The non-negativity constraint for the latent factors is handled by
choosing priors with support on the non-negative numbers. Two Bayesian
inference procedures based on Markov chain Monte Carlo sampling are
described: Gibbs sampling and Hamiltonian Markov chain Monte Carlo. We
evaluate the model on two food science data sets, and show that the
probabilistic NTF model leads to better predictions and avoids overfitting
compared to existing NTF approaches.

** Comment:** Rated by reviewers amongst the top 5% of the
presented papers.

Alexandre Khae Wu Navarro, Jes Frellsen, and Richard E. Turner.
**The Multivariate Generalised
von Mises distribution: Inference and applications**.
January 2017.

** Abstract:** Circular variables arise in a
multitude of data-modelling contexts ranging from robotics to the social
sciences, but they have been largely overlooked by the machine learning
community. This paper partially redresses this imbalance by extending some
standard probabilistic modelling tools to the circular domain. First we
introduce a new multivariate distribution over circular variables, called the
multivariate Generalised von Mises (mGvM) distribution. This distribution can
be constructed by restricting and renormalising a general multivariate
Gaussian distribution to the unit hyper-torus. Previously proposed
multivariate circular distributions are shown to be special cases of this
construction. Second, we introduce a new probabilistic model for circular
regression, that is inspired by Gaussian Processes, and a method for
probabilistic principal component analysis with circular hidden variables.
These models can leverage standard modelling tools (e.g. covariance functions
and methods for automatic relevance determination). Third, we show that the
posterior distribution in these models is a mGvM distribution which enables
development of an efficient variational free-energy scheme for performing
approximate inference and approximate maximum-likelihood learning.

James Robert Lloyd, Peter Orbanz, Zoubin Ghahramani, and Daniel M. Roy.
**Random
function priors for exchangeable arrays with applications to graphs and
relational data**.
In *Advances in Neural Information Processing Systems 26*, pages 1-9,
Lake Tahoe, California, USA, December 2012.

** Abstract:** A
fundamental problem in the analysis of structured relational data like
graphs, networks, databases, and matrices is to extract a summary of the
common structure underlying relations between individual entities. Relational
data are typically encoded in the form of arrays; invariance to the ordering
of rows and columns corresponds to exchangeable arrays. Results in
probability theory due to Aldous, Hoover and Kallenberg show that
exchangeable arrays can be represented in terms of a random measurable
function which constitutes the natural model parameter in a Bayesian model.
We obtain a flexible yet simple Bayesian nonparametric model by placing a
Gaussian process prior on the parameter function. Efficient inference
utilises elliptical slice sampling combined with a random sparse
approximation to the Gaussian process. We demonstrate applications of the
model to network data and clarify its relation to models in the literature,
several of which emerge as special cases.

Peter Orbanz.
**Projective limit
random probabilities on Polish spaces**.
*Electron. J. Stat.*, 5:1354-1373, 2011.

**
Abstract:** A pivotal problem in Bayesian nonparametrics is the
construction of prior distributions on the space M(V) of probability measures
on a given domain V. In principle, such distributions on the
infinite-dimensional space M(V) can be constructed from their
finite-dimensional marginals—the most prominent example being the
construction of the Dirichlet process from finite-dimensional Dirichlet
distributions. This approach is both intuitive and applicable to the
construction of arbitrary distributions on M(V), but also hamstrung by a
number of technical difficulties. We show how these difficulties can be
resolved if the domain V is a Polish topological space, and give a
representation theorem directly applicable to the construction of any
probability distribution on M(V) whose first moment measure is well-defined.
The proof draws on a projective limit theorem of Bochner, and on properties
of set functions on Polish spaces to establish countable additivity of the
resulting random probabilities.

Sinead Williamson, Peter Orbanz, and Zoubin Ghahramani.
**Dependent
Indian buffet processes**.
In *13th International Conference on Artificial Intelligence and
Statistics*, volume 9 of *W & CP*, pages 924-931, Chia
Laguna, Sardinia, Italy, May 2010.

** Abstract:** Latent
variable models represent hidden structure in observational data. To account
for the distribution of the observational data changing over time, space or
some other covariate, we need generalizations of latent variable models that
explicitly capture this dependency on the covariate. A variety of such
generalizations has been proposed for latent variable models based on the
Dirichlet process. We address dependency on covariates in binary latent
feature models, by introducing a dependent Indian Buffet Process. The model
generates a binary random matrix with an unbounded number of columns for each
value of the covariate. Evolution of the binary matrices over the covariate
set is controlled by a hierarchical Gaussian process model. The choice of
covariance functions controls the dependence structure and exchangeability
properties of the model. We derive a Markov Chain Monte Carlo sampling
algorithm for Bayesian inference, and provide experiments on both synthetic
and real-world data. The experimental results show that explicit modeling of
dependencies significantly improves accuracy of predictions.

Peter Orbanz and Yee-Whye Teh.
**Bayesian nonparametric models**.
In *Encyclopedia of Machine Learning*. Springer, 2010.

Peter Orbanz.
**Construction of
nonparametric Bayesian models from parametric Bayes equations**.
In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta,
editors, *Advances in Neural Information Processing Systems 22*, pages
1392-1400. The MIT Press, 2009.

** Abstract:** We consider
the general problem of constructing nonparametric Bayesian models on
infinite-dimensional random objects, such as functions, infinite graphs or
infinite permutations. The problem has generated much interest in machine
learning, where it is treated heuristically, but has not been studied in full
generality in nonparametric Bayesian statistics, which tends to focus on
models over probability distributions. Our approach applies a standard tool
of stochastic process theory, the construction of stochastic processes from
their finite-dimensional marginal distributions. The main contribution of the
paper is a generalization of the classic Kolmogorov extension theorem to
conditional probabilities. This extension allows a rigorous construction of
nonparametric Bayesian models from systems of finitedimensional, parametric
Bayes equations. Using this approach, we show (i) how existence of a
conjugate posterior for the nonparametric model can be guaranteed by choosing
conjugate finite-dimensional models in the construction, (ii) how the mapping
to the posterior parameters of the nonparametric model can be explicitly
determined, and (iii) that the construction of conjugate models in essence
requires the finite-dimensional models to be in the exponential family. As an
application of our constructive framework, we derive a model on infinite
permutations, the nonparametric Bayesian analogue of a model recently
proposed for the analysis of rank data.

** Comment:** Supplements
(proofs) and techreport
version

Daniel A. Braun, Pedro A. Ortega, Evangelos Theodorou, and Stefan Schaal.
**Path integral
control and bounded rationality**.
In *2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement
Learning*, 2011.

** Abstract:** Path integral methods have
recently been shown to be applicable to a very general class of optimal
control problems. Here we examine the path integral formalism from a
decision-theoretic point of view, since an optimal controller can always be
regarded as an instance of a perfectly rational decision-maker that chooses
its actions so as to maximize its expected utility. The problem with perfect
rationality is, however, that finding optimal actions is often very difficult
due to prohibitive computational resource costs that are not taken into
account. In contrast, a bounded rational decision-maker has only limited
resources and therefore needs to strike some compromise between the desired
utility and the required resource costs. In particular, we suggest an
information-theoretic measure of resource costs that can be derived
axiomatically. As a consequence we obtain a variational principle for choice
probabilities that trades off maximizing a given utility criterion and
avoiding resource costs that arise due to deviating from initially given
default choice probabilities. The resulting bounded rational policies are in
general probabilistic. We show that the solutions found by the path integral
formalism are such bounded rational policies. Furthermore, we show that the
same formalism generalizes to discrete control problems, leading to linearly
solvable bounded rational control policies in the case of Markov systems.
Importantly, Bellman's optimality principle is not presupposed by this
variational principle, but it can be derived as a limit case. This suggests
that the information theoretic formalization of bounded rationality might
serve as a general principle in control design that unifies a number of
recently reported approximate optimal control methods both in the continuous
and discrete domain.

Daniel A. Braun, Pedro A. Ortega, and Daniel M. Wolpert.
**Motor coordination:
When two have to act as one**.
*Special issue of Experimental Brain Research on Joint Action*, 2011.

** Abstract:** Trying to pass someone walking toward you in a
narrow corridor is a familiar example of a two-person motor game that
requires coordination. In this study, we investigate coordination in
sensorimotor tasks that correspond to classic coordination games with
multiple Nash equilibria, such as "choosing sides", "stag hunt", "chicken",
and "battle of sexes". In these tasks, subjects made reaching movements
reflecting their continuously evolving "decisions" while they received a
continuous payoff in the form of a resistive force counteracting their
movements. Successful coordination required two subjects to "choose" the same
Nash equilibrium in this force-payoff landscape within a single reach. We
found that on the majority of trials coordination was achieved. Compared to
the proportion of trials in which miscoordination occurred, successful
coordination was characterized by several distinct features: an increased
mutual information between the players' movement endpoints, an increased
joint entropy during the movements, and by differences in the timing of the
players' responses. Moreover, we found that the probability of successful
coordination depends on the players' initial distance from the Nash
equilibria. Our results suggest that two-person coordination arises naturally
in motor interactions and is facilitated by favorable initial positions,
stereotypical motor pattern, and differences in response times.

Pedro A. Ortega and Daniel A. Braun.
**Information,
utility and bounded rationality**.
In *The fourth conference on artificial general intelligence*, volume
6830 of *Lecture Notes on Artificial Intelligence*, pages 269-274.
Springer-Verlag, 2011.

** Abstract:** Perfectly rational
decision-makers maximize expected utility, but crucially ignore the resource
costs incurred when determining optimal actions. Here we employ an axiomatic
framework for bounded rational decision-making based on a thermodynamic
interpretation of resource costs as information costs. This leads to a
variational free utility principle akin to thermodynamical free energy that
trades off utility and information costs. We show that bounded optimal
control solutions can be derived from this variational principle, which leads
in general to stochastic policies. Furthermore, we show that risk-sensitive
and robust (minimax) control schemes fall out naturally from this framework
if the environment is considered as a bounded rational and perfectly rational
opponent, respectively. When resource costs are ignored, the maximum expected
utility principle is recovered.

Pedro A. Ortega, Daniel A. Braun, and Simon Godsill.
**Reinforcement
learning and the Bayesian control rule**.
In *The fourth conference on artificial general intelligence*, volume
6830 of *Lecture Notes on Artificial Intelligence*, pages 281-285.
Springer-Verlag, 2011.

** Abstract:** We present an
actor-critic scheme for reinforcement learning in complex domains. The main
contribution is to show that planning and I/O dynamics can be separated such
that an intractable planning problem reduces to a simple multi-armed bandit
problem, where each lever stands for a potentially arbitrarily complex
policy. Furthermore, we use the Bayesian control rule to construct an
adaptive bandit player that is universal with respect to a given class of
optimal bandit players, thus indirectly constructing an adaptive agent that
is universal with respect to a given class of policies.

Pedro A. Ortega.
**A
Unified Framework for Resource-Bounded Agents Interacting with an Unknown
Environment**.
PhD thesis, Department of Engineering, University of Cambridge, 2011.

** Abstract:** The aim of this thesis is to present a
mathematical framework for conceptualizing and constructing adaptive
autonomous systems under resource constraints. The first part of this thesis
contains a concise presentation of the foundations of classical agency:
namely the formalizations of decision making and learning. Decision making
includes: (a) subjective expected utility (SEU) theory, the framework of
decision making under uncertainty; (b) the maximum SEU principle to choose
the optimal solution; and (c) its application to the design of autonomous
systems, culminating in the Bellman optimality equations. Learning includes:
(a) Bayesian probability theory, the theory for reasoning under uncertainty
that extends logic; and (b) Bayes-Optimal agents, the application of Bayesian
probability theory to the design of optimal adaptive agents. Then, two major
problems of the maximum SEU principle are highlighted: (a) the prohibitive
computational costs and (b) the need for the causal precedence of the choice
of the policy. The second part of this thesis tackles the two aforementioned
problems. First, an information-theoretic notion of resources in autonomous
systems is established. Second, a framework for resource-bounded agency is
introduced. This includes: (a) a maximum bounded SEU principle that is
derived from a set of axioms of utility; (b) an axiomatic model of
probabilistic causality, which is applied for the formalization of autonomous
systems having uncertainty over their policy and environment; and (c) the
Bayesian control rule, which is derived from the maximum bounded SEU
principle and the model of causality, implementing a stochastic adaptive
control law that deals with the case where autonomous agents are uncertain
about their policy and environment.

Daniel A. Braun and Pedro A. Ortega.
**A minimum relative
entropy principle for adaptive control in linear quadratic
regulators**.
In *Proceedings of the 7th international conference on informatics in
control, automation and robotics*, page (in press), 2010.

**
Abstract:** The design of optimal adaptive controllers is usually based on
heuristics, because solving Bellman's equations over information states is
notoriously intractable. Approximate adaptive controllers often rely on the
principle of certainty-equivalence where the control process deals with
parameter point estimates as if they represented ``true'' parameter values.
Here we present a stochastic control rule instead where controls are sampled
from a posterior distribution over a set of probabilistic input-output models
and the true model is identified by Bayesian inference. This allows
reformulating the adaptive control problem as an inference and sampling
problem derived from a minimum relative entropy principle. Importantly,
inference and action sampling both work forward in time and hence such a
Bayesian adaptive controller is applicable on-line. We demonstrate the
improved performance that can be achieved by such an approach for linear
quadratic regulator examples.

Pedro A. Ortega and Daniel A. Braun.
**An axiomatic formalization of
bounded rationality based on a utility-information equivalence**.
Technical report, Dept. of Engineering, University of Cambridge, 2010.

** Abstract:** Classic decision-theory is based on the maximum
expected utility (MEU) principle, but crucially ignores the resource costs
incurred when determining optimal decisions. Here we propose an axiomatic
framework for bounded decision-making that considers resource costs. Agents
are formalized as probability measures over input-output streams. We
postulate that any such probability measure can be assigned a corresponding
conjugate utility function based on three axioms: utilities should be
real-valued, additive and monotonic mappings of probabilities. We show that
these axioms enforce a unique conversion law between utility and probability
(and thereby, information). Moreover, we show that this relation can be
characterized as a variational principle: given a utility function, its
conjugate probability measure maximizes a free utility functional.
Transformations of probability measures can then be formalized as a change in
free utility due to the addition of new constraints expressed by a target
utility function. Accordingly, one obtains a criterion to choose a
probability measure that trades off the maximization of a target utility
function and the cost of the deviation from a reference distribution. We show
that optimal control, adaptive estimation and adaptive control problems can
be solved this way in a resource-efficient way. When resource costs are
ignored, the MEU principle is recovered. Our formalization might thus provide
a principled approach to bounded rationality that establishes a close link to
information theory.

Pedro A. Ortega and Daniel A. Braun.
**A
Bayesian rule for adaptive control based on causal interventions**.
In *The third conference on artificial general intelligence*, pages
115-120, Paris, 2010. Atlantis Press.

** Abstract:**
Explaining adaptive behavior is a central problem in artificial intelligence
research. Here we formalize adaptive agents as mixture distributions over
sequences of inputs and outputs (I/O). Each distribution of the mixture
constitutes a "possible world", but the agent does not know which of the
possible worlds it is actually facing. The problem is to adapt the I/O stream
in a way that is compatible with the true world. A natural measure of
adaptation can be obtained by the Kullback Leibler (KL) divergence between
the I/O distribution of the true world and the I/O distribution expected by
the agent that is uncertain about possible worlds. In the case of pure input
streams, the Bayesian mixture provides a well-known solution for this
problem. We show, however, that in the case of I/O streams this solution
breaks down, because outputs are issued by the agent itself and require a
different probabilistic syntax as provided by intervention calculus. Based on
this calculus, we obtain a Bayesian control rule that allows modeling
adaptive behavior with mixture distributions over I/O streams. This rule
might allow for a novel approach to adaptive control based on a minimum
KL-principle.

Pedro A. Ortega and Daniel A. Braun.
**A
conversion between utility and information**.
In *The third conference on artificial general intelligence*, pages
115-120, Paris, 2010. Atlantis Press.

** Abstract:** Rewards
typically express desirabilities or preferences over a set of alternatives.
Here we propose that rewards can be defined for any probability distribution
based on three desiderata, namely that rewards should be real- valued,
additive and order-preserving, where the later implies that more probable
events should also be more desirable. Our main result states that rewards are
then uniquely determined by the negative information content. To analyze
stochastic processes, we define the utility of a realization as its reward
rate. Under this interpretation, we show that the expected utility of a
stochastic process is its negative entropy rate. Furthermore, we apply our
results to analyze agent-environment interactions. We show that the expected
utility that will actually be achieved by the agent is given by the negative
cross-entropy from the input-output (I/O) distribution of the coupled
interaction system and the agent's I/O distribution. Thus, our results allow
for an information-theoretic interpretation of the notion of utility and the
characterization of agent-environment interactions in terms of entropy
dynamics.

Pedro A. Ortega and Daniel A. Braun.
**A minimum relative entropy
principle for learning and acting**.
*Journal of Artificial Intelligence Research*, 38:475-511, 2010, doi 10.1613/jair.3062.

** Abstract:** This paper proposes a method to construct an
adaptive agent that is univemacmacrsal with respect to a given class of
experts, where each expert is designed specifically for a particular
environment. This adaptive control problem is formalized as the problem of
minimizing the relative entropy of the adaptive agent from the expert that is
most suitable for the unknown environment. If the agent is a passive
observer, then the optimal solution is the well-known Bayesian predictor.
However, if the agent is active, then its past actions need to be treated as
causal interventions on the I/O stream rather than normal probability
conditions. Here it is shown that the solution to this new variational
problem is given by a stochastic controller called the Bayesian control rule,
which implements adaptive behavior as a mixture of experts. Furthermore, it
is shown that under mild assumptions, the Bayesian control rule converges to
the control law of the most suitable expert.

Daniel A. Braun, Pedro A. Ortega, and Daniel M. Wolpert.
**Nash equilibria in
multi-agent motor interactions**.
*PLoS Computational Biology*, 5(8), 2009.

** Abstract:**
Social interactions in classic cognitive games likeBlack-box alpha (BB-α) is
a new approximate inference method based on the minimization of
α-divergences. BB-αscales to large datasets because it can be implemented
using stochastic gradient descent. BB-αcan be applied to complex
probabilistic models with little effort since it only requires as input the
likelihood function and its gradients. These gradients can be easily obtained
using automatic differentiation. By changing the divergence parameter α, the
method is able to interpolate between variational Bayes (VB) (α→0) and an
algorithm similar to expectation propagation (EP) (α= 1). Experiments on
probit regression and neural network regression and classification problems
show that BB-αwith non-standard settings of α, such as α= 0.5, usually
produces better predictions than with α→0 (VB) or α= 1 (EP). the
ultimatum game or the prisoner's dilemma typically lead to Nash equilibria
when multiple competitive decision makers with perfect knowledge select
optimal strategies. However, in evolutionary game theory it has been shown
that Nash equilibria can also arise as attractors in dynamical systems that
can describe, for example, the population dynamics of microorganisms. Similar
to such evolutionary dynamics, we find that Nash equilibria arise naturally
in motor interactions in which players vie for control and try to minimize
effort. When confronted with sensorimotor interaction tasks that correspond
to the classical prisoner's dilemma and the rope-pulling game, two-player
motor interactions led predominantly to Nash solutions. In contrast, when a
single player took both roles, playing the sensorimotor game bimanually,
cooperative solutions were found. Our methodology opens up a new avenue for
the study of human motor interactions within a game theoretic framework,
suggesting that the coupling of motor systems can lead to game theoretic
solutions.

Konstantina Palla, David A. Knowles, and Zoubin Ghahramani.
**A reversible
infinite hmm using normalised random measures**.
In *31st International Conference on Machine Learning*, Beijing, China,
June 2014.

** Abstract:** We present a nonparametric prior
over reversible Markov chains. We use completely random measures,
specifically gamma processes, to construct a countably infinite graph with
weighted edges. By enforcing symmetry to make the edges undirected we define
a prior over random walks on graphs that results in a reversible Markov
chain. The resulting prior over infinite transition matrices is closely
related to the hierarchical Dirichlet process but enforces reversibility. A
reinforcement scheme has recently been proposed with similar properties, but
the de Finetti measure is not well characterised. We take the alternative
approach of explicitly constructing the mixing measure, which allows more
straightforward and efficient inference at the cost of no longer having a
closed form predictive distribution. We use our process to construct a
reversible infinite HMM which we apply to two real datasets, one from
epigenomics and one ion channel recording.

**Sigma: simple
greedy matching for aligning large knowledge bases**.
In *KDD*, pages 572-580. Association for Computing Machinery, 2013.

** Abstract:** The Internet has enabled the creation of a
growing number of large-scale knowledge bases in a variety of domains
containing complementary information. Tools for automatically aligning these
knowledge bases would make it possible to unify many sources of structured
knowledge and answer complex queries. However, the efficient alignment of
large-scale knowledge bases still poses a considerable challenge. Here, we
present Simple Greedy Matching (SiGMa), a simple algorithm for aligning
knowledge bases with millions of entities and facts. SiGMa is an iterative
propagation algorithm which leverages both the structural information from
the relationship graph as well as flexible similarity measures between entity
properties in a greedy local search, thus making it scalable. Despite its
greedy nature, our experiments indicate that SiGMa can efficiently match some
of the world's largest knowledge bases with high precision. We provide
additional experiments on benchmark datasets which demonstrate that SiGMa can
outperform state-of-the-art approaches both in accuracy and efficiency.

Konstantina Palla, David A. Knowles, and Zoubin Ghahramani.
**A
nonparametric variable clustering model**.
In *Advances in Neural Information Processing Systems 26*, pages 1-9,
Lake Tahoe, California, USA, December 2012.

** Abstract:**
Factor analysis models effectively summarise the covariance structure of high
dimensional data, but the solutions are typically hard to interpret. This
motivates attempting to find a disjoint partition, i.e. a simple clustering,
of observed variables into highly correlated subsets. We introduce a Bayesian
non-parametric approach to this problem, and demonstrate advantages over
heuristic methods proposed to date. Our Dirichlet process variable clustering
(DPVC) model can discover block-diagonal covariance structures in data. We
evaluate our method on both synthetic and gene expression analysis
problems.

Konstantina Palla, David A. Knowles, and Zoubin Ghahramani.
**An infinite latent attribute
model for network data**.
In *29th International Conference on Machine Learning*, Edinburgh,
Scotland, June 2012.

** Abstract:** Latent variable models for
network data extract a summary of the relational structure underlying an
observed network. The simplest possible models subdivide nodes of the network
into clusters; the probability of a link between any two nodes then depends
only on their cluster assignment. Currently available models can be
classified by whether clusters are disjoint or are allowed to overlap. These
models can explain a "flat" clustering structure. Hierarchical Bayesian
models provide a natural approach to capture more complex dependencies. We
propose a model in which objects are characterised by a latent feature
vector. Each feature is itself partitioned into disjoint groups
(subclusters), corresponding to a second layer of hierarchy. In experimental
comparisons, the model achieves significantly improved predictive performance
on social and biological link prediction tasks. The results indicate that
models with a single layer hierarchy over-simplify real networks.

Martin Trapp, Robert Peharz, Hong Ge, Franz Pernkopf, and Carl Edward
Rasmussen.
**Deep
structured mixtures of gaussian processes**.
In *23rd International Conference on Artificial Intelligence and
Statistics*, Online, August 2020.

** Abstract:** Gaussian
Processes (GPs) are powerful non-parametric Bayesian regression models that
allow exact posterior inference, but exhibit high computational and memory
costs. In order to improve scalability of GPs, approximate posterior
inference is frequently employed, where a prominent class of approximation
techniques is based on local GP experts. However, local-expert techniques
proposed so far are either not well-principled, come with limited
approximation guarantees, or lead to intractable models. In this paper, we
introduce deep structured mixtures of GP experts, a stochastic process model
which i) allows exact posterior inference, ii) has attractive computational
and memory costs, and iii) when used as GP approximation, captures predictive
uncertainties consistently better than previous expert-based approximations.
In a variety of experiments, we show that deep structured mixtures have a low
approximation error and often perform competitive or outperform prior
work.

Robert Peharz, Steven Lang, Antonio Vergari, Karl Stelzner, Alejandro Molina,
Martin Trapp, Guy Van den Broeck, Kristian Kersting, and Zoubin Ghahramani.
**Einsum networks: Fast and
scalable learning of tractable probabilistic circuits**.
In *37th International Conference on Machine Learning*, Online, July
2020.

** Abstract:** Probabilistic circuits (PCs) are a
promising avenue for probabilistic modeling, as they permit a wide range of
exact and efficient inference routines. Recent ``deep-learning-style''
implementations of PCs strive for a better scalability, but are still
difficult to train on real-world data, due to their sparsely connected
computational graphs. In this paper, we propose Einsum Networks (EiNets), a
novel implementation design for PCs, improving prior art in several regards.
At their core, EiNets combine a large number of arithmetic operations in a
single monolithic einsum-operation, leading to speedups and memory savings of
up to two orders of magnitude, in comparison to previous implementations. As
an algorithmic contribution, we show that the implementation of
Expectation-Maximization (EM) can be simplified for PCs, by leveraging
automatic differentiation. Furthermore, we demonstrate that EiNets scale well
to datasets which were previously out of reach, such as SVHN and CelebA, and
that they can be used as faithful generative image models.

Martin Trapp, Robert Peharz, Hong Ge, Franz Pernkopf, and Zoubin Ghahramani.
**Bayesian
learning of sum-product networks**.
In *Advances in Neural Information Processing Systems 33*, Vancouver,
December 2019.

** Abstract:** Sum-product networks (SPNs) are
flexible density estimators and have received significant attention due to
their attractive inference properties. While parameter learning in SPNs is
well developed, structure learning leaves something to be desired: Even
though there is a plethora of SPN structure learners, most of them are
somewhat ad-hoc and based on intuition rather than a clear learning
principle. In this paper, we introduce a well-principled Bayesian framework
for SPN structure learning. First, we decompose the problem into i) laying
out a computational graph, and ii) learning the so-called scope function over
the graph. The first is rather unproblematic and akin to neural network
architecture validation. The second represents the effective structure of the
SPN and needs to respect the usual structural constraints in SPN, i.e.
completeness and decomposability. While representing and learning the scope
function is somewhat involved in general, in this paper, we propose a natural
parametrisation for an important and widely used special case of SPNs. These
structural parameters are incorporated into a Bayesian model, such that
simultaneous structure and parameter learning is cast into monolithic
Bayesian posterior inference. In various experiments, our Bayesian SPNs often
improve test likelihoods over greedy SPN learners. Further, since the
Bayesian framework protects against overfitting, we can evaluate
hyper-parameters directly on the Bayesian model score, waiving the need for a
separate validation set, which is especially beneficial in low data regimes.
Bayesian SPNs can be applied to heterogeneous domains and can easily be
extended to nonparametric formulations. Moreover, our Bayesian approach is
the first, which consistently and robustly learns SPN structures under
missing data.

Antonio Vergari, Robert Peharz, Nicola Di Mauro, Alejandro Molina, Kristian
Kersting, and Floriana Esposito.
**Sum-product
autoencoding: Encoding and decoding representations using sum-product
networks**.
In *32nd AAAI Conference on Artificial Intelligence*, New Orleans, USA,
February 2018.

** Abstract:** Abstract Sum-Product Networks
(SPNs) are a deep probabilistic architecture that up to now has been
successfully employed for tractable inference. Here, we extend their scope
towards unsupervised representation learning: we encode samples into
continuous and categorical embeddings and show that they can also be decoded
back into the original input space by leveraging MPE inference. We
characterize when this Sum-Product Autoencoding (SPAE) leads to equivalent
reconstructions and extend it towards dealing with missing embedding
information. Our experimental results on several multilabel classification
problems demonstrate that SPAE is competitive with state-of-the-art
autoencoder architectures, even if the SPNs were never trained to reconstruct
their inputs.

Martin Trapp, Tamas Madl, Robert Peharz, Franz Pernkopf, and Robert Trappl.
**Safe
semi-supervised learning of sum-product networks**.
In *33st Conference on Uncertainty in Artificial Intelligence*, Sidney,
Australia, August 2017.

** Abstract:** In several domains
obtaining class annotations is expensive while at the same time unlabelled
data are abundant. While most semi-supervised approaches enforce restrictive
assumptions on the data distribution, recent work has managed to learn
semi-supervised models in a non-restrictive regime. However, so far such
approaches have only been proposed for linear models. In this work, we
introduce semi-supervised parameter learning for Sum-Product Networks (SPNs).
SPNs are deep probabilistic models admitting inference in linear time in
number of network edges. Our approach has several advantages, as it (1)
allows generative and discriminative semi-supervised learning, (2) guarantees
that adding unlabelled data can increase, but not degrade, the performance
(safe), and (3) is computationally efficient and does not enforce restrictive
assumptions on the data distribution. We show on a variety of data sets that
safe semi-supervised learning with SPNs is competitive compared to
state-of-the-art and can lead to a better generative and discriminative
objective value than a purely supervised approach.

Robert Pinsler, Peter Karkus, Andras Kupcsik, David Hsu, and Wee Sun Lee.
**Factored
contextual policy search with Bayesian optimization**.
In *IEEE International Conference on Robotics and Automation*, Montreal,
Canada, May 2019.

** Abstract:** Scarce data is a major
challenge to scaling robot learning to truly complex tasks, as we need to
generalize locally learned policies over different task contexts. Contextual
policy search offers data-efficient learning and generalization by explicitly
conditioning the policy on a parametric context space. In this paper, we
further structure the contextual policy representation. We propose to factor
contexts into two components: target contexts that describe the task
objectives, e.g. target position for throwing a ball; and environment
contexts that characterize the environment, e.g. initial position or mass of
the ball. Our key observation is that experience can be directly generalized
over target contexts. We show that this can be easily exploited in contextual
policy search algorithms. In particular, we apply factorization to a Bayesian
optimization approach to contextual policy search both in sampling-based and
active learning settings. Our simulation results show faster learning and
better generalization in various robotic domains. See our supplementary
video: https://youtu.be/MNTbBAOufDY.

Robert Pinsler, Jonathan Gordon, Eric Nalisnick, and Jose Miguel
Hernández-Lobato.
**Bayesian
batch active learning as sparse subset approximation**.
In *Advances in Neural Information Processing Systems 33*, 2019.

** Abstract:** Leveraging the wealth of unlabeled data produced
in recent years provides great potential for improving supervised models.
When the cost of acquiring labels is high, probabilistic active learning
methods can be used to greedily select the most informative data points to be
labeled. However, for many large-scale problems standard greedy procedures
become computationally infeasible and suffer from negligible model change. In
this paper, we introduce a novel Bayesian batch active learning approach that
mitigates these issues. Our approach is motivated by approximating the
complete data posterior of the model parameters. While naive batch
construction methods result in correlated queries, our algorithm produces
diverse batches that enable efficient active learning at scale. We derive
interpretable closed-form solutions akin to existing active learning
procedures for linear models, and generalize to arbitrary models using random
projections. We demonstrate the benefits of our approach on several
large-scale regression and classification tasks.

Robert Pinsler, Riad Akrour, Takayuki Osa, Jan Peters, and Gerhard Neumann.
**Sample
and feedback efficient hierarchical reinforcement learning from human
preferences**.
In *IEEE International Conference on Robotics and Automation*, Brisbane,
Australia, May 2018.

** Abstract:** While reinforcement
learning has led to promising results in robotics, defining an informative
reward function is challenging. Prior work considered including the human in
the loop to jointly learn the reward function and the optimal policy.
Generating samples from a physical robot and requesting human feedback are
both taxing efforts for which efficiency is critical. We propose to learn
reward functions from both the robot and the human perspectives to improve on
both efficiency metrics. Learning a reward function from the human
perspective increases feedback efficiency by assuming that humans rank
trajectories according to a low-dimensional outcome space. Learning a reward
function from the robot perspective circumvents the need for a dynamics model
while retaining the sample efficiency of model-based approaches. We provide
an algorithm that incorporates bi-perspective reward learning into a general
hierarchical reinforcement learning framework and demonstrate the merits of
our approach on a toy task and a simulated robot grasping task.

Sébastien Bratières, Novi Quadrianto, Sebastian Nowozin, and Zoubin
Ghahramani.
**Scalable
Gaussian Process structured prediction for grid factor graph
applications**.
In *31st International Conference on Machine Learning*, 2014.

** Abstract:** Structured prediction is an important and well
studied problem with many applications across machine learning. GPstruct is a
recently proposed structured prediction model that offers appealing
properties such as being kernelised, non-parametric, and supporting Bayesian
inference (Bratières et al. 2013). The model places a Gaussian process prior
over energy functions which describe relationships between input variables
and structured output variables. However, the memory demand of GPstruct is
quadratic in the number of latent variables and training runtime scales
cubically. This prevents GPstruct from being applied to problems involving
grid factor graphs, which are prevalent in computer vision and spatial
statistics applications. Here we explore a scalable approach to learning
GPstruct models based on ensemble learning, with weak learners (predictors)
trained on subsets of the latent variables and bootstrap data, which can
easily be distributed. We show experiments with 4M latent variables on image
segmentation. Our method outperforms widely-used conditional random field
models trained with pseudo-likelihood. Moreover, in image segmentation
problems it improves over recent state-of-the-art marginal optimisation
methods in terms of predictive performance and uncertainty calibration.
Finally, it generalises well on all training set sizes.

Novi Quadrianto, Viktoriia Sharmanska, David A. Knowles, and Zoubin Ghahramani.
**The supervised
ibp: Neighbourhood preserving infinite latent feature models**.
In *29th Conference on Uncertainty in Artificial Intelligence*,
Bellevue, USA, July 2013.

** Abstract:** We propose a
probabilistic model to infer supervised latent variables in the Hamming space
from observed data. Our model allows simultaneous inference of the number of
binary latent variables, and their values. The latent variables preserve
neighbourhood structure of the data in a sense that objects in the same
semantic concept have similar latent values, and objects in different
concepts have dissimilar latent values. We formulate the supervised infinite
latent variable problem based on an intuitive principle of pulling objects
together if they are of the same type, and pushing them apart if they are
not. We then combine this principle with a flexible Indian Buffet Process
prior on the latent variables. We show that the inferred supervised latent
variables can be directly used to perform a nearest neighbour search for the
purpose of retrieval. We introduce a new application of dynamically extending
hash codes, and show how to effectively couple the structure of the hash
codes with continuously growing structure of the neighbourhood preserving
infinite latent feature space.

Sébastien Bratières, Novi Quadrianto, and Zoubin Ghahramani.
**Bayesian
structured prediction using Gaussian processes**.
*arXiv*, abs/1307.3846, 2013.

** Abstract:** We introduce
a conceptually novel structured prediction model, GPstruct, which is
kernelized, non-parametric and Bayesian, by design. We motivate the model
with respect to existing approaches, among others, conditional random fields
(CRFs), maximum margin Markov networks (M3N), and structured support vector
machines (SVMstruct), which embody only a subset of its properties. We
present an inference procedure based on Markov Chain Monte Carlo. The
framework can be instantiated for a wide range of structured objects such as
linear chains, trees, grids, and other general graphs. As a proof of concept,
the model is benchmarked on several natural language processing tasks and a
video gesture segmentation task involving a linear chain structure. We show
prediction accuracies for GPstruct which are comparable to or exceeding those
of CRFs and SVMstruct.

Viktoriia Sharmanska, Novi Quadrianto, and Christoph Lampert.
**Learning to rank
using privileged information**.
In *International Conference on Computer Vision*, 2013.

**
Abstract:** Many computer vision problems have an asymmetric distribution
of information between training and test time. In this work, we study the
case where we are given additional information about the training data, which
however will not be available at test time. This situation is called learning
using privileged information (LUPI). We introduce two maximum-margin
techniques that are able to make use of this additional source of
information, and we show that the framework is applicable to several
scenarios that have been studied in computer vision before. Experiments with
attributes, bounding boxes, image tags and rationales as additional
information in object classification show promising results.

Novi Quadrianto, Chao Chen, and Christoph Lampert.
**The most
persistent soft-clique in a set of sampled graphs**.
In *29th International Conference on Machine Learning*, Edinburgh,
Scotland, June 2012.

** Abstract:** When searching for
characteristic subpatterns in potentially noisy graph data, it appears
self-evident that having multiple observations would be better than having
just one. However, it turns out that the inconsistencies introduced when
different graph instances have different edge sets pose a serious challenge.
In this work we address this challenge for the problem of finding maximum
weighted cliques. We introduce the concept of most persistent soft-clique.
This is subset of vertices, that 1) is almost fully or at least densely
connected, 2) occurs in all or almost all graph instances, and 3) has the
maximum weight. We present a measure of clique-ness, that essentially counts
the number of edge missing to make a subset of vertices into a clique. With
this measure, we show that the problem of finding the most persistent
soft-clique problem can be cast either as: a) a max-min two person game
optimization problem, or b) a min-min soft margin optimization problem. Both
formulations lead to the same solution when using a partial Lagrangian method
to solve the optimization problems. By experiments on synthetic data and on
real social network data we show that the proposed method is able to reliably
find soft cliques in graph data, even if that is distorted by random noise or
unreliable observations.

Viktoriia Sharmanska, Novi Quadrianto, and Christoph Lampert.
**Augmented
attributes representations**.
In *12th European Conference on Computer Vision*, pages 242-255,
2012.

** Abstract:** We propose a new learning method to infer
a mid-level feature representation that combines the advantage of semantic
attribute representations with the higher expressive power of non-semantic
features. The idea lies in augmenting an existing attribute-based
representation with additional dimensions for which an autoencoder model is
coupled with a large-margin principle. This construction allows a smooth
transition between the zero-shot regime with no training example, the
unsupervised regime with training examples but without class labels, and the
supervised regime with training examples and with class labels. The resulting
optimization problem can be solved efficiently, because several of the
necessity steps have closed-form solutions. Through extensive experiments we
show that the augmented representation achieves better results in terms of
object categorization accuracy than the semantic representation alone.

Tatiana Tommasi, Novi Quadrianto, Barbara Caputo, and Christoph Lampert.
**Beyond dataset
bias: Multi-task unaligned shared knowledge transfer**.
In *11th Asian Conference on Computer Vision*, 2012.

**
Abstract:** Many visual datasets are traditionally used to analyze the
performance of different learning techniques. The evaluation is usually done
within each dataset, therefore it is questionable if such results are a
reliable indicator of true generalization ability. We propose here an
algorithm to exploit the existing data resources when learning on a new
multiclass problem. Our main idea is to identify an image representation that
decomposes orthogonally into two subspaces: a part specific to each dataset,
and a part generic to, and therefore shared between, all the considered
source sets. This allows us to use the generic representation as un-biased
reference knowledge for a novel classification task. By casting the method in
the multi-view setting, we also make it possible to use different features
for different databases. We call the algorithm MUST, Multitask Unaligned
Shared knowledge Transfer. Through extensive experiments on five public
datasets, we show that MUST consistently improves the cross-datasets
generalization performance.

Martin Trapp, Robert Peharz, Hong Ge, Franz Pernkopf, and Carl Edward
Rasmussen.
**Deep
structured mixtures of gaussian processes**.
In *23rd International Conference on Artificial Intelligence and
Statistics*, Online, August 2020.

** Abstract:** Gaussian
Processes (GPs) are powerful non-parametric Bayesian regression models that
allow exact posterior inference, but exhibit high computational and memory
costs. In order to improve scalability of GPs, approximate posterior
inference is frequently employed, where a prominent class of approximation
techniques is based on local GP experts. However, local-expert techniques
proposed so far are either not well-principled, come with limited
approximation guarantees, or lead to intractable models. In this paper, we
introduce deep structured mixtures of GP experts, a stochastic process model
which i) allows exact posterior inference, ii) has attractive computational
and memory costs, and iii) when used as GP approximation, captures predictive
uncertainties consistently better than previous expert-based approximations.
In a variety of experiments, we show that deep structured mixtures have a low
approximation error and often perform competitive or outperform prior
work.

David R. Burt, Carl Edward Rasmussen, and Mark van der Wilk.
**Convergence
of sparse variational inference in Gaussian processes regression**.
*Journal of Machine Learning Research*, 21, 2020.

**
Abstract:** Gaussian processes are distributions over functions that are
versatile and mathematically convenient priors in Bayesian modelling.
However, their use is often impeded for data with large numbers of
observations, N, due to the cubic (in N) cost of matrix operations used in
exact inference. Many solutions have been proposed that rely on M << N
inducing variables to form an approximation at a cost of O(NM^{2}).
While the computational cost appears linear in N, the true complexity depends
on how M must scale with N to ensure a certain quality of the approximation.
In this work, we investigate upper and lower bounds on how M needs to grow
with N to ensure high quality approximations. We show that we can make the
KL-divergence between the approximate model and the exact posterior
arbitrarily small for a Gaussian-noise regression model with M << N.
Specifically, for the popular squared exponential kernel and D-dimensional
Gaussian distributed covariates, M = O((log N)^{D}) suffice and a
method with an overall computational cost of O(N(log N)^{2D}(log log
N)^{2}) can be used to perform inference.

Alessandro Davide Ialongo, Mark van der Wilk, James Hensman, and Carl Edward
Rasmussen.
**Overcoming mean-field
approximations in recurrent Gaussian process models**.
In *36th International Conference on Machine Learning*, Long Beach, June
2019.

** Abstract:** We identify a new variational inference
scheme for dynamical systems whose transition function is modelled by a
Gaussian process. Inference in this setting has either employed
computationally intensive MCMC methods, or relied on factorisations of the
variational posterior. As we demonstrate in our experiments, the
factorisation between latent system states and transition function can lead
to a miscalibrated posterior and to learning unnecessarily large noise terms.
We eliminate this factorisation by explicitly modelling the dependence
between state trajectories and the Gaussian process posterior. Samples of the
latent states can then be tractably generated by conditioning on this
representation. The method we obtain (VCDT: variationally coupled dynamics
and trajectories) gives better predictive performance and more calibrated
estimates of the transition function, yet maintains the same time and space
complexities as mean-field methods. Code is available at:
https://github.com/ialong/GPt.

David R Burt, Carl Edward Rasmussen, and Mark van der Wilk.
**Rates of convergence for sparse
variational Gaussian process regression**.
*arXiv*, 2019.

** Abstract:** Excellent variational
approximations to Gaussian process posteriors have been developed which avoid
the O(N^{3}) scaling with dataset size N. They reduce the
computational cost to O(NM^{2}), with M≪N being the number of
inducing variables, which summarise the process. While the computational cost
seems to be linear in N, the true complexity of the algorithm depends on how
M must increase to ensure a certain quality of approximation. We address this
by characterising the behavior of an upper bound on the KL divergence to the
posterior. We show that with high probability the KL divergence can be made
arbitrarily small by growing M more slowly than N. A particular case of
interest is that for regression with normally distributed inputs in
D-dimensions with the popular Squared Exponential kernel,
M=O(log^{D}N) is sufficient. Our results show that as datasets grow,
Gaussian process posteriors can truly be approximated cheaply, and provide a
concrete rule for how to increase M in continual learning scenarios.

Adrià Garriga-Alonso, Carl Edward Rasmussen, and Laurence Aitchison.
**Deep convolutional networks as
shallow Gaussian processes**.
In *International Conference on Learning Representations (ICLR)*,
2019.

** Abstract:** We show that the output of a (residual)
convolutional neural network (CNN) with an appropriate prior over the weights
and biases is a Gaussian process (GP) in the limit of infinitely many
convolutional filters, extending similar results for dense networks. For a
CNN, the equivalent kernel can be computed exactly and, unlike "deep
kernels", has very few parameters: only the hyperparameters of the original
CNN. Further, we show that this kernel has two properties that allow it to be
computed efficiently; the cost of evaluating the kernel for a pair of images
is similar to a single forward pass through the original CNN with only one
filter per layer. The kernel equivalent to a 32-layer ResNet obtains 0.84%
classification error on MNIST, a new record for GPs with a comparable number
of parameters.

Alessandro Davide Ialongo, Mark van der Wilk, James Hensman, and Carl Edward
Rasmussen.
**Non-factorised variational
inference in dynamical systems**.
In *First Symposium on Advances in Approximate Bayesian Inference*,
Montreal, December 2018.

** Abstract:** We focus on
variational inference in dynamical systems where the discrete time transition
function (or evolution rule) is modelled by a Gaussian process. The dominant
approach so far has been to use a factorised posterior distribution,
decoupling the transition function from the system states. This is not exact
in general and can lead to an overconfident posterior over the transition
function as well as an overestimation of the intrinsic stochasticity of the
system (process noise). We propose a new method that addresses these issues
and incurs no additional computational costs.

Jan-Peter Calliess, Stephen Roberts, Carl Edward Rasmussen, and Jan
Maciejowski.
**Nonlinear set
membership regression with adaptive hyper-parameter estimation for online
learning and control**.
In *Proceedings of the European Control Conference*, 2018.

** Abstract:** Methods known as Lipschitz Interpolation or
Nonlinear Set Membership regression have become established tools for
nonparametric system-identification and data-based control. They utilise
presupposed Lipschitz properties to compute inferences over unobserved
function values. Unfortunately, they rely on the a priori knowledge of a
Lipschitz constant of the underlying target function which serves as a
hyperparameter. We propose a closed-form estimator of the Lipschitz constant
that is robust to bounded observational noise in the data. The merger of
Lipschitz Interpolation with the new hyperparameter estimator gives a new
nonparametric machine learning method for which we derive online learning
convergence guarantees. Furthermore, we apply our learning method to
model-reference adaptive control and provide a convergence guarantee on the
closed-loop dynamics. In a simulated flight manoeuvre control scenario, we
compare the performance of our approach to recently proposed alternative
learning-based controllers.

Paavo Parmas, Carl Edward Rasmussen, Jan Peters, and Kenji Doya.
**PIPPS:
Flexible model-based policy search robust to the curse of chaos**.
In *35th International Conference on Machine Learning*, 2018.

** Abstract:** Previously, the exploding gradient problem has
been explained to be central in deep learning and model-based reinforcement
learning, because it causes numerical issues and instability in optimization.
Our experiments in model-based reinforcement learning imply that the problem
is not just a numerical issue, but it may be caused by a fundamental
chaos-like nature of long chains of nonlinear computations. Not only do the
magnitudes of the gradients become large, the direction of the gradients
becomes essentially random. We show that reparameterization gradients suffer
from the problem, while likelihood ratio gradients are robust. Using our
insights, we develop a model-based policy search framework, Probabilistic
Inference for Particle-Based Policy Search (PIPPS), which is easily
extensible, and allows for almost arbitrary models and policies, while
simultaneously matching the performance of previous data-efficient learning
algorithms. Finally, we invent the total propagation algorithm, which
efficiently computes a union over all pathwise derivative depths during a
single backwards pass, automatically giving greater weight to estimators with
lower variance, sometimes improving over reparameterization gradients by
10^{6} times.

Alessandro Davide Ialongo, Mark van der Wilk, and Carl Edward Rasmussen.
**Closed-form inference and
prediction in Gaussian process state-space models**.
In *NIPS Time Series Workshop 2017*, Long Beach, December 2017.

** Abstract:** We examine an analytic variational inference
scheme for the Gaussian Process State Space Model (GPSSM) - a probabilistic
model for system identification and time-series modelling. Our approach
performs variational inference over both the system states and the transition
function. We exploit Markov structure in the true posterior, as well as an
inducing point approximation to achieve linear time complexity in the length
of the time series. Contrary to previous approaches, no Monte Carlo sampling
is required: inference is cast as a deterministic optimisation problem. In a
number of experiments, we demonstrate the ability to model non-linear
dynamics in the presence of both process and observation noise as well as to
impute missing information (e.g. velocities from raw positions through time),
to de-noise, and to estimate the underlying dimensionality of the system.
Finally, we also introduce a closed-form method for multi-step prediction,
and a novel criterion for assessing the quality of our approximate
posterior.

Rowan McAllister and Carl Edward Rasmussen.
**Data-efficient
reinforcement learning in continuous state-action
Gaussian-POMDPs**.
In *Advances in Neural Information Processing Systems 31*, Long Beach,
California, December 2017.

** Abstract:** We present a
data-efficient reinforcement learning method for continuous state-action
systems under significant observation noise. Data-efficient solutions under
small noise exist, such as PILCO which learns the cartpole swing-up task in
30s. PILCO evaluates policies by planning state-trajectories using a dynamics
model. However, PILCO applies policies to the observed state, therefore
planning in observation space. We extend PILCO with filtering to instead plan
in belief space, consistent with partially observable Markov decisions
process (POMDP) planning. This enables data-efficient learning under
significant observation noise, outperforming more naive methods such as
post-hoc application of a filter to policies optimised by the original
(unfiltered) PILCO algorithm. We test our method on the cartpole swing-up
task, which involves nonlinear dynamics and requires nonlinear control.

Mark van der Wilk, Carl Edward Rasmussen, and James Hensman.
**Convolutional
Gaussian processes**.
In *Advances in Neural Information Processing Systems 31*, 2017.

** Abstract:** We present a practical way of introducing
convolutional structure into Gaussian processes, making them more suited to
high-dimensional inputs like images. The main contribution of our work is the
construction of an inter-domain inducing point approximation that is
well-tailored to the convolutional kernel. This allows us to gain the
generalisation benefit of a convolutional kernel, together with fast but
accurate posterior inference. We investigate several variations of the
convolutional kernel, and apply it to MNIST and CIFAR-10, which have both
been known to be challenging for Gaussian processes. We also show how the
marginal likelihood can be used to find an optimal weighting between
convolutional and RBF kernels to further improve performance. We hope that
this illustration of the usefulness of a marginal likelihood will help
automate discovering architectures in larger models.

** Comment:** arXiv

Matthias Stephan Bauer, Mark van der Wilk, and Carl Edward Rasmussen.
**Understanding
probabilistic sparse Gaussian process approximations**.
In *Advances in Neural Information Processing Systems 29*, 2016.

** Abstract:** Good sparse approximations are essential for
practical inference in Gaussian Processes as the computational cost of exact
methods is prohibitive for large datasets. The Fully Independent Training
Conditional (FITC) and the Variational Free Energy (VFE) approximations are
two recent popular methods. Despite superficial similarities, these
approximations have surprisingly different theoretical properties and behave
differently in practice. We thoroughly investigate the two methods for
regression both analytically and through illustrative examples, and draw
conclusions to guide practical application.

** Comment:** arXiv

Roberto Calandra, Jan Peters, Carl Edward Rasmussen, and Marc Peter Deisenroth.
**Manifold
Gaussian processes for regression**.
In *International Joint Conference on Neural Networks*, 2016.

** Abstract:** Off-the-shelf Gaussian Process (GP) covariance
functions encode smoothness assumptions on the structure of the function to
be modeled. To model complex and nondifferentiable functions, these
smoothness assumptions are often too restrictive. One way to alleviate this
limitation is to find a different representation of the data by introducing a
feature space. This feature space is often learned in an unsupervised way,
which might lead to data representations that are not useful for the overall
regression task. In this paper, we propose Manifold Gaussian Processes, a
novel supervised method that jointly learns a transformation of the data into
a feature space and a GP regression from the feature space to observed space.
The Manifold GP is a full GP and allows to learn data representations, which
are useful for the overall regression task. As a proof-of-concept, we
evaluate our approach on complex non-smooth functions where standard GPs
perform poorly, such as step functions and robotics tasks with contacts.

Marc Peter Deisenroth, Dieter Fox, and Carl Edward Rasmussen.
**Gaussian processes for data-efficient learning in robotics and control**.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*,
37:408-423, 2015, doi
10.1109/TPAMI.2013.218.

** Abstract:** Autonomous learning
has been a promising direction in control and robotics for more than a decade
since data-driven learning allows to reduce the amount of engineering
knowledge, which is otherwise required. However, autonomous reinforcement
learning (RL) approaches typically require many interactions with the system
to learn controllers, which is a practical limitation in real systems, such
as robots, where many interactions can be impractical and time consuming. To
address this problem, current learning approaches typically require
task-speciﬁc knowledge in form of expert demonstrations, realistic
simulators, pre-shaped policies, or speciﬁc knowledge about the underlying
dynamics. In this article, we follow a different approach and speed up
learning by extracting more information from data. In particular, we learn a
probabilistic, non-parametric Gaussian process transition model of the
system. By explicitly incorporating model uncertainty into long-term planning
and controller learning our approach reduces the effects of model errors, a
key problem in model-based learning. Compared to state-of-the art RL our
model-based policy search method achieves an unprecedented speed of learning.
We demonstrate its applicability to autonomous learning in real robot and
control tasks.

B. Bischoff, D. Nguyen-Tuong, D. van Hoof, A. McHutchon, Carl Edward Rasmussen,
A. Knoll, and M. P. Deisenroth.
**Policy search
for learning robot control using sparse data**.
In *IEEE International Conference on Robotics and Automation*, pages
3882-3887, Hong Kong, China, 2014. IEEE, doi
10.1109/ICRA.2014.6907422.

** Abstract:** In many complex
robot applications, such as grasping and manipulation, it is difficult to
program desired task solutions beforehand, as robots are within an uncertain
and dynamic environment. In such cases, learning tasks from experience can be
a useful alternative. To obtain a sound learning and generalization
performance, machine learning, especially, reinforcement learning, usually
requires sufficient data. However, in cases where only little data is
available for learning, due to system constraints and practical issues,
reinforcement learning can act suboptimally. In this paper, we investigate
how model-based reinforcement learning, in particular the probabilistic
inference for learning control method (PILCO), can be tailored to cope with
the case of sparse data to speed up learning. The basic idea is to include
further prior knowledge into the learning process. As PILCO is built on the
probabilistic Gaussian processes framework, additional system knowledge can
be incorporated by defining appropriate prior distributions, e.g. a linear
mean Gaussian prior. The resulting PILCO formulation remains in closed form
and analytically tractable. The proposed approach is evaluated in simulation
as well as on a physical robot, the Festo Robotino XT. For the robot
evaluation, we employ the approach for learning an object pick-up task. The
results show that by including prior knowledge, policy learning can be sped
up in presence of sparse data.

Roger Frigola, Yutian Chen, and Carl E. Rasmussen.
**Variational
Gaussian process state-space models**.
In Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger,
editors, *Advances in Neural Information Processing Systems 27*,
2014.

** Abstract:** State-space models have been successfully
used for more than fifty years in different areas of science and engineering.
We present a procedure for efficient variational Bayesian learning of
nonlinear state-space models based on sparse Gaussian processes. The result
of learning is a tractable posterior over nonlinear dynamical systems. In
comparison to conventional parametric models, we offer the possibility to
straightforwardly trade off model capacity and computational cost whilst
avoiding overfitting. Our main algorithm uses a hybrid inference approach
combining variational Bayes and sequential Monte Carlo. We also present
stochastic variational inference and online learning approaches for fast
learning with long time series.

Roger Frigola, Fredrik Lindsten, Thomas B. Schön, and Carl Edward
Rasmussen.
**Identification of Gaussian
process state-space models with particle stochastic approximation
EM**.
In *Proceedings of the 19th World Congress of the International Federation
of Automatic Control (IFAC)*, 2014.

** Abstract:**
Gaussian process state-space models (GP-SSMs) are a very flexible family of
models of nonlinear dynamical systems. They comprise a Bayesian nonparametric
representation of the dynamics of the system and additional
(hyper-)parameters governing the properties of this nonparametric
representation. The Bayesian formalism enables systematic reasoning about the
uncertainty in the system dynamics. We present an approach to maximum
likelihood identification of the parameters in GP-SSMs, while retaining the
full nonparametric description of the dynamics. The method is based on a
stochastic approximation version of the EM algorithm that employs recent
developments in particle Markov chain Monte Carlo for efficient
identification.

Yarin Gal, Mark van der Wilk, and Carl Rasmussen.
**Distributed
variational inference in sparse Gaussian process regression and latent
variable models**.
In Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger,
editors, *Advances in Neural Information Processing Systems 27*, pages
3257-3265. Curran Associates, Inc., 2014.

** Abstract:**
Gaussian processes (GPs) are a powerful tool for probabilistic inference over
functions. They have been applied to both regression and non-linear
dimensionality reduction, and offer desirable properties such as uncertainty
estimates, robustness to over-fitting, and principled ways for tuning
hyper-parameters. However the scalability of these models to big datasets
remains an active topic of research. We introduce a novel re-parametrisation
of variational inference for sparse GP regression and latent variable models
that allows for an efficient distributed algorithm. This is done by
exploiting the decoupling of the data given the inducing points to
re-formulate the evidence lower bound in a Map-Reduce setting. We show that
the inference scales well with data and computational resources, while
preserving a balanced distribution of the load among the nodes. We further
demonstrate the utility in scaling Gaussian processes to big data. We show
that GP performance improves with increasing amounts of data in regression
(on flight data with 2 million records) and latent variable modelling (on
MNIST). The results show that GPs perform better than many common models
often used for big data.

Roger Frigola, Fredrik Lindsten, Thomas B. Schön, and Carl Edward
Rasmussen.
**Bayesian inference and learning in
Gaussian process state-space models with particle MCMC**.
In L. Bottou, C.J.C. Burges, Z. Ghahramani, M. Welling, and K.Q. Weinberger,
editors, *Advances in Neural Information Processing Systems 26*.
Curran Associates, Inc., 2013.

** Abstract:** State-space
models are successfully used in many areas of science, engineering and
economics to model time series and dynamical systems. We present a fully
Bayesian approach to inference and learning in nonlinear nonparametric
state-space models. We place a Gaussian process prior over the transition
dynamics, resulting in a flexible model able to capture complex dynamical
phenomena. However, to enable efficient inference, we marginalize over the
dynamics of the model and instead infer directly the joint smoothing
distribution through the use of specially tailored Particle Markov Chain
Monte Carlo samplers. Once a sample from the smoothing distribution is
computed, the state transition predictive distribution can be formulated
analytically. We make use of sparse Gaussian process models to greatly reduce
the computational complexity of the approach.

Roger Frigola and Carl Edward Rasmussen.
**Integrated pre-processing for
Bayesian nonlinear system identification with Gaussian processes**.
In *Decision and Control (CDC), 2013 IEEE 52nd Annual Conference on*,
2013.

** Abstract:** We introduce GP-FNARX: a new model for
nonlinear system identification based on a nonlinear autoregressive exogenous
model (NARX) with filtered regressors (F) where the nonlinear regression
problem is tackled using sparse Gaussian processes (GP). We integrate data
pre-processing with system identification into a fully automated procedure
that goes from raw data to an identified model. Both pre-processing
parameters and GP hyper-parameters are tuned by maximizing the marginal
likelihood of the probabilistic model. We obtain a Bayesian model of the
system's dynamics which is able to report its uncertainty in regions where
the data is scarce. The automated approach, the modeling of uncertainty and
its relatively low computational cost make of GP-FNARX a good candidate for
applications in robotics and adaptive control.

Michael A. Osborne, David Duvenaud, Roman Garnett, Carl Edward Rasmussen,
Stephen J. Roberts, and Zoubin Ghahramani.
**Active
learning of model evidence using Bayesian quadrature**.
In *Advances in Neural Information Processing Systems 25*, pages 46-54,
Lake Tahoe, California, USA, December 2012.

** Abstract:**
Numerical integration is a key component of many problems in scientiﬁc
computing, statistical modelling, and machine learning. Bayesian Quadrature
is a model-based method for numerical integration which, relative to standard
Monte Carlo methods, offers increased sample efficiency and a more robust
estimate of the uncertainty in the estimated integral. We propose a novel
Bayesian Quadrature approach for numerical integration when the integrand is
non-negative, such as the case of computing the marginal likelihood,
predictive distribution, or normalising constant of a probabilistic model.
Our approach approximately marginalises the quadrature model’s
hyperparameters in closed form, and introduces an active learning scheme to
optimally select function evaluations, as opposed to using Monte Carlo
samples. We demonstrate our method on both a number of synthetic benchmarks
and a real scientiﬁc problem from astronomy.

John P. Cunningham, Zoubin Ghahramani, and Carl Edward Rasmussen.
**Gaussian
Processes for time-marked time-series data**.
In *15th International Conference on Artificial Intelligence and
Statistics*, pages 255-263, 2012.

** Abstract:** In many
settings, data is collected as multiple time series, where each recorded time
series is an observation of some underlying dynamical process of interest.
These observations are often time-marked with known event times, and one
desires to do a range of standard analyses. When there is only one time
marker, one simply aligns the observations temporally on that marker. When
multiple time-markers are present and are at different times on different
time series observations, these analyses are more difficult. We describe a
Gaussian Process model for analyzing multiple time series with multiple time
markings, and we test it on a variety of data.

Marc Peter Deisenroth, Ryan D. Turner, Marco F. Huber, Uwe D. Hanebeck, and
Carl Edward Rasmussen.
**Robust
filtering and smoothing with Gaussian processes**.
*IEEE Transactions on Automatic Control*, 57(7):1865-1871, 2012, doi
10.1109/TAC.2011.2179426.

** Abstract:** We propose a
principled algorithm for robust Bayesian filtering and smoothing in nonlinear
stochastic dynamic systems when both the transition function and the
measurement function are described by nonparametric Gaussian process (GP)
models. GPs are gaining increasing importance in signal processing, machine
learning, robotics, and control for representing unknown system functions by
posterior probability distributions. This modern way of "system
identification" is more robust than finding point estimates of a parametric
function representation. Our principled filtering/smoothing approach for GP
dynamic systems is based on analytic moment matching in the context of the
forward-backward algorithm. Our numerical evaluations demonstrate the
robustness of the proposed approach in situations where other
state-of-the-art Gaussian filters and smoothers can fail.

Joseph Hall, Carl Edward Rasmussen, and Jan Maciejowski.
**Modelling and
control of nonlinear systems using Gaussian processes with partial model
information**.
In *51st IEEE Conference on Decision and Control*, 2012.

**
Abstract:** Gaussian processes are gaining increasing popularity among the
control community, in particular for the modelling of discrete time state
space systems. However, it has not been clear how to incorporate model
information, in the form of known state relationships, when using a Gaussian
process as a predictive model. An obvious example of known prior information
is position and velocity related states. Incorporation of such information
would be beneficial both computationally and for faster dynamics learning.
This paper introduces a method of achieving this, yielding faster dynamics
learning and a reduction in computational effort from O(Dn^{2}) to
O((D-F)n^{2}) in the prediction stage for a system with D states, F
known state relationships and n observations. The effectiveness of the method
is demonstrated through its inclusion in the PILCO learning algorithm with
application to the swing-up and balance of a torque-limited pendulum and the
balancing of a robotic unicycle in simulation.

Ryan D. Turner and Carl Edward Rasmussen.
**Model based learning
of sigma points in unscented Kalman filtering**.
*Neurocomputing*, 80:47-53, 2012, doi
10.1016/j.neucom.2011.07.029.

** Abstract:** The unscented
Kalman filter (UKF) is a widely used method in control and time series
applications. The UKF suffers from arbitrary parameters necessary for sigma
point placement, potentially causing it to perform poorly in nonlinear
problems. We show how to treat sigma point placement in a UKF as a learning
problem in a model based view. We demonstrate that learning to place the
sigma points correctly from data can make sigma point collapse much less
likely. Learning can result in a significant increase in predictive
performance over default settings of the parameters in the UKF and other
filters designed to avoid the problems of the UKF, such as the GP-ADF. At the
same time, we maintain a lower computational complexity than the other
methods. We call our method UKF-L.

Marc Peter Deisenroth, Carl Edward Rasmussen, and Dieter Fox.
**Learning to
control a low-cost manipulator using data-efficient reinforcement
learning**.
In *9th International Conference on Robotics: Science & Systems*, Los
Angeles, CA, USA, June 2011.

** Abstract:** Over the last
years, there has been substantial progress in robust manipulation in
unstructured environments. The long-term goal of our work is to get away from
precise, but very expensive robotic systems and to develop affordable,
potentially imprecise, self-adaptive manipulator systems that can
interactively perform tasks such as playing with children. In this paper, we
demonstrate how a low-cost off-the-shelf robotic system can learn closed-loop
policies for a stacking task in only a handful of trials - from scratch. Our
manipulator is inaccurate and provides no pose feedback. For learning a
controller in the work space of a Kinect-style depth camera, we use a
model-based reinforcement learning technique. Our learning method is data
efficient, reduces model bias, and deals with several noise sources in a
principled way during long-term planning. We present a way of incorporating
state-space constraints into the learning process and analyze the learning
gain by exploiting the sequential structure of the stacking task.

** Comment:** project
site

David Duvenaud, Hannes Nickisch, and Carl Edward Rasmussen.
**Additive
Gaussian processes**.
In *Advances in Neural Information Processing Systems 24*, pages
226-234, Granada, Spain, 2011.

** Abstract:** We introduce a
Gaussian process model of functions which are additive. An additive function
is one which decomposes into a sum of low-dimensional functions, each
depending on only a subset of the input variables. Additive GPs generalize
both Generalized Additive Models, and the standard GP models which use
squared-exponential kernels. Hyperparameter learning in this model can be
seen as Bayesian Hierarchical Kernel Learning (HKL). We introduce an
expressive but tractable parameterization of the kernel function, which
allows efficient evaluation of all input interaction terms, whose number is
exponential in the input dimension. The additional structure discoverable by
this model results in increased interpretability, as well as state-of-the-art
predictive power in regression tasks.

Marc Peter Deisenroth and Carl Edward Rasmussen.
**PILCO: A
model-based and data-efficient approach to policy search**.
In *28th International Conference on Machine Learning*, 2011.

** Abstract:** In this paper, we introduce PILCO, a practical,
data-efficient model-based policy search method. PILCO reduces model bias,
one of the key problems of model-based reinforcement learning, in a
principled way. By learning a probabilistic dynamics model and explicitly
incorporating model uncertainty into long-term planning, PILCO can cope with
very little data and facilitates learning from scratch in only a few trials.
Policy evaluation is performed in closed form using state-of-the-art
approximate inference. Furthermore, policy gradients are computed
analytically for policy improvement. We report unprecedented learning
efficiency on challenging and high-dimensional control tasks.

** Comment:** web
site

Joseph Hall, Carl Edward Rasmussen, and Jan Maciejowski.
**Reinforcement
learning with reference tracking control in continuous state spaces**.
In *Proceedings of 50th IEEE Conference on Decision and Control and European
Control Conference*, 2011.

** Abstract:** The contribution
described in this paper is an algorithm for learning nonlinear, reference
tracking, control policies given no prior knowledge of the dynamical system
and limited interaction with the system through the learning process.
Concepts from the field of reinforcement learning, Bayesian statistics and
classical control have been brought together in the formulation of this
algorithm which can be viewed as a form indirect self tuning regulator. On
the task of reference tracking using the inverted pendulum it was shown to
yield generally improved performance on the best controller derived from the
standard linear quadratic method using only 30 s of total interaction with
the system. Finally, the algorithm was shown to work on the double pendulum
proving its ability to solve nontrivial control tasks.

Andrew McHutchon and Carl Edward Rasmussen.
**Gaussian process
training with input noise**.
In J. Shawe-Taylor, R.S. Zemel, P.L. Bartlett, F. Pereira, and K.Q. Weinberger,
editors, *Advances in Neural Information Processing Systems 24*, pages
1341-1349, Granada, Spain, 2011. Curran Associates, Inc.

**
Abstract:** In standard Gaussian Process regression input locations are
assumed to be noise free. We present a simple yet effective GP model for
training on input points corrupted by i.i.d. Gaussian noise. To make
computations tractable we use a local linear expansion about each input
point. This allows the input noise to be recast as output noise proportional
to the squared gradient of the GP posterior mean. The input noise
hyperparameters are trained alongside other hyperparameters by the usual
method of maximisation of the marginal likelihood, and allow estimation of
the noise levels on each input dimension. Training uses an iterative scheme,
which alternates between optimising the hyperparameters and calculating the
posterior gradient. Analytic predictive moments can then be found for
Gaussian distributed test points. We compare our model to others over a range
of different regression problems and show that it improves over current
methods.

Carl Edward Rasmussen and Hannes Nickisch.
**Gaussian
Processes for Machine Learning (GPML) Toolbox**.
*Journal of Machine Learning Research*, 11:3011-3015, December 2010.

** Abstract:** The GPML toolbox provides a wide range of
functionality for Gaussian process (GP) inference and prediction. GPs are
specified by mean and covariance functions; we offer a library of simple mean
and covariance functions and mechanisms to compose more complex ones. Several
likelihood functions are supported including Gaussian and heavy-tailed for
regression as well as others suitable for classification. Finally, a range of
inference methods is provided, including exact and variational inference,
Expectation Propagation, and Laplace's method dealing with non-Gaussian
likelihoods and FITC for dealing with large regression tasks.

** Comment:** Toolbox avaiable from here. Implements algorithms
from Rasmussen and Williams, 2006.

Hannes Nickisch and Carl Edward Rasmussen.
**Gaussian mixture
modeling with Gaussian process latent variable models**.
In *Proceedings of the 32nd DAGM Symposium on Pattern Recognition*,
Lecture Notes in Computer Science (LNCS), Darmstadt, Germany, September 2010.
Springer, doi
10.1007/978-3-642-15986-2_28.

** Abstract:** Density
modeling is notoriously difficult for high dimensional data. One approach to
the problem is to search for a lower dimensional manifold which captures the
main characteristics of the data. Recently, the Gaussian Process Latent
Variable Model (GPLVM) has successfully been used to find low dimensional
manifolds in a variety of complex data. The GPLVM consists of a set of points
in a low dimensional latent space, and a stochastic map to the observed
space. We show how it can be interpreted as a density model in the observed
space. However, the GPLVM is not trained as a density model and therefore
yields bad density estimates. We propose a new training strategy and obtain
improved generalisation performance and better density estimates in
comparative evaluations on several benchmark data sets.

Ryan Turner and Carl Edward Rasmussen.
**Model based learning
of sigma points in unscented Kalman filtering**.
In Samuel Kaski, David J. Miller, Erkki Oja, and Antti Honkela, editors,
*Machine Learning for Signal Processing (MLSP 2010)*, pages 178-183,
Kittilä, Finland, August 2010.

** Abstract:** The
unscented Kalman filter (UKF) is a widely used method in control and time
series applications. The UKF suffers from arbitrary parameters necessary for
a step known as sigma point placement, causing it to perform poorly in
nonlinear problems. We show how to treat sigma point placement in a UKF as a
learning problem in a model based view. We demonstrate that learning to place
the sigma points correctly from data can make sigma point collapse much less
likely. Learning can result in a significant increase in predictive
performance over default settings of the parameters in the UKF and other
filters designed to avoid the problems of the UKF, such as the GP-ADF. At the
same time, we maintain a lower computational complexity than the other
methods. We call our method UKF-L.

Dilan Görür and Carl Edward Rasmussen.
**Dirichlet process
Gaussian mixture models: Choice of the base distribution**.
*Journal of Computer Science and Technology*, 25(4):615-625, July 2010,
doi
10.1007/s11390-010-9355-8.

** Abstract:** In the Bayesian
mixture modeling framework it is possible to infer the necessary number of
components to model the data and therefore it is unnecessary to explicitly
restrict the number of components. Nonparametric mixture models sidestep the
problem of finding the "correct" number of mixture components by assuming
infinitely many components. In this paper Dirichlet process mixture (DPM)
models are cast as infinite mixture models and inference using Markov chain
Monte Carlo is described. The specification of the priors on the model
parameters is often guided by mathematical and practical convenience. The
primary goal of this paper is to compare the choice of conjugate and
non-conjugate base distributions on a particular class of DPM models which is
widely used in applications, the Dirichlet process Gaussian mixture model
(DPGMM). We compare computational efficiency and modeling performance of
DPGMM defined using a conjugate and a conditionally conjugate base
distribution. We show that better density models can result from using a
wider class of priors with no or only a modest increase in computational
effort.

Miguel Lázaro-Gredilla, Joaquin Quiñonero-Candela, Carl Edward Rasmussen,
and Aníbal Figueiras-Vidal.
**Sparse
spectrum Gaussian process regression**.
*Journal of Machine Learning Research*, 11:1865-1881, June 2010.

** Abstract:** We present a new sparse Gaussian Process (GP)
model for regression. The key novel idea is to sparsify the *spectral
representation* of the GP. This leads to a simple, practical algorithm for
regression tasks. We compare the achievable trade-offs between predictive
accuracy and computational requirements, and show that these are typically
superior to existing state-of-the-art sparse approximations. We discuss both
the weight space and function space representations, and note that the new
construction implies priors over functions which are always stationary, and
can approximate any covariance function in this class.

Yunus Saatçi, Ryan Turner, and Carl Edward Rasmussen.
**Gaussian process
change point models**.
In *27th International Conference on Machine Learning*, pages 927-934,
Haifa, Israel, June 2010.

** Abstract:** We combine Bayesian
online change point detection with Gaussian processes to create a
nonparametric time series model which can handle change points. The model can
be used to locate change points in an online manner; and, unlike other
Bayesian online change point detection algorithms, is applicable when
temporal correlations in a regime are expected. We show three variations on
how to apply Gaussian processes in the change point context, each with their
own advantages. We present methods to reduce the computational burden of
these models and demonstrate it on several real world data sets.

Ryan Turner, Marc Peter Deisenroth, and Carl Edward Rasmussen.
**State-space
inference and learning with Gaussian processes**.
In Yee Whye Teh and Mike Titterington, editors, *13th International
Conference on Artificial Intelligence and Statistics*, volume 9 of
*W & CP*, pages 868-875, Chia Laguna, Sardinia, Italy, May 13-15
2010. Journal of Machine Learning Research.

** Abstract:**
State-space inference and learning with Gaussian processes (GPs) is an
unsolved problem. We propose a new, general methodology for inference and
learning in nonlinear state-space models that are described probabilistically
by non-parametric GP models. We apply the expectation maximization algorithm
to iterate between inference in the latent state-space and learning the
parameters of the underlying GP dynamics model.

** Comment:** poster.

Ryan Turner, Marc Peter Deisenroth, and Carl Edward Rasmussen.
**System
identification in Gaussian process dynamical systems**.
In Dilan Görür, editor, *NIPS Workshop on Nonparametric Bayes*,
Whistler, BC, Canada, December 2009.

** Comment:** poster.

Ryan Turner, Yunus Saatçi, and Carl Edward Rasmussen.
**Adaptive
sequential Bayesian change point detection**.
In Zaïd Harchaoui, editor, *NIPS Workshop on Temporal Segmentation*,
Whistler, BC, Canada, December 2009.

** Abstract:** Real-world
time series are often nonstationary with respect to the parameters of some
underlying prediction model (UPM). Furthermore, it is often desirable to
adapt the UPM to incoming regime changes as soon as possible, necessitating
sequential inference about change point locations. A Bayesian algorithm for
online change point detection (BOCPD) has been introduced recently by Adams
and MacKay (2007). In this algorithm, uncertainty about the last change point
location is updated sequentially, and is integrated out to make online
predictions robust to parameter changes. BOCPD requires a set of fixed
hyper-parameters which allow the user to fully specify the hazard function
for change points and the prior distribution over the parameters of the UPM.
In practice, finding the "right" hyper-parameters can be quite difficult. We
therefore extend BOCPD by introducing hyper-parameter learning, without
sacrificing the online nature of the algorithm. Hyper-parameter learning is
performed by optimizing the marginal likelihood of the BOCPD model, a
closed-form quantity which can be computed sequentially. We illustrate
performance on three real-world datasets.

Marc Peter Deisenroth and Carl Edward Rasmussen.
**Efficient
reinforcement learning for motor control**.
In *10th International PhD Workshop on Systems and Control*, Hluboká
nad Vltavou, Czech Republic, September 2009.

** Abstract:**
Artificial learners often require many more trials than humans or animals
when learning motor control tasks in the absence of expert knowledge. We
implement two key ingredients of biological learning systems, generalization
and incorporation of uncertainty into the decision-making process, to speed
up artificial learning. We present a coherent and fully Bayesian framework
that allows for efficient artificial learning in the absence of expert
knowledge. The success of our learning framework is demonstrated on
challenging nonlinear control problems in simulation and in hardware.

Marc Peter Deisenroth and Carl Edward Rasmussen.
**Bayesian inference
for efficient learning in control**.
In *Multidisciplinary Symposium on Reinforcement Learning*,
Montréal, QC, Canada, June 2009.

** Abstract:** In
contrast to humans or animals, artificial learners often require more trials
when learning motor control tasks solely based on experience. Efficient
autonomous learners will reduce the amount of engineering required to solve
control problems. By using probabilistic forward models, we can employ two
key ingredients of biological learning systems to speed up artificial
learning. We present a consistent and coherent Bayesian framework that allows
for efficient autonomous experience-based learning. We demonstrate the
success of our learning algorithm by applying it to challenging nonlinear
control problems in simulation and in hardware.

Marc Peter Deisenroth, Carl Edward Rasmussen, and Jan Peters.
**Gaussian process
dynamic programming**.
*Neurocomputing*, 72(7-9):1508-1524, March 2009, doi
10.1016/j.neucom.2008.12.019.

** Abstract:** Reinforcement
learning (RL) and optimal control of systems with continuous states and
actions require approximation techniques in most interesting cases. In this
article, we introduce Gaussian process dynamic programming (GPDP), an
approximate value function-based RL algorithm. We consider both a classic
optimal control problem, where problem-specific prior knowledge is available,
and a classic RL problem, where only very general priors can be used. For the
classic optimal control problem, GPDP models the unknown value functions with
Gaussian processes and generalizes dynamic programming to continuous-valued
states and actions. For the RL problem, GPDP starts from a given initial
state and explores the state space using Bayesian active learning. To design
a fast learner, available data have to be used efficiently. Hence, we propose
to learn probabilistic models of the a priori unknown transition dynamics and
the value functions on the fly. In both cases, we successfully apply the
resulting continuous-valued controllers to the under-actuated pendulum swing
up and analyze the performances of the suggested algorithms. It turns out
that GPDP uses data very efficiently and can be applied to problems, where
classic dynamic programming would be cumbersome.

** Comment:** code.

Carl Edward Rasmussen, Bernhard J. de la Cruz, Zoubin Ghahramani, and David L.
Wild.
**Modeling and visualizing
uncertainty in gene expression clusters using Dirichlet process
mixtures**.
*IEEE/ACM Transactions on Computational Biology and Bioinformatics*,
6(4):615-628, 2009, doi
10.1109/TCBB.2007.70269.

** Abstract:** Although the use
of clustering methods has rapidly become one of the standard computational
approaches in the literature of microarray gene expression data, little
attention has been paid to uncertainty in the results obtained. Dirichlet
process mixture (DPM) models provide a nonparametric Bayesian alternative to
the bootstrap approach to modeling uncertainty in gene expression clustering.
Most previously published applications of Bayesian model-based clustering
methods have been to short time series data. In this paper, we present a case
study of the application of nonparametric Bayesian clustering methods to the
clustering of high-dimensional nontime series gene expression data using full
Gaussian covariances. We use the probability that two genes belong to the
same cluster in a DPM model as a measure of the similarity of these gene
expression profiles. Conversely, this probability can be used to define a
dissimilarity measure, which, for the purposes of visualization, can be input
to one of the standard linkage algorithms used for hierarchical clustering.
Biologically plausible results are obtained from the Rosetta compendium of
expression profiles which extend previously published cluster analyses of
this data.

Carl Edward Rasmussen and Marc Peter Deisenroth.
**Probabilistic
inference for fast learning in control**.
In S. Girgin, M. Loth, R. Munos, P. Preux, and D. Ryabko, editors, *Recent
Advances in Reinforcement Learning*, volume 5323 of *Lecture Notes in
Computer Science (LNCS)*, pages 229-242, Villeneuve d'Ascq, France,
November 2008. Springer-Verlag.

** Abstract:** We provide a
novel framework for very fast model-based reinforcement learning in
continuous state and action spaces. The framework requires probabilistic
models that explicitly characterize their levels of confidence. Within this
framework, we use flexible, non-parametric models to describe the world based
on previously collected experience. We demonstrate learning on the cart-pole
problem in a setting where we provide very limited prior knowledge about the
task. Learning progresses rapidly, and a good policy is found after only a
hand-full of iterations.

** Comment:** videos and more. slides.

Hannes Nickisch and Carl Edward Rasmussen.
**Approximations
for binary Gaussian process classification**.
*Journal of Machine Learning Research*, 9:2035-2078, October 2008.

** Abstract:** We provide a comprehensive overview of many
recent algorithms for approximate inference in Gaussian process models for
probabilistic binary classification. The relationships between several
approaches are elucidated theoretically, and the properties of the different
algorithms are corroborated by experimental results. We examine both 1) the
quality of the predictive distributions and 2) the suitability of the
different marginal likelihood approximations for model selection (selecting
hyperparameters) and compare to a gold standard based on MCMC. Interestingly,
some methods produce good predictive distributions although their marginal
likelihood approximations are poor. Strong conclusions are drawn about the
methods: The Expectation Propagation algorithm is almost always the method of
choice unless the computational budget is very tight. We also extend existing
methods in various ways, and provide unifying code implementing all
approaches.

Marc Peter Deisenroth, Jan Peters, and Carl Edward Rasmussen.
**Approximate
dynamic programming with Gaussian processes**.
In *2008 American Control Conference (ACC 2008)*, pages 4480-4485,
Seattle, WA, USA, June 2008.

** Abstract:** In general, it is
difficult to determine an optimal closed-loop policy in nonlinear control
problems with continuous-valued state and control domains. Hence,
approximations are often inevitable. The standard method of discretizing
states and controls suffers from the curse of dimensionality and strongly
depends on the chosen temporal sampling rate. The paper introduces Gaussian
Process Dynamic Programming (GPDP). In GPDP, value functions in the Bellman
recursion of the dynamic programming algorithm are modeled using Gaussian
processes. GPDP returns an optimal state-feedback for a finite set of states.
Based on these outcomes, we learn a possibly discontinuous closed-loop policy
on the entire state space by switching between two independently trained
Gaussian processes.

** Comment:** code.

Marc Peter Deisenroth, Carl Edward Rasmussen, and Jan Peters.
**Model-based
reinforcement learning with continuous states and actions**.
In *Proceedings of the 16th European Symposium on Artificial Neural Networks
(ESANN 2008)*, pages 19-24, Bruges, Belgium, April 2008.

**
Abstract:** Finding an optimal policy in a reinforcement learning (RL)
framework with continuous state and action spaces is challenging. Approximate
solutions are often inevitable. GPDP is an approximate dynamic programming
algorithm based on Gaussian process (GP) models for the value functions. In
this paper, we extend GPDP to the case of unknown transition dynamics. After
building a GP model for the transition dynamics, we apply GPDP to this model
and determine a continuous-valued policy in the entire state space. We apply
the resulting controller to the underpowered pendulum swing up. Moreover, we
compare our results on this RL task to a nearly optimal discrete DP solution
in a fully known environment.

Sören Sonnenburg, Mikio L. Braun, Cheng Soon Ong, Samy Bengio, Leon Bottou,
Geoffrey Holmes, Yann LeCun, Klaus-Robert Müller, Fernando Pereira,
Carl Edward Rasmussen, Gunnar Rätsch, Bernhard Schölkopf, Alexander
Smola, Pascal Vincent, Jason Weston, and Robert C. Williamson.
**The
need for open source software in machine learning**.
*Journal of Machine Learning Research*, 8:2443-2466, October 2007.

** Abstract:** Open source tools have recently reached a level
of maturity which makes them suitable for building large-scale real-world
systems. At the same time, the field of machine learning has developed a
large body of powerful learning algorithms for diverse applications. However,
the true potential of these methods is not realized, since existing
implementations are not openly shared, resulting in software with low
usability, and weak interoperability. We argue that this situation can be
significantly improved by increasing incentives for researchers to publish
their software under an open source model. Additionally, we outline the
problems authors are faced with when trying to publish algorithmic
implementations of machine learning methods. We believe that a resource of
peer reviewed software accompanied by short articles would be highly valuable
to both the machine learning and the general scientific community.

Joaquin Quiñonero-Candela, Carl Edward Rasmussen, and Christopher K. I.
Williams.
**Approximation
methods for Gaussian process regression**.
In L. Bottou, O. Chapelle, D. DeCoste, and J. Weston, editors, *Large-Scale
Kernel Machines*, Neural Information Processing, pages 203-223. The
MIT Press, Cambridge, MA, USA, September 2007.

**
Abstract:** A wealth of computationally efficient approximation methods for
Gaussian process regression have been recently proposed. We give a unifying
overview of sparse approximations, following Quiñonero-Candela and Rasmussen (2005), and a
brief review of approximate matrix-vector multiplication methods.

** Comment:** book

Dilan Görür, Frank Jäkel, and Carl Edward Rasmussen.
**A choice model
with infinitely many latent features**.
In W. W. Cohen and Andrew Moore, editors, *23rd International Conference on
Machine Learning*, pages 361-368, New York, NY, USA, June 2006. ACM
Press, doi
10.1145/1143844.1143890.

** Abstract:** Elimination by
aspects (EBA) is a probabilistic choice model describing how humans decide
between several options. The options from which the choice is made are
characterized by binary features and associated weights. For instance, when
choosing which mobile phone to buy the features to consider may be: long
lasting battery, color screen, etc. Existing methods for inferring the
parameters of the model assume pre-specified features. However, the features
that lead to the observed choices are not always known. Here, we present a
non-parametric Bayesian model to infer the features of the options and the
corresponding weights from choice data. We use the Indian buffet process
(IBP) as a prior over the features. Inference using Markov chain Monte Carlo
(MCMC) in conjugate IBP models has been previously described. The main
contribution of this paper is an MCMC algorithm for the EBA model that can
also be used in inference for other non-conjugate IBP models-this may
broaden the use of IBP priors considerably.

Malte Kuß and Carl Edward Rasmussen.
**Assessing
approximations for Gaussian process classification**.
In Y. Weiss, B. Schölkopf, and J. Platt, editors, *Advances in Neural
Information Processing Systems 18*, pages 699-706, Cambridge, MA, USA,
April 2006. The MIT Press.

** Abstract:** Gaussian processes
are attractive models for probabilistic classification but unfortunately
exact inference is analytically intractable. We compare Laplace's method and
Expectation Propagation (EP) focusing on marginal likelihood estimates and
predictive performance. We explain theoretically and corroborate empirically
that EP is superior to Laplace. We also compare to a sophisticated MCMC
scheme and show that EP is surprisingly accurate.

Tobias Pfingsten, Daniel Herrmann, and Carl Edward Rasmussen.
**Model-based design
analysis and yield optimization**.
*IEEE Transactions on Semiconductor Manufacturing*, 19(4):475-486,
2006, doi
10.1109/TSM.2006.883589.

** Abstract:** Fluctuations are
inherent to any fabrication process. Integrated circuits and
microelectromechanical systems are particularly affected by these variations,
and due to high-quality requirements the effect on the devices' perform ance
has to be understood quantitatively. In recent years, it has become possible
to model the performance of such complex systems on the basis of design
specifications, and model-based sensitivity analysis has made its way into
industrial engineering. We show how an efficient Bayesian approach, using a
Gaussian process prior, can replace the commonly used brute-force Monte Carlo
scheme, making it possible to apply the analysis to computationally costly
models. We introduce a number of global, statistically justified sensitivity
measures for design analysis and optimization. Two models of integrated
systems serve us as case studies to introduce the analysis and to assess its
convergence properties. We show that the Bayesian Monte Carlo scheme can save
costly simulation runs and can ensure a reliable accuracy of the
analysis.

** Comment:** Winner of the 2006 Best Paper Award for the
journal.

Joaquin Quiñonero-Candela, Carl Edward Rasmussen, Fabian Sinz, Olivier
Bousquet, and Bernhard Schölkopf.
**Evaluating
predictive uncertainty challenge**.
In J. Quiñonero-Candela, I. Dagan, B. Magnini, and F. d'Alché-Buc,
editors, *Machine Learning Challenges. Evaluating predictive uncertainty,
visual object classification and recognising tectual entailment. First PASCAL
Machine Learning Challenges Workshop*, volume 3944 of *Lecture Notes
in Computer Science (LNCS)*, pages 1-27, Berlin, Germany, 04 2006.
Springer, doi
10.1007/11736790_1.

** Abstract:** This Chapter presents
the PASCAL1 Evaluating Predictive Uncertainty Challenge, introduces the
contributed Chapters by the participants who obtained outstanding results,
and provides a discussion with some lessons to be learnt. The Challenge was
set up to evaluate the ability of Machine Learning algorithms to provide good
"probabilistic predictions", rather than just the usual "point predictions"
with no measure of uncertainty, in regression and classification problems.
Participants had to compete on a number of regression and classification
tasks, and were evaluated by both traditional losses that only take into
account point predictions and losses we proposed that evaluate the quality of
the probabilistic predictions.

Carl Edward Rasmussen and Christopher K. I. Williams.
**Gaussian processes
for machine learning**.
The MIT Press, 2006.

** Abstract:** Gaussian processes (GPs)
provide a principled, practical, probabilistic approach to learning in kernel
machines. GPs have received increased attention in the machine-learning
community over the past decade, and this book provides a long-needed
systematic and unified treatment of theoretical and practical aspects of GPs
in machine learning. The treatment is comprehensive and self-contained,
targeted at researchers and students in machine learning and applied
statistics.

** Comment:** Winner of the 2009 DeGroot Prize. Book web page, chapters and entire book
pdf. GPML Toolbox.

Malte Kuß, Tobias Pfingsten, Lehel Csatò, and Carl Edward Rasmussen.
**Approximate
inference for robust Gaussian process regression**.
Technical Report 136, Max Planck Institute for Biological Cybernetics,
Tübingen, Germany, 2005.

** Abstract:** Gaussian process
(GP) priors have been successfully used in non-parametric Bayesian regression
and classification models. Inference can be performed analytically only for
the regression model with Gaussian noise. For all other likelihood models
inference is intractable and various approximation techniques have been
proposed. In recent years expectation-propagation (EP) has been developed as
a general method for approximate inference. This article provides a general
summary of how expectation-propagation can be used for approximate inference
in Gaussian process models. Furthermore we present a case study describing
its implementation for a new robust variant of Gaussian process regression.
To gain further insights into the quality of the EP approximation we present
experiments in which we compare to results obtained by Markov chain Monte
Carlo (MCMC) sampling.

Malte Kuß and Carl Edward Rasmussen.
**Assessing
approximate inference for binary Gaussian process classification**.
*Journal of Machine Learning Research*, 6:1679-1704, 2005.

** Abstract:** Gaussian process priors can be used to define
flexible, probabilistic classification models. Unfortunately exact Bayesian
inference is analytically intractable and various approximation techniques
have been proposed. In this work we review and compare Laplace's method and
Expectation Propagation for approximate Bayesian inference in the binary
Gaussian process classification model. We present a comprehensive comparison
of the approximations, their predictive performance and marginal likelihood
estimates to results obtained by MCMC sampling. We explain theoretically and
corroborate empirically the advantages of Expectation Propagation compared to
Laplace's method.

Joaquin Quiñonero-Candela and Carl Edward Rasmussen.
**Analysis of some
methods for reduced rank Gaussian process regression**.
In R. Murray-Smith and R. Shorten, editors, *Switching and Learning in
Feedback Systems*, pages 98-127. Springer, Berlin, Heidelberg, 2005.

** Abstract:** While there is strong motivation for using
Gaussian Processes (GPs) due to their excellent performance in regression and
classification problems, their computational complexity makes them
impractical when the size of the training set exceeds a few thousand cases.
This has motivated the recent proliferation of a number of cost-effective
approximations to GPs, both for classification and for regression. In this
paper we analyze one popular approximation to GPs for regression: the reduced
rank approximation. While generally GPs are equivalent to infinite linear
models, we show that Reduced Rank Gaussian Processes (RRGPs) are equivalent
to finite sparse linear models. We also introduce the concept of degenerate
GPs and show that they correspond to inappropriate priors. We show how to
modify the RRGP to prevent it from being degenerate at test time. Training
RRGPs consists both in learning the covariance function hyperparameters and
the support set. We propose a method for learning hyperparameters for a given
support set. We also review the Sparse Greedy GP (SGGP) approximation (Somla
and Bartlett, 2001), which is a way of learning the support set for given
hyperparameters based on approximating the posterior. We propose an
alternative method to the SGGP that has better generalization capabilities.
Finally we make experiments to compare the different ways of training a RRGP.
We provide some Matlab code for learning RRGPs.

Joaquin Quiñonero-Candela and Carl Edward Rasmussen.
**A
unifying view of sparse approximate Gaussian process regression**.
*Journal of Machine Learning Research*, 6:1939-1959, 2005.

** Abstract:** We provide a new unifying view, including all
existing proper probabilistic sparse approximations for Gaussian process
regression. Our approach relies on expressing the effective prior which the
methods are using. This allows new insights to be gained, and highlights the
relationship between existing methods. It also allows for a clear
theoretically justified ranking of the closeness of the known approximations
to the corresponding full GPs. Finally we point directly to designs of new
better sparse approximations, combining the best of the existing strategies,
within attractive computational constraints.

Carl Edward Rasmussen and Joaquin Quiñonero-Candela.
**Healing the
Relevance Vector Machine through augmentation**.
In L. De Raedt and S. Wrobel, editors, *22nd International Conference on
Machine Learning*, pages 689-696, 2005.

** Abstract:**
The Relevance Vector Machine (RVM) is a sparse approximate Bayesian kernel
method. It provides full predictive distributions for test cases. However,
the predictive uncertainties have the unintuitive property, that *they
get smaller the further you move away from the training cases*. We give a
thorough analysis. Inspired by the analogy to non-degenerate Gaussian
Processes, we suggest augmentation to solve the problem. The purpose of the
resulting model, RVM*, is primarily to corroborate the theoretical and
experimental analysis. Although RVM* could be used in practical applications,
it is no longer a truly sparse model. Experiments show that sparsity comes at
the expense of worse predictive distributions.

Carl Edward Rasmussen and Malte Kuß.
**Gaussian processes
in reinforcement learning**.
In S. Thrun, L.K. Saul, and B. Schölkopf, editors, *Advances in Neural
Information Processing Systems 16*, pages 751-759, Cambridge, MA, USA,
December 2004. The MIT Press.

** Abstract:** We exploit some
useful properties of Gaussian process (GP) regression models for
reinforcement learning in continuous state spaces and discrete time. We
demonstrate how the GP model allows evaluation of the value function in
closed form. The resulting policy iteration algorithm is demonstrated on a
simple problem with a two dimensional state space. Further, we speculate that
the intrinsic ability of GP models to characterise distributions of functions
would allow the method to capture entire distributions over future values
instead of merely their expectation, which has traditionally been the focus
of much of reinforcement learning.

Edward Snelson, Carl Edward Rasmussen, and Zoubin Ghahramani.
**Warped Gaussian
processes**.
In S. Thrun, L. Saul, and B. Schölkopf, editors, *Advances in Neural
Information Processing Systems 16*, pages 337-344, Cambridge, MA, USA,
December 2004. The MIT Press.

** Abstract:** We generalise
the Gaussian process (GP) framework for regression by learning a nonlinear
transformation of the GP outputs. This allows for non-Gaussian processes and
non-Gaussian noise. The learning algorithm chooses a nonlinear transformation
such that transformed data is well-modelled by a GP. This can be seen as
including a preprocessing transformation as an integral part of the
probabilistic modelling problem, rather than as an ad-hoc step. We
demonstrate on several real regression problems that learning the
transformation can lead to significantly better performance than using a
regular GP, or a GP with a fixed transformation.

A. Dubey, S. Hwang, C. Rangel, Carl Edward Rasmussen, Zoubin Ghahramani, and
David L. Wild.
**Clustering
protein sequence and structure space with infinite Gaussian mixture
models**.
In *Pacific Symposium on Biocomputing 2004*, pages 399-410, Singapore,
2004. World Scientific Publishing.

** Abstract:** We describe
a novel approach to the problem of automatically clustering protein sequences
and discovering protein families, subfamilies etc., based on the thoery of
infinite Gaussian mixture models. This method allows the data itself to
dictate how many mixture components are required to model it, and provides a
measure of the probability that two proteins belong to the same cluster. We
illustrate our methods with application to three data sets: globin sequences,
globin sequences with known tree-dimensional structures and G-pretein coupled
receptor sequences. The consistency of the clusters indicate that that our
methods is producing biologically meaningful results, which provide a very
good indication of the underlying families and subfamilies. With the
inclusion of secondary structure and residue solvent accessibility
information, we obtain a classification of sequences of known structure which
reflects and extends their SCOP classifications.

Ananya Dubey, Seungwoo Hwang, Claudia Rangel, Carl Edward Rasmussen, Zoubin
Ghahramani, and David L. Wild.
**Clustering
protein sequence and structure space with infinite Gaussian mixture
models**.
In Russ B. Altman, A. Keith Dunker, Lawrence Hunter, Tiffany A. Jung, and
Teri E. Klein, editors, *Pacific Symposium on Biocomputing*, pages
399-410. World Scientific, 2004.

** Abstract:** We describe a
novel approach to the problem of automatically clustering protein sequences
and discovering protein families, subfamilies etc., based on the theory of
infinite Gaussian mixtures models. This method allows the data itself to
dictate how many mixture components are required to model it, and provides a
measure of the probability that two proteins belong to the same cluster. We
illustrate our methods with application to three data sets: globin sequences,
globin sequences with known three-dimensional structures and G-protein
coupled receptor sequences. The consistency of the clusters indicate that our
method is producing biologically meaningful results, which provide a very
good indication of the underlying families and subfamilies. With the
inclusion of secondary structure and residue solvent accessibility
information, we obtain a classification of sequences of known structure which
both reflects and extends their SCOP classifications. A supplementray web
site containing larger versions of the figures is available at
http://public.kgi.edu/approximately wid/PSB04/index.html

Jan Eichhorn, Andreas S. Tolias, Alexander Zien, Malte Kuß, Carl Edward
Rasmussen, Jason Weston, Nikos K. Logothetis, and Bernhard Schölkopf.
**Prediction on
spike data using kernel algorithms**.
In Sebastian Thrun, Lawrence K. Saul, and Bernhard Schölkopf, editors,
*Advances in Neural Information Processing Systems 16*,
volume 16, pages 1367-1374, Cambridge, MA, USA, 2004. The MIT
Press.

** Abstract:** We report and compare the performance of
different learning algorithms based on data from cortical recordings. The
task is to predict the orientation of visual stimuli from the activity of a
population of simultaneously recorded neurons. We compare several ways of
improving the coding of the input (i.e., the spike data) as well as of the
output (i.e., the orientation), and report the results obtained using
different kernel algorithms.

Matthias O. Franz, Younghee Kwon, Carl Edward Rasmussen, and Bernhard
Schölkopf.
**Semi-supervised
kernel regression using whitened function classes**.
In C. E. Rasmussen, H. H. Bülthoff, M. A. Giese, and B. Schölkopf,
editors, *Lecture Notes in Computer Science (LNCS)*, volume 3175,
pages 18-26, Berlin, Germany, 2004. Springer.

** Abstract:**
The use of non-orthonormal basis functions in ridge regression leads to an
often undesired non-isotropic prior in function space. In this study, we
investigate an alternative regularization technique that results in an
implicit whitening of the basis functions by penalizing directions in
function space with a large prior variance. The regularization term is
computed from unlabelled input data that characterizes the input
distribution. Tests on two datasets using polynomial basis functions showed
an improved average performance compared to standard ridge regression.

Dilan Görür, Carl Edward Rasmussen, Andreas S. Tolias, Fabian Sinz, and
Nikos K. Logothetis.
**Modelling
spikes with mixtures of factor analysers**.
In C. E. Rasmussen, H. H. Bülthoff, B. Schölkopf, and M. A. Giese,
editors, *DAGM 2004*, volume 3175 of *Lecture Notes in Computer
Science (LNCS)*, pages 391-398, Berlin, Germany, 09 2004. Springer.

** Abstract:** Identifying the action potentials of individual
neurons from extracellular recordings, known as spike sorting, is a
challenging problem. We consider the spike sorting problem using a generative
model,mixtures of factor analysers, which concurrently performs clustering
and feature extraction. The most important advantage of this method is that
it quantifies the certainty with which the spikes are classified. This can be
used as a means for evaluating the quality of clustering and therefore spike
isolation. Using this method, nearly simultaneously occurring spikes can also
be modelled which is a hard task for many of the spike sorting methods.
Furthermore, modelling the data with a generative model allows us to generate
simulated data.

Juš Kocijan, Roderick Murray-Smith, Carl Edward Rasmussen, and Agathe
Girard.
**Gaussian
process model based predictive control**.
In *American Control Conference*, pages 2214-2219, 2004.

** Abstract:** Gaussian process models provide a probabilistic
non-parametric modelling approach for black-box identi cation of non-linear
dynamic systems. The Gaussian processes can highlight areas of the input
space where prediction quality is poor, due to the lack of data or its
complexity, by indicating the higher variance around the predicted mean.
Gaussian process models contain noticeably less coef cients to be optimised.
This paper illustrates possible application of Gaussian process models within
model-based predictive control. The extra information provided within
Gaussian process model is used in predictive control, where optimisation of
control signal takes the variance information into account. The predictive
control principle is demonstrated on control of pH process benchmark.

Carl Edward Rasmussen.
**Gaussian processes in
machine learning**.
In Olivier Bousquet, Ulrike von Luxburg, and Gunnar Rätsch, editors,
*Advanced Lectures on Machine Learning: ML Summer Schools 2003, Canberra,
Australia, February 2 - 14, 2003, Tübingen, Germany, August 4 - 16, 2003,
Revised Lectures*, volume 3176 of *Lecture Notes in Computer Science
(LNCS)*, pages 63-71. Springer-Verlag, Heidelberg, 2004.

**
Abstract:** We give a basic introduction to Gaussian Process regression
models. We focus on understanding the role of the stochastic process and how
it is used to define a distribution over functions. We present the simple
equations for incorporating training data and examine how to learn the
hyperparameters using the marginal likelihood. We explain the practical
advantages of Gaussian Process and end with conclusions and a look at the
current trends in GP work.

** Comment:** Copyright by Springer, springerlink

Fabian Sinz, Joaquin Quiñonero-Candela, Gökhan H. Bakir, Carl Edward
Rasmussen, and Matthias O. Franz.
**Learning
depth from stereo**.
In C. E. Rasmussen, H. H. Bülthoff, B. Schölkopf, and M. A. Giese,
editors, *26th DAGM Symposium*, volume 3175 of *Lecture Notes in
Computer Science (LNCS)*, pages 245-252, Berlin, Germany, 09 2004.
Springer.

** Abstract:** We compare two approaches to the
problem of estimating the depth of a point in space from observing its image
position in two different cameras: 1. The classical photogrammetric approach
explicitly models the two cameras and estimates their intrinsic and extrinsic
parameters using a tedious calibration procedure; 2. A generic machine
learning approach where the mapping from image to spatial coordinates is
directly approximated by a Gaussian Process regression. Our results show that
the generic learning approach, in addition to simplifying the procedure of
calibration, can lead to higher depth accuracies than classical calibration
although no specific domain knowledge is used.

Agathe Girard, Carl Edward Rasmussen, Joaquin Quiñonero-Candela, and
Roderick Murray-Smith.
**Gaussian
process priors with uncertain inputs - application to multiple-step ahead
time series forecasting**.
In S. Becker, S. Thrun, and K. Obermayer, editors, *Advances in Neural
Information Processing Systems 15*, pages 529-536, Cambridge, MA, USA,
December 2003. The MIT Press.

** Abstract:** We consider the
problem of multi-step ahead prediction in time series analysis using the
non-parametric Gaussian process model. k-step ahead forecasting of a
discrete-time non-linear dynamic system can be performed by doing repeated
one-step ahead predictions. For a state-space model of the form y_{t}
= f(y_{t-1},...,y_{t-L}), the prediction of y at time t + k
is based on the point estimates of the previous outputs. In this paper, we
show how, using an analytical Gaussian approximation, we can formally
incorporate the uncertainty about intermediate regressor values, thus
updating the uncertainty on the current prediction.

Carl Edward Rasmussen and Zoubin Ghahramani.
**Bayesian Monte
Carlo**.
In S. Becker, S. Thrun, and K. Obermayer, editors, *Advances in Neural
Information Processing Systems 15*, pages 489-496, Cambridge, MA, USA,
December 2003. The MIT Press.

** Abstract:** We investigate
Bayesian alternatives to classical Monte Carlo methods for evaluating
integrals. Bayesian Monte Carlo (BMC) allows the incorporation of prior
knowledge, such as smoothness of the integrand, into the estimation. In a
simple problem we show that this outperforms any classical importance
sampling method. We also attempt more challenging multidimensional integrals
involved in computing marginal likelihoods of statistical models (a.k.a.
partition functions and model evidences). We find that Bayesian Monte Carlo
outperformed Annealed Importance Sampling, although for very high dimensional
problems or problems with massive multimodality BMC may be less adequate. One
advantage of the Bayesian approach to Monte Carlo is that samples can be
drawn from any distribution. This allows for the possibility of active design
of sample points so as to maximise information gain.

Ercan Solak, Roderick Murray-Smith, William E. Leithead, Douglas Leith, and
Carl Edward Rasmussen.
**Derivative
observations in Gaussian process models of dynamic systems**.
In S. Becker, S. Thrun, and K. Obermayer, editors, *Advances in Neural
Information Processing Systems 15*, pages 1033-1040, Cambridge, MA, USA,
December 2003. The MIT Press.

** Abstract:** Gaussian
processes provide an approach to nonparametric modelling which allows a
straightforward combination of function and derivative observations in an
empirical model. This is of particular importance in identification of
nonlinear dynamic systems from experimental data. 1) It allows us to combine
derivative information, and associated uncertainty with normal function
observations into the learning and inference process. This derivative
information can be in the form of priors specified by an expert or identified
from perturbation data close to equilibrium. 2) It allows a seamless fusion
of multiple local linear models in a consistent manner, inferring consistent
models and ensuring that integrability constraints are met. 3) It improves
dramatically the computational efficiency of Gaussian process models for
dynamic system identification, by summarising large quantities of
near-equilibrium data by a handful of linearisations, reducing the training
set size - traditionally a problem for Gaussian process models.

Roderick Murray-Smith, Daniel Sbarbaro, Carl Edward Rasmussen, and Agathe
Girard.
**Adaptive,
cautious, predictive control with Gaussian process priors**.
In P. Van den Hof, B. Wahlberg, and S. Weiland, editors, *IFAC SYSID
2003*, pages 1195-1200, Oxford, UK, August 2003. Elsevier Science Ltd.

** Abstract:** Nonparametric Gaussian Process models, a Bayesian
statistics approach, are used to implement a nonlinear adaptive control law.
Predictions, including propagation of the state uncertainty are made over a
k-step horizon. The expected value of a quadratic cost function is minimised,
over this prediction horizon, without ignoring the variance of the model
predictions. The general method and its main features are illustrated on a
simulation example.

Joaquin Quiñonero-Candela, Agathe Girard, Jan Larsen, and Carl Edward
Rasmussen.
**Propagation of
uncertainty in Bayesian kernel models - application to multiple-step ahead
forecasting**.
In *ICASSP 2003*, volume 2, pages 701-704, April 2003.

** Abstract:** The object of Bayesian modelling is the
predictive distribution, which in a forecasting scenario enables improved
estimates of forecasted values and their uncertainties. In this paper we
focus on reliably estimating the predictive mean and variance of forecasted
values using Bayesian kernel based models such as the Gaussian Process and
the Relevance Vector Machine. We derive novel analytic expressions for the
predictive mean and variance for Gaussian kernel shapes under the assumption
of a Gaussian input distribution in the static case, and of a recursive
Gaussian predictive density in iterative forecasting. The capability of the
method is demonstrated for forecasting of time-series and compared to
approximate methods.

Juš Kocijan, Blaž Banko, Bojan Likar, Agathe Girard, Roderick
Murray-Smith, and Carl Edward Rasmussen.
**A case based
comparison of identification with neural network and Gaussian process
models**.
In *IFAC Internaltional Conference on Intelligent Control Systems and Signal
Processing*, volume 1, pages 137-142, 2003.

**
Abstract:** In this paper an alternative approach to black-box
identification of non-linear dynamic systems is compared with the more
established approach of using artificial neural networks. The Gaussian
process prior approach is a representative of non-parametric modelling
approaches. It was compared on a pH process modelling case study. The purpose
of modelling was to use the model for control design. The comparison revealed
that even though Gaussian process models can be effectively used for
modelling dynamic systems caution has to be axercised when signals are
selected.

Juš Kocijan, Roderick Murray-Smith, Carl Edward Rasmussen, and Bojan
Likar.
**Predictive
control with Gaussian process models**.
In B. Zajc and M. Tkal, editors, *IEEE Region 8 Eurocon 2003: Computer as a
Tool*, pages 352-356, 2003.

** Abstract:** This paper
describes model-based predictive control based on Gaussian processes.
Gaussian process models provide a probabilistic non-parametric modelling
approach for black-box identification of non-linear dynamic systems. It
offers more insight in variance of obtained model response, as well as fewer
parameters to determine than other models. The Gaussian processes can
highlight areas of the input space where prediction quality is poor, due to
the lack of data or its complexity, by indicating the higher variance around
the predicted mean. This property is used in predictive control, where
optimisation of control signal takes the variance information into account.
The predictive control principle is demonstrated on a simulated example of
nonlinear system.

Joaquin Quiñonero-Candela, Agathe Girard, Jan Larsen, and Carl Edward
Rasmussen.
**Propagation of uncertainty in Bayesian
kernel models - application to multiple-step ahead forecasting**.
In C. Molina, T. Adali, J. Larsen, M. Van Hulle, S. C. Douglas, and J. Rouat,
editors, *NNSP 2003*, Piscataway, New Jersey, 2003. IEEE Press.

** Abstract:** The object of Bayesian modelling is the
predictive distribution, which in a forecasting scenario enables improved
estimates of forecasted values and their uncertainties. In this paper we
focus on reliably estimating the predictive mean and variance of forecasted
values using Bayesian kernel based models such as the Gaussian Process and
the Relevance Vector Machine. We derive novel analytic expressions for the
predictive mean and variance for Gaussian kernel shapes under the assumption
of a Gaussian input distribution in the static case, and of a recursive
Gaussian predictive density in iterative forecasting. The capability of the
method is demonstrated for forecasting of time-series and compared to
approximate methods.

** Comment:** Electronic version of Quiñonero-Candela, Girard, Larsen and
Rasmussen, 2003 which should have been presented at ICASSP 03, but was
cancelled due to bird flu epidemic.

Joaquin Quiñonero-Candela, Agathe Girard, and Carl Edward Rasmussen.
**Prediction at an
uncertain input for Gaussian processes and Relevance Vector Machines
application to multiple-step ahead time-series prediction**.
Technical Report IMM-2003-18, Instititute for Mathemetical Modelling, DTU,
2003.

** Comment:** techreport

Carl Edward Rasmussen.
**Gaussian processes to
speed up Hybrid Monte Carlo for expensive Bayesian integrals**.
In *Bayesian Statistics 7*, pages 651-659. Oxford University Press,
2003.

** Abstract:** Hybrid Monte Carlo (HMC) is often the
method of choice for computing Bayesian integrals that are not analytically
tractable. However the success of this method may require a very large number
of evaluations of the (un-normalized) posterior and its partial derivatives.
In situations where the posterior is computationally costly to evaluate, this
may lead to an unacceptable computational load for HMC. I propose to use a
Gaussian Process model of the (log of the) posterior for most of the
computations required by HMC. Within this scheme only occasional evaluation
of the actual posterior is required to guarantee that the samples generated
have exactly the desired distribution, even if the GP model is somewhat
inaccurate. The method is demonstrated on a 10 dimensional problem, where 200
evaluations suffice for the generation of 100 roughly independent points from
the posterior. Thus, the proposed scheme allows Bayesian treatment of models
with posteriors that are computationally demanding, such as models involving
computer simulation.

Matthew J. Beal, Zoubin Ghahramani, and Carl Edward Rasmussen.
**The infinite
hidden Markov model**.
In *Advances in Neural Information Processing Systems 14*, pages
577-584, Cambridge, MA, USA, December 2002. The MIT Press.

**
Abstract:** We show that it is possible to extend hidden Markov models to
have a countably infinite number of hidden states. By using the theory of
Dirichlet processes we can implicitly integrate out the infinitely many
transition parameters, leaving only three hyperparameters which can be
learned from data. These three hyperparameters define a hierarchical
Dirichlet process capable of capturing a rich set of transition dynamics. The
three hyperparameters control the time scale of the dynamics, the sparsity of
the underlying state-transition matrix, and the expected number of distinct
hidden states in a finite sequence. In this framework it is also natural to
allow the alphabet of emitted symbols to be infinite - consider, for
example, symbols being possible words appearing in English text.

Carl Edward Rasmussen and Zoubin Ghahramani.
**Infinite mixtures of
Gaussian process experts**.
In *Advances in Neural Information Processing Systems 14*, pages
881-888, Cambridge, MA, USA, December 2002. The MIT Press.

**
Abstract:** We present an extension to the Mixture of Experts (ME) model,
where the individual experts are Gaussian Process (GP) regression models.
Using an input-dependent adaptation of the Dirichlet Process, we implement a
gating network for an infinite number of Experts. Inference in this model may
be done efficiently using a Markov Chain relying on Gibbs sampling. The model
allows the effective covariance function to vary with the inputs, and may
handle large datasets - thus potentially overcoming two of the biggest
hurdles with GP models. Simulations show the viability of this approach.

Irene K. Andersen, Anna Szymkowiak, Carl Edward Rasmussen, L. G. Hanson, J. R.
Marstrand, H. B. W. Larsson, and Lars Kai Hansen.
**Perfusion
quantification using Gaussian process deconvolution**.
*Magnetic Resonance in Medicine*, 48(2):351-361, 2002, doi 10.1002/mrm.10213.

** Abstract:** The quantification of perfusion using dynamic
susceptibility contrast MR imaging requires deconvolution to obtain the
residual impulse-response function (IRF). Here, a method using a Gaussian
process for deconvolution, GPD, is proposed. The fact that the IRF is smooth
is incorporated as a constraint in the method. The GPD method, which
automatically estimates the noise level in each voxel, has the advantage that
model parameters are optimized automatically. The GPD is compared to singular
value decomposition (SVD) using a common threshold for the singular values
and to SVD using a threshold optimized according to the noise level in each
voxel. The comparison is carried out using artificial data as well as using
data from healthy volunteers. It is shown that GPD is comparable to SVD
variable optimized threshold when determining the maximum of the IRF, which
is directly related to the perfusion. GPD provides a better estimate of the
entire IRF. As the signal to noise ratio increases or the time resolution of
the measurements increases, GPD is shown to be superior to SVD. This is also
found for large distribution volumes.

Christopher K. I. Williams, Carl Edward Rasmussen, Anton Schwaighofer, and
Volker Tresp.
**Observations
on the Nyström method for Gaussian process prediction**.
Technical report, University of Edinburgh, 2002.

** Abstract:**
A number of methods for speeding up Gaussian Process (GP) prediction have
been proposed, including the Nyström method of Williams and Seeger
(2001). In this paper we focus on two issues (1) the relationship of the
Nyström method to the Subset of Regressors method (Poggio and Girosi
1990; Luo and Wahba, 1997) and (2) understanding in what circumstances the
Nyström approximation would be expected to provide a good approximation
to exact GP regression.

Carl Edward Rasmussen and Zoubin Ghahramani.
**Occam's
razor**.
In *Advances in Neural Information Processing Systems 13*, pages
294-300, Cambridge, MA, USA, December 2001. The MIT Press.

**
Abstract:** The Bayesian paradigm apparently only sometimes gives rise to
Occam's Razor; at other times very large models perform well. We give simple
examples of both kinds of behaviour. The two views are reconciled when
measuring complexity of functions, rather than of the machinery used to
implement them. We analyze the complexity of functions for some linear in the
parameter models that are equivalent to Gaussian Processes, and always find
Occam's Razor at work.

Pedro A. d. F. R. Højen-Sørensen, Carl Edward Rasmussen, and Lars Kai
Hansen.
**Bayesian
modelling of fMRI time series**.
In *Advances in Neural Information Processing Systems 12*, pages
754-760. The MIT Press, 2000.

** Abstract:** We present a
Hidden Markov Model (HMM) for inferring the hidden psychological state (or
neural activity) during single trial fMRI activation experiments with blocked
task paradigms. Inference is based on Bayesian methodology, using a
combination of analytical and a variety of Markov Chain Monte Carlo (MCMC)
sampling techniques. The advantage of this method is that detection of short
time learning effects between repeated trials is possible since inference is
based only on single trial experiments.

Carl Edward Rasmussen.
**The infinite Gaussian
mixture model**.
In *Advances in Neural Information Processing Systems 12*, pages
554-560. The MIT Press, 2000.

** Abstract:** In a Bayesian
mixture model it is not necessary a priori to limit the number of components
to be finite. In this paper an infinite Gaussian mixture model is presented
which neatly sidesteps the difficult problem of finding the "right" number of
mixture components. Inference in the model is done using an efficient
parameter-free Markov Chain that relies entirely on Gibbs sampling.

Carl Edward Rasmussen.
**Evaluation of
Gaussian processes and other methods for non-linear regression**.
PhD thesis, University of Toronto, Department of Computer Science, Toronto,
CANADA, 1996.

** Abstract:** This thesis develops two Bayesian
learning methods relying on Gaussian processes and a rigorous statistical
approach for evaluating such methods. In these experimental designs the
sources of uncertainty in the estimated generalisation performances due to
both variation in training and test sets are accounted for. The framework
allows for estimation of generalisation performance as well as statistical
tests of significance for pairwise comparisons. Two experimental designs are
recommended and supported by the DELVE software environment.

Two new
non-parametric Bayesian learning methods relying on Gaussian process priors
over functions are developed. These priors are controlled by hyperparameters
which set the characteristic length scale for each input dimension. In the
simplest method, these parameters are fit from the data using optimization.
In the second, fully Bayesian method, a Markov chain Monte Carlo technique is
used to integrate over the hyperparameters. One advantage of these Gaussian
process methods is that the priors and hyperparameters of the trained models
are easy to interpret.

The Gaussian process methods are benchmarked
against several other methods, on regression tasks using both real data and
data generated from realistic simulations. The experiments show that small
datasets are unsuitable for benchmarking purposes because the uncertainties
in performance measurements are large. A second set of experiments provide
strong evidence that the bagging procedure is advantageous for the
Multivariate Adaptive Regression Splines (MARS) method.

The simulated
datasets have controlled characteristics which make them useful for
understanding the relationship between properties of the dataset and the
performance of different methods. The dependency of the performance on
available computation time is also investigated. It is shown that a Bayesian
approach to learning in multi-layer perceptron neural networks achieves
better performance than the commonly used early stopping procedure, even for
reasonably short amounts of computation time. The Gaussian process methods
are shown to consistently outperform the more conventional methods.

Carl Edward Rasmussen.
**A practical Monte
Carlo implementation of Bayesian learning**.
In *Advances in Neural Information Processing Systems 8*, pages
598-604, Cambridge, MA., USA, 1996. The MIT Press.

**
Abstract:** A practical method for Bayesian training of feed-forward neural
networks using sophisticated Monte Carlo methods is presented and evaluated.
In reasonably small amounts of computer time this approach outperforms other
state-of-the-art methods on 5 datalimited tasks from real world domains.

Carl Edward Rasmussen, Radford M. Neal, Geoffrey E. Hinton, Drew van Camp, Mike
Revow, Zoubin Ghahramani, Rafal Kustra, and Robert Tibshirani.
**The DELVE
manual**, 1996.

** Abstract:** DELVE - Data for
Evaluating Learning in Valid Experiments - is a collection of datasets from
many sources, an environment within which this data can be used to assess the
performance of methods for learning relationships from data, and a repository
for the results of such experiments.

** Comment:** The delve website.

Chris K. I. Williams and Carl Edward Rasmussen.
**Gaussian processes
for regression**.
In *Advances in Neural Information Processing Systems 8*, pages
514-520, Cambridge, MA., USA, 1996. The MIT Press.

**
Abstract:** The Bayesian analysis of neural networks is difficult because a
simple prior over weights implies a complex prior over functions. We
investigate the use of a Gaussian process prior over functions, which permits
the predictive Bayesian analysis for fixed values of hyperparameters to be
carried out exactly using matrix operations. Two methods, using optimization
and averaging (via Hybrid Monte Carlo) over hyperparameters have been tested
on a number of challenging problems and have produced excellent results.

Lars Kai Hansen and Carl Edward Rasmussen.
**Pruning from
adaptive regularization**.
*Neural Computation*, 6(6):1222-1231, 1994.

**
Abstract:** Inspired by the recent upsurge of interest in Bayesian methods
we consider adaptive regularization. A generalization based scheme for
adaptation of regularization parameters is introduced and compared to
Bayesian regularization. We show that pruning arises naturally within both
adaptive regularization schemes. As model example we have chosen the simplest
possible: estimating the mean of a random variable with known variance.
Marked similarities are found between the two methods in that they both
involve a "noise limit", below which they regularize with infinite weight
decay, i.e., they prune. However, pruning is not always beneficial. We show
explicitly that both methods in some cases may increase the generalization
error. This corresponds to situations where the underlying assumptions of the
regularizer are poorly matched to the environment.

Carl Edward Rasmussen and David J. Willshaw.
**Presynaptic and postsynaptic
comptetition in models for the development of neuromuscular
connections**.
*Biological Cybernetics*, 68(5):409-419, 1993, doi 10.1007/BF00198773.

** Abstract:** In the establishment of connections between nerve
and muscle there is an initial stage when each muscle fibre is innervated by
several different motor axons. Withdrawal of connections then takes place
until each fibre has contact from just a single axon. The evidence suggests
that the withdrawal process involves competition between nerve terminals. We
examine in formal models several types of competitive mechanism that have
been proposed for this phenomenon. We show that a model which combines
competition for a presynaptic resource with competition for a postsynaptic
resource is superior to others. This model accounts for many anatomical and
physiological findings and has a biologically plausible implementation.
Intrinsic withdrawal appears to be a side effect of the competitive mechanism
rather than a separate non-competitive feature. The model's capabilities are
confirmed by theoretical analysis and full scale computer simulations.

Giambattista Parascandolo, Niki Kilbertus, Mateo Rojas-Carulla, and Bernhard
Schölkopf.
**Learning
independent causal mechanisms**.
In *35th International Conference on Machine Learning*, Stockholm
Sweden, July 2018.

** Abstract:** Statistical learning relies
upon data sampled from a distribution, and we usually do not care what
actually generated it in the first place. From the point of view of causal
modeling, the structure of each distribution is induced by physical
mechanisms that give rise to dependences between observables. Mechanisms,
however, can be meaningful autonomous modules of generative models that make
sense beyond a particular entailed data distribution, lending themselves to
transfer between problems. We develop an algorithm to recover a set of
independent (inverse) mechanisms from a set of transformed data points. The
approach is unsupervised and based on a set of experts that compete for data
generated by the mechanisms, driving specialization. We analyze the proposed
method in a series of experiments on image data. Each expert learns to map a
subset of the transformed data back to a reference distribution. The learned
mechanisms generalize to novel domains. We discuss implications for transfer
learning and links to recent trends in generative modeling.

Niki Kilbertus, Mateo Rojas-Carulla, Giambattista Parascandolo, Moritz Hardt,
Dominik Janzing, and Bernhard Schölkopf.
**Avoiding discrimination
through causal reasoning**.
In *Advances in Neural Information Processing Systems 30*, Long Beach,
California, December 2017.

** Abstract:** Recent work on
fairness in machine learning has focused on various statistical
discrimination criteria and how they trade off. Most of these criteria are
observational: They depend only on the joint distribution of predictor,
protected attribute, features, and outcome. While convenient to work with,
observational criteria have severe inherent limitations that prevent them
from resolving matters of fairness conclusively. Going beyond observational
criteria, we frame the problem of discrimination based on protected
attributes in the language of causal reasoning. This viewpoint shifts
attention from ``What is the right fairness criterion?'' to ``What do we want
to assume about our model of the causal data generating process?'' Through
the lens of causality, we make several contributions. First, we crisply
articulate why and when observational criteria fail, thus formalizing what
was before a matter of opinion. Second, our approach exposes previously
ignored subtleties and why they are fundamental to the problem. Finally, we
put forward natural causal non-discrimination criteria and develop algorithms
that satisfy them.

Mateo Rojas-Carulla, Bernhard Schölkopf, Richard Turner, and Jonas Peters.
**A causal perspective on domain
adaptation**.
*arXiv preprint arXiv:1507.05333)*, 2015.

** Abstract:**
From training data from several related domains (or tasks), methods of domain
adaptation try to combine knowledge to improve performance. This paper
discusses an approach to domain adaptation which is inspired by a causal
interpretation of the multi-task problem. We assume that a covariate shift
assumption holds true for a subset of predictor variables: the conditional of
the target variable given this subset of predictors is invariant with respect
to shifts in those predictors (covariates). We propose to learn the
corresponding conditional expectation in the training domains and use it for
estimation in the target domain. We further introduce a method which allows
for automatic inference of the above subset in regression and classification.
We study the performance of this approach in an adversarial setting, in the
case where no additional examples are available in the test domain. If a
labeled sample is available, we provide a method for using both the
transferred invariant conditional and task specific information. We present
results on synthetic data sets and a sentiment analysis problem.

Krzysztof Choromanski, Mark Rowland, Wenyu Chen, and Adrian Weller.
**Unifying
orthogonal Monte Carlo methods**.
In *36th International Conference on Machine Learning*, Long Beach, June
2019.

** Abstract:** Many machine learning methods making use
of Monte Carlo sampling in vector spaces have been shown to be improved by
conditioning samples to be mutually orthogonal. Exact orthogonal coupling of
samples is computationally intensive, hence approximate methods have been of
great interest. In this paper, we present a unifying perspective of many
approximate methods by considering Givens transformations, propose new
approximate methods based on this framework, and demonstrate the first
statistical guarantees for families of approximate methods in kernel
approximation. We provide extensive empirical evaluations with guidance for
practitioners.

Mark Rowland, Jiri Hron, Yunhao Tang, Krzysztof Choromanski, Tamas Sarlos, and
Adrian Weller.
**Orthogonal
estimation of Wasserstein distances**.
In *22nd International Conference on Artificial Intelligence and
Statistics*, Okinawa, Japan, April 2019.

** Abstract:**
Wasserstein distances are increasingly used in a wide variety of applications
in machine learning. Sliced Wasserstein distances form an important subclass
which may be estimated efficiently through one-dimensional sorting
operations. In this paper, we propose a new variant of sliced Wasserstein
distance, study the use of orthogonal coupling in Monte Carlo estimation of
Wasserstein distances and draw connections with stratified sampling, and
evaluate our approaches experimentally in a range of large-scale experiments
in generative modelling and reinforcement learning.

Mark Rowland, Krzysztof Choromanski, Francois Chalus, Aldo Pacchiano, Tamas
Sarlos, Richard Turner, and Adrian Weller.
**Geometrically
coupled Monte Carlo sampling**.
In *Advances in Neural Information Processing Systems 32*, Montreal
Canada, December 2018.

** Abstract:** Monte Carlo sampling in
high-dimensional, low-sample settings is important in many machine learning
tasks. We improve current methods for sampling in Euclidean spaces by
avoiding independence, and instead consider ways to couple samples. We show
fundamental connections to optimal transport theory, leading to novel
sampling algorithms, and providing new theoretical grounding for existing
strategies. We compare our new strategies against prior methods for improving
sample efficiency, including quasi-Monte Carlo, by studying discrepancy. We
explore our findings empirically, and observe benefits of our sampling
schemes for reinforcement learning and generative modelling.

Krzysztof Choromanski, Mark Rowland, Vikas Sindhwani, Richard Turner, and
Adrian Weller.
**Structured
evolution with compact architectures for scalable policy
optimization**.
In *35th International Conference on Machine Learning*, Stockholm
Sweden, July 2018.

** Abstract:** We present a new method of
blackbox optimization via gradient approximation with the use of structured
random orthogonal matrices, providing more accurate estimators than baselines
and with provable theoretical guarantees. We show that this algorithm can be
successfully applied to learn better quality compact policies than those
using standard gradient estimation techniques. The compact policies we learn
have several advantages over unstructured ones, including faster training
algorithms and faster inference. These benefits are important when the policy
is deployed on real hardware with limited resources. Further, compact
policies provide more scalable architectures for derivative-free optimization
(DFO) in high dimensional spaces. We show that most robotics tasks from the
OpenAI Gym can be solved using neural networks with less than 300 parameters,
with almost linear time complexity of the inference phase, with up to 13x
fewer parameters relative to the Evolution Strategies (ES) algorithm
introduced by Salimans et al. (2017). We do not need heuristics such as
fitness shaping to learn good quality policies, resulting in a simple and
theoretically motivated training mechanism.

Krzysztof Choromanski, Mark Rowland, Tamas Sarlos, Vikas Sindhwani, Richard E.
Turner, and Adrian Weller.
**The geometry of
random features**.
In *21st International Conference on Artificial Intelligence and
Statistics*, Playa Blanca, Lanzarote, Canary Islands, April 2018.

** Abstract:** We present an in-depth examination of the
effectiveness of radial basis function kernel (beyond Gaussian) estimators
based on orthogonal random feature maps. We show that orthogonal estimators
outperform state-of-the-art mechanisms that use iid sampling under weak
conditions for tails of the associated Fourier distributions. We prove that
for the case of many dimensions, the superiority of the orthogonal transform
can be accurately measured by a property we define called the charm of the
kernel, and that orthogonal random features provide optimal (in terms of mean
squared error) kernel estimators. We provide the first theoretical results
which explain why orthogonal random features outperform unstructured on
downstream tasks such as kernel ridge regression by showing that orthogonal
random features provide kernel algorithms with better spectral properties
than the previous state-of-the-art. Our results enable practitioners more
generally to estimate the benefits from applying orthogonal transforms.

Mark Rowland, Marc G. Bellemare, Will Dabney, Rémi Munos, and Yee Whye Teh.
**An analysis of categorical
distributional reinforcement learning**.
In *21st International Conference on Artificial Intelligence and
Statistics*, Playa Blanca, Lanzarote, Canary Islands, April 2018.

** Abstract:** Distributional approaches to value-based
reinforcement learning model the entire distribution of returns, rather than
just their expected values, and have recently been shown to yield
state-of-the-art empirical performance. This was demonstrated by the recently
proposed C51 algorithm, based on categorical distributional reinforcement
learning (CDRL) [Bellemare et al., 2017]. However, the theoretical properties
of CDRL algorithms are not yet well understood. In this paper, we introduce a
framework to analyse CDRL algorithms, establish the importance of the
projected distributional Bellman operator in distributional RL, draw
fundamental connections between CDRL and the Cramér distance, and give a
proof of convergence for sample-based categorical distributional
reinforcement learning algorithms.

Will Dabney, Mark Rowland, Marc G. Bellemare, and Rémi Munos.
**Distributional reinforcement
learning with quantile regression**.
In *32nd AAAI Conference on Artificial Intelligence*, New Orleans,
February 2018.

** Abstract:** In reinforcement learning an
agent interacts with the environment by taking actions and observing the next
state and reward. When sampled probabilistically, these state transitions,
rewards, and actions can all induce randomness in the observed long-term
return. Traditionally, reinforcement learning algorithms average over this
randomness to estimate the value function. In this paper, we build on recent
work advocating a distributional approach to reinforcement learning in which
the distribution over returns is modeled explicitly instead of only
estimating the mean. That is, we examine methods of learning the value
distribution instead of the value function. We give results that close a
number of gaps between the theoretical and algorithmic results given by
Bellemare, Dabney, and Munos (2017). First, we extend existing results to the
approximate distribution setting. Second, we present a novel distributional
reinforcement learning algorithm consistent with our theoretical formulation.
Finally, we evaluate this new algorithm on the Atari 2600 games, observing
that it significantly outperforms many of the recent improvements on DQN,
including the related distributional algorithm C51.

Maria Lomeli, Mark Rowland, Arthur Gretton, and Zoubin Ghahramani.
**Antithetic and Monte Carlo
kernel estimators for partial rankings**.
*arXiv preprint arXiv:1807.00400*, 2018.

** Abstract:**
In the modern age, rankings data is ubiquitous and it is useful for a variety
of applications such as recommender systems, multi-object tracking and
preference learning. However, most rankings data encountered in the real
world is incomplete, which prevents the direct application of existing
modelling tools for complete rankings. Our contribution is a novel way to
extend kernel methods for complete rankings to partial rankings, via
consistent Monte Carlo estimators for Gram matrices: matrices of kernel
values between pairs of observations. We also present a novel variance
reduction scheme based on an antithetic variate construction between
permutations to obtain an improved estimator for the Mallows kernel. The
corresponding antithetic kernel estimator has lower variance and we
demonstrate empirically that it has a better performance in a variety of
Machine Learning tasks. Both kernel estimators are based on extending kernel
mean embeddings to the embedding of a set of full rankings consistent with an
observed partial ranking. They form a computationally tractable alternative
to previous approaches for partial rankings data. An overview of the existing
kernels and metrics for permutations is also provided.

Krzysztof Choromanski, Mark Rowland, and Adrian Weller.
**The
unreasonable effectiveness of structured random orthogonal
embeddings**.
In *Advances in Neural Information Processing Systems 31*, Long Beach,
California, December 2017.

** Abstract:** We examine a class
of embeddings based on structured random matrices with orthogonal rows which
can be applied in many machine learning applications including dimensionality
reduction and kernel approximation. For both the Johnson-Lindenstrauss
transform and the angular kernel, we show that we can select matrices
yielding guaranteed improved performance in accuracy and/or speed compared to
earlier methods. We introduce matrices with complex entries which give
significant further accuracy improvement. We provide geometric and Markov
chain-based perspectives to help understand the benefits, and empirical
results which suggest that the approach is helpful in a wider range of
applications.

Mark Rowland and Adrian Weller.
**Uprooting
and rerooting higher-order graphical models**.
In *Advances in Neural Information Processing Systems 31*, Long Beach,
California, December 2017.

** Abstract:** The idea of
uprooting and rerooting graphical models was introduced specifically for
binary pairwise models by Weller [18] as a way to transform a model to any of
a whole equivalence class of related models, such that inference on any one
model yields inference results for all others. This is very helpful since
inference, or relevant bounds, may be much easier to obtain or more accurate
for some model in the class. Here we introduce methods to extend the approach
to models with higher-order potentials and develop theoretical insights. For
example, we demonstrate that the triplet-consistent polytope TRI is unique in
being 'universally rooted'. We demonstrate empirically that rerooting can
significantly improve accuracy of methods of inference for higher-order
models at negligible computational cost.

Mark Rowland, Aldo Pacchiano, and Adrian Weller.
**Conditions beyond
treewidth for tightness of higher-order LP relaxations**.
In *20th International Conference on Artificial Intelligence and
Statistics*, Fort Lauderdale, Florida, April 2017.

**
Abstract:** Linear programming (LP) relaxations are a popular method to
attempt to find a most likely configuration of a discrete graphical model. If
a solution to the relaxed problem is obtained at an integral vertex then the
solution is guaranteed to be exact and we say that the relaxation is tight.
We consider binary pairwise models and introduce new methods which allow us
to demonstrate refined conditions for tightness of LP relaxations in the
Sherali-Adams hierarchy. Our results include showing that for higher order LP
relaxations, treewidth is not precisely the right way to characterize
tightness. This work is primarily theoretical, with insights that can improve
efficiency in practice.

Nilesh Tripuraneni, Mark Rowland, Zoubin Ghahramani, and Richard E. Turner.
**Magnetic
Hamiltonian Monte Carlo**.
In *34th International Conference on Machine Learning*, 2017.

** Abstract:** Hamiltonian Monte Carlo (HMC) exploits
Hamiltonian dynamics to construct efficient proposals for Markov chain Monte
Carlo (MCMC). In this paper, we present a generalization of HMC which
exploits non-canonical Hamiltonian dynamics. We refer to this algorithm as
magnetic HMC, since in 3 dimensions a subset of the dynamics map onto the
mechanics of a charged particle coupled to a magnetic field. We establish a
theoretical basis for the use of non-canonical Hamiltonian dynamics in MCMC,
and construct a symplectic, leapfrog-like integrator allowing for the
implementation of magnetic HMC. Finally, we exhibit several examples where
these non-canonical dynamics can lead to improved mixing of magnetic HMC
relative to ordinary HMC.

**Black-box
Alpha Divergence Minimization**.
In *33rd International Conference on Machine Learning*, New York USA,
June 2016.

** Abstract:** Black-box alpha (BB-$α$) is a
new approximate inference method based on the minimization of
$α$-divergences. BB-$α$ scales to large datasets because it can be
implemented using stochastic gradient descent. BB-$α$ can be applied to
complex probabilistic models with little effort since it only requires as
input the likelihood function and its gradients. These gradients can be
easily obtained using automatic differentiation. By changing the divergence
parameter α, the method is able to interpolate between variational Bayes
(VB) ($α \rightarrow 0$) and an algorithm similar to expectation
propagation (EP) ($α = 1$). Experiments on probit regression and neural
network regression and classification problems show that BB-αwith
non-standard settings of $α$, such as $α = 0.5$, usually produces
better predictions than with $α \rightarrow 0$ (VB) or $α = 1$
(EP).

Adrian Weller, Mark Rowland, and David Sontag.
**Tightness of LP
relaxations for almost balanced models**.
In *19th International Conference on Artificial Intelligence and
Statistics*, Cadiz, Spain, May 2016.

** Abstract:** Linear
programming (LP) relaxations are widely used to attempt to identify a most
likely configuration of a discrete graphical model. In some cases, the LP
relaxation attains an optimum vertex at an integral location and thus
guarantees an exact solution to the original optimization problem. When this
occurs, we say that the LP relaxation is tight. Here we consider binary
pairwise models and derive sufﬁcient conditions for guaranteed tightness of
(i) the standard LP relaxation on the local polytope LP+LOC, and (ii) the LP
relaxation on the triplet-consistent polytope LP+TRI (the next level in the
Sherali-Adams hierarchy). We provide simple new proofs of earlier results and
derive signiﬁcant novel results including that LP+TRI is tight for any
model where each block is balanced or almost balanced, and a decomposition
theorem that may be used to break apart complex models into smaller pieces.
An almost balanced (sub-)model is one that contains no frustrated cycles
except through one privileged variable.

Matej Balog, Balaji Lakshminarayanan, Zoubin Ghahramani, Daniel M. Roy, and
Yee Whye Teh.
**The
Mondrian kernel**.
In *32nd Conference on Uncertainty in Artificial Intelligence*, pages
32-41, Jersey City, New Jersey, USA, June 2016.

**
Abstract:** We introduce the Mondrian kernel, a fast random feature
approximation to the Laplace kernel. It is suitable for both batch and online
learning, and admits a fast kernel-width-selection procedure as the random
features can be re-used efficiently for all kernel widths. The features are
constructed by sampling trees via a Mondrian process [Roy and Teh, 2009], and
we highlight the connection to Mondrian forests [Lakshminarayanan et al.,
2014], where trees are also sampled via a Mondrian process, but fit
independently. This link provides a new insight into the relationship between
kernel methods and random forests.

** Comment:** [Supplementary
Material] [arXiv] [Poster]
[Slides]
[Code]

Gintare Karolina Dziugaite, Daniel M. Roy, and Zoubin Ghahramani.
**Training
generative neural networks via Maximum Mean Discrepancy
optimization**.
In *31st Conference on Uncertainty in Artificial Intelligence*, pages
258-267, Amsterdam, The Netherlands, July 2015.

**
Abstract:** We consider training a deep neural network to generate samples
from an unknown distribution given i.i.d. data. We frame learning as an
optimization minimizing a two-sample test statistic-informally speaking, a
good generator network produces samples that cause a two-sample test to fail
to reject the null hypothesis. As our two-sample test statistic, we use an
unbiased estimate of the maximum mean discrepancy, which is the centerpiece
of the nonparametric kernel two-sample test proposed by Gretton et al.
(2012). We compare to the adversarial nets framework introduced by Goodfellow
et al. (2014), in which learning is a two-player game between a generator
network and an adversarial discriminator network, both trained to outwit the
other. From this perspective, the MMD statistic plays the role of the
discriminator. In addition to empirical comparisons, we prove bounds on the
generalization error incurred by optimizing the empirical MMD.

Creighton Heaukulani and Daniel M. Roy.
**The combinatorial structure of beta
negative binomial processes**.
Technical report, Dept. of Engineering, University of Cambridge, March 2014.

** Abstract:** We characterize the combinatorial structure of
conditionally-i.i.d. sequences of negative binomial processes with a common
beta process base measure. In Bayesian nonparametric applications, such
processes have served as models for unknown multisets of a measurable space.
Previous work has characterized random subsets arising from
conditionally-i.i.d. sequences of Bernoulli processes with a common beta
process base measure. In this case, the combinatorial structure is described
by the Indian buffet process. Our results give a count analogue of the Indian
buffet process, which we call a negative binomial Indian buffet process. As
an intermediate step toward this goal, we provide constructions for the beta
negative binomial process that avoid a representation of the underlying beta
process base measure.

**Random
function priors for exchangeable arrays with applications to graphs and
relational data**.
In *Advances in Neural Information Processing Systems 26*, pages 1-9,
Lake Tahoe, California, USA, December 2012.

** Abstract:** A
fundamental problem in the analysis of structured relational data like
graphs, networks, databases, and matrices is to extract a summary of the
common structure underlying relations between individual entities. Relational
data are typically encoded in the form of arrays; invariance to the ordering
of rows and columns corresponds to exchangeable arrays. Results in
probability theory due to Aldous, Hoover and Kallenberg show that
exchangeable arrays can be represented in terms of a random measurable
function which constitutes the natural model parameter in a Bayesian model.
We obtain a flexible yet simple Bayesian nonparametric model by placing a
Gaussian process prior on the parameter function. Efficient inference
utilises elliptical slice sampling combined with a random sparse
approximation to the Gaussian process. We demonstrate applications of the
model to network data and clarify its relation to models in the literature,
several of which emerge as special cases.

Daniel M. Roy.
**On the computability
and complexity of Bayesian reasoning**.
In *NIPS Workshop on Philosophy and Machine Learning*, 2011.

** Abstract:** If we consider the claim made by some cognitive
scientists that the mind performs Bayesian reasoning, and if we
simultaneously accept the Physical Church-Turing thesis and thus believe that
the computational power of the mind is no more than that of a Turing machine,
then what limitations are there to the reasoning abilities of the mind? I
give an overview of joint work with Nathanael Ackerman (Harvard, Mathematics)
and Cameron Freer (MIT, CSAIL) that bears on the computability and complexity
of Bayesian reasoning. In particular, we prove that conditional probability
is in general not computable in the presence of continuous random variables.
However, in light of additional structure in the prior distribution, such as
the presence of certain types of noise, or of exchangeability, conditioning
is possible. These results cover most of statistical practice. At the
workshop on Logic and Computational Complexity, we presented results on the
computational complexity of conditioning, embedding sharp-P-complete problems
in the task of computing conditional probabilities for diffuse continuous
random variables. This work complements older work. For example, under
cryptographic assumptions, the computational complexity of producing samples
and computing probabilities was separated by Ben-David, Chor, Goldreich and
Luby. In recent work, we also make use of cryptographic assumptions to show
that different representations of exchangeable sequences may have vastly
different complexity. However, when faced with an adversary that is
computational bounded, these different representations have the same
complexity, highlighting the fact that knowledge representation and
approximation play a fundamental role in the possibility and plausibility of
Bayesian reasoning.

David Sontag and Daniel M. Roy.
**The Complexity of
Inference in Latent Dirichlet Allocation**.
In *Advances in Neural Information Processing Systems 24*, Cambridge,
MA, USA, 2011. The MIT Press.

** Abstract:** We consider the
computational complexity of probabilistic inference in Latent Dirichlet
Allocation (LDA). First, we study the problem of finding the maximum a
posteriori (MAP) assignment of topics to words, where the document's topic
distribution is integrated out. We show that, when the effective number of
topics per document is small, exact inference takes polynomial time. In
contrast, we show that, when a document has a large number of topics, finding
the MAP assignment of topics to words in LDA is NP-hard. Next, we consider
the problem of finding the MAP topic distribution for a document, where the
topic-word assignments are integrated out. We show that this problem is also
NP-hard. Finally, we briefly discuss the problem of sampling from the
posterior, showing that this is NP-hard in one restricted setting, but
leaving open the general question.

E. Gilboa, Yunus Saatçi, and John P. Cunningham.
**Scaling multidimensional inference for structured Gaussian processes**.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*, pages
424-436, 2015, doi
10.1109/TPAMI.2013.192.

** Abstract:** Exact Gaussian
process (GP) regression has O(N^{3} runtime for data size N, making
it intractable for large N. Many algorithms for improving GP scaling
approximate the covariance with lower rank matrices. Other work has exploited
structure inherent in particular covariance functions, including GPs with
implied Markov structure, and inputs on a lattice (both enable O(N) or O(N
log N) runtime). However, these GP advances have not been well extended to
the multidimensional input setting, despite the preponderance of
multidimensional applications. This paper introduces and tests three novel
extensions of structured GPs to multidimensional inputs, for models with
additive and multiplicative kernels. First we present a new method for
inference in additive GPs, showing a novel connection between the classic
backfitting method and the Bayesian framework. We extend this model using two
advances: a variant of projection pursuit regression, and a Laplace
approximation for non-Gaussian observations. Lastly, for multiplicative
kernel structure, we present a novel method for GPs with inputs on a
multidimensional grid. We illustrate the power of these three advances on
several data sets, achieving performance equal to or very close to the naive
GP at orders of magnitude less cost.

** Comment:** arXiv

E. Gilboa, Yunus Saatçi, and John P. Cunningham.
**Scaling
multidimensional Gaussian processes using projected additive
approximations**.
In *30th International Conference on Machine Learning*, 2013.

** Abstract:** Exact Gaussian Process (GP) regression has
O(N^{3}) runtime for data size N, making it intractable for large N.
Many algorithms for improving GP scaling approximate the covariance with
lower rank matrices. Other work has exploited structure inherent in
particular covariance functions, including GPs with implied Markov structure,
and equispaced inputs (both enable O(N) runtime). However, these GP advances
have not been extended to the multidimensional input setting, despite the
preponderance of multidimensional applications. This paper introduces and
tests novel extensions of structured GPs to multidimensional inputs. We
present new methods for additive GPs, showing a novel connection between the
classic backﬁtting method and the Bayesian framework. To achieve optimal
accuracy-complexity tradeoff, we extend this model with a novel variant of
projection pursuit regression. Our primary result – projection pursuit
Gaussian Process Regression – shows orders of magnitude speedup while
preserving high accuracy. The natural second and third steps include
non-Gaussian observations and higher dimensional equispaced grid methods. We
introduce novel techniques to address both of these necessary directions. We
thoroughly illustrate the power of these three advances on several datasets,
achieving close performance to the naive Full GP at orders of magnitude less
cost.

Yunus Saatçi.
**Scalable inference for
structured Gaussian process models**.
PhD thesis, University of Cambridge, Department of Engineering, Cambridge, UK,
2011.

** Abstract:** The generic inference and learning
algorithm for Gaussian Process (GP) regression has O(N^{3}) runtime
and O(N^{2}) memory complexity, where N is the number of observations
in the dataset. Given the computational resources available to a present-day
workstation, this implies that GP regression simply *cannot be run* on
large datasets. The need to use non- Gaussian likelihood functions for tasks
such as classification adds even more to the computational burden
involved.

The majority of algorithms designed to improve the scaling of
GPs are founded on the idea of approximating the true covariance matrix,
which is usually of rank N, with a matrix of rank P, where P<<N.
Typically, the true training set is replaced with a smaller, representative
(pseudo-) training set such that a specific measure of information loss is
minimized. These algorithms typically attain O(P^{2}N) runtime and
O(PN) space complexity. They are also general in the sense that they are
designed to work with *any* covariance function. In essence, they
trade off accuracy with computational complexity. The central contribution of
this thesis is to improve scaling instead by exploiting any structure that is
present in the covariance matrices generated by *particular*
covariance functions. Instead of settling for a kernel-independent
accuracy/complexity trade off, as is done in much the literature, we often
obtain accuracies close to, or exactly equal to the full GP model at a
fraction of the computational cost.

We define a *structured* GP as
any GP model that is endowed with a kernel which produces structured
covariance matrices. A trivial example of a structured GP is one with the
linear regression kernel. In this case, given inputs living in R^{D},
the covariance matrices generated have rank D - this results in significant
computational gains in the usual case where D<<N. Another case arises
when a stationary kernel is evaluated on equispaced, scalar inputs. This
results in *Toeplitz* covariance matrices and all necessary
computations can be carried out exactly in O(N log N).

This thesis
studies four more types of structured GP. First, we comprehensively review
the case of kernels corresponding to *Gauss-Markov* processes
evaluated on scalar inputs. Using state-space models we show how
(generalised) regression (including hyperparameter learning) can be performed
in O(N log N) runtime and O(N) space. Secondly, we study the case where we
introduce block structure into the covariance matrix of a GP time-series
model by assuming a particular form of nonstationarity a priori. Third, we
extend the efficiency of scalar Gauss-Markov processes to higher-dimensional
input spaces by assuming *additivity*. We illustrate the connections
between the classical backfitting algorithm and approximate Bayesian
inference techniques including Gibbs sampling and variational Bayes. We also
show that it is possible to relax the rather strong assumption of additivity
without sacrificing O(N log N) complexity, by means of a projection-pursuit
style GP regression model. Finally, we study the properties of a GP model
with a tensor product kernel evaluated on a multivariate grid of inputs
locations. We show that for an *arbitrary* (regular or irregular) grid
the resulting covariance matrices are Kronecker and full GP regression can be
implemented in O(N) time and memory usage.

We illustrate the power of
these methods on several real-world regression datasets which satisfy the
assumptions inherent in the structured GP employed. In many cases we obtain
performance comparable to the generic GP algorithm. We also analyse the
performance degradation when these assumptions are not met, and in several
cases show that it is comparable to that observed for sparse GP methods. We
provide similar results for regression tasks with non-Gaussian likelihoods,
an extension rarely addressed by sparse GP techniques.

Yunus Saatçi, Ryan Turner, and Carl Edward Rasmussen.
**Gaussian process
change point models**.
In *27th International Conference on Machine Learning*, pages 927-934,
Haifa, Israel, June 2010.

** Abstract:** We combine Bayesian
online change point detection with Gaussian processes to create a
nonparametric time series model which can handle change points. The model can
be used to locate change points in an online manner; and, unlike other
Bayesian online change point detection algorithms, is applicable when
temporal correlations in a regime are expected. We show three variations on
how to apply Gaussian processes in the change point context, each with their
own advantages. We present methods to reduce the computational burden of
these models and demonstrate it on several real world data sets.

Ryan Turner, Yunus Saatçi, and Carl Edward Rasmussen.
**Adaptive
sequential Bayesian change point detection**.
In Zaïd Harchaoui, editor, *NIPS Workshop on Temporal Segmentation*,
Whistler, BC, Canada, December 2009.

** Abstract:** Real-world
time series are often nonstationary with respect to the parameters of some
underlying prediction model (UPM). Furthermore, it is often desirable to
adapt the UPM to incoming regime changes as soon as possible, necessitating
sequential inference about change point locations. A Bayesian algorithm for
online change point detection (BOCPD) has been introduced recently by Adams
and MacKay (2007). In this algorithm, uncertainty about the last change point
location is updated sequentially, and is integrated out to make online
predictions robust to parameter changes. BOCPD requires a set of fixed
hyper-parameters which allow the user to fully specify the hazard function
for change points and the prior distribution over the parameters of the UPM.
In practice, finding the "right" hyper-parameters can be quite difficult. We
therefore extend BOCPD by introducing hyper-parameter learning, without
sacrificing the online nature of the algorithm. Hyper-parameter learning is
performed by optimizing the marginal likelihood of the BOCPD model, a
closed-form quantity which can be computed sequentially. We illustrate
performance on three real-world datasets.

Richard J. Gibbens and Yunus Saatçi.
**Data,
modelling and inference in road traffic networks**.
*Philosophical Transactions of the Royal Society A: Mathematical, Physical
and Engineering Sciences*, 366(1872):1907-1919, June 2008, doi
10.1098/rsta.2008.0020.

** Abstract:** In this paper, we
study UK road traffic data and explore a range of modelling and inference
questions that arise from them. For example, loop detectors on the M25
motorway record speed and flow measurements at regularly spaced locations as
well as the entry and exit lanes of junctions. An exploratory study of these
data helps us to better understand and quantify the nature of congestion on
the road network. From a traveller's perspective it is crucially important to
understand the overall journey times and we look at methods to improve our
ability to predict journey times given access jointly to both real-time and
historical loop detector data. Throughout this paper we will comment on
related work derived from US freeway data.

Jurgen Van Gael, Yunus Saatçi, Yee-Whye Teh, and Zoubin Ghahramani.
**Beam sampling
for the infinite hidden Markov model**.
In *25th International Conference on Machine Learning*, volume 25,
pages 1088-1095, Helsinki, Finland, 2008. Association for Computing
Machinery.

** Abstract:** The infinite hidden Markov model is
a non-parametric extension of the widely used hidden Markov model. Our paper
introduces a new inference algorithm for the infinite Hidden Markov model
called beam sampling. Beam sampling combines slice sampling, which limits the
number of states considered at each time step to a finite number, with
dynamic programming, which samples whole state trajectories efficiently. Our
algorithm typically outperforms the Gibbs sampler and is more robust. We
present applications of iHMM inference using the beam sampler on changepoint
detection and text prediction problems.

Adam Ścibior.
**Formally justified and
modular Bayesian inference for probabilistic programs**.
PhD thesis, University of Cambridge, Department of Engineering, Cambridge, UK,
2019.

** Abstract:** Probabilistic modelling offers a simple
and coherent framework to describe the real world in the face of uncertainty.
Furthermore, by applying Bayes' rule it is possible to use probabilistic
models to make inferences about the state of the world from partial
observations. While traditionally probabilistic models were constructed on
paper, more recently the approach of probabilistic programming enables users
to write the models in executable languages resembling computer programs and
to freely mix them with deterministic code. It has long been recognised that
the semantics of programming languages is complicated and the intuitive
understanding that programmers have is often inaccurate, resulting in
difficult to understand bugs and unexpected program behaviours. Programming
languages are therefore studied in a rigorous way using formal languages with
mathematically defined semantics. Traditionally formal semantics of
probabilistic programs are defined using exact inference results, but in
practice exact Bayesian inference is not tractable and approximate methods
are used instead, posing a question of how the results of these algorithms
relate to the exact results. Correctness of such approximate methods is
usually argued somewhat less rigorously, without reference to a formal
semantics. In this dissertation we formally develop denotational semantics
for probabilistic programs that correspond to popular sampling algorithms
often used in practice. The semantics is defined for an expressive typed
lambda calculus with higher-order functions and inductive types, extended
with probabilistic effects for sampling and conditioning, allowing continuous
distributions and unbounded likelihoods. It makes crucial use of the recently
developed formalism of quasi-Borel spaces to bring all these elements
together. We provide semantics corresponding to several variants of Markov
chain Monte Carlo and Sequential Monte Carlo methods and formally prove a
notion of correctness for these algorithms in the context of probabilistic
programming. We also show that the semantic construction can be directly
mapped to an implementation using established functional programming
abstractions called monad transformers. We develop a compact Haskell library
for probabilistic programming closely corresponding to the semantic
construction, giving users a high level of assurance in the correctness of
the implementation. We also demonstrate on a collection of benchmarks that
the library offers performance competitive with existing systems of similar
scope. An important property of our construction, both the semantics and the
implementation, is the high degree of modularity it offers. All the inference
algorithms are constructed by combining small building blocks in a setup
where the type system ensures correctness of compositions. We show that with
basic building blocks corresponding to vanilla Metropolis-Hastings and
Sequential Monte Carlo we can implement more advanced algorithms known in the
literature, such as Resample-Move Sequential Monte Carlo, Particle Marginal
Metropolis-Hastings, and Sequential Monte Carlo squared. These
implementations are very concise, reducing the effort required to produce
them and the scope for bugs. On top of that, our modular construction enables
in some cases deterministic testing of randomised inference algorithms,
further increasing reliability of the implementation.

Adam Ścibior, Ohad Kammar, and Zoubin Ghahramani.
**Functional programming
for modular Bayesian inference**.
*Proceedings of the ACM on Programming Languages*, 2, 2018.

** Abstract:** We present an architectural design of a library
for Bayesian modelling and inference in modern functional programming
languages. The novel aspect of our approach are modular implementations of
existing state-of-the-art inference algorithms. Our design relies on three
inherently functional features: higher-order functions, inductive data-types,
and support for either type-classes or an expressive module system. We
provide a performant Haskell implementation of this architecture,
demonstrating that high-level and modular probabilistic programming can be
added as a library in sufficiently expressive languages. We review the core
abstractions in this architecture: inference representations, inference
transformations, and inference representation transformers. We then implement
concrete instances of these abstractions, counterparts to particle filters
and Metropolis-Hastings samplers, which form the basic building blocks of our
library. By composing these building blocks we obtain state-of-the-art
inference algorithms: Resample-Move Sequential Monte Carlo, Particle Marginal
Metropolis-Hastings, and Sequential Monte Carlo Squared. We evaluate our
implementation against existing probabilistic programming systems and find it
is already competitively performant, although we conjecture that existing
functional programming optimisation techniques could reduce the overhead
associated with the abstractions we use. We show that our modular design
enables deterministic testing of inherently stochastic Monte Carlo
algorithms. Finally, we demonstrate using OCaml that an expressive module
system can also implement our design.

Adam Ścibior, Ohad Kammar, Matthijs Vákár, Sam Staton, Hongseok Yang,
Yufei Cai, Klaus Ostermann, Sean K. Moss, Chris Heunen, and Zoubin
Ghahramani.
**Denotational
validation of higher-order Bayesian inference**.
*Proceedings of the ACM on Programming Languages*, 2, 2018.

** Abstract:** We present a modular semantic account of Bayesian
inference algorithms for probabilistic programming languages, as used in data
science and machine learning. Sophisticated inference algorithms are often
explained in terms of composition of smaller parts. However, neither their
theoretical justification nor their implementation reflects this modularity.
We show how to conceptualise and analyse such inference algorithms as
manipulating intermediate representations of probabilistic programs using
higher-order functions and inductive types, and their denotational semantics.
Semantic accounts of continuous distributions use measurable spaces. However,
our use of higher-order functions presents a substantial technical
difficulty: it is impossible to define a measurable space structure over the
collection of measurable functions between arbitrary measurable spaces that
is compatible with standard operations on those functions, such as function
application. We overcome this difficulty using quasi-Borel spaces, a recently
proposed mathematical structure that supports both function spaces and
continuous distributions. We define a class of semantic structures for
representing probabilistic programs, and semantic validity criteria for
transformations of these representations in terms of distribution
preservation. We develop a collection of building blocks for composing
representations. We use these building blocks to validate common inference
algorithms such as Sequential Monte Carlo and Markov Chain Monte Carlo. To
emphasize the connection between the semantic manipulation and its
traditional measure theoretic origins, we use Kock’s synthetic measure
theory. We demonstrate its usefulness by proving a quasi-Borel counterpart to
the Metropolis-Hastings-Green theorem.

Johannes Borgström, Andrew D. Gordon, Long Ouyang, Claudio Russo, Adam
Ścibior, and Marcin Szymczak.
**Fabular:
Regression formulas as probabilistic programming**.
In *Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on
Principles of Programming Languages*, POPL 2016, pages 271-283, New
York, NY, USA, 2016. acm, doi
10.1145/2837614.2837653.

** Abstract:** Regression
formulas are a domain-specific language adopted by several R packages for
describing an important and useful class of statistical models: hierarchical
linear regressions. Formulas are succinct, expressive, and clearly popular,
so are they a useful addition to probabilistic programming languages? And
what do they mean? We propose a core calculus of hierarchical linear
regression, in which regression coefficients are themselves defined by nested
regressions (unlike in R). We explain how our calculus captures the essence
of the formula DSL found in R. We describe the design and implementation of
Fabular, a version of the Tabular schema-driven probabilistic programming
language, enriched with formulas based on our regression calculus. To the
best of our knowledge, this is the first formal description of the core ideas
of R's formula notation, the first development of a calculus of regression
formulas, and the first demonstration of the benefits of composing regression
formulas and latent variables in a probabilistic programming language.

Carl-Johann Simon-Gabriel, Adam Ścibior, Ilya Tolstikhin, and Bernhard
Schölkopf.
**Consistent
kernel mean estimation for functions of random variables**.
In *Advances in Neural Information Processing Systems 30*, 2016.

** Abstract:** We provide a theoretical foundation for
non-parametric estimation of functions of random variables using kernel mean
embeddings. We show that for any continuous function f, consistent estimators
of the mean embedding of a random variable X lead to consistent estimators of
the mean embedding of f(X). For Matérn kernels and sufficiently smooth
functions we also provide rates of convergence. Our results extend to
functions of multiple random variables. If the variables are dependent, we
require an estimator of the mean embedding of their joint distribution as a
starting point; if they are independent, it is sufficient to have separate
estimators of the mean embeddings of their marginal distributions. In either
case, our results cover both mean embeddings based on i.i.d. samples as well
as "reduced set" expansions in terms of dependent expansion points. The
latter serves as a justification for using such expansions to limit memory
resources when applying the approach as a basis for probabilistic
programming.

Adam Ścibior, Zoubin Ghahramani, and Andrew D. Gordon.
**Practical
probabilistic programming with monads**.
In *Proceedings of the 8th ACM SIGPLAN Symposium on Haskell*.
Association for Computing Machinery, 2015, doi
10.1145/2804302.2804317.

** Abstract:** The machine
learning community has recently shown a lot of interest in practical
probabilistic programming systems that target the problem of Bayesian
inference. Such systems come in different forms, but they all express
probabilistic models as computational processes using syntax resembling
programming languages. In the functional programming community monads are
known to offer a convenient and elegant abstraction for programming with
probability distributions, but their use is often limited to very simple
inference problems. We show that it is possible to use the monad abstraction
to construct probabilistic models for machine learning, while still offering
good performance of inference in challenging models. We use a GADT as an
underlying representation of a probability distribution and apply Sequential
Monte Carlo-based methods to achieve efficient inference. We define a formal
semantics via measure theory. We demonstrate a clean and elegant
implementation that achieves performance comparable with Anglican, a
state-of-the-art probabilistic programming system.

Rowan McAllister, Yarin Gal, Alex Kendall, Mark van der Wilk, Amar Shah,
Roberto Cipolla, and Adrian Weller.
**Concrete problems
for autonomous vehicle safety: Advantages of Bayesian deep
learning,**.
In *International Joint Conference on Artificial Intelligence*,
Melbourne, Australia, August 2017.

** Abstract:** Autonomous
vehicle (AV) software is typically composed of a pipeline of individual
components, linking sensor inputs to motor outputs. Erroneous component
outputs propagate downstream, hence safe AV software must consider the
ultimate effect of each component's errors. Further, improving safety alone
is not sufficient. Passengers must also feel safe to trust and use AV
systems. To address such concerns, we investigate three under-explored themes
for AV research: safety, interpretability, and compliance. Safety can be
improved by quantifying the uncertainties of component outputs and
propagating them forward through the pipeline. Interpretability is concerned
with explaining what the AV observes and why it makes the decisions it does,
building reassurance with the passenger. Compliance refers to maintaining
some control for the passenger. We discuss open challenges for research
within these themes. We highlight the need for concrete evaluation metrics,
propose example problems, and highlight possible solutions.

Sankalp Bhatnagar, Anna Alexandrova, Shahar Avin, Stephen Cave, Lucy Cheke,
Matthew Crosby, Jan Feyereisl, Marta Halina, Bao Sheng Loe, Sean
o Heigeartaigh, Fernando Martínez-Plumed, Huw Price, Henry Shevlin, Adrian
Weller, Alan Winfield, and Jose Hernandez-Orallo.
**Mapping
intelligence: Requirements and possibilities**.
In *Philosophy and Theory of Artificial Intelligence (PT-AI)*, 2017.

** Abstract:** New types of artificial intelligence (AI), from
cognitive assistants to social robots, are challenging meaningful comparison
with other kinds of intelligence. How can such intelligent systems be
catalogued, evaluated, and contrasted, with representations and projections
that offer meaningful insights? To catalyse the research in AI and the future
of cognition, we present the motivation, requirements and possibilities for
an atlas of intelligence: an integrated framework and collaborative open
repository for collecting and exhibiting information of all kinds of
intelligence, including humans, non-human animals, AI systems, hybrids and
collectives thereof. After presenting this initiative, we review related
efforts and present the requirements of such a framework. We survey existing
visualisations and representations, and discuss which criteria of inclusion
should be used to configure an atlas of intelligence.

Matthew W Hoffman, Bobak Shahriari, and Nando de Freitas.
**On
correlation and budget constraints in model-based bandit optimization with
application to automatic machine learning**.
In *17th International Conference on Artificial Intelligence and
Statistics*, pages 365-374, Reykjavik, Iceland, April 2014.

** Abstract:** We address the problem of finding the maximizer
of a nonlinear function that can only be evaluated, subject to noise, at a
finite number of query locations. Further, we will assume that there is a
constraint on the total number of permitted function evaluations. We
introduce a Bayesian approach for this problem and show that it empirically
outperforms both the existing frequentist counterpart and other Bayesian
optimization methods. The Bayesian approach places emphasis on detailed
modelling, including the modelling of correlations among the arms. As a
result, it can perform well in situations where the number of arms is much
larger than the number of allowed function evaluation, whereas the
frequentist counterpart is inapplicable. This feature enables us to develop
and deploy practical applications, such as automatic machine learning
toolboxes. The paper presents comprehensive comparisons of the proposed
approach with many Bayesian and bandit optimization techniques, the first
comparison of many of these methods in the literature.

Amar Shah, Andrew Gordon Wilson, and Zoubin Ghahramani.
**Student-t
processes as alternatives to Gaussian processes**.
In *AISTATS*, JMLR Proceedings. JMLR.org, 2014.

**
Abstract:** We investigate the Student-t process as an alternative to the
Gaussian process as a nonparametric prior over functions. We derive closed
form expressions for the marginal likelihood and predictive distribution of a
Student-t process, by integrating away an inverse Wishart process prior over
the covariance kernel of a Gaussian process model. We show surprising
equivalences between different hierarchical Gaussian process models leading
to Student-t processes, and derive a new sampling scheme for the inverse
Wishart process, which helps elucidate these equivalences. Overall, we show
that a Student-t process can retain the attractive properties of a Gaussian
process - a nonparametric representation, analytic marginal and predictive
distributions, and easy model selection through covariance kernels - but has
enhanced flexibility, and predictive covariances that, unlike a Gaussian
process, explicitly depend on the values of training observations. We verify
empirically that a Student-t process is especially useful in situations where
there are changes in covariance structure, or in applications like Bayesian
optimization, where accurate predictive covariances are critical for good
performance. These advantages come at no additional computational cost over
Gaussian processes.

Tomoharu Iwata, Amar Shah, and Zoubin Ghahramani.
**Discovering
latent influence in online social activities via shared cascade poisson
processes**.
In *KDD*, pages 266-274. Association for Computing Machinery, 2013.

** Abstract:** Many people share their activities with others
through online communities. These shared activities have an impact on other
users' activities. For example, users are likely to become interested in
items that are adopted (e.g. liked, bought and shared) by their friends. In
this paper, we propose a probabilistic model for discovering latent influence
from sequences of item adoption events. An inhomogeneous Poisson process is
used for modeling a sequence, in which adoption by a user triggers the
subsequent adoption of the same item by other users. For modeling adoption of
multiple items, we employ multiple inhomogeneous Poisson processes, which
share parameters, such as influence for each user and relations between
users. The proposed model can be used for finding influential users,
discovering relations between users and predicting item popularity in the
future. We present an efficient Bayesian inference procedure of the proposed
model based on the stochastic EM algorithm. The effectiveness of the proposed
model is demonstrated by using real data sets in a social bookmark sharing
service.

Amar Shah and Zoubin Ghahramani.
**Determinantal
clustering processes - A nonparametric Bayesian approach to kernel based
semi-supervised clustering**.
*UAI*, 2013.

** Abstract:** Semi-supervised clustering is
the task of clustering data points into clusters where only a fraction of the
points are labelled. The true number of clusters in the data is often unknown
and most models require this parameter as an input. Dirichlet process mixture
models are appealing as they can infer the number of clusters from the data.
However, these models do not deal with high dimensional data well and can
encounter difficulties in inference. We present a novel nonparameteric
Bayesian kernel based method to cluster data points without the need to
prespecify the number of clusters or to model complicated densities from
which data points are assumed to be generated from. The key insight is to use
determinants of submatrices of a kernel matrix as a measure of how close
together a set of points are. We explore some theoretical properties of the
model and derive a natural Gibbs based algorithm with MCMC hyperparameter
learning. The model is implemented on a variety of synthetic and real world
data sets.

Christian Steinruecken.
**Compressing combinatorial
objects**.
<>arXiv:1401.03689 [cs.IT], January 2016.

** Abstract:**
Most of the world's digital data is currently encoded in a sequential form,
and compression methods for sequences have been studied extensively. However,
there are many types of non-sequential data for which good compression
techniques are still largely unexplored. This paper contributes insights and
concrete techniques for compressing various kinds of non-sequential data via
arithmetic coding, and derives re-usable probabilistic data models from
fairly generic structural assumptions. Near-optimal compression methods are
described for certain types of permutations, combinations and multisets; and
the conditions for optimality are made explicit for each method.

Christian Steinruecken, Zoubin Ghahramani, and David MacKay.
**Improving PPM
with dynamic parameter updates**.
In Ali Bilgin, Michael W. Marcellin, Joan Serra-Sagristà, and James A.
Storer, editors, *Proceedings of the Data Compression Conference*.
IEEE Computer Society, April 2015.

** Abstract:** This article
makes several improvements to the classic PPM algorithm, resulting in a new
algorithm with superior compression effectiveness on human text. The key
differences of our algorithm to classic PPM are that (A) rather than the
original escape mechanism, we use a generalised blending method with explicit
hyper-parameters that control the way symbol counts are combined to form
predictions; (B) different hyper-parameters are used for classes of different
contexts; and (C) these hyper-parameters are updated dynamically using
gradient information. The resulting algorithm (PPM-DP) compresses human text
better than all currently published variants of PPM, CTW, DMC, LZ, CSE and
BWT, with runtime only slightly slower than classic PPM.

Christian Steinruecken.
**Compressing sets and
multisets of sequences**.
*IEEE Transactions on Information Theory*, 61(3):1485-1490, March
2015, doi
10.1109/TIT.2015.2392093.
A previous version was published at the Data Compression Conference 2014.

** Abstract:** This article describes lossless compression
algorithms for multisets of sequences, taking advantage of the multiset's
unordered structure. Multisets are a generalisation of sets where members are
allowed to occur multiple times. A multiset can be encoded naively by simply
storing its elements in some sequential order, but then information is wasted
on the ordering. We propose a technique that transforms the multiset into an
order-invariant tree representation, and derive an arithmetic code that
optimally compresses the tree. Our method achieves compression even if the
sequences in the multiset are individually incompressible (such as
cryptographic hash sums). The algorithm is demonstrated practically by
compressing collections of SHA-1 hash sums, and multisets of arbitrary,
individually encodable objects.

Felipe Tobar, Thang D. Bui, and Richard E. Turner.
**Learning
stationary time series using gaussian process with nonparametric
kernels**.
In *Advances in Neural Information Processing Systems 29*, Montréal
CANADA, Dec 2015.

** Abstract:** We introduce the Gaussian
Process Convolution Model (GPCM), a two-stage nonparametric generative
procedure to model stationary signals as the convolution between a
continuous-time white-noise process and a continuous-time linear filter drawn
from Gaussian process. The GPCM is a continuous-time nonparametricwindow
moving average process and, conditionally, is itself a Gaussian process with
a nonparametric kernel defined in a probabilistic fashion. The generative
model can be equivalently considered in the frequency domain, where the power
spectral density of the signal is specified using a Gaussian process. One of
the main contributions of the paper is to develop a novel variational
freeenergy approach based on inter-domain inducing variables that efficiently
learns the continuous-time linear filter and infers the driving white-noise
process. In turn, this scheme provides closed-form probabilistic estimates of
the covariance kernel and the noise-free signal both in denoising and
prediction scenarios. Additionally, the variational inference procedure
provides closed-form expressions for the approximate posterior of the
spectral density given the observed data, leading to new Bayesian
nonparametric approaches to spectrum estimation. The proposed GPCM is
validated using synthetic and real-world signals.

Felipe Tobar, Petar M. Djurić, and Danilo P. Mandic.
**Unsupervised
state-space modeling using reproducing kernels**.
*IEEE Transactions on Signal Processing*, 63:5210 - 5221, 2015.

** Abstract:** A novel framework for the design of state-space
models (SSMs) is proposed whereby the state-transition function of the model
is parametrized using reproducing kernels. The nature of SSMs requires
learning a latent function that resides in the state space and for which
input-output sample pairs are not available, thus prohibiting the use of
gradient-based supervised kernel learning. To this end, we then propose to
learn the mixing weights of the kernel estimate by sampling from their
posterior density using Monte Carlo methods. We first introduce an offline
version of the proposed algorithm, followed by an online version which
performs inference on both the parameters and the hidden state through
particle filtering. The accuracy of the estimation of the state-transition
function is first validated on synthetic data. Next, we show that the
proposed algorithm outperforms kernel adaptive filters in the prediction of
real-world time series, while also providing probabilistic estimates, a key
advantage over standard methods.

Felipe Tobar and Danilo P. Mandic.
**Design
of positive-definite quaternion kernels**.
*IEEE Signal Processing Letters*, 22:2117 - 2121, 2015.

**
Abstract:** Quaternion reproducing kernel Hilbert spaces (QRKHS) have been
proposed recently and provide a high-dimensional feature space (alternative
to the real-valued multikernel approach) for general kernel-learning
applications. The current challenge within quaternion-kernel learning is the
lack of general quaternion-valued kernels, which are necessary to exploit the
full advantages of the QRKHS theory in real-world problems. This letter
proposes a novel way to design quaternion-valued kernels, this is achieved by
transforming three complex kernels into quaternion ones and then combining
their real and imaginary parts. Building on this general construction, our
emphasis is on a new quaternion kernel of polynomial features, which is
assessed in the prediction of bodysensor networks applications.

Felipe Tobar and Danilo P. Mandic.
**High-dimensional kernel regression: A guide for practitioners**.
In W.-C. Siu Y. C. Lim, H. K. Kwan, editor, *Trends in Digital Signal
Processing: A Festschrift in Honour of A.G. Constantinides*,
chapter 9, pages 287-310. CRC Press, 2015.

Felipe Tobar and Richard E. Turner.
**Modelling
of complex signals using Gaussian processes**.
In *Proceedings of the IEEE International Conference on Acoustics, Speech,
and Signal Processing (ICASSP)*, pages 2209 - 2213, 2015.

** Abstract:** In complex-valued signal processing, estimation
algorithms require complete knowledge (or accurate estimation) of the second
order statistics, this makes Gaussian processes (GP) well suited for
modelling complex signals, as they are designed in terms of covariance
functions. Dealing with bivariate signals using GPs require four covariance
matrices, or equivalently, two complex matrices. We propose a GP-based
approach for modelling complex signals, whereby the second-order statistics
are learnt through maximum likelihood; in particular, the complex GP approach
allows for circularity coefficient estimation in a robust manner when the
observed signal is corrupted by (circular) white noise. The proposed model is
validated using climate signals, for both circular and noncircular cases. The
results obtained open new possibilities for collaboration between the complex
signal processing and Gaussian processes communities towards an appealing
representation and statistical description of bivariate signals.

Jonathan Gordon, Wessel Bruinsma, Andrew Y. K. Foong, James Requeima, Yann
Dubois, and Richard Turner.
**Convolutional
conditional neural processes**.
Adis Ababa, April 2020.

** Abstract:** We introduce the
Convolutional Conditional Neural Process (ConvCNP), a new member of the
Neural Process family that models translation equivariance in the data.
Translation equivariance is an important inductive bias for many learning
problems including time series modelling, spatial data, and images. The model
embeds data sets into an infinite-dimensional function space, as opposed to
finite-dimensional vector spaces. To formalize this notion, we extend the
theory of neural representations of sets to include functional
representations, and demonstrate that any translation-equivariant embedding
can be represented using a convolutional deep-set. We evaluate ConvCNPs in
several settings, demonstrating that they achieve state-of-the-art
performance compared to existing NPs. We demonstrate that building in
translation equivariance enables zero-shot generalization to challenging,
out-of-domain tasks.

John Bronskill, Jonathan Gordon, James Requeima, Sebastian Nowozin, and
Richard E. Turner.
**TaskNorm:
rethinking batch normalization for meta-learning**.
In *37th International Conference on Machine Learning*. Proceedings of
Machine Learning Research, 2020.

** Abstract:** Modern
meta-learning approaches for image classification rely on increasingly deep
networks to achieve state-of-the-art performance, making batch normalization
an essential component of meta-learning pipelines. However, the hierarchical
nature of the meta-learning setting presents several challenges that can
render conventional batch normalization ineffective, giving rise to the need
to rethink normalization in this setting. We evaluate a range of approaches
to batch normalization for meta-learning scenarios, and develop a novel
approach that we call TASKNORM. Experiments on fourteen datasets demonstrate
that the choice of batch normalization has a dramatic effect on both
classification accuracy and training time for both gradient based and
gradient free meta-learning approaches. Importantly, TASKNORM is found to
consistently improve performance. Finally, we provide a set of best practices
for normalization that will allow fair comparison of meta-learning
algorithms.

Wessel Bruinsma, Eric Perim, Will Tebbutt, J. Scott Hosking, Arno Solin, and
Richard E. Turner.
**Scalable
exact inference in multi-output Gaussian processes**.
In *37th International Conference on Machine Learning*. Proceedings of
Machine Learning Research, 2020.

** Abstract:** Multi-output
Gaussian processes (MOGPs) leverage the flexibility and interpretability of
GPs while capturing structure across outputs, which is desirable, for
example, in spatio-temporal modelling. The key problem with MOGPs is their
computational scaling $O(n^3 p^3)$, which is cubic in the number of both
inputs $n$ (e.g., time points or locations) and outputs $p$. For this reason,
a popular class of MOGPs assumes that the data live around a low-dimensional
linear subspace, reducing the complexity to $O(n^3 m^3)$. However, this cost
is still cubic in the dimensionality of the subspace $m$, which is still
prohibitively expensive for many applications. We propose the use of a
sufficient statistic of the data to accelerate inference and learning in
MOGPs with orthogonal bases. The method achieves linear scaling in $m$ in
practice, allowing these models to scale to large $m$ without sacrificing
significant expressivity or requiring approximation. This advance opens up a
wide range of real-world tasks and can be combined with existing GP
approximations in a plug-and-play way. We demonstrate the efficacy of the
method on various synthetic and real-world data sets.

Jonathan Gordon, John Bronskill, Matthias Bauer, Sebastian Nowozin, and Richard
Turner.
**Meta-learning
probabilistic inference for prediction**.
New Orleans, April 2019.

** Abstract:** This paper introduces a
new framework for data efficient and versatile learning. Specifically: 1) We
develop ML-PIP, a general framework for Meta-Learning approximate
Probabilistic Inference for Prediction. ML-PIP extends existing probabilistic
interpretations of meta-learning to cover a broad class of methods. 2) We
introduce \Versa, an instance of the framework employing a flexible and
versatile amortization network that takes few-shot learning datasets as
inputs, with arbitrary numbers of shots, and outputs a distribution over
task-specific parameters in a single forward pass. Versa substitutes
optimization at test time with forward passes through inference networks,
amortizing the cost of inference and relieving the need for second
derivatives during training. 3) We evaluate \Versa on benchmark datasets
where the method sets new state-of-the-art results, and can handle arbitrary
number of shots, and for classification, arbitrary numbers of classes at
train and test time. The power of the approach is then demonstrated through a
challenging few-shot ShapeNet view reconstruction task.

James Requeima, Jonathan Gordon, John Bronskill, Sebastian Nowozin, and
Richard E. Turner.
**Fast
and flexible multi-task classification using conditional neural adaptive
processes**.
In *Advances in Neural Information Processing Systems 33*, 2019.

** Abstract:** The goal of this paper is to design image
classification systems that, after an initial multi-task training phase, can
automatically adapt to new tasks encountered at test time. We introduce a
conditional neural process based approach to the multi-task classification
setting for this purpose, and establish connections to the meta- and few-shot
learning literature. The resulting approach, called CNAPs, comprises a
classifier whose parameters are modulated by an adaptation network that takes
the current task's dataset as input. We demonstrate that CNAPs achieves
state-of-the-art results on the challenging Meta-Dataset benchmark indicating
high-quality transfer-learning. We show that the approach is robust, avoiding
both over-fitting in low-shot regimes and under-fitting in high-shot regimes.
Timing experiments reveal that CNAPs is computationally efficient at
test-time as it does not involve gradient based adaptation. Finally, we show
that trained models are immediately deployable to continual learning and
active learning where they can outperform existing approaches that do not
leverage transfer learning.

James Requeima, William Tebbutt, Wessel Bruinsma, and Richard E. Turner.
**The Gaussian
process autoregressive regression model (GPAR)**.
In *22nd International Conference on Artificial Intelligence and
Statistics*. Proceedings of Machine Learning Research, 2019.

** Abstract:** Multi-output regression models must exploit
dependencies between outputs to maximise predictive performance. The
application of Gaussian processes (GPs) to this setting typically yields
models that are computationally demanding and have limited representational
power. We present the Gaussian Process Autoregressive Regression (GPAR)
model, a scalable multi-output GP model that is able to capture nonlinear,
possibly input-varying, dependencies between outputs in a simple and
tractable way: the product rule is used to decompose the joint distribution
over the outputs into a set of conditionals, each of which is modelled by a
standard GP. GPAR’s efficacy is demonstrated on a variety of synthetic and
real-world problems, outperforming existing GP models and achieving
state-of-the-art performance on established benchmarks.

Mark Rowland, Krzysztof Choromanski, Francois Chalus, Aldo Pacchiano, Tamas
Sarlos, Richard Turner, and Adrian Weller.
**Geometrically
coupled Monte Carlo sampling**.
In *Advances in Neural Information Processing Systems 32*, Montreal
Canada, December 2018.

** Abstract:** Monte Carlo sampling in
high-dimensional, low-sample settings is important in many machine learning
tasks. We improve current methods for sampling in Euclidean spaces by
avoiding independence, and instead consider ways to couple samples. We show
fundamental connections to optimal transport theory, leading to novel
sampling algorithms, and providing new theoretical grounding for existing
strategies. We compare our new strategies against prior methods for improving
sample efficiency, including quasi-Monte Carlo, by studying discrepancy. We
explore our findings empirically, and observe benefits of our sampling
schemes for reinforcement learning and generative modelling.

Krzysztof Choromanski, Mark Rowland, Vikas Sindhwani, Richard Turner, and
Adrian Weller.
**Structured
evolution with compact architectures for scalable policy
optimization**.
In *35th International Conference on Machine Learning*, Stockholm
Sweden, July 2018.

** Abstract:** We present a new method of
blackbox optimization via gradient approximation with the use of structured
random orthogonal matrices, providing more accurate estimators than baselines
and with provable theoretical guarantees. We show that this algorithm can be
successfully applied to learn better quality compact policies than those
using standard gradient estimation techniques. The compact policies we learn
have several advantages over unstructured ones, including faster training
algorithms and faster inference. These benefits are important when the policy
is deployed on real hardware with limited resources. Further, compact
policies provide more scalable architectures for derivative-free optimization
(DFO) in high dimensional spaces. We show that most robotics tasks from the
OpenAI Gym can be solved using neural networks with less than 300 parameters,
with almost linear time complexity of the inference phase, with up to 13x
fewer parameters relative to the Evolution Strategies (ES) algorithm
introduced by Salimans et al. (2017). We do not need heuristics such as
fitness shaping to learn good quality policies, resulting in a simple and
theoretically motivated training mechanism.

Yingzhen Li and Richard E. Turner.
**Gradient Estimators
for Implicit Models**.
In *Sixth International Conference on Learning Representations*,
Vancouver CANADA, May 2018.

** Abstract:** Implicit models,
which allow for the generation of samples but not for point-wise evaluation
of probabilities, are omnipresent in real-world problems tackled by machine
learning and a hot topic of current research. Some examples include data
simulators that are widely used in engineering and scientific research,
generative adversarial networks (GANs) for image synthesis, and
hot-off-the-press approximate inference techniques relying on implicit
distributions. The majority of existing approaches to learning implicit
models rely on approximating the intractable distribution or optimisation
objective for gradient- based optimisation, which is liable to produce
inaccurate updates and thus poor models. This paper alleviates the need for
such approximations by proposing the Stein gradient estimator, which directly
estimates the score function of the implicitly defined distribution. The
efficacy of the proposed estimator is empirically demonstrated by examples
that include meta-learning for approximate inference and entropy regularised
GANs that provide improved sample diversities.

Cuong V. Nguyen, Yingzhen Li, and Thang D. Bui Richard E. Turner.
**Variational
Continual Learning**.
In *Sixth International Conference on Learning Representations*,
Vancouver CANADA, May 2018.

** Abstract:** This paper develops
variational continual learning (VCL), a simple but general framework for
continual learning that fuses online variational inference (VI) and recent
advances in Monte Carlo VI for neural networks. The framework can
successfully train both deep discriminative models and deep generative models
in complex continual learning settings where existing tasks evolve over time
and entirely new tasks emerge. Experimental results show that variational
continual learning outperforms state-of-the-art continual learning methods on
a variety of tasks, avoiding catastrophic forgetting in a fully automatic
way.

Krzysztof Choromanski, Mark Rowland, Tamas Sarlos, Vikas Sindhwani, Richard E.
Turner, and Adrian Weller.
**The geometry of
random features**.
In *21st International Conference on Artificial Intelligence and
Statistics*, Playa Blanca, Lanzarote, Canary Islands, April 2018.

** Abstract:** We present an in-depth examination of the
effectiveness of radial basis function kernel (beyond Gaussian) estimators
based on orthogonal random feature maps. We show that orthogonal estimators
outperform state-of-the-art mechanisms that use iid sampling under weak
conditions for tails of the associated Fourier distributions. We prove that
for the case of many dimensions, the superiority of the orthogonal transform
can be accurately measured by a property we define called the charm of the
kernel, and that orthogonal random features provide optimal (in terms of mean
squared error) kernel estimators. We provide the first theoretical results
which explain why orthogonal random features outperform unstructured on
downstream tasks such as kernel ridge regression by showing that orthogonal
random features provide kernel algorithms with better spectral properties
than the previous state-of-the-art. Our results enable practitioners more
generally to estimate the benefits from applying orthogonal transforms.

Thang D. Bui, Cuong V. Nguyen, and Richard E. Turner.
**Streaming
sparse Gaussian process approximations**.
In *Advances in Neural Information Processing Systems 31*,
volume 31, Long Beach, California, USA, December 2017.

**
Abstract:** Sparse approximations for Gaussian process models provide a
suite of methods that enable these models to be deployed in large data regime
and enable analytic intractabilities to be sidestepped. However, the field
lacks a principled method to handle streaming data in which the posterior
distribution over function values and the hyperparameters are updated in an
online fashion. The small number of existing approaches either use suboptimal
hand-crafted heuristics for hyperparameter learning, or suffer from
catastrophic forgetting or slow updating when new data arrive. This paper
develops a new principled framework for deploying Gaussian process
probabilistic models in the streaming setting, providing principled methods
for learning hyperparameters and optimising pseudo-input locations. The
proposed framework is experimentally validated using synthetic and real-world
datasets.

** Comment:** The first two authors contributed equally.

Shixiang Gu, Timothy Lillicrap, Zoubin Ghahramani, Richard E. Turner, Bernhard
Schölkopf, and Sergey Levine.
**Interpolated policy gradient:
Merging on-policy and off-policy policy gradient estimation for deep
reinforcement learning**.
In *Advances in Neural Information Processing Systems 31*, Long Beach
USA, Dec 2017.

** Abstract:** Off-policy model-free deep
reinforcement learning methods using previously collected data can improve
sample efficiency over on-policy policy gradient techniques. On the other
hand, on-policy algorithms are often more stable and easier to use. This
paper examines, both theoretically and empirically, approaches to merging on-
and off-policy updates for deep reinforcement learning. Theoretical results
show that off-policy updates with a value function estimator can be
interpolated with on-policy policy gradient updates whilst still satisfying
performance bounds. Our analysis uses control variate methods to produce a
family of policy gradient algorithms, with several recently proposed
algorithms being special cases of this family. We then provide an empirical
comparison of these techniques with the remaining algorithmic details fixed,
and show how different mixing of off-policy gradient estimates with on-policy
samples contribute to improvements in empirical performance. The final
algorithm provides a generalization and unification of existing deep policy
gradient techniques, has theoretical guarantees on the bias introduced by
off-policy updates, and improves on the state-of-the-art model-free deep RL
methods on a number of OpenAI Gym continuous control benchmarks.

Natasha Jaques, Shixiang Gu, Dzmitry Bahdanau, Jose Miguel Hernndez Lobato,
Richard E. Turner, and Douglas Eck.
**Sequence tutor: Conservative
fine-tuning of sequence generation models with kl-control**.
In *34th International Conference on Machine Learning*, Sydney
AUSTRALIA, Aug 2017.

** Abstract:** This paper proposes a
general method for improving the structure and quality of sequences generated
by a recurrent neural network (RNN), while maintaining information originally
learned from data, as well as sample diversity. An RNN is first pre-trained
on data using maximum likelihood estimation (MLE), and the probability
distribution over the next token in the sequence learned by this model is
treated as a prior policy. Another RNN is then trained using reinforcement
learning (RL) to generate higher-quality outputs that account for
domain-specific incentives while retaining proximity to the prior policy of
the MLE RNN. To formalize this objective, we derive novel off-policy RL
methods for RNNs from KL-control. The effectiveness of the approach is
demonstrated on two applications; 1) generating novel musical melodies, and
2) computational molecular generation. For both problems, we show that the
proposed method improves the desired properties and structure of the
generated sequences, while maintaining information learned from data.

** Comment:** [MIT
Technology Review] [Video]

Shixiang Gu, Timothy Lillicrap, Zoubin Ghahramani, Richard E. Turner, and
Sergey Levine.
**Q-prop: Sample-efficient policy
gradient with an off-policy critic**.
In *5th International Conference on Learning Representations*, Toulon
France, April 2017.

** Abstract:** Model-free deep
reinforcement learning (RL) methods have been successful in a wide variety of
simulated domains. However, a major obstacle facing deep RL in the real world
is their high sample complexity. Batch policy gradient methods offer stable
learning, but at the cost of high variance, which often requires large
batches. TD-style methods, such as off-policy actor-critic and Q-learning,
are more sample-efficient but biased, and often require costly hyperparameter
sweeps to stabilize. In this work, we aim to develop methods that combine the
stability of policy gradients with the efficiency of off-policy RL. We
present Q-Prop, a policy gradient method that uses a Taylor expansion of the
off-policy critic as a control variate. Q-Prop is both sample efficient and
stable, and effectively combines the benefits of on-policy and off-policy
methods. We analyze the connection between Q-Prop and existing model-free
algorithms, and use control variate theory to derive two variants of Q-Prop
with conservative and aggressive adaptation. We show that conservative Q-Prop
provides substantial gains in sample efficiency over trust region policy
optimization (TRPO) with generalized advantage estimation (GAE), and improves
stability over deep deterministic policy gradient (DDPG), the
state-of-the-art on-policy and off-policy methods, on OpenAI Gym's MuJoCo
continuous control environments.

Alexandre Khae Wu Navarro, Jes Frellsen, and Richard E. Turner.
**The Multivariate Generalised
von Mises distribution: Inference and applications**.
January 2017.

** Abstract:** Circular variables arise in a
multitude of data-modelling contexts ranging from robotics to the social
sciences, but they have been largely overlooked by the machine learning
community. This paper partially redresses this imbalance by extending some
standard probabilistic modelling tools to the circular domain. First we
introduce a new multivariate distribution over circular variables, called the
multivariate Generalised von Mises (mGvM) distribution. This distribution can
be constructed by restricting and renormalising a general multivariate
Gaussian distribution to the unit hyper-torus. Previously proposed
multivariate circular distributions are shown to be special cases of this
construction. Second, we introduce a new probabilistic model for circular
regression, that is inspired by Gaussian Processes, and a method for
probabilistic principal component analysis with circular hidden variables.
These models can leverage standard modelling tools (e.g. covariance functions
and methods for automatic relevance determination). Third, we show that the
posterior distribution in these models is a mGvM distribution which enables
development of an efficient variational free-energy scheme for performing
approximate inference and approximate maximum-likelihood learning.

Thang D. Bui, Josiah Yan, and Richard E. Turner.
**A unifying framework for
Gaussian process pseudo-point approximations using power expectation
propagation**.
18(104):1-72, 2017.

** Abstract:** Gaussian processes (GPs) are
flexible distributions over functions that enable high-level assumptions
about unknown functions to be encoded in a parsimonious, flexible and general
way. Although elegant, the application of GPs is limited by computational and
analytical intractabilities that arise when data are sufficiently numerous or
when employing non-Gaussian models. Consequently, a wealth of GP
approximation schemes have been developed over the last 15 years to address
these key limitations. Many of these schemes employ a small set of pseudo
data points to summarise the actual data. In this paper we develop a new
pseudo-point approximation framework using Power Expectation Propagation
(Power EP) that unifies a large number of these pseudo-point approximations.
Unlike much of the previous venerable work in this area, the new framework is
built on standard methods for approximate inference (variational free-energy,
EP and Power EP methods) rather than employing approximations to the
probabilistic generative model itself. In this way all of the approximation
is performed at `inference time' rather than at `modelling time', resolving
awkward philosophical and empirical questions that trouble previous
approaches. Crucially, we demonstrate that the new framework includes new
pseudo-point approximation methods that outperform current approaches on
regression and classification tasks.

Nilesh Tripuraneni, Mark Rowland, Zoubin Ghahramani, and Richard E. Turner.
**Magnetic
Hamiltonian Monte Carlo**.
In *34th International Conference on Machine Learning*, 2017.

** Abstract:** Hamiltonian Monte Carlo (HMC) exploits
Hamiltonian dynamics to construct efficient proposals for Markov chain Monte
Carlo (MCMC). In this paper, we present a generalization of HMC which
exploits non-canonical Hamiltonian dynamics. We refer to this algorithm as
magnetic HMC, since in 3 dimensions a subset of the dynamics map onto the
mechanics of a charged particle coupled to a magnetic field. We establish a
theoretical basis for the use of non-canonical Hamiltonian dynamics in MCMC,
and construct a symplectic, leapfrog-like integrator allowing for the
implementation of magnetic HMC. Finally, we exhibit several examples where
these non-canonical dynamics can lead to improved mixing of magnetic HMC
relative to ordinary HMC.

Yingzhen Li and Richard E. Turner.
**Rényi
Divergence Variational Inference**.
In *Advances in Neural Information Processing Systems 29*, Barcelona
SPAIN, Dec 2016.

** Abstract:** This paper introduces the
variational Rényi bound (VR) that extends traditional variational
inference to Rényi's alpha-divergences. This new family of variational
methods unifies a number of existing approaches, and enables a smooth
interpolation from the evidence lower-bound to the log (marginal) likelihood
that is controlled by the value of alpha that parametrises the divergence.
The reparameterization trick, Monte Carlo approximation and stochastic
optimisation methods are deployed to obtain a tractable and unified framework
for optimisation. We further consider negative alpha values and propose a
novel variational inference method as a new special case in the proposed
framework. Experiments on Bayesian neural networks and variational
auto-encoders demonstrate the wide applicability of the VR bound.

**Deep Gaussian
processes for regression using approximate expectation propagation**.
In *33rd International Conference on Machine Learning*, New York, USA,
June 2016.

** Abstract:** Deep Gaussian processes (DGPs) are
multi-layer hierarchical generalisations of Gaussian processes (GPs) and are
formally equivalent to neural networks with multiple, infinitely wide hidden
layers. DGPs are nonparametric probabilistic models and as such are arguably
more flexible, have a greater capacity to generalise, and provide better
calibrated uncertainty estimates than alternative deep models. This paper
develops a new approximate Bayesian learning scheme that enables DGPs to be
applied to a range of medium to large scale regression problems for the first
time. The new method uses an approximate Expectation Propagation procedure
and a novel and efficient extension of the probabilistic backpropagation
algorithm for learning. We evaluate the new method for non-linear regression
on eleven real-world datasets, showing that it always outperforms GP
regression and is almost always better than state-of-the-art deterministic
and sampling-based approximate inference methods for Bayesian neural
networks. As a by-product, this work provides a comprehensive analysis of six
approximate Bayesian methods for training neural networks.

**Black-box
Alpha Divergence Minimization**.
In *33rd International Conference on Machine Learning*, New York USA,
June 2016.

** Abstract:** Black-box alpha (BB-$α$) is a
new approximate inference method based on the minimization of
$α$-divergences. BB-$α$ scales to large datasets because it can be
implemented using stochastic gradient descent. BB-$α$ can be applied to
complex probabilistic models with little effort since it only requires as
input the likelihood function and its gradients. These gradients can be
easily obtained using automatic differentiation. By changing the divergence
parameter α, the method is able to interpolate between variational Bayes
(VB) ($α \rightarrow 0$) and an algorithm similar to expectation
propagation (EP) ($α = 1$). Experiments on probit regression and neural
network regression and classification problems show that BB-αwith
non-standard settings of $α$, such as $α = 0.5$, usually produces
better predictions than with $α \rightarrow 0$ (VB) or $α = 1$
(EP).

Alexander G D G Matthews, James Hensman, Richard E. Turner, and Zoubin
Ghahramani.
**On Sparse Variational methods
and the Kullback-Leibler divergence between stochastic processes**.
In *19th International Conference on Artificial Intelligence and
Statistics*, Cadiz, Spain, May 2016.

** Abstract:** The
variational framework for learning inducing variables (Titsias, 2009a) has
had a large impact on the Gaussian process literature. The framework may be
interpreted as minimizing a rigorously defined Kullback-Leibler divergence
between the approximating and posterior processes. To our knowledge this
connection has thus far gone unremarked in the literature. In this paper we
give a substantial generalization of the literature on this topic. We give a
new proof of the result for infinite index sets which allows inducing points
that are not data points and likelihoods that depend on all function values.
We then discuss augmented index sets and show that, contrary to previous
works, marginal consistency of augmentation is not enough to guarantee
consistency of variational inference with the original model. We then
characterize an extra condition where such a guarantee is obtainable. Finally
we show how our framework sheds light on interdomain sparse approximations
and sparse approximations for Cox processes.

Shixiang Gu, Zoubin Ghahramani, and Richard E. Turner.
**Neural adaptive sequential Monte
Carlo**.
In *Advances in Neural Information Processing Systems 29*, Montréal
CANADA, Dec 2015.

** Abstract:** Sequential Monte Carlo (SMC),
or particle filtering, is a popular class of methods for sampling from an
intractable target distribution using a sequence of simpler intermediate
distributions. Like other importance sampling-based methods, performance is
critically dependent on the proposal distribution: a bad proposal can lead to
arbitrarily inaccurate estimates of the target distribution. This paper
presents a new method for automatically adapting the proposal using an
approximation of the Kullback-Leibler divergence between the true posterior
and the proposal distribution. The method is very flexible, applicable to any
parameterised proposal distribution and it supports online and batch
variants. We use the new framework to adapt powerful proposal distributions
with rich parameterisations based upon neural networks leading to Neural
Adaptive Sequential Monte Carlo (NASMC). Experiments indicate that NASMC
significantly improves inference in a non-linear state space model
outperforming adaptive proposal methods including the Extended Kalman and
Unscented Particle Filters. Experiments also indicate that improved inference
translates into improved parameter learning when NASMC is used as a
subroutine of Particle Marginal Metropolis Hastings. Finally we show that
NASMC is able to train a neural network-based deep recurrent generative model
achieving results that compete with the state-of-the-art for polymorphic
music modelling. NASMC can be seen as bridging the gap between adaptive SMC
methods and the recent work in scalable, black-box variational inference.

Yingzhen Li, José Miguel Hernández-Lobato, and Richard E. Turner.
**Stochastic Expectation
Propagation**.
In *Advances in Neural Information Processing Systems 28*, Montréal
CANADA, Dec 2015.

** Abstract:** Expectation propagation (EP)
is a deterministic approximation algorithm that is often used to perform
approximate Bayesian parameter learning. EP approximates the full intractable
posterior distribution through a set of local-approximations that are
iteratively refined for each datapoint. EP can offer analytic and
computational advantages over other approximations, such as Variational
Inference (VI), and is the method of choice for a number of models. The local
nature of EP appears to make it an ideal candidate for performing Bayesian
learning on large models in large-scale datasets settings. However, EP has a
crucial limitation in this context: the number approximating factors needs to
increase with the number of data-points, N, which often entails a
prohibitively large memory overhead. This paper presents an extension to EP,
called stochastic expectation propagation (SEP), that maintains a global
posterior approximation (like VI) but updates it in a local way (like EP ).
Experiments on a number of canonical learning problems using synthetic and
real-world datasets indicate that SEP performs almost as well as full EP, but
reduces the memory consumption by a factor of N. SEP is therefore ideally
suited to performing approximate Bayesian learning in the large model, large
dataset setting.

Felipe Tobar, Thang D. Bui, and Richard E. Turner.
**Learning
stationary time series using gaussian process with nonparametric
kernels**.
In *Advances in Neural Information Processing Systems 29*, Montréal
CANADA, Dec 2015.

** Abstract:** We introduce the Gaussian
Process Convolution Model (GPCM), a two-stage nonparametric generative
procedure to model stationary signals as the convolution between a
continuous-time white-noise process and a continuous-time linear filter drawn
from Gaussian process. The GPCM is a continuous-time nonparametricwindow
moving average process and, conditionally, is itself a Gaussian process with
a nonparametric kernel defined in a probabilistic fashion. The generative
model can be equivalently considered in the frequency domain, where the power
spectral density of the signal is specified using a Gaussian process. One of
the main contributions of the paper is to develop a novel variational
freeenergy approach based on inter-domain inducing variables that efficiently
learns the continuous-time linear filter and infers the driving white-noise
process. In turn, this scheme provides closed-form probabilistic estimates of
the covariance kernel and the noise-free signal both in denoising and
prediction scenarios. Additionally, the variational inference procedure
provides closed-form expressions for the approximate posterior of the
spectral density given the observed data, leading to new Bayesian
nonparametric approaches to spectrum estimation. The proposed GPCM is
validated using synthetic and real-world signals.

Yarin Gal and Richard Turner.
**Improving the
Gaussian process sparse spectrum approximation by representing uncertainty
in frequency inputs**.
In *Proceedings of the 32nd International Conference on Machine Learning
(ICML-15)*, pages 655-664, 2015.

** Abstract:** Standard
sparse pseudo-input approximations to the Gaussian process (GP) cannot handle
complex functions well. Sparse spectrum alternatives attempt to answer this
but are known to over-fit. We suggest the use of variational inference for
the sparse spectrum approximation to avoid both issues. We model the
covariance function with a finite Fourier series approximation and treat it
as a random variable. The random covariance function has a posterior, on
which a variational distribution is placed. The variational distribution
transforms the random covariance function to fit the data. We study the
properties of our approximate inference, compare it to alternative ones, and
extend it to the distributed and stochastic domains. Our approximation
captures complex functions better than standard approaches and avoids
over-fitting.

Mateo Rojas-Carulla, Bernhard Schölkopf, Richard Turner, and Jonas Peters.
**A causal perspective on domain
adaptation**.
*arXiv preprint arXiv:1507.05333)*, 2015.

** Abstract:**
From training data from several related domains (or tasks), methods of domain
adaptation try to combine knowledge to improve performance. This paper
discusses an approach to domain adaptation which is inspired by a causal
interpretation of the multi-task problem. We assume that a covariate shift
assumption holds true for a subset of predictor variables: the conditional of
the target variable given this subset of predictors is invariant with respect
to shifts in those predictors (covariates). We propose to learn the
corresponding conditional expectation in the training domains and use it for
estimation in the target domain. We further introduce a method which allows
for automatic inference of the above subset in regression and classification.
We study the performance of this approach in an adversarial setting, in the
case where no additional examples are available in the test domain. If a
labeled sample is available, we provide a method for using both the
transferred invariant conditional and task specific information. We present
results on synthetic data sets and a sentiment analysis problem.

Felipe Tobar and Richard E. Turner.
**Modelling
of complex signals using Gaussian processes**.
In *Proceedings of the IEEE International Conference on Acoustics, Speech,
and Signal Processing (ICASSP)*, pages 2209 - 2213, 2015.

** Abstract:** In complex-valued signal processing, estimation
algorithms require complete knowledge (or accurate estimation) of the second
order statistics, this makes Gaussian processes (GP) well suited for
modelling complex signals, as they are designed in terms of covariance
functions. Dealing with bivariate signals using GPs require four covariance
matrices, or equivalently, two complex matrices. We propose a GP-based
approach for modelling complex signals, whereby the second-order statistics
are learnt through maximum likelihood; in particular, the complex GP approach
allows for circularity coefficient estimation in a robust manner when the
observed signal is corrupted by (circular) white noise. The proposed model is
validated using climate signals, for both circular and noncircular cases. The
results obtained open new possibilities for collaboration between the complex
signal processing and Gaussian processes communities towards an appealing
representation and statistical description of bivariate signals.

Thang D. Bui and Richard E. Turner.
**Tree-structured Gaussian process approximations**.
In Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger,
editors, *Advances in Neural Information Processing Systems 28*,
volume 28, pages 2213-2221. Curran Associates, Inc., 2014.

** Abstract:** Gaussian process regression can be accelerated by
constructing a small pseudo-dataset to summarize the observed data. This idea
sits at the heart of many approximation schemes, but such an approach
requires the number of pseudo-datapoints to be scaled with the range of the
input space if the accuracy of the approximation is to be maintained. This
presents problems in time-series settings or in spatial datasets where large
numbers of pseudo-datapoints are required since computation typically scales
quadratically with the pseudo-dataset size. In this paper we devise an
approximation whose complexity grows linearly with the number of
pseudo-datapoints. This is achieved by imposing a tree or chain structure on
the pseudo-datapoints and calibrating the approximation using a
Kullback-Leibler (KL) minimization. Inference and learning can then be
performed efficiently using the Gaussian belief propagation algorithm. We
demonstrate the validity of our approach on a set of challenging regression
tasks including missing data imputation for audio and spatial datasets. We
trace out the speed-accuracy trade-off for the new method and show that the
frontier dominates those obtained from a large number of existing
approximation techniques.

Victoria Leong, Michael A Stone, Richard E Turner, and Usha Goswami.
**A
role for amplitude modulation phase relationships in speech rhythm
perception**.
*Journal of the Acoustical Society of America*, 136:366-381, 2014.

** Abstract:** Prosodic rhythm in speech [the alternation of
"Strong" (S) and "weak" (w) syllables] is cued, among others, by slow rates
of amplitude modulation (AM) within the speech envelope. However, it is
unclear exactly which envelope modulation rates and statistics are the most
important for the rhythm percept. Here, the hypothesis that the phase
relationship between "Stress" rate ( 2 Hz) and "Syllable" rate ( 4 Hz)
AMs provides a perceptual cue for speech rhythm is tested. In a rhythm
judgment task, adult listeners identified AM tone-vocoded nursery rhyme
sentences that carried either trochaic (S-w) or iambic patterning (w-S).
Manipulation of listeners' rhythm perception was attempted by parametrically
phase-shifting the Stress AM and Syllable AM in the vocoder. It was expected
that a 1π radian phase-shift (half a cycle) would reverse the perceived
rhythm pattern (i.e., trochaic -> iambic) whereas a 2π radian shift (full
cycle) would retain the perceived rhythm pattern (i.e., trochaic ->
trochaic). The results confirmed these predictions. Listeners judgments of
rhythm systematically followed Stress-Syllable AM phase-shifts, but were
unaffected by phase-shifts between the Syllable AM and the Sub-beat AM
( 14 Hz) in a control condition. It is concluded that the Stress-Syllable
AM phase relationship is an envelope-based modulation statistic that supports
speech rhythm perception.

Richard E. Turner and Maneesh Sahani.
**Decomposing
signals into a sum of amplitude and frequency modulated sinusoids using
probabilistic inference**.
In *Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE
International Conference on*, pages 2173-2176, march 2012, doi
10.1109/ICASSP.2012.6288343.

** Abstract:** There are many
methods for decomposing signals into a sum of amplitude and frequency
modulated sinusoids. In this paper we take a new estimation based approach.
Identifying the problem as ill-posed, we show how to regularize the solution
by imposing soft constraints on the amplitude and phase variables of the
sinusoids. Estimation proceeds using a version of Kalman smoothing. We
evaluate the method on synthetic and natural, clean and noisy signals,
showing that it outperforms previous decompositions, but at a higher
computational cost.

Richard E. Turner and Maneesh Sahani.
**Demodulation
as probabilistic inference**.
*Transactions on Audio, Speech and Language Processing*, 19:2398-2411,
2011.

** Abstract:** Demodulation is an ill-posed problem
whenever both carrier and envelope signals are broadband and unknown. Here,
we approach this problem using the methods of probabilistic inference. The
new approach, called Probabilistic Amplitude Demodulation (PAD), is
computationally challenging but improves on existing methods in a number of
ways. By contrast to previous approaches to demodulation, it satisfies five
key desiderata: PAD has soft constraints because it is probabilistic; PAD is
able to automatically adjust to the signal because it learns parameters; PAD
is user-steerable because the solution can be shaped by user-specific prior
information; PAD is robust to broad-band noise because this is modelled
explicitly; and PAD’s solution is self-consistent, empirically satisfying a
Carrier Identity property. Furthermore, the probabilistic view naturally
encompasses noise and uncertainty, allowing PAD to cope with missing data and
return error bars on carrier and envelope estimates. Finally, we show that
when PAD is applied to a bandpass-filtered signal, the stop-band energy of
the inferred carrier is minimal, making PAD well-suited to sub-band
demodulation.

Richard E. Turner and Maneesh Sahani.
**Probabilistic
amplitude and frequency demodulation**.
In *Advances in Neural Information Processing Systems 24*, pages
981-989. The MIT Press, 2011.

** Abstract:** A number of
recent scientific and engineering problems require signals to be decomposed
into a product of a slowly varying positive envelope and a quickly varying
carrier whose instantaneous frequency also varies slowly over time. Although
signal processing provides algorithms for so-called amplitude- and
frequency-demodulation (AFD), there are well known problems with all of the
existing methods. Motivated by the fact that AFD is ill-posed, we approach
the problem using probabilistic inference. The new approach, called
probabilistic amplitude and frequency demodulation (PAFD), models
instantaneous frequency using an auto-regressive generalization of the von
Mises distribution, and the envelopes using Gaussian auto-regressive dynamics
with a positivity constraint. A novel form of expectation propagation is used
for inference. We demonstrate that although PAFD is computationally
demanding, it outperforms previous approaches on synthetic and real signals
in clean, noisy and missing data settings.

Richard E. Turner and Maneesh Sahani.
**Two
problems with variational expectation maximisation for time-series
models**.
In D. Barber, T. Cemgil, and S. Chiappa, editors, *Bayesian Time series
models*, chapter 5, pages 109-130. Cambridge University Press,
2011.

** Abstract:** Variational methods are a key component
of the approximate inference and learning toolbox. These methods fill an
important middle ground, retaining distributional information about
uncertainty in latent variables, unlike maximum a posteriori methods (MAP),
and yet generally requiring less computational time than Monte Carlo Markov
Chain methods. In particular the variational Expectation Maximisation (vEM)
and variational Bayes algorithms, both involving variational optimisation of
a free-energy, are widely used in time-series modelling. Here, we investigate
the success of vEM in simple probabilistic time-series models. First we
consider the inference step of vEM, and show that a consequence of the
well-known compactness property of variational inference is a failure to
propagate uncertainty in time, thus limiting the usefulness of the retained
distributional information. In particular, the uncertainty may appear to be
smallest precisely when the approximation is poorest. Second, we consider
parameter learning and analytically reveal systematic biases in the
parameters found by vEM. Surprisingly, simpler variational approximations
(such a mean-field) can lead to less bias than more complicated structured
approximations.

Richard E. Turner and Maneesh Sahani.
**Statistical
inference for single- and multi-band probabilistic amplitude
demodulation.**.
In *Proceedings of the IEEE International Conference on Acoustics, Speech,
and Signal Processing (ICASSP)*, pages 5466-5469, 2010.

**
Abstract:** Amplitude demodulation is an ill-posed problem and so it is
natural to treat it from a Bayesian viewpoint, inferring the most likely
carrier and envelope under probabilistic constraints. One such treatment is
Probabilistic Amplitude Demodulation (PAD), which, whilst computationally
more intensive than traditional approaches, offers several advantages. Here
we provide methods for estimating the uncertainty in the PAD-derived
envelopes and carriers, and for learning free-parameters like the time-scale
of the envelope. We show how the probabilistic approach can naturally handle
noisy and missing data. Finally, we indicate how to extend the model to
signals which contain multiple modulators and carriers.

Richard E. Turner.
**Statistical
Models for Natural Sounds**.
PhD thesis, Gatsby Computational Neuroscience Unit, UCL, 2010.

**
Abstract:** It is important to understand the rich structure of natural
sounds in order to solve important tasks, like automatic speech recognition,
and to understand auditory processing in the brain. This thesis takes a step
in this direction by characterising the statistics of simple natural sounds.
We focus on the statistics because perception often appears to depend on
them, rather than on the raw waveform. For example the perception of auditory
textures, like running water, wind, fire and rain, depends on
summary-statistics, like the rate of falling rain droplets, rather than on
the exact details of the physical source. In order to analyse the statistics
of sounds accurately it is necessary to improve a number of traditional
signal processing methods, including those for amplitude demodulation,
time-frequency analysis, and sub-band demodulation. These estimation tasks
are ill-posed and therefore it is natural to treat them as Bayesian inference
problems. The new probabilistic versions of these methods have several
advantages. For example, they perform more accurately on natural signals and
are more robust to noise, they can also fill-in missing sections of data, and
provide error-bars. Furthermore, free-parameters can be learned from the
signal. Using these new algorithms we demonstrate that the energy, sparsity,
modulation depth and modulation time-scale in each sub-band of a signal are
critical statistics, together with the dependencies between the sub-band
modulators. In order to validate this claim, a model containing co-modulated
coloured noise carriers is shown to be capable of generating a range of
realistic sounding auditory textures. Finally, we explored the connection
between the statistics of natural sounds and perception. We demonstrate that
inference in the model for auditory textures qualitatively replicates the
primitive grouping rules that listeners use to understand simple acoustic
scenes. This suggests that the auditory system is optimised for the
statistics of natural sounds.

Pietro Berkes, Richard E. Turner, and Maneesh Sahani.
**A
structured model of video reproduces primary visual cortical
organisation**.
*PLoS Computational Biology*, 5(9), 09 2009, doi
10.1371/journal.pcbi.1000495.

** Abstract:** The visual
system must learn to infer the presence of objects and features in the world
from the images it encounters, and as such it must, either implicitly or
explicitly, model the way these elements interact to create the image. Do the
response properties of cells in the mammalian visual system reflect this
constraint? To address this question, we constructed a probabilistic model in
which the identity and attributes of simple visual elements were represented
explicitly and learnt the parameters of this model from unparsed, natural
video sequences. After learning, the behaviour and grouping of variables in
the probabilistic model corresponded closely to functional and anatomical
properties of simple and complex cells in the primary visual cortex (V1). In
particular, feature identity variables were activated in a way that resembled
the activity of complex cells, while feature attribute variables responded
much like simple cells. Furthermore, the grouping of the attributes within
the model closely parallelled the reported anatomical grouping of simple
cells in cat V1. Thus, this generative model makes explicit an interpretation
of complex and simple cells as elements in the segmentation of a visual scene
into basic independent features, along with a parametrisation of their
moment-by-moment appearances. We speculate that such a segmentation may form
the initial stage of a hierarchical system that progressively separates the
identity and appearance of more articulated visual elements, culminating in
view-invariant object recognition.

Jörg Lücke, Richard E. Turner, Maneesh Sahani, and Marc Henniges.
**Occlusive
components analysis**.
In Y Bengio, D Schuurmans, J Lafferty, C K I Williams, and A Culotta, editors,
*nips22*, pages 1069-1077. mit, 2009.

** Abstract:**
We study unsupervised learning in a probabilistic generative model for
occlusion. The model uses two types of latent variables: one indicates which
objects are present in the image, and the other how they are ordered in
depth. This depth order then determines how the positions and appearances of
the objects present, specified in the model parameters, combine to form the
image. We show that the object parameters can be learnt from an unlabelled
set of images in which objects occlude one another. Exact maximum-likelihood
learning is intractable. However, we show that tractable approximations to
Expectation Maximization (EM) can be found if the training images each
contain only a small number of objects on average. In numerical experiments
it is shown that these approximations recover the correct set of object
parameters. Experiments on a novel version of the bars test using colored
bars, and experiments on more realistic data, show that the algorithm
performs well in extracting the generating causes. Experiments based on the
standard bars benchmark test for object learning show that the algorithm
performs well in comparison to other recent component extraction approaches.
The model and the learning algorithm thus connect research on occlusion with
the research field of multiple-causes component extraction methods.

Pietro Berkes, Richard E. Turner, and Maneesh Sahani.
**On
sparsity and overcompleteness in image models**.
In J. C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, *nips20*,
volume 20. mit, 2008.

** Abstract:** Computational models
of visual cortex, and in particular those based on sparse coding, have
enjoyed much recent attention. Despite this currency, the question of how
sparse or how over-complete a sparse representation should be, has gone
without principled answer. Here, we use Bayesian model-selection methods to
address these questions for a sparse-coding model based on a Student-t prior.
Having validated our methods on toy data, we find that natural images are
indeed best modelled by extremely sparse distributions; although for the
Student-t prior, the associated optimal basis size is only modestly
over-complete.

Richard E. Turner and Maneesh Sahani.
**Modeling
natural sounds with modulation cascade processes**.
In J. C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, *nips20*,
volume 20. mit, 2008.

** Abstract:** Natural sounds are
structured on many time-scales. A typical segment of speech, for example,
contains features that span four orders of magnitude: Sentences ($\sim1$s);
phonemes ($\sim10$−$1$ s); glottal pulses ($\sim 10$−$2$s); and formants
($\sim 10$−$3$s). The auditory system uses information from each of these
time-scales to solve complicated tasks such as auditory scene analysis [1].
One route toward understanding how auditory processing accomplishes this
analysis is to build neuroscience-inspired algorithms which solve similar
tasks and to compare the properties of these algorithms with properties of
auditory processing. There is however a discord: Current machine-audition
algorithms largely concentrate on the shorter time-scale structures in
sounds, and the longer structures are ignored. The reason for this is
two-fold. Firstly, it is a difficult technical problem to construct an
algorithm that utilises both sorts of information. Secondly, it is
computationally demanding to simultaneously process data both at high
resolution (to extract short temporal information) and for long duration (to
extract long temporal information). The contribution of this work is to
develop a new statistical model for natural sounds that captures structure
across a wide range of time-scales, and to provide efficient learning and
inference algorithms. We demonstrate the success of this approach on a
missing data task.

Richard E. Turner and M Sahani.
**Probabilistic
amplitude demodulation**.
In *7th International Conference on Independent Component Analysis and
Signal Separation*, pages 544-551, 2007.

** Abstract:**
Auditory scene analysis is extremely challenging. One approach, perhaps that
adopted by the brain, is to shape useful representations of sounds on prior
knowledge about their statistical structure. For example, sounds with
harmonic sections are common and so time-frequency representations are
efficient. Most current representations concentrate on the shorter
components. Here, we propose representations for structures on longer
time-scales, like the phonemes and sentences of speech. We decompose a sound
into a product of processes, each with its own characteristic time-scale.
This demodulation cascade relates to classical amplitude demodulation, but
traditional algorithms fail to realise the representation fully. A new
approach, probabilistic amplitude demodulation, is shown to out-perform the
established methods, and to easily extend to representation of a full
demodulation cascade.

Richard E. Turner and Maneesh Sahani.
**A
maximum-likelihood interpretation for slow feature analysis**.
*nc*, 19(4):1022-1038, 2007, doi
http://dx.doi.org/10.1162/neco.2007.19.4.1022.

**
Abstract:** The brain extracts useful features from a maelstrom of sensory
information, and a fundamental goal of theoretical neuroscience is to work
out how it does so. One proposed feature extraction strategy is motivated by
the observation that the meaning of sensory data, such as the identity of a
moving visual object, is often more persistent than the activation of any
single sensory receptor. This notion is embodied in the slow feature analysis
(SFA) algorithm, which uses “slowness” as an heuristic by which to
extract semantic information from multi-dimensional time-series. Here, we
develop a probabilistic interpretation of this algorithm showing that
inference and learning in the limiting case of a suitable probabilistic model
yield exactly the results of SFA. Similar equivalences have proved useful in
interpreting and extending comparable algorithms such as independent
component analysis. For SFA, we use the equivalent probabilistic model as a
conceptual spring-board, with which to motivate several novel extensions to
the algorithm.

**Predictive
entropy search for Bayesian optimization with unknown constraints**.
In *32nd International Conference on Machine Learning*, pages
1699-1707, 2015.

** Abstract:** Unknown constraints arise in
many types of expensive black-box optimization problems. Several methods have
been proposed recently for performing Bayesian optimization with constraints,
based on the expected improvement (EI) heuristic. However, EI can lead to
pathologies when used with constraints. For example, in the case of decoupled
constraints—i.e., when one can independently evaluate the objective or the
constraints—EI can encounter a pathology that prevents exploration.
Additionally, computing EI requires a current best solution, which may not
exist if none of the data collected so far satisfy the constraints. By
contrast, information-based approaches do not suffer from these failure
modes. In this paper, we present a new information-based method called
Predictive Entropy Search with Constraints (PESC). We analyze the performance
of PESC and show that it compares favorably to EI-based approaches on
synthetic and benchmark problems, as well as several real-world examples. We
demonstrate that PESC is an effective algorithm that provides a promising
direction towards a unified solution for constrained Bayesian
optimization.

David Duvenaud, Oren Rippel, Ryan P. Adams, and Zoubin Ghahramani.
**Avoiding pathologies in very
deep networks**.
In *17th International Conference on Artificial Intelligence and
Statistics*, Reykjavik, Iceland, April 2014.

**
Abstract:** Choosing appropriate architectures and regularization
strategies for deep networks is crucial to good predictive performance. To
shed light on this problem, we analyze the analogous problem of constructing
useful priors on compositions of functions. Specifically, we study the deep
Gaussian process, a type of infinitely-wide, deep neural network. We show
that in standard architectures, the representational capacity of the network
tends to capture fewer degrees of freedom as the number of layers increases,
retaining only a single degree of freedom in the limit. We propose an
alternate network architecture which does not suffer from this pathology. We
also examine deep covariance functions, obtained by composing infinitely many
feature transforms. Lastly, we characterize the class of models obtained by
performing dropout on Gaussian processes.

Andrew Gordon Wilson and Ryan Prescott Adams.
**Gaussian
process kernels for pattern discovery and extrapolation**.
In *30th International Conference on Machine Learning*, February 18
2013.

** Abstract:** Gaussian processes are rich distributions
over functions, which provide a Bayesian nonparametric approach to smoothing
and interpolation. We introduce simple closed form kernels that can be used
with Gaussian processes to discover patterns and enable extrapolation. These
kernels are derived by modelling a spectral density - the Fourier transform
of a kernel - with a Gaussian mixture. The proposed kernels support a broad
class of stationary covariances, but Gaussian process inference remains
simple and analytic. We demonstrate the proposed kernels by discovering
patterns and performing long range extrapolation on synthetic examples, as
well as atmospheric CO2 trends and airline passenger data. We also show that
we can reconstruct standard covariances within our framework.

** Comment:** arXiv:1302.4245

Marc Peter Deisenroth, Ryan D. Turner, Marco F. Huber, Uwe D. Hanebeck, and
Carl Edward Rasmussen.
**Robust
filtering and smoothing with Gaussian processes**.
*IEEE Transactions on Automatic Control*, 57(7):1865-1871, 2012, doi
10.1109/TAC.2011.2179426.

** Abstract:** We propose a
principled algorithm for robust Bayesian filtering and smoothing in nonlinear
stochastic dynamic systems when both the transition function and the
measurement function are described by nonparametric Gaussian process (GP)
models. GPs are gaining increasing importance in signal processing, machine
learning, robotics, and control for representing unknown system functions by
posterior probability distributions. This modern way of "system
identification" is more robust than finding point estimates of a parametric
function representation. Our principled filtering/smoothing approach for GP
dynamic systems is based on analytic moment matching in the context of the
forward-backward algorithm. Our numerical evaluations demonstrate the
robustness of the proposed approach in situations where other
state-of-the-art Gaussian filters and smoothers can fail.

Ryan D. Turner and Carl Edward Rasmussen.
**Model based learning
of sigma points in unscented Kalman filtering**.
*Neurocomputing*, 80:47-53, 2012, doi
10.1016/j.neucom.2011.07.029.

** Abstract:** The unscented
Kalman filter (UKF) is a widely used method in control and time series
applications. The UKF suffers from arbitrary parameters necessary for sigma
point placement, potentially causing it to perform poorly in nonlinear
problems. We show how to treat sigma point placement in a UKF as a learning
problem in a model based view. We demonstrate that learning to place the
sigma points correctly from data can make sigma point collapse much less
likely. Learning can result in a significant increase in predictive
performance over default settings of the parameters in the UKF and other
filters designed to avoid the problems of the UKF, such as the GP-ADF. At the
same time, we maintain a lower computational complexity than the other
methods. We call our method UKF-L.

Ryan Darby Turner.
**Gaussian processes for
state space models and change point detection**.
PhD thesis, University of Cambridge, Department of Engineering, Cambridge, UK,
2011.

** Abstract:** This thesis details several applications
of Gaussian processes (GPs) for enhanced time series modeling. We first cover
different approaches for using Gaussian processes in time series problems.
These are extended to the state space approach to time series in two
different problems. We also combine Gaussian processes and Bayesian online
change point detection (BOCPD) to increase the generality of the Gaussian
process time series methods. These methodologies are evaluated on predictive
performance on six real world data sets, which include three environmental
data sets, one financial, one biological, and one from industrial well
drilling.

Gaussian processes are capable of generalizing standard linear
time series models. We cover two approaches: the Gaussian process time series
model (GPTS) and the autoregressive Gaussian process (ARGP). We cover a
variety of methods that greatly reduce the computational and memory
complexity of Gaussian process approaches, which are generally cubic in
computational complexity.

Two different improvements to state space based
approaches are covered. First, Gaussian process inference and learning (GPIL)
generalizes linear dynamical systems (LDS), for which the Kalman filter is
based, to general nonlinear systems for nonparametric system identification.
Second, we address pathologies in the unscented Kalman filter (UKF). We use
Gaussian process optimization (GPO) to learn UKF settings that minimize the
potential for sigma point collapse.

We show how to embed mentioned
Gaussian process approaches to time series into a change point framework. Old
data, from an old regime, that hinders predictive performance is
automatically and elegantly phased out. The computational improvements for
Gaussian process time series approaches are of even greater use in the change
point framework. We also present a supervised framework learning a change
point model when change point labels are available in training.

These
mentioned methodologies significantly improve predictive performance on the
diverse set of data sets selected.

Ryan Turner, Steven Bottone, and Zoubin Ghahramani.
**Fast online
anomaly detection using scan statistics**.
In Samuel Kaski, David J. Miller, Erkki Oja, and Antti Honkela, editors,
*Machine Learning for Signal Processing (MLSP 2010)*, pages 385-390,
Kittilä, Finland, August 2010.

** Abstract:** We present
methods to do fast online anomaly detection using scan statistics. Scan
statistics have long been used to detect statistically significant bursts of
events. We extend the scan statistics framework to handle many practical
issues that occur in application: dealing with an unknown background rate of
events, allowing for slow natural changes in background frequency, the
inverse problem of finding an unusual lack of events, and setting the test
parameters to maximize power. We demonstrate its use on real and synthetic
data sets with comparison to other methods.

Ryan Turner and Carl Edward Rasmussen.
**Model based learning
of sigma points in unscented Kalman filtering**.
In Samuel Kaski, David J. Miller, Erkki Oja, and Antti Honkela, editors,
*Machine Learning for Signal Processing (MLSP 2010)*, pages 178-183,
Kittilä, Finland, August 2010.

** Abstract:** The
unscented Kalman filter (UKF) is a widely used method in control and time
series applications. The UKF suffers from arbitrary parameters necessary for
a step known as sigma point placement, causing it to perform poorly in
nonlinear problems. We show how to treat sigma point placement in a UKF as a
learning problem in a model based view. We demonstrate that learning to place
the sigma points correctly from data can make sigma point collapse much less
likely. Learning can result in a significant increase in predictive
performance over default settings of the parameters in the UKF and other
filters designed to avoid the problems of the UKF, such as the GP-ADF. At the
same time, we maintain a lower computational complexity than the other
methods. We call our method UKF-L.

Yunus Saatçi, Ryan Turner, and Carl Edward Rasmussen.
**Gaussian process
change point models**.
In *27th International Conference on Machine Learning*, pages 927-934,
Haifa, Israel, June 2010.

** Abstract:** We combine Bayesian
online change point detection with Gaussian processes to create a
nonparametric time series model which can handle change points. The model can
be used to locate change points in an online manner; and, unlike other
Bayesian online change point detection algorithms, is applicable when
temporal correlations in a regime are expected. We show three variations on
how to apply Gaussian processes in the change point context, each with their
own advantages. We present methods to reduce the computational burden of
these models and demonstrate it on several real world data sets.

Ryan Turner, Marc Peter Deisenroth, and Carl Edward Rasmussen.
**State-space
inference and learning with Gaussian processes**.
In Yee Whye Teh and Mike Titterington, editors, *13th International
Conference on Artificial Intelligence and Statistics*, volume 9 of
*W & CP*, pages 868-875, Chia Laguna, Sardinia, Italy, May 13-15
2010. Journal of Machine Learning Research.

** Abstract:**
State-space inference and learning with Gaussian processes (GPs) is an
unsolved problem. We propose a new, general methodology for inference and
learning in nonlinear state-space models that are described probabilistically
by non-parametric GP models. We apply the expectation maximization algorithm
to iterate between inference in the latent state-space and learning the
parameters of the underlying GP dynamics model.

** Comment:** poster.

Ryan Turner, Marc Peter Deisenroth, and Carl Edward Rasmussen.
**System
identification in Gaussian process dynamical systems**.
In Dilan Görür, editor, *NIPS Workshop on Nonparametric Bayes*,
Whistler, BC, Canada, December 2009.

** Comment:** poster.

Ryan Turner, Yunus Saatçi, and Carl Edward Rasmussen.
**Adaptive
sequential Bayesian change point detection**.
In Zaïd Harchaoui, editor, *NIPS Workshop on Temporal Segmentation*,
Whistler, BC, Canada, December 2009.

** Abstract:** Real-world
time series are often nonstationary with respect to the parameters of some
underlying prediction model (UPM). Furthermore, it is often desirable to
adapt the UPM to incoming regime changes as soon as possible, necessitating
sequential inference about change point locations. A Bayesian algorithm for
online change point detection (BOCPD) has been introduced recently by Adams
and MacKay (2007). In this algorithm, uncertainty about the last change point
location is updated sequentially, and is integrated out to make online
predictions robust to parameter changes. BOCPD requires a set of fixed
hyper-parameters which allow the user to fully specify the hazard function
for change points and the prior distribution over the parameters of the UPM.
In practice, finding the "right" hyper-parameters can be quite difficult. We
therefore extend BOCPD by introducing hyper-parameter learning, without
sacrificing the online nature of the algorithm. Hyper-parameter learning is
performed by optimizing the marginal likelihood of the BOCPD model, a
closed-form quantity which can be computed sequentially. We illustrate
performance on three real-world datasets.

Edward Snelson and Zoubin Ghahramani.
**Local and global
sparse Gaussian process approximations**.
In M. Meila and X. Shen, editors, *11th International Conference on
Artificial Intelligence and Statistics*. Omnipress, 2007.

**
Abstract:** Gaussian process (GP) models are flexible probabilistic
nonparametric models for regression, classification and other tasks.
Unfortunately they suffer from computational intractability for large data
sets. Over the past decade there have been many different approximations
developed to reduce this cost. Most of these can be termed global
approximations, in that they try to summarize all the training data via a
small set of support points. A different approach is that of local
regression, where many local experts account for their own part of space. In
this paper we start by investigating the regimes in which these different
approaches work well or fail. We then proceed to develop a new sparse GP
approximation which is a combination of both the global and local approaches.
Theoretically we show that it is derived as a natural extension of the
framework developed by Quiñonero-Candela and
Rasmussen for sparse GP approximations. We demonstrate the benefits of
the combined approximation on some 1D examples for illustration, and on some
large real-world data sets.

Edward Snelson and Zoubin Ghahramani.
**Sparse Gaussian
processes using pseudo-inputs**.
In Y. Weiss, B. Schölkopf, and J. Platt, editors, *Advances in Neural
Information Processing Systems 18*, pages 1257-1264. The MIT Press,
Cambridge, MA, 2006.

** Abstract:** We present a new Gaussian
process (GP) regression model whose covariance is parameterized by the the
locations of M pseudo-input points, which we learn by a gradient based
optimization. We take M<<N, where N is the number of real data points,
and hence obtain a sparse regression method which has O(NM^{2})
training cost and O(M^{2}) prediction cost per test case. We also
find hyperparameters of the covariance function in the same joint
optimization. The method can be viewed as a Bayesian regression model with
particular input dependent noise. The method turns out to be closely related
to several other sparse GP approaches, and we discuss the relation in detail.
We finally demonstrate its performance on some large data sets, and make a
direct comparison to other sparse GP methods. We show that our method can
match full GP performance with small M, i.e. very sparse solutions, and it
significantly outperforms other approaches in this regime.

Edward Snelson and Zoubin Ghahramani.
**Variable
noise and dimensionality reduction for sparse Gaussian processes**.
In R. Dechter and T. S. Richardson, editors, *22nd Conference on Uncertainty
in Artificial Intelligence*. AUAI Press, 2006.

**
Abstract:** The sparse pseudo-input Gaussian process (SPGP) is a new
approximation method for speeding up GP regression in the case of a large
number of data points N. The approximation is controlled by the gradient
optimization of a small set of M pseudo-inputs, thereby reducing complexity
from O(N^{3}) to O(NM^{2}). One limitation of the SPGP is
that this optimization space becomes impractically big for high dimensional
data sets. This paper addresses this limitation by performing automatic
dimensionality reduction. A projection of the input space to a low
dimensional space is learned in a supervised manner, alongside the
pseudo-inputs, which now live in this reduced space. The paper also
investigates the suitability of the SPGP for modeling data with
input-dependent noise. A further extension of the model is made to make it
even more powerful in this regard - we learn an uncertainty parameter for
each pseudo-input. The combination of sparsity, reduced dimension, and
input-dependent noise makes it possible to apply GPs to much larger and more
complex data sets than was previously practical. We demonstrate the benefits
of these methods on several synthetic and real world problems.

Edward Snelson and Zoubin Ghahramani.
**Compact
approximations to Bayesian predictive distributions**.
In *22nd International Conference on Machine Learning*, Bonn, Germany,
August 2005. Omnipress.

** Abstract:** We provide a general
framework for learning precise, compact, and fast representations of the
Bayesian predictive distribution for a model. This framework is based on
minimizing the KL divergence between the true predictive density and a
suitable compact approximation. We consider various methods for doing this,
both sampling based approximations, and deterministic approximations such as
expectation propagation. These methods are tested on a mixture of Gaussians
model for density estimation and on binary linear classification, with both
synthetic data sets for visualization and several real data sets. Our results
show significant reductions in prediction time and memory footprint.

Edward Snelson, Carl Edward Rasmussen, and Zoubin Ghahramani.
**Warped Gaussian
processes**.
In S. Thrun, L. Saul, and B. Schölkopf, editors, *Advances in Neural
Information Processing Systems 16*, pages 337-344, Cambridge, MA, USA,
December 2004. The MIT Press.

** Abstract:** We generalise
the Gaussian process (GP) framework for regression by learning a nonlinear
transformation of the GP outputs. This allows for non-Gaussian processes and
non-Gaussian noise. The learning algorithm chooses a nonlinear transformation
such that transformed data is well-modelled by a GP. This can be seen as
including a preprocessing transformation as an integral part of the
probabilistic modelling problem, rather than as an ad-hoc step. We
demonstrate on several real regression problems that learning the
transformation can lead to significantly better performance than using a
regular GP, or a GP with a fixed transformation.

Umang Bhatt, Adrian Weller, and Jose M. F. Moura.
**Evaluating and aggregating
feature-based model explanations**.
In *International Joint Conference on Artificial Intelligence*, 2020.

** Abstract:** A feature-based model explanation denotes how
much each input feature contributes to a model's output for a given data
point. As the number of proposed explanation functions grows, we lack
quantitative evaluation criteria to help practitioners know when to use which
explanation function. This paper proposes quantitative evaluation criteria
for feature-based explanations: low sensitivity, high faithfulness, and low
complexity. We devise a framework for aggregating explanation functions. We
develop a procedure for learning an aggregate explanation function with lower
complexity and then derive a new aggregate Shapley value explanation function
that minimizes sensitivity.

Umang Bhatt, Alice Xiang, Shubham Sharma, Adrian Weller, Ankur Taly, Yunhan
Jia, Joydeep Ghosh, Ruchir Puri, José M. F. Moura, and Peter Eckersley.
**Explainable machine learning in
deployment**.
In *ACM Conference on Fairness, Accountability, and Transparency
(FAT*)*, 2020.

** Abstract:** Explainable machine learning
offers the potential to provide stakeholders with insights into model
behavior by using various methods such as feature importance scores,
counterfactual explanations, or influential training data. Yet there is
little understanding of how organizations use these methods in practice. This
study explores how organizations view and use explainability for stakeholder
consumption. We find that, currently, the majority of deployments are not for
end users affected by the model but rather for machine learning engineers,
who use explainability to debug the model itself. There is thus a gap between
explainability in practice and the goal of transparency, since explanations
primarily serve internal stakeholders rather than external ones. Our study
synthesizes the limitations of current explainability techniques that hamper
their use for end users. To facilitate end user interaction, we develop a
framework for establishing clear goals for explainability. We end by
discussing concerns raised regarding explainability.

Krzysztof Choromanski, David Cheikhi, Jared Davis, Valerii Likhosherstov,
Achille Nazaret, Achraf Bahamou, Xingyou Song, Mrugank Akarte, Jack
Parker-Holder, Jacob Bergquist, Yuan Gao, Aldo Pacchiano, Tamas Sarlos,
Adrian Weller, and Vikas Sindhwani.
**Stochastic flows and geometric
optimization on the orthogonal group**.
In *37th International Conference on Machine Learning*, 2020.

** Abstract:** We present a new class of stochastic,
geometrically-driven optimization algorithms on the orthogonal group O(d) and
naturally reductive homogeneous manifolds obtained from the action of the
rotation group SO(d). We theoretically and experimentally demonstrate that
our methods can be applied in various fields of machine learning including
deep, convolutional and recurrent neural networks, reinforcement learning,
normalizing flows and metric learning. We show an intriguing connection
between efficient stochastic optimization on the orthogonal group and graph
theory (e.g. matching problem, partition functions over graphs,
graph-coloring). We leverage the theory of Lie groups and provide theoretical
results for the designed class of algorithms. We demonstrate broad
applicability of our methods by showing strong performance on the seemingly
unrelated tasks of learning world models to obtain stable policies for the
most difficult Humanoid agent from OpenAI Gym and improving convolutional
neural networks.

Botty Dimanov, Umang Bhatt, Mateja Jamnik, and Adrian Weller.
**You
shouldn't trust me: Learning models which conceal unfairness from multiple
explanation methods**.
In *European Conference on Artificial Intelligence (ECAI)*, 2020.

** Abstract:** Transparency of algorithmic systems has been
discussed as a way for end-users and regulators to develop appropriate trust
in machine learning models. One popular approach, LIME [26], even suggests
that model explanations can answer the question ``Why should I trust you?''
Here we show a straightforward method for modifying a pre-trained model to
manipulate the output of many popular feature importance explanation methods
with little change in accuracy, thus demonstrating the danger of trusting
such explanation methods. We show how this explanation attack can mask a
model’s discriminatory use of a sensitive feature, raising strong concerns
about using such explanation methods to check model fairness.

Moein Khajehnejad, Ahmad Asgharian Rezaei, Mahmoudreza Babaei, Jessica
Hoffmann, Mahdi Jalili, and Adrian Weller.
**Adversarial graph embeddings for
fair influence maximization over social networks**.
In *International Joint Conference on Artificial Intelligence*, 2020.

** Abstract:** Influence maximization is a widely studied topic
in network science, where the aim is to reach the maximum possible number of
nodes, while only targeting a small initial set of individuals. It has
critical applications in many fields, including viral marketing, information
propagation, news dissemination, and vaccinations. However, the objective
does not usually take into account whether the final set of influenced nodes
is fair with respect to sensitive attributes, such as race or gender. Here we
address fair influence maximization, aiming to reach minorities more
equitably. We introduce Adversarial Graph Embeddings: we co-train an
auto-encoder for graph embedding and a discriminator to discern sensitive
attributes. This leads to embeddings which are similarly distributed across
sensitive attributes. We then find a good initial set by clustering the
embeddings. We believe we are the first to use embeddings for the task of
fair influence maximization. While there are typically trade-offs between
fairness and influence maximization objectives, our experiments on synthetic
and real-world datasets show that our approach dramatically reduces disparity
while remaining competitive with state-of-the-art influence maximization
methods.

Yunfei Teng, Wenbo Gao, Francois Chalus, Anna Choromanska, Donald Goldfarb, and
Adrian Weller.
**Leader stochastic gradient
descent (LSGD) for distributed training of deep learning models**.
In *Advances in Neural Information Processing Systems 33*, Vancouver,
December 2019.

** Abstract:** We consider distributed
optimization under communication constraints for training deep learning
models. We propose a new algorithm, whose parameter updates rely on two
forces: a regular gradient step, and a corrective direction dictated by the
currently best-performing worker (leader). Our method differs from the
parameter-averaging scheme EASGD in a number of ways: (i) our objective
formulation does not change the location of stationary points compared to the
original optimization problem; (ii) we avoid convergence decelerations caused
by pulling local workers descending to different local minima to each other
(i.e. to the average of their parameters); (iii) our update by design breaks
the curse of symmetry (the phenomenon of being trapped in poorly generalizing
sub-optimal solutions in symmetric non-convex landscapes); and (iv) our
approach is more communication efficient since it broadcasts only parameters
of the leader rather than all workers. We provide theoretical analysis of the
batch version of the proposed algorithm, which we call Leader Gradient
Descent (LGD), and its stochastic variant (LSGD). Finally, we implement an
asynchronous version of our algorithm and extend it to the multi-leader
setting, where we form groups of workers, each represented by its own local
leader (the best performer in a group), and update each worker with a
corrective direction comprised of two attractive forces: one to the local,
and one to the global leader (the best performer among all workers). The
multi-leader setting is well-aligned with current hardware architecture,
where local workers forming a group lie within a single computational node
and different groups correspond to different nodes. For training
convolutional neural networks, we empirically demonstrate that our approach
compares favorably to state-of-the-art baselines.

Niki Kilbertus, Phil Ball, Matt Kusner, Adrian Weller, and Ricardo Silva.
**The sensitivity of counterfactual
fairness to unmeasured confounding**.
In *35th Conference on Uncertainty in Artificial Intelligence*, Tel
Aviv, July 2019.

** Abstract:** Causal approaches to fairness
have seen substantial recent interest, both from the machine learning
community and from wider parties interested in ethical prediction algorithms.
In no small part, this has been due to the fact that causal models allow one
to simultaneously leverage data and expert knowledge to remove discriminatory
effects from predictions. However, one of the primary assumptions in causal
modeling is that you know the causal graph. This introduces a new opportunity
for bias, caused by misspecifying the causal model. One common way for
misspecification to occur is via unmeasured confounding: the true causal
effect between variables is partially described by unobserved quantities. In
this work we design tools to assess the sensitivity of fairness measures to
this confounding for the popular class of non-linear additive noise models
(ANMs). Specifically, we give a procedure for computing the maximum
difference between two counterfactually fair predictors, where one has become
biased due to confounding. For the case of bivariate confounding our
technique can be swiftly computed via a sequence of closed-form updates. For
multivariate confounding we give an algorithm that can be efficiently solved
via automatic differentiation. We demonstrate our new sensitivity analysis
tools in real-world fairness scenarios to assess the bias arising from
confounding.

Tameem Adel and Adrian Weller.
**TibGM: A
transferable and information-based graphical model approach for reinforcement
learning**.
In *36th International Conference on Machine Learning*, Long Beach, June
2019.

** Abstract:** One of the challenges to reinforcement
learning (RL) is scalable transferability among complex tasks. Incorporating
a graphical model (GM), along with the rich family of related methods, as a
basis for RL frameworks provides potential to address issues such as
transferability, generalisation and exploration. Here we propose a flexible
GM-based RL framework which leverages efficient inference procedures to
enhance generalisation and transfer power. In our proposed transferable and
information-based graphical model framework ‘TibGM’, we show the
equivalence between our mutual information-based objective in the GM, and an
RL consolidated objective consisting of a standard reward maximisation target
and a generalisation/transfer objective. In settings where there is a sparse
or deceptive reward signal, our TibGM framework is flexible enough to
incorporate exploration bonuses depicting intrinsic rewards. We empirically
verify improved performance and exploration power.

Krzysztof Choromanski, Mark Rowland, Wenyu Chen, and Adrian Weller.
**Unifying
orthogonal Monte Carlo methods**.
In *36th International Conference on Machine Learning*, Long Beach, June
2019.

** Abstract:** Many machine learning methods making use
of Monte Carlo sampling in vector spaces have been shown to be improved by
conditioning samples to be mutually orthogonal. Exact orthogonal coupling of
samples is computationally intensive, hence approximate methods have been of
great interest. In this paper, we present a unifying perspective of many
approximate methods by considering Givens transformations, propose new
approximate methods based on this framework, and demonstrate the first
statistical guarantees for families of approximate methods in kernel
approximation. We provide extensive empirical evaluations with guidance for
practitioners.

Mark Rowland, Jiri Hron, Yunhao Tang, Krzysztof Choromanski, Tamas Sarlos, and
Adrian Weller.
**Orthogonal
estimation of Wasserstein distances**.
In *22nd International Conference on Artificial Intelligence and
Statistics*, Okinawa, Japan, April 2019.

** Abstract:**
Wasserstein distances are increasingly used in a wide variety of applications
in machine learning. Sliced Wasserstein distances form an important subclass
which may be estimated efficiently through one-dimensional sorting
operations. In this paper, we propose a new variant of sliced Wasserstein
distance, study the use of orthogonal coupling in Monte Carlo estimation of
Wasserstein distances and draw connections with stratified sampling, and
evaluate our approaches experimentally in a range of large-scale experiments
in generative modelling and reinforcement learning.

Tameem Adel, Isabel Valera, Zoubin Ghahramani, and Adrian Weller.
**One-network
adversarial fairness**.
In *33rd AAAI Conference on Artificial Intelligence*, Hawaii, January
2019.

** Abstract:** There is currently a great expansion of
the impact of machine learning algorithms on our lives, prompting the need
for objectives other than pure performance, including fairness. Fairness here
means that the outcome of an automated decision-making system should not
discriminate between subgroups characterized by sensitive attributes such as
gender or race. Given any existing differentiable classifier, we make only
slight adjustments to the architecture including adding a new hidden layer,
in order to enable the concurrent adversarial optimization for fairness and
accuracy. Our framework provides one way to quantify the tradeoff between
fairness and accuracy, while also leading to strong empirical
performance.

Stephen Cave, Rune Nyrup, Karina Vold, and Adrian Weller.
**Motivations and risks
of machine ethics**.
*Proceedings of the IEEE*, 107(3):562-574, 2019.

**
Abstract:** This paper surveys reasons for and against pursuing the field
of machine ethics, understood as research aiming to build ``ethical
machines.'' We clarify the nature of this goal, why it is worth pursuing, and
the risks involved in its pursuit. First, we survey and clarify some of the
philosophical issues surrounding the concept of an ``ethical machine'' and
the aims of machine ethics. Second, we argue that while there are good prima
facie reasons for pursuing machine ethics, including the potential to improve
the ethical alignment of both humans and machines, there are also potential
risks that must be considered. Third, we survey these potential risks and
point to where research should be devoted to clarifying and managing
potential risks. We conclude by making some recommendations about the
questions that future work could address.

Ofer Meshi, Ben London, Adrian Weller, and David Sontag.
**Train and test tightness of
LP relaxations in structured prediction**.
*Journal of Machine Learning Research*, 20(13):1-34, 2019.

** Abstract:** Structured prediction is used in areas including
computer vision and natural language processing to predict structured outputs
such as segmentations or parse trees. In these settings, prediction is
performed by MAP inference or, equivalently, by solving an integer linear
program. Because of the complex scoring functions required to obtain accurate
predictions, both learning and inference typically require the use of
approximate solvers. We propose a theoretical explanation for the striking
observation that approximations based on linear programming (LP) relaxations
are often tight (exact) on real-world instances. In particular, we show that
learning with LP relaxed inference encourages integrality of training
instances, and that this training tightness generalizes to test data.

Mark Rowland, Krzysztof Choromanski, Francois Chalus, Aldo Pacchiano, Tamas
Sarlos, Richard Turner, and Adrian Weller.
**Geometrically
coupled Monte Carlo sampling**.
In *Advances in Neural Information Processing Systems 32*, Montreal
Canada, December 2018.

** Abstract:** Monte Carlo sampling in
high-dimensional, low-sample settings is important in many machine learning
tasks. We improve current methods for sampling in Euclidean spaces by
avoiding independence, and instead consider ways to couple samples. We show
fundamental connections to optimal transport theory, leading to novel
sampling algorithms, and providing new theoretical grounding for existing
strategies. We compare our new strategies against prior methods for improving
sample efficiency, including quasi-Monte Carlo, by studying discrepancy. We
explore our findings empirically, and observe benefits of our sampling
schemes for reinforcement learning and generative modelling.

Tameem Adel, Zoubin Ghahramani, and Adrian Weller.
**Discovering
interpretable representations for both deep generative and discriminative
models**.
In *35th International Conference on Machine Learning*, Stockholm
Sweden, July 2018.

** Abstract:** Interpretability of
representations in both deep generative and discriminative models is highly
desirable. Current methods jointly optimize an objective combining accuracy
and interpretability. However, this may reduce accuracy, and is not
applicable to already trained models. We propose two interpretability
frameworks. First, we provide an interpretable lens for an existing model. We
use a generative model which takes as input the representation in an existing
(generative or discriminative) model, weakly supervised by limited side
information. Applying a flexible and invertible transformation to the input
leads to an interpretable representation with no loss in accuracy. We extend
the approach using an active learning strategy to choose the most useful side
information to obtain, allowing a human to guide what ``interpretable" means.
Our second framework relies on joint optimization for a representation which
is both maximally informative about the side information and maximally
compressive about the non-interpretable data factors. This leads to a novel
perspective on the relationship between compression and regularization. We
also propose a new interpretability evaluation metric based on our framework.
Empirically, we achieve state-of-the-art results on three datasets using the
two proposed algorithms.

Krzysztof Choromanski, Mark Rowland, Vikas Sindhwani, Richard Turner, and
Adrian Weller.
**Structured
evolution with compact architectures for scalable policy
optimization**.
In *35th International Conference on Machine Learning*, Stockholm
Sweden, July 2018.

** Abstract:** We present a new method of
blackbox optimization via gradient approximation with the use of structured
random orthogonal matrices, providing more accurate estimators than baselines
and with provable theoretical guarantees. We show that this algorithm can be
successfully applied to learn better quality compact policies than those
using standard gradient estimation techniques. The compact policies we learn
have several advantages over unstructured ones, including faster training
algorithms and faster inference. These benefits are important when the policy
is deployed on real hardware with limited resources. Further, compact
policies provide more scalable architectures for derivative-free optimization
(DFO) in high dimensional spaces. We show that most robotics tasks from the
OpenAI Gym can be solved using neural networks with less than 300 parameters,
with almost linear time complexity of the inference phase, with up to 13x
fewer parameters relative to the Evolution Strategies (ES) algorithm
introduced by Salimans et al. (2017). We do not need heuristics such as
fitness shaping to learn good quality policies, resulting in a simple and
theoretically motivated training mechanism.

Niki Kilbertus, Adria Gascon, Matt Kusner, Michael Veale, Krishna P. Gummadi,
and Adrian Weller.
**Blind
justice: Fairness with encrypted sensitive attributes**.
In *35th International Conference on Machine Learning*, Stockholm
Sweden, July 2018.

** Abstract:** Recent work has explored how
to train machine learning models which do not discriminate against any
subgroup of the population as determined by sensitive attributes such as
gender or race. To avoid disparate treatment, sensitive attributes should not
be considered. On the other hand, in order to avoid disparate impact,
sensitive attributes must be examined — e.g., in order to learn a fair
model, or to check if a given model is fair. We introduce methods from secure
multi-party computation which allow us to avoid both. By encrypting sensitive
attributes, we show how an outcome based fair model may be learned, checked,
or have its outputs verified and held to account, without users revealing
their sensitive attributes.

Sungsoo Ahn, Michael Chertkov, Jinwoo Shin, and Adrian Weller.
**Gauged
mini-bucket elimination for approximate inference**.
In *21st International Conference on Artificial Intelligence and
Statistics*, Playa Blanca, Lanzarote, Canary Islands, April 2018.

** Abstract:** Computing the partition function Z of a discrete
graphical model is a fundamental inference challenge. Since this is
computationally intractable, variational approximations are often used in
practice. Recently, so-called gauge transformations were used to improve
variational lower bounds on Z. In this paper, we propose a new
gauge-variational approach, termed WMBE-G, which combines gauge
transformations with the weighted mini-bucket elimination (WMBE) method.
WMBE-G can provide both upper and lower bounds on Z, and is easier to
optimize than the prior gauge-variational algorithm. We show that WMBE-G
strictly improves the earlier WMBE approximation for symmetric models
including Ising models with no magnetic field. Our experimental results
demonstrate the effectiveness of WMBE-G even for generic, nonsymmetric
models.

Krzysztof Choromanski, Mark Rowland, Tamas Sarlos, Vikas Sindhwani, Richard E.
Turner, and Adrian Weller.
**The geometry of
random features**.
In *21st International Conference on Artificial Intelligence and
Statistics*, Playa Blanca, Lanzarote, Canary Islands, April 2018.

** Abstract:** We present an in-depth examination of the
effectiveness of radial basis function kernel (beyond Gaussian) estimators
based on orthogonal random feature maps. We show that orthogonal estimators
outperform state-of-the-art mechanisms that use iid sampling under weak
conditions for tails of the associated Fourier distributions. We prove that
for the case of many dimensions, the superiority of the orthogonal transform
can be accurately measured by a property we define called the charm of the
kernel, and that orthogonal random features provide optimal (in terms of mean
squared error) kernel estimators. We provide the first theoretical results
which explain why orthogonal random features outperform unstructured on
downstream tasks such as kernel ridge regression by showing that orthogonal
random features provide kernel algorithms with better spectral properties
than the previous state-of-the-art. Our results enable practitioners more
generally to estimate the benefits from applying orthogonal transforms.

Nina Grgić-Hlača, Elissa Redmiles, Krishna P. Gummadi, and Adrian
Weller.
**Human
perceptions of fairness in algorithmic decision making: A case study of
criminal risk prediction**.
In *The Web Conference (WWW)*, Lyon, April 2018.

**
Abstract:** As algorithms are increasingly used to make important decisions
that affect human lives, ranging from social benefit assignment to predicting
risk of criminal recidivism, concerns have been raised about the fairness of
algorithmic decision making. Most prior works on algorithmic fairness
normatively prescribe how fair decisions ought to be made. In contrast, here,
we descriptively survey users for how they perceive and reason about fairness
in algorithmic decision making. A key contribution of this work is the
framework we propose to understand why people perceive certain features as
fair or unfair to be used in algorithms. Our framework identifies eight
properties of features, such as relevance, volitionality and reliability, as
latent considerations that inform people’s moral judgments about the
fairness of feature use in decision-making algorithms. We validate our
framework through a series of scenario-based surveys with 576 people. We find
that, based on a person’s assessment of the eight latent properties of a
feature in our exemplar scenario, we can accurately (> 85%) predict if the
person will judge the use of the feature as fair. Our findings have important
implications. At a high-level, we show that people’s unfairness concerns
are multi-dimensional and argue that future studies need to address
unfairness concerns beyond discrimination. At a low-level, we find
considerable disagreements in people’s fairness judgments. We identify root
causes of the disagreements, and note possible pathways to resolve them.

Mahmoudreza Babaei, Juhi Kulshrestha, Abhijnan Chakraborty, Fabricio
Benevenuto, Krishna P. Gummadi, and Adrian Weller.
**Purple
feed: Identifying high consensus news posts on social media**.
In *1st AAAI/ACM Conference on Artificial Intelligence, Ethics and
Society*, New Orleans, February 2018.

** Abstract:**
Although diverse news stories are actively posted on social media, readers
often focus on news which reinforces their pre-existing views, leading to
‘filter bubble’ effects. To combat this, some recent systems expose and
nudge readers toward stories with different points of view. One example is
the Wall Street Journal’s ‘Blue Feed, Red Feed’ system, which presents
posts from biased publishers on each side of a topic. However, these systems
have had limited success. In this work, we present a complementary approach
which identifies high consensus ‘purple’ posts that generate similar
reactions from both ‘blue’ and ‘red’ readers. We define and
operationalize consensus for news posts on Twitter in the context of US
politics. We identify several high consensus posts and discuss their
empirical properties. We present a highly scalable method for automatically
identifying high and low consensus news posts on Twitter, by utilizing a
novel category of audience leaning based features, which we show are well
suited to this task. Finally, we build and publicly deploy our ‘Purple
Feed’ system (twitter-app.mpi-sws.org/purple-feed), which highlights high
consensus posts from publishers on both sides of the political spectrum.

N. Grgić-Hlača, M. B. Zafar, K. P. Gummadi, and A. Weller.
**Beyond
distributive fairness in algorithmic decision making: Feature selection for
procedurally fair learning**.
In *32nd AAAI Conference on Artificial Intelligence*, New Orleans,
February 2018.

** Abstract:** With wide-spread usage of
machine learning methods in numerous domains involving human subjects,
several studies have raised questions about the potential for unfairness
towards certain individuals or groups. A number of recent works have proposed
methods to measure and